2025-08-20

Title: Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora

Authors: Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher, Christian Heumann, Stephanie Thiemichen
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.13169
Pdf URL: https://arxiv.org/pdf/2508.13169
Copy Paste: [[2508.13169]] Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora(https://arxiv.org/abs/2508.13169)
Keywords: language model
Abstract: Large language models are increasingly shaping digital communication, yet their outputs often reflect structural gender imbalances that originate from their training data. This paper presents an extended actor-level pipeline for detecting and mitigating gender discrimination in large-scale text corpora. Building on prior work in discourse-aware fairness analysis, we introduce new actor-level metrics that capture asymmetries in sentiment, syntactic agency, and quotation styles. The pipeline supports both diagnostic corpus analysis and exclusion-based balancing, enabling the construction of fairer corpora. We apply our approach to the taz2024full corpus of German newspaper articles from 1980 to 2024, demonstrating substantial improvements in gender balance across multiple linguistic dimensions. Our results show that while surface-level asymmetries can be mitigated through filtering and rebalancing, subtler forms of bias persist, particularly in sentiment and framing. We release the tools and reports to support further research in discourse-based fairness auditing and equitable corpus construction.
摘要：大型语言模型越来越多地塑造数字通信，但是它们的输出通常反映出源自培训数据的结构性失衡。本文提出了一条扩展的演员级管道，用于检测和缓解大规模文本语料库中的性别歧视。在话语意识到公平分析的先前工作的基础上，我们引入了新的演员级指标，这些指标捕获了情感，句法代理和引用风格的不对称的指标。该管道支持诊断语料库分析和基于排除的平衡，从而实现了更公平的语料库的构建。我们将我们的方法应用于1980年至2024年德国报纸文章的TAZ2024Full语料库，这表明多种语言方面的性别平衡有了很大的改善。我们的结果表明，尽管表面级别的不对称性可以通过过滤和重新平衡来缓解，但偏见的微妙形式持续存在，尤其是在情感和框架中。我们发布工具和报告，以支持基于话语的公平性审计和公平语料库构建方面的进一步研究。

Title: MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

Authors: Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang, Ruizhe Ding, Shilei Wen
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2508.13186
Pdf URL: https://arxiv.org/pdf/2508.13186
Copy Paste: [[2508.13186]] MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents(https://arxiv.org/abs/2508.13186)
Keywords: prompt, agent
Abstract: AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents' multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02\% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models.
摘要：具有高级推理和工具使用功能的AI代理在Web浏览中表现出令人印象深刻的性能，可深入搜索。尽管诸如BrowseComp等现有基准评估了这些浏览能力，但它们主要关注文本信息，忽略了多模式内容的普遍性。为了弥合这一差距，我们介绍了MM-BrowseComp，这是一种新颖的基准，包括224个具有挑战性的手工制作的问题，专门设计用于评估代理商的多模式检索和推理能力。这些问题通常将图像纳入提示中，并且在搜索过程和推理过程中遇到的关键信息也可以嵌入到网页上的图像或视频中。因此，仅依靠文本的方法证明不足以用于我们的基准。此外，我们为每个问题提供了经过验证的清单，从而可以对多模式依赖性和推理路径进行细粒度分析。我们对MM-BrowseComp上最新模型的全面评估表明，即使使用工具的OpenAI O3（例如OpenAi O3）仅达到29.02 \％的准确性，从而强调了次优的多模式功能，并且在当前模型中缺乏本机多模态推理。

Title: Stands to Reason: Investigating the Effect of Reasoning on Idiomaticity Detection

Authors: Dylan Phelps, Rodrigo Wilkens, Edward Gow-Smith, Thomas Pickard, Maggie Mi, Aline Villavicencio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.13365
Pdf URL: https://arxiv.org/pdf/2508.13365
Copy Paste: [[2508.13365]] Stands to Reason: Investigating the Effect of Reasoning on Idiomaticity Detection(https://arxiv.org/abs/2508.13365)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: The recent trend towards utilisation of reasoning models has improved the performance of Large Language Models (LLMs) across many tasks which involve logical steps. One linguistic task that could benefit from this framing is idiomaticity detection, as a potentially idiomatic expression must first be understood before it can be disambiguated and serves as a basis for reasoning. In this paper, we explore how reasoning capabilities in LLMs affect idiomaticity detection performance and examine the effect of model size. We evaluate, as open source representative models, the suite of DeepSeek-R1 distillation models ranging from 1.5B to 70B parameters across four idiomaticity detection datasets. We find the effect of reasoning to be smaller and more varied than expected. For smaller models, producing chain-of-thought (CoT) reasoning increases performance from Math-tuned intermediate models, but not to the levels of the base models, whereas larger models (14B, 32B, and 70B) show modest improvements. Our in-depth analyses reveal that larger models demonstrate good understanding of idiomaticity, successfully producing accurate definitions of expressions, while smaller models often fail to output the actual meaning. For this reason, we also experiment with providing definitions in the prompts of smaller models, which we show can improve performance in some cases.
摘要：利用推理模型的最新趋势改善了涉及逻辑步骤的许多任务中大语言模型（LLM）的性能。一项可以从这种框架中受益的语言任务是惯用性检测，因为必须首先理解潜在的惯用表达，然后才能被歧视并作为推理的基础。在本文中，我们探讨了LLM中的推理能力如何影响惯用性检测性能并检查模型大小的效果。我们评估作为开源代表模型，在四个惯用性检测数据集中，DeepSeek-R1蒸馏模型的套件范围从1.5B到70B参数。我们发现推理的效果比预期的要小，更多样化。对于较小的模型，产生思考链（COT）推理可提高数学调整的中间模型的性能，而不是基本模型的水平，而较大的模型（14B，32B和70B）显示出适度的改进。我们深入的分析表明，较大的模型表明对惯用性的理解很好，成功地产生了表达式的准确定义，而较小的模型通常无法输出实际含义。因此，我们还尝试在较小模型的提示中提供定义，在某些情况下我们可以提高性能。

Title: Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis

Authors: Ayoub Ben Chaliah, Hela Dellagi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13382
Pdf URL: https://arxiv.org/pdf/2508.13382
Copy Paste: [[2508.13382]] Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis(https://arxiv.org/abs/2508.13382)
Keywords: language model, llm, chain-of-thought, agent
Abstract: We present Datarus-R1-14B, a 14 B-parameter open-weights language model fine-tuned from Qwen 2.5-14B-Instruct to act as a virtual data analyst and graduate-level problem solver. Datarus is trained not on isolated question-answer pairs but on full analytical trajectories including reasoning steps, code execution, error traces, self-corrections, and final conclusions, all captured in a ReAct-style notebook format spanning finance, medicine, numerical analysis, and other quantitative domains. Our training pipeline combines (i) a trajectory-centric synthetic data generator that yielded 144 000 tagged notebook episodes, (ii) a dual-reward framework blending a lightweight tag-based structural signal with a Hierarchical Reward Model (HRM) that scores both single-step soundness and end-to-end coherence, and (iii) a memory-optimized implementation of Group Relative Policy Optimization (GRPO) featuring KV-cache reuse, sequential generation, and reference-model sharding. A cosine curriculum smoothly shifts emphasis from structural fidelity to semantic depth, reducing the format collapse and verbosity that often plague RL-aligned LLMs. A central design choice in Datarus is it dual reasoning interface. In agentic mode the model produces ReAct-tagged steps that invoke Python tools to execute real code; in reflection mode it outputs compact Chain-of-Thought (CoT) traces delimited by and tags. On demanding postgraduate-level problems, Datarus exhibits an "AHA-moment" pattern: it sketches hypotheses, revises them once or twice, and converges avoiding the circular, token-inflating loops common to contemporary systems. Across standard public benchmarks Datarus surpasses similar size models and even reaches the level of larger reasoning models such as QwQ-32B achieving up to 30% higher accuracy on AIME 2024/2025 and LiveCodeBench while emitting 18-49% fewer tokens per solution.
摘要：我们提出Datarus-R1-14B，这是一种从QWEN 2.5-14B教学进行微调的14 B参数开放式语言模型，以充当虚拟数据分析师和研究生级问题解决方案。 Datarus不是在孤立的问答对中进行的，而是在完整的分析轨迹上进行的，包括推理步骤，代码执行，错误跟踪，自我校正和最终结论，所有这些结论均以反应式笔记本格式涵盖，涵盖了涉及融资，药物，数值分析和其他定量域的反应式笔记本格式。我们的培训管道结合了（i）以轨迹为中心的合成数据生成器，产生了144 000个标记的笔记本发作，（ii）将基于轻巧标记的结构信号与层次结构奖励模型（HRM）与单个步骤的声音和最终端到端的相对性策略（iiii Iii colledimation collesimation compligation）（IIII III）融合在一起的双重奖励框架（III），并获得了记忆（III）的范围（III）。 KV-CACHE重用，顺序生成和参考模型分片。余弦课程顺利地将重点从结构忠诚转移到语义深度，从而降低了通常困扰rl rl对准的LLM的格式崩溃和冗长。 Datarus中的一个中心设计选择是双重推理接口。在代理模式下，模型会产生反应标签的步骤，以调用Python工具执行真实代码。在反射模式下，它输出由和标签界定的紧凑型链（COT）痕迹。在要求的研究生级别的问题时，Datarus表现出“ AHA-Moment”模式：它勾勒出假设，修改一次或两次，并收敛，避免了当代系统常见的圆形，代币充气的环路。在标准的公共基准中，Datarus超过了相似的尺寸模型，甚至达到了诸如QWQ-32B之类的较大推理模型的水平，在AIME 2024/2025和LiveCodeBench上的准确性高达30％，同时每解决方案发射18-49％的代币。

Title: ALIGN: Word Association Learning for Cross-Cultural Generalization in Large Language Models

Authors: Chunhua Liu, Kabir Manandhar Shrestha, Sukai Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13426
Pdf URL: https://arxiv.org/pdf/2508.13426
Copy Paste: [[2508.13426]] ALIGN: Word Association Learning for Cross-Cultural Generalization in Large Language Models(https://arxiv.org/abs/2508.13426)
Keywords: language model, llm
Abstract: As large language models (LLMs) increasingly mediate cross-cultural communication, their behavior still reflects the distributional bias of the languages and viewpoints that are over-represented in their pre-training corpora. Yet, it remains a challenge to model and align culture due to limited cultural knowledge and a lack of exploration into effective learning approaches. We introduce a cost-efficient, cognitively grounded remedy: parameter-efficient fine-tuning on native speakers' free word-association norms, which encode implicit cultural schemas. Leveraging English-US and Mandarin associations from the Small-World-of-Words project, we adapt Llama-3.1-8B and Qwen-2.5-7B via supervised fine-tuning (SFT) and PPO-based preference optimization. SFT boosts held-out association Precision at 5 by 16-20% in English and 43-165% in Mandarin, lifts median concreteness by +0.20, and attains human-level valence and arousal. These lexical gains transfer: on World-Values-Survey questions, fine-tuned models shift answer distributions toward the target culture, and on a 50-item high-tension subset, Qwen's Chinese-aligned responses double while Llama's US bias drops by one-third. Our 7-8B models rival or beat vanilla 70B baselines, showing that a few million culture-grounded associations can instill value alignment without costly retraining. Our work highlights both the promise and the need for future research grounded in human cognition in improving cultural alignment in AI models.
摘要：随着大型语言模型（LLMS）日益介导跨文化的交流，它们的行为仍然反映出在培训前语料库中代表过多的语言和观点的分布偏见。然而，由于文化知识有限以及对有效学习方法的探索，模型和结盟文化仍然是一个挑战。我们介绍了一种具有成本效益的认知基础补救措施：对母语者的自由单词关联规范的参数有效调整，该规范编码隐式文化模式。通过监督的微调（SFT）和基于PPO的偏好优化，我们从小世界项目中利用英语 - US和普通话协会来调整Llama-3.1-8B和Qwen-2.5-7b。 SFT提升协会的精度为5乘16-20％，普通话为43-165％，将中位数提高+0.20，并达到人级价值和唤醒。这些词汇收益转移：在世界价值调查的问题上，微调模型将答案分布转向目标文化，在50个项目的高压子集上，Qwen的中文一致性回答会加倍，而Llama的美国偏见下降了三分之一。我们的7-8B型号竞争或击败香草70B基线，表明有几百万个具有文化的协会可以灌输价值对齐，而无需昂贵的再训练。我们的工作凸显了人们对人类认知基于人工AI模型中文化一致性的诺言和对未来研究的需求。

Title: ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs

Authors: Hongxin Ding, Baixiang Huang, Yue Fang, Weibin Liao, Xinke Jiang, Zheng Li, Junfeng Zhao, Yasha Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13514
Pdf URL: https://arxiv.org/pdf/2508.13514
Copy Paste: [[2508.13514]] ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs(https://arxiv.org/abs/2508.13514)
Keywords: language model, llm
Abstract: Interactive medical questioning is essential in real-world clinical consultations, where physicians must actively gather information from patients. While medical Large Language Models (LLMs) have shown impressive capabilities in static medical question answering, they predominantly operate under a reactive paradigm: generating answers directly without seeking additional information, which risks incorrect diagnoses in such interactive settings. To address this limitation, we propose ProMed, a reinforcement learning (RL) framework that transitions medical LLMs toward a proactive paradigm, equipping them with the ability to ask clinically valuable questions before decision-making. At the core of ProMed is the Shapley Information Gain (SIG) reward, which quantifies the clinical utility of each question by combining the amount of newly acquired information with its contextual importance, estimated via Shapley values. We integrate SIG into a two-stage training pipeline: (1) SIG-Guided Model Initialization uses Monte Carlo Tree Search (MCTS) to construct high-reward interaction trajectories to supervise the model, and (2) SIG-Augmented Policy Optimization, which integrates SIG and enhances RL with a novel SIG-guided Reward Distribution Mechanism that assigns higher rewards to informative questions for targeted optimization. Extensive experiments on two newly curated partial-information medical benchmarks demonstrate that ProMed significantly outperforms state-of-the-art methods by an average of 6.29% and delivers a 54.45% gain over the reactive paradigm, while also generalizing robustly to out-of-domain cases.
摘要：互动医学询问在现实世界中至关重要的临床咨询中至关重要，医师必须积极地从患者那里收集信息。尽管医学大语言模型（LLM）在静态医学问题上显示出令人印象深刻的功能，但它们主要在反应性范式下运行：直接生成答案而无需寻求其他信息，这可能会在这种交互式设置中诊断不正确。为了解决这一限制，我们提出了诺言，将医疗LLMS过渡到主动范式，使他们能够在决策之前提出临床上有价值的问题。 Promed的核心是Shapley信息增益（SIG）奖励，该奖励通过将新获得信息的数量与其上下文重要性相结合，量化了每个问题的临床实用性，这是通过Shapley值估算的。我们将SIG集成到两个阶段的训练管道中：（1）SIG引导的模型初始化使用Monte Carlo Tree Search（MCT）来构建高奖励交互轨迹以监督模型，以及（2）SIG-INGEAGENT POLIGN的优化，将SIG和RL集成了RL，并将RL与新颖的SIG引入奖励分配机构相结合，以使其具有更高的奖励分配性，以提供针对性的启用性问题。对两个新策划的部分信息医学基准进行了广泛的实验表明，该基准的表现平均超过了最先进的方法6.29％，并且比反应性范式可获得54.45％的增长，同时也可以强有力地推广到多域内的情况。

Title: Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation

Authors: Hassan Barmandah
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.13525
Pdf URL: https://arxiv.org/pdf/2508.13525
Copy Paste: [[2508.13525]] Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation(https://arxiv.org/abs/2508.13525)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs) for Arabic are still dominated by Modern Standard Arabic (MSA), with limited support for Saudi dialects such as Najdi and Hijazi. This underrepresentation hinders their ability to capture authentic dialectal variation. Using a privately curated Saudi Dialect Instruction dataset (Hijazi and Najdi; 5,466 synthetic instruction-response pairs; 50/50 split), we LoRA-tune ALLaM-7B-Instruct-preview, the first foundation model developed in Saudi Arabia, for Saudi dialect generation. We investigate two variants: (i) Dialect-Token training, which prepends an explicit dialect tag to the instruction, and (ii) No-Token training, which omits the tag at formatting time. Evaluation on a held-out test set combines an external dialect classifier with text fidelity metrics (chrF++ and BERTScore) and diversity measures. The Dialect-Token model achieves the best control, raising the Saudi rate from 47.97% to 84.21% and reducing MSA leakage from 32.63% to 6.21%; fidelity also improves (chrF++ +3.53, BERTScore +0.059). Both LoRA variants outperform strong generic instruction models (Falcon-7B-Instruct, Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, AceGPT-v2-8B-Chat, JAIS-13B-Chat) in dialect control and fidelity, while avoiding metadata-tag echoing that these baselines frequently exhibit. We do not release the dataset or any model weights/adapters; instead, we release training/evaluation/inference code and a detailed datasheet (schema and aggregate statistics) to support independent verification.
摘要：阿拉伯语的大型语言模型（LLM）仍然以现代标准阿拉伯语（MSA）为主，对沙特方言（例如Najdi和Hijazi）的支持有限。这种代表性不足会阻碍其捕获真实的方言变化的能力。使用私人策划的沙特方言指令数据集（Hijazi和Najdi； 5,466个合成指令 - 响应对； 50/50拆分），我们lora-tune allam-7b-7b-Instruct-Preview，这是沙特阿拉伯的第一个基金会模型，用于沙特阿拉伯，用于沙特语言方言。我们研究了两个变体：（i）方言训练，该方言训练预先介绍了指令的明确方言标签，以及（ii）无言语培训，该培训在格式化时省略了标签。在固定测试集上的评估将外部方言分类器与文本保真度指标（CHRF ++和BERTSCORE）和多样性度量相结合。方言模型实现了最佳控制，将沙特的率从47.97％提高到84.21％，并将MSA泄漏从32.63％降低到6.21％；保真度也有所改善（CHRF ++ +3.53，Bertscore +0.059）。这两个Lora变体都优于强大的通用指导模型（Falcon-7b-instruct，Llama-3.1-8b-Instruct，Qwen-2.5-7b-Instruct，AceGPT-V2-8B-CHAT，JAIS-13B-CHAT，JAIS-13B-CHAT）在方格控制和富裕的情况下，避免了这些基本的底座，以避免使用这些基础。我们不会释放数据集或任何模型权重/适配器；取而代之的是，我们发布培训/评估/推理代码以及详细的数据表（模式和汇总统计信息），以支持独立验证。

Title: MATA (māta): Mindful Assessment of the Telugu Abilities of Large Language Models

Authors: Chalamalasetti Kranti, Sowmya Vajjala
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.13526
Pdf URL: https://arxiv.org/pdf/2508.13526
Copy Paste: [[2508.13526]] MATA (māta): Mindful Assessment of the Telugu Abilities of Large Language Models(https://arxiv.org/abs/2508.13526)
Keywords: language model, llm
Abstract: In this paper, we introduce MATA, a novel evaluation dataset to assess the ability of Large Language Models (LLMs) in Telugu language, comprising 729 carefully curated multiple-choice and open-ended questions that span diverse linguistic dimensions. We evaluate 11 open-weight and closed-source LLMs on our dataset and present a fine-grained analysis of their performance. Further, we empirically show how LLMs rely on superficial heuristics such as answer position and distractor patterns for multiple-choice questions. Finally, we also compare LLM-as-a-judge evaluation with human evaluation for open-ended questions and draw some conclusions on its reliability in a low-resource language. We argue that such fine-grained evaluation is essential for understanding model limitations and can inform the development of more linguistically capable LLMs, while also serving as a foundation for future research in Telugu NLP.
摘要：在本文中，我们介绍了MATA，这是一个新颖的评估数据集，以评估泰卢固语语言中的大语言模型（LLMS）的能力，包括729个精心策划的多项选择和开放式问题，这些问题涵盖了各种语言的维度。我们在数据集中评估了11个开放量和闭合源LLM，并对它们的性能进行了细粒度的分析。此外，我们从经验上展示了LLM如何依赖于肤浅的启发式方法，例如答案位置和干扰物模式来解决多项选择问题。最后，我们还将LLM-AS-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-As-A-Audge评估与人为开放性问题的评估进行了比较，并在低资源语言中得出了有关其可靠性的一些结论。我们认为，如此细粒度的评估对于理解模型限制至关重要，可以为更语言能力的LLM的发展提供信息，同时也是泰卢固语NLP未来研究的基础。

Title: A Comparative Study of Decoding Strategies in Medical Text Generation

Authors: Oriana Presacan, Alireza Nik, Vajira Thambawita, Bogdan Ionescu, Michael Riegler
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13580
Pdf URL: https://arxiv.org/pdf/2508.13580
Copy Paste: [[2508.13580]] A Comparative Study of Decoding Strategies in Medical Text Generation(https://arxiv.org/abs/2508.13580)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) rely on various decoding strategies to generate text, and these choices can significantly affect output quality. In healthcare, where accuracy is critical, the impact of decoding strategies remains underexplored. We investigate this effect in five open-ended medical tasks, including translation, summarization, question answering, dialogue, and image captioning, evaluating 11 decoding strategies with medically specialized and general-purpose LLMs of different sizes. Our results show that deterministic strategies generally outperform stochastic ones: beam search achieves the highest scores, while {\eta} and top-k sampling perform worst. Slower decoding methods tend to yield better quality. Larger models achieve higher scores overall but have longer inference times and are no more robust to decoding. Surprisingly, while medical LLMs outperform general ones in two of the five tasks, statistical analysis shows no overall performance advantage and reveals greater sensitivity to decoding choice. We further compare multiple evaluation metrics and find that correlations vary by task, with MAUVE showing weak agreement with BERTScore and ROUGE, as well as greater sensitivity to the decoding strategy. These results highlight the need for careful selection of decoding methods in medical applications, as their influence can sometimes exceed that of model choice.
摘要：大型语言模型（LLMS）依靠各种解码策略来生成文本，这些选择可以显着影响产出质量。在准确性至关重要的医疗保健中，解码策略的影响仍然没有被忽视。我们在五个开放式医疗任务中调查了这种效果，包括翻译，摘要，问答，对话和图像字幕，评估了11种不同尺寸的医学专业和通用LLM的解码策略。我们的结果表明，确定性策略通常超过随机性的策略：梁搜索的得分最高，而{\ eta}和Top-K采样表现最差。较慢的解码方法倾向于产生更好的质量。较大的模型总体上可以达到更高的分数，但推理时间较长，并且不再具有强大的解码。令人惊讶的是，尽管医学LLM在五个任务中的两个中的表现都超过了统计分析，但统计分析没有显示整体性能优势，并且揭示了对解码选择的敏感性。我们进一步比较了多个评估指标，并发现相关性因任务而异，而淡紫色与Bertscore和Rouge的一致性较弱，并且对解码策略的敏感性更大。这些结果突出了需要在医疗应用中仔细选择解码方法的必要性，因为它们的影响有时会超过模型选择的影响。

Title: Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM

Authors: Dariia Puhach, Amir H. Payberah, Éva Székely
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13603
Pdf URL: https://arxiv.org/pdf/2508.13603
Copy Paste: [[2508.13603]] Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM(https://arxiv.org/abs/2508.13603)
Keywords: language model, llm, prompt
Abstract: Similar to text-based Large Language Models (LLMs), Speech-LLMs exhibit emergent abilities and context awareness. However, whether these similarities extend to gender bias remains an open question. This study proposes a methodology leveraging speaker assignment as an analytic tool for bias investigation. Unlike text-based models, which encode gendered associations implicitly, Speech-LLMs must produce a gendered voice, making speaker selection an explicit bias cue. We evaluate Bark, a Text-to-Speech (TTS) model, analyzing its default speaker assignments for textual prompts. If Bark's speaker selection systematically aligns with gendered associations, it may reveal patterns in its training data or model design. To test this, we construct two datasets: (i) Professions, containing gender-stereotyped occupations, and (ii) Gender-Colored Words, featuring gendered connotations. While Bark does not exhibit systematic bias, it demonstrates gender awareness and has some gender inclinations.
摘要：类似于基于文本的大语言模型（LLM），语音插件表现出紧急的能力和背景意识。但是，这些相似之处是否扩展到性别偏见仍然是一个悬而未决的问题。这项研究提出了一种利用扬声器分配的方法论，作为偏见研究的分析工具。与基于文本的模型（隐式地编码性别关联）不同，语音插件必须产生性别语音，从而使扬声器选择成为明确的偏见提示。我们评估Bark是一种文本到语音（TTS）模型，分析其默认的扬声器作业以获取文本提示。如果Bark的扬声器选择系统地与性别关联保持一致，则可能会在其培训数据或模型设计中揭示模式。为了测试这一点，我们构建了两个数据集：（i）包含性别概论的职业的专业，以及（ii）具有性别内涵的性别色彩单词。虽然树皮没有表现出系统的偏见，但它表现出性别意识，并且具有一些性别倾向。

Title: CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Authors: Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, Yonatan Belinkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.13650
Pdf URL: https://arxiv.org/pdf/2508.13650
Copy Paste: [[2508.13650]] CRISP: Persistent Concept Unlearning via Sparse Autoencoders(https://arxiv.org/abs/2508.13650)
Keywords: language model, llm
Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.
摘要：随着大型语言模型（LLMS）越来越多地部署在现实世界应用程序中，因此在保存模型实用程序的同时选择性删除不必要的知识的需求已变得至关重要。最近的工作探索了稀疏的自动编码器（SAE），以对单义特征执行精确的干预措施。但是，大多数基于SAE的方法在推理时间运行，这不会在模型的参数中持续变化。这种干预措施可以由具有参数访问的恶意演员绕过或逆转。我们介绍了Crisp，这是一种使用SAE的持续概念的参数效率方法。 Crisp会自动识别多个层的显着SAE功能，并抑制其激活。我们试验了两个LLM，并表明我们的方法优于WMDP基准的安全 - 关键学习任务的先验方法，从而成功地消除了有害知识，同时保留了一般和内域功能。特征级别的分析表明，Crisp在目标和良性概念之间实现了语义上相干的分离，从而可以精确地抑制目标特征。

Title: ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?

Authors: Vy Tuong Dang, An Vo, Quang Tau, Duc Dm, Daeyoung Kim
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.13680
Pdf URL: https://arxiv.org/pdf/2508.13680
Copy Paste: [[2508.13680]] ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?(https://arxiv.org/abs/2508.13680)
Keywords: language model, prompt
Abstract: Vision language models (VLMs) demonstrate remarkable capabilities on English multimodal tasks, but their performance on low-resource languages with genuinely multimodal educational content remains largely unexplored. In this work, we test how VLMs perform on Vietnamese educational assessments, investigating whether VLMs trained predominantly on English data can handle real-world cross-lingual multimodal reasoning. Our work presents the first comprehensive evaluation of VLM capabilities on multimodal Vietnamese exams through proposing ViExam, a benchmark containing 2,548 multimodal questions. We find that state-of-the-art VLMs achieve only 57.74% while open-source models achieve 27.70% mean accuracy across 7 academic domains, including Mathematics, Physics, Chemistry, Biology, Geography, Driving Test, and IQ Test. Most VLMs underperform average human test-takers (66.54%), with only the thinking VLM o3 (74.07%) exceeding human average performance, yet still falling substantially short of human best performance (99.60%). Cross-lingual prompting with English instructions while maintaining Vietnamese content fails to improve performance, decreasing accuracy by 1 percentage point for SOTA VLMs. Human-in-the-loop collaboration can partially improve VLM performance by 5 percentage points. Code and data are available at: this https URL.
摘要：视觉语言模型（VLM）在英语多模式任务上表现出了出色的功能，但是它们在具有真正多模式教育内容的低资源语言上的表现仍然很大程度上没有探索。在这项工作中，我们测试了VLM在越南教育评估上的表现，研究了接受英语数据培训的VLM是否可以处理现实世界中的跨语性多模式推理。我们的工作通过提出Viexam（一种包含2,548个多模式问题的基准，对VLM功能的VLM能力进行了首次全面评估。我们发现，最先进的VLMS仅获得57.74％，而开源模型在7个学术领域（包括数学，物理学，化学，化学，生物学，地理，地理，驾驶测试和智商测试）中达到27.70％的平均准确性。大多数VLM的表现不佳的人类考试者（66.54％），只有思维VLM O3（74.07％）超过人类平均表现，但仍未达到人类最佳表现（99.60％）。跨语性提示在维持越南内容的同时提示英语说明无法提高性能，因此SOTA VLMS的准确性提高了1个百分点。人类的合作可以部分提高VLM的性能5个百分点。代码和数据可在以下网址提供：此HTTPS URL。

Title: Generics and Default Reasoning in Large Language Models

Authors: James Ravi Kirkpatrick, Rachel Katharine Sterken
Subjects: cs.CL, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2508.13718
Pdf URL: https://arxiv.org/pdf/2508.13718
Copy Paste: [[2508.13718]] Generics and Default Reasoning in Large Language Models(https://arxiv.org/abs/2508.13718)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: This paper evaluates the capabilities of 28 large language models (LLMs) to reason with 20 defeasible reasoning patterns involving generic generalizations (e.g., 'Birds fly', 'Ravens are black') central to non-monotonic logic. Generics are of special interest to linguists, philosophers, logicians, and cognitive scientists because of their complex exception-permitting behaviour and their centrality to default reasoning, cognition, and concept acquisition. We find that while several frontier models handle many default reasoning problems well, performance varies widely across models and prompting styles. Few-shot prompting modestly improves performance for some models, but chain-of-thought (CoT) prompting often leads to serious performance degradation (mean accuracy drop -11.14%, SD 15.74% in models performing above 75% accuracy in zero-shot condition, temperature 0). Most models either struggle to distinguish between defeasible and deductive inference or misinterpret generics as universal statements. These findings underscore both the promise and limits of current LLMs for default reasoning.
摘要：本文评估了28种大语言模型（LLM）的能力，以20种涉及通用概括（例如，“鸟蝇”，“乌鸦”是黑色的）的20种不良推理模式进行推理。语言学家，哲学家，逻辑学家和认知科学家具有特殊的兴趣，因为它们的异常渗透行为及其对默认推理，认知和概念的核心核心。我们发现，尽管几种边境模型可以很好地解决许多默认推理问题，但性能在模型和促使样式方面差异很大。很少有促使一些促使某些模型的性能适度提高了性能，但是经过思考链（COT）提示通常会导致严重的性能下降（平均准确度下降-11.14％，SD 15.74％在零摄影条件下执行75％精度的模型中，温度为0）。大多数模型要么难以区分不诚实和演绎推理，要么将误解为普遍语句。这些发现强调了当前LLMS对默认推理的承诺和限制。

Title: Prediction is not Explanation: Revisiting the Explanatory Capacity of Mapping Embeddings

Authors: Hanna Herasimchyk, Alhassan Abdelhalim, Sören Laue, Michaela Regneri
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.13729
Pdf URL: https://arxiv.org/pdf/2508.13729
Copy Paste: [[2508.13729]] Prediction is not Explanation: Revisiting the Explanatory Capacity of Mapping Embeddings(https://arxiv.org/abs/2508.13729)
Keywords: language model, llm
Abstract: Understanding what knowledge is implicitly encoded in deep learning models is essential for improving the interpretability of AI systems. This paper examines common methods to explain the knowledge encoded in word embeddings, which are core elements of large language models (LLMs). These methods typically involve mapping embeddings onto collections of human-interpretable semantic features, known as feature norms. Prior work assumes that accurately predicting these semantic features from the word embeddings implies that the embeddings contain the corresponding knowledge. We challenge this assumption by demonstrating that prediction accuracy alone does not reliably indicate genuine feature-based interpretability. We show that these methods can successfully predict even random information, concluding that the results are predominantly determined by an algorithmic upper bound rather than meaningful semantic representation in the word embeddings. Consequently, comparisons between datasets based solely on prediction performance do not reliably indicate which dataset is better captured by the word embeddings. Our analysis illustrates that such mappings primarily reflect geometric similarity within vector spaces rather than indicating the genuine emergence of semantic properties.
摘要：了解在深度学习模型中隐式编码的知识对于改善AI系统的可解释性至关重要。本文研究了通用方法来解释单词嵌入中编码的知识，这些知识是大语言模型（LLMS）的核心元素。这些方法通常涉及将嵌入到人解剖语义特征（称为特征规范）的集合中。先前的工作假设从嵌入一词中准确预测这些语义特征意味着嵌入式包含相应的知识。我们通过证明仅预测准确性并不能可靠地表明基于特征的解释性来挑战这一假设。我们表明，这些方法可以成功预测随机信息，得出的结论是，结果主要由嵌入一词中的算法上限而不是有意义的语义表示决定。因此，仅基于预测性能的数据集之间的比较并不能可靠地指示单词嵌入式单词更好地捕获了哪些数据集。我们的分析表明，此类映射主要反映了向量空间内的几何相似性，而不是表明语义特性的真正出现。

Title: EEG-MedRAG: Enhancing EEG-based Clinical Decision-Making via Hierarchical Hypergraph Retrieval-Augmented Generation

Authors: Yi Wang, Haoran Luo, Lu Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.13735
Pdf URL: https://arxiv.org/pdf/2508.13735
Copy Paste: [[2508.13735]] EEG-MedRAG: Enhancing EEG-based Clinical Decision-Making via Hierarchical Hypergraph Retrieval-Augmented Generation(https://arxiv.org/abs/2508.13735)
Keywords: retrieval-augmented generation
Abstract: With the widespread application of electroencephalography (EEG) in neuroscience and clinical practice, efficiently retrieving and semantically interpreting large-scale, multi-source, heterogeneous EEG data has become a pressing challenge. We propose EEG-MedRAG, a three-layer hypergraph-based retrieval-augmented generation framework that unifies EEG domain knowledge, individual patient cases, and a large-scale repository into a traversable n-ary relational hypergraph, enabling joint semantic-temporal retrieval and causal-chain diagnostic generation. Concurrently, we introduce the first cross-disease, cross-role EEG clinical QA benchmark, spanning seven disorders and five authentic clinical perspectives. This benchmark allows systematic evaluation of disease-agnostic generalization and role-aware contextual understanding. Experiments show that EEG-MedRAG significantly outperforms TimeRAG and HyperGraphRAG in answer accuracy and retrieval, highlighting its strong potential for real-world clinical decision support. Our data and code are publicly available at this https URL.
摘要：随着脑电图（EEG）在神经科学和临床实践中的广泛应用，有效地检索和语义解释了大规模，多源，异质性EEG数据已成为一个紧迫的挑战。我们提出了EEG-MEDRAG，这是一种基于三层超毛孔的检索型生成框架，将EEG结构域知识，个别患者病例和大规模存储库统一到可遍历的N-ary关系超级绘图中，从而实现了联合性语义性语义性超透明剂，从而实现了诊断和因果关系诊断性的生成。同时，我们引入了第一个交叉疾病，跨力的脑电图临床基准测试，涵盖了七种疾病和五种真实的临床视角。该基准允许系统地评估疾病不足的概括和角色感知的上下文理解。实验表明，EEG-MEDRAG在答案的准确性和检索方面显着优于Timerag和HyperGraphrag，这突出了其对现实世界中临床决策支持的强大潜力。我们的数据和代码可在此HTTPS URL上公开获取。

Title: Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA

Authors: Kaiwei Zhang, Qi Jia, Zijian Chen, Wei Sun, Xiangyang Zhu, Chunyi Li, Dandan Zhu, Guangtao Zhai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.13743
Pdf URL: https://arxiv.org/pdf/2508.13743
Copy Paste: [[2508.13743]] Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA(https://arxiv.org/abs/2508.13743)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs), while increasingly used in domains requiring factual rigor, often display a troubling behavior: sycophancy, the tendency to align with user beliefs regardless of correctness. This tendency is reinforced by preference-based alignment techniques that optimize for user satisfaction but can undermine truthfulness. While relatively benign in casual dialogue, sycophancy poses serious risks in high-stakes settings such as scientific question answering (QA), where model outputs may shape collaborative reasoning, decision-making, and knowledge formation. Despite its importance, this phenomenon remains underexamined in factual QA contexts. We address this gap by introducing a unified evaluation framework to quantify the impact of sycophantic context on model behavior in scientific QA, measuring how much user-imposed social pressure distorts model outputs. The framework incorporates adversarial prompting setups and targeted metrics, such as misleading resistance and sycophancy resistance, that capture a model's ability to maintain factual consistency under misleading cues. Systematic evaluations across open-source and proprietary models reveal pervasive sycophantic tendencies, driven more by alignment strategy than by model size. To mitigate this issue, we propose Pressure-Tune, a lightweight post-training method that fine-tunes models on synthetic adversarial dialogues paired with chain-of-thought rationales. These rationales reject user misinformation while reinforcing factual commitments. Experiments on challenging scientific QA benchmarks show that Pressure-Tune significantly enhances sycophancy resistance without compromising accuracy or responsiveness to valid feedback, offering a practical pathway toward more truthful and principled model behavior.
摘要：大型语言模型（LLMS）虽然越来越多地用于需要事实严格的领域，但通常会表现出令人不安的行为：无粘合症，与用户信念保持一致的趋势无论正确。通过基于偏好的对齐技术来增强这种趋势，该技术优化了用户满意度，但会破坏真实性。尽管在随意的对话中相对良性，但在诸如科学问题回答（QA）之类的高风险环境中构成了严重的风险，其中模型输出可能会塑造协作推理，决策和知识形成。尽管它很重要，但这种现象在事实质量质量质量的环境中仍然没有被忽视。我们通过引入一个统一的评估框架来解决这一差距，以量化Sycophantic环境对科学质量保险公司模型行为的影响，从而衡量用户施加的社会压力扭曲模型输出的程度。该框架结合了对抗性提示设置和有针对性的指标，例如误导性阻力和抗粘液抗性，这些指标捕获了模型在误导性提示下保持事实一致性的能力。跨开源和专有模型之间的系统评估揭示了普遍的粘噬细胞趋势，而与模型尺寸相比，由比对策略驱动更多。为了减轻此问题，我们提出了压力调节，这是一种轻巧的训练后方法，该方法对合成对抗对话进行微调模型与经过经过经过经过经过经过经过经过经过经过经过经过经过经验链的理由的对话配对。这些理由拒绝用户错误的信息，同时加强事实承诺。有关挑战性科学质量检查基准测试的实验表明，压力调节可显着增强粘粘性的抵抗力，而不会损害有效反馈的准确性或响应能力，从而为更真实和原则性的模型行为提供了实用的途径。

Title: MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment

Authors: Shengchao Liu, Xiaoming Liu, Chengzhengxu Li, Zhaohan Zhang, Guoxin Ma, Yu Lan, Shuai Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.13768
Pdf URL: https://arxiv.org/pdf/2508.13768
Copy Paste: [[2508.13768]] MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment(https://arxiv.org/abs/2508.13768)
Keywords: language model
Abstract: Large Language Models have shown growing ability to generate fluent and coherent texts that are highly similar to the writing style of humans. Current detectors for Machine-Generated Text (MGT) perform well when they are trained and tested in the same domain but generalize poorly to unseen domains, due to domain shift between data from different sources. In this work, we propose MGT-Prism, an MGT detection method from the perspective of the frequency domain for better domain generalization. Our key insight stems from analyzing text representations in the frequency domain, where we observe consistent spectral patterns across diverse domains, while significant discrepancies in magnitude emerge between MGT and human-written texts (HWTs). The observation initiates the design of a low frequency domain filtering module for filtering out the document-level features that are sensitive to domain shift, and a dynamic spectrum alignment strategy to extract the task-specific and domain-invariant features for improving the detector's performance in domain generalization. Extensive experiments demonstrate that MGT-Prism outperforms state-of-the-art baselines by an average of 0.90% in accuracy and 0.92% in F1 score on 11 test datasets across three domain-generalization scenarios.
摘要：大型语言模型显示出越来越多的能力产生流利和连贯的文本，这些文本与人类的写作风格高度相似。当前机器生成的文本（MGT）在同一域中训练和测试时，它们的探测器表现良好，但由于来自不同来源的数据之间的域移动，因此在同一域中训练和测试范围很差，以至于看不见域。在这项工作中，我们提出了MGT-Prism，这是一种MGT检测方法，从频域的角度来进行更好的域概括。我们的关键见解源于分析频域中的文本表示，在该文本表示范围内，我们观察到跨不同域之间的光谱模式，而MGT和人写的文本（HWTS）之间的幅度显着差异。该观察结果启动了低频域滤波模块的设计，以滤除对域移动敏感的文档级特征，以及一种动态频谱对齐策略，以提取特定任务和域的不变特征，以改善探测器在域中的性能。广泛的实验表明，MGT-PRISM在三种域将来的情况下，在11个测试数据集上，准确度的准确性平均优于最先进的基线，而F1得分为0.92％。

Title: Can Large Language Models (LLMs) Describe Pictures Like Children? A Comparative Corpus Study

Authors: Hanna Woloszyn, Benjamin Gagl
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.13769
Pdf URL: https://arxiv.org/pdf/2508.13769
Copy Paste: [[2508.13769]] Can Large Language Models (LLMs) Describe Pictures Like Children? A Comparative Corpus Study(https://arxiv.org/abs/2508.13769)
Keywords: language model, llm, prompt
Abstract: The role of large language models (LLMs) in education is increasing, yet little attention has been paid to whether LLM-generated text resembles child language. This study evaluates how LLMs replicate child-like language by comparing LLM-generated texts to a collection of German children's descriptions of picture stories. We generated two LLM-based corpora using the same picture stories and two prompt types: zero-shot and few-shot prompts specifying a general age from the children corpus. We conducted a comparative analysis across psycholinguistic text properties, including word frequency, lexical richness, sentence and word length, part-of-speech tags, and semantic similarity with word embeddings. The results show that LLM-generated texts are longer but less lexically rich, rely more on high-frequency words, and under-represent nouns. Semantic vector space analysis revealed low similarity, highlighting differences between the two corpora on the level of corpus semantics. Few-shot prompt increased similarities between children and LLM text to a minor extent, but still failed to replicate lexical and semantic patterns. The findings contribute to our understanding of how LLMs approximate child language through multimodal prompting (text + image) and give insights into their use in psycholinguistic research and education while raising important questions about the appropriateness of LLM-generated language in child-directed educational tools.
摘要：大型语言模型（LLM）在教育中的作用正在增加，但是很少有人注意LLM生成的文本是否类似于儿童语言。这项研究通过将LLM生成的文本与德国儿童对图片故事的描述进行比较来评估LLM如何复制类似儿童的语言。我们使用相同的图片故事和两种及时的类型生成了两个基于LLM的语料库：零射击和几个提示，从儿童语料库中指定了一般年龄。我们对心理语言文本属性进行了比较分析，包括单词频率，词汇丰富性，句子和单词长度，言论的一部分标签以及与单词嵌入的语义相似性。结果表明，LLM生成的文本更长，但词汇丰富，更多地依赖于高频单词和代表性不足的名词。语义矢量空间分析显示出低相似性，突出了两个语料库语义语义水平上两个语料库之间的差异。很少有促使儿童和LLM文本之间的相似之处在很小的程度上增加，但仍未复制词汇和语义模式。这些发现有助于我们理解LLM如何通过多模式提示（文本 +图像）近似儿童语言，并深入了解它们在心理学研究和教育中的使用，同时提出有关LLM生成语言在儿童指导的教育工具中适当性的重要问题。

Title: TracSum: A New Benchmark for Aspect-Based Summarization with Sentence-Level Traceability in Medical Domain

Authors: Bohao Chu, Meijie Li, Sameh Frihat, Chengyu Gu, Georg Lodde, Elisabeth Livingstone, Norbert Fuhr
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.13798
Pdf URL: https://arxiv.org/pdf/2508.13798
Copy Paste: [[2508.13798]] TracSum: A New Benchmark for Aspect-Based Summarization with Sentence-Level Traceability in Medical Domain(https://arxiv.org/abs/2508.13798)
Keywords: llm
Abstract: While document summarization with LLMs has enhanced access to textual information, concerns about the factual accuracy of these summaries persist, especially in the medical domain. Tracing evidence from which summaries are derived enables users to assess their accuracy, thereby alleviating this concern. In this paper, we introduce TracSum, a novel benchmark for traceable, aspect-based summarization, in which generated summaries are paired with sentence-level citations, enabling users to trace back to the original context. First, we annotate 500 medical abstracts for seven key medical aspects, yielding 3.5K summary-citation pairs. We then propose a fine-grained evaluation framework for this new task, designed to assess the completeness and consistency of generated content using four metrics. Finally, we introduce a summarization pipeline, Track-Then-Sum, which serves as a baseline method for comparison. In experiments, we evaluate both this baseline and a set of LLMs on TracSum, and conduct a human evaluation to assess the evaluation results. The findings demonstrate that TracSum can serve as an effective benchmark for traceable, aspect-based summarization tasks. We also observe that explicitly performing sentence-level tracking prior to summarization enhances generation accuracy, while incorporating the full context further improves completeness.
摘要：尽管使用LLMS的文档摘要增强了对文本信息的访问，但对这些摘要的事实准确性的担忧仍然存在，尤其是在医疗领域。追踪得出的摘要的证据使用户能够评估其准确性，从而减轻这种担忧。在本文中，我们介绍了Tracsum，这是一种基于可追溯的，基于方面的摘要的新基准，其中生成的摘要与句子级的引用配对，使用户可以追溯到原始上下文。首先，我们对七个关键医学方面的500个医学摘要注释，产生3.5k摘要引用对。然后，我们为这项新任务提出了一个细粒度的评估框架，旨在评估使用四个指标的生成内容的完整性和一致性。最后，我们引入了一个摘要管道，即轨道，然后用作比较的基线方法。在实验中，我们在Tracsum上评估了这一基线和一组LLM，并进行人类评估以评估评估结果。研究结果表明，Tracsum可以作为可追溯，基于方面的摘要任务的有效基准。我们还观察到，在摘要之前明确执行句子级跟踪可以提高生成精度，同时融合完整的环境进一步提高了完整性。

Title: Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding

Authors: Maciej Skorski, Alina Landowska
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2508.13804
Pdf URL: https://arxiv.org/pdf/2508.13804
Copy Paste: [[2508.13804]] Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding(https://arxiv.org/abs/2508.13804)
Keywords: language model, llm
Abstract: How do large language models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluate top language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from ~700 annotators on 100K+ texts spanning social media, news, and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25\% of human annotators, achieving much better-than-average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.
摘要：与人类相比，大型语言模型如何理解道德维度？对市场领先语言模型的第一个大规模贝叶斯评估提供了答案。与使用确定性基础真理（多数或包含规则）的先前工作相反，我们对注释者分歧进行了模型，以捕获息肉不确定性（固有的人类分歧）和认知不确定性（模型域灵敏度）。我们在250k+注释中评估了高语模型（Claude Sonnet 4，DeepSeek-V3，Llama 4 Maverick），涉及约700个注释，涵盖100k+文本，涵盖社交媒体，新闻和论坛。我们的GPU优化的贝叶斯框架处理了1M+模型查询，表明AI模型通常排名在人类注释者的前25％\％，实现了比平均水平的精度要高得多。重要的是，我们发现AI产生的假否定性比人类少得多，突出了他们更敏感的道德检测能力。

Title: Prompt-Based One-Shot Exact Length-Controlled Generation with LLMs

Authors: Juncheng Xie, Hung-yi Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13805
Pdf URL: https://arxiv.org/pdf/2508.13805
Copy Paste: [[2508.13805]] Prompt-Based One-Shot Exact Length-Controlled Generation with LLMs(https://arxiv.org/abs/2508.13805)
Keywords: language model, gpt, llm, prompt
Abstract: Controlling the length of text produced by large language models (LLMs) remains challenging: models frequently overshoot or undershoot explicit length instructions because they cannot reliably keep an internal token count. We present a prompt-based, one-shot strategy that compels an off-the-shelf LLM to generate exactly a desired number of tokens - words (English) or characters (Chinese) - without any fine-tuning or iterative sampling. The prompt appends countdown markers and explicit counting rules so that the model "writes while counting." We evaluate on four settings: open-ended generation (1-1000 tokens), XSUM summarization, MT-Bench-LI instruction following, and the LIFEBENCH equal-length track. On MT-Bench-LI, strict length compliance with GPT-4.1 leaps from below 30% under naive prompts to above 95% with our countdown prompt, surpassing the popular draft-then-revise baseline, while judged answer quality is preserved. These results show that precise length control can be achieved through prompt engineering alone, offering a lightweight alternative to training- or decoding-based methods.
摘要：控制大型语言模型（LLM）产生的文本长度仍然具有挑战性：模型经常超过或不明确的长度说明，因为它们无法可靠地保持内部令牌计数。我们提出了一种基于及时的单发策略，该策略迫使现成的llm精确地生成了所需数量的令牌 - 单词（英语）或字符（中文） - 而无需进行任何微调或迭代抽样。该提示将附加倒计时标记和显式计数规则，以便模型“在计数时写入”。我们在四个设置上进行评估：开放式生成（1-1000个令牌），XSUM摘要，MT-Bench-LI指令以下和救生馆等于长度的轨道。在MT-Bench-Li上，严格符合GPT-4.1在NAIVE提示下从30％以下的GPT-4.1跳跃到95％以上，在我们的倒计时提示下超过了95％，超过了流行的选秀，然后又超过了革命性的基线，而被判断的答案质量则保留了。这些结果表明，仅通过及时的工程才能实现精确的长度控制，从而提供了轻巧的替代基于训练或解码的方法。

Title: The illusion of a perfect metric: Why evaluating AI's words is harder than it looks

Authors: Maria Paz Oliva, Adriana Correia, Ivan Vankov, Viktor Botev
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13816
Pdf URL: https://arxiv.org/pdf/2508.13816
Copy Paste: [[2508.13816]] The illusion of a perfect metric: Why evaluating AI's words is harder than it looks(https://arxiv.org/abs/2508.13816)
Keywords: llm, retrieval augmented generation
Abstract: Evaluating Natural Language Generation (NLG) is crucial for the practical adoption of AI, but has been a longstanding research challenge. While human evaluation is considered the de-facto standard, it is expensive and lacks scalability. Practical applications have driven the development of various automatic evaluation metrics (AEM), designed to compare the model output with human-written references, generating a score which approximates human judgment. Over time, AEMs have evolved from simple lexical comparisons, to semantic similarity models and, more recently, to LLM-based evaluators. However, it seems that no single metric has emerged as a definitive solution, resulting in studies using different ones without fully considering the implications. This paper aims to show this by conducting a thorough examination of the methodologies of existing metrics, their documented strengths and limitations, validation methods, and correlations with human judgment. We identify several key challenges: metrics often capture only specific aspects of text quality, their effectiveness varies by task and dataset, validation practices remain unstructured, and correlations with human judgment are inconsistent. Importantly, we find that these challenges persist in the most recent type of metric, LLM-as-a-Judge, as well as in the evaluation of Retrieval Augmented Generation (RAG), an increasingly relevant task in academia and industry. Our findings challenge the quest for the 'perfect metric'. We propose selecting metrics based on task-specific needs and leveraging complementary evaluations and advocate that new metrics should focus on enhanced validation methodologies.
摘要：评估自然语言产生（NLG）对于实际采用AI至关重要，但一直是一项长期的研究挑战。尽管人类评估被认为是事实上的标准，但它很昂贵，缺乏可扩展性。实际应用驱动了各种自动评估指标（AEM）的开发，旨在将模型输出与人写入的参考进行比较，产生近似人类判断的分数。随着时间的流逝，AEM已从简单的词汇比较，语义相似性模型以及最近的基于LLM的评估者演变而来。但是，似乎没有单一的指标作为确定的解决方案出现，从而在没有完全考虑含义的情况下使用了不同的研究。本文旨在通过对现有指标的方法，其证明的优势和局限性，验证方法以及与人类判断的相关性进行彻底研究来证明这一点。我们确定了几个关键挑战：指标通常仅捕获文本质量的特定方面，其有效性因任务和数据集而有所不同，验证实践仍然非结构化，并且与人类判断的相关性是不一致的。重要的是，我们发现，这些挑战持续存在于最近类型的指标，LLM-AS-A-A-Gudge，以及评估检索增强发电（RAG）的评估，这是学术界和行业中日益相关的任务。我们的发现挑战了对“完美指标”的追求。我们建议根据特定于任务的需求进行选择指标，并利用互补评估，并主张新指标应专注于增强验证方法。

Title: Extracting Structured Requirements from Unstructured Building Technical Specifications for Building Information Modeling

Authors: Insaf Nahri, Romain Pinquié, Philippe Véron, Nicolas Bus, Mathieu Thorel
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13833
Pdf URL: https://arxiv.org/pdf/2508.13833
Copy Paste: [[2508.13833]] Extracting Structured Requirements from Unstructured Building Technical Specifications for Building Information Modeling(https://arxiv.org/abs/2508.13833)
Keywords: language model
Abstract: This study explores the integration of Building Information Modeling (BIM) with Natural Language Processing (NLP) to automate the extraction of requirements from unstructured French Building Technical Specification (BTS) documents within the construction industry. Employing Named Entity Recognition (NER) and Relation Extraction (RE) techniques, the study leverages the transformer-based model CamemBERT and applies transfer learning with the French language model Fr\_core\_news\_lg, both pre-trained on a large French corpus in the general domain. To benchmark these models, additional approaches ranging from rule-based to deep learning-based methods are developed. For RE, four different supervised models, including Random Forest, are implemented using a custom feature vector. A hand-crafted annotated dataset is used to compare the effectiveness of NER approaches and RE models. Results indicate that CamemBERT and Fr\_core\_news\_lg exhibited superior performance in NER, achieving F1-scores over 90\%, while Random Forest proved most effective in RE, with an F1 score above 80\%. The outcomes are intended to be represented as a knowledge graph in future work to further enhance automatic verification systems.
摘要：这项研究探讨了建筑信息建模（BIM）与自然语言处理（NLP）的整合，以自动从建筑行业内的非结构化法国建筑技术规范（BTS）文档中提取要求。该研究采用命名实体识别（NER）和关系提取（RE）技术，利用了基于变压器的模型Camembert，并使用法语模型FR \ _Core \ _news \ _lg应用转移学习，既可以在一般域中的大型法国语体上进行预先培训。为了基准这些模型，开发了从基于规则到深度学习方法的其他方法。对于RE，使用自定义功能向量实施了四种不同的监督模型，包括随机森林。手工制作的注释数据集用于比较NER方法和RE模型的有效性。结果表明，Camembert和Fr \ _core \ _news \ _lg在NER中表现出卓越的性能，达到90 \％以上的F1分数，而随机森林在RE中证明是最有效的，F1得分高于80 \％。结果旨在表示将来的工作中的知识图，以进一步增强自动验证系统。

Title: MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models

Authors: Jiacheng Ruan, Dan Jiang, Xian Gao, Ting Liu, Yuzhuo Fu, Yangyang Kang
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2508.13938
Pdf URL: https://arxiv.org/pdf/2508.13938
Copy Paste: [[2508.13938]] MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models(https://arxiv.org/abs/2508.13938)
Keywords: language model, llm
Abstract: Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models' reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs' comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and support five languages: Chinese, English, French, Spanish, and Japanese. We conducted extensive experiments on 16 open-source models and 4 closed-source models, and the results demonstrate that MME-SCI is widely challenging for existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini achieved accuracy of only 52.11%, 24.73%, 36.57%, and 29.80% in mathematics, physics, chemistry, and biology, respectively, indicating a significantly higher difficulty level compared to existing benchmarks. More importantly, using MME-SCI's multilingual and fine-grained knowledge attributes, we analyzed existing models' performance in depth and identified their weaknesses in specific domains. The Data and Evaluation Code are available at this https URL.
摘要：最近，多模式的大语言模型（MLLM）在各个领域都取得了重大进步，相应的评估基准已不断完善和改进。在此过程中，科学领域的基准在评估MLLM的推理能力方面发挥了重要作用。但是，现有的基准仍然面临三个关键挑战：1）对模型在多语言场景中的推理能力的评估不足； 2）评估MLLM的综合方式覆盖范围不足； 3）缺乏对科学知识点的细粒度注释。为了解决这些差距，我们提出了MME-SCI，这是一个全面且具有挑战性的基准。我们仔细地收集了1,019个高质量的提问对，涉及3种不同的评估模式。这些对涵盖了四个主题，即数学，物理，化学和生物学，并支持五种语言：中文，英语，法语，西班牙语和日语。我们对16个开源模型和4种封闭源模型进行了广泛的实验，结果表明，MME-SCI对于现有MLLM来说是广泛挑战的。例如，在仅图像评估模式下，O4-MINI的准确性仅达到52.11％，24.73％，36.57％和29.80％的精度，分别在数学，物理，化学和生物学上，与现有基准相比，难度水平明显更高。更重要的是，使用MME-SCI的多语言和细粒知识属性，我们分析了现有模型的深度性能，并确定了它们在特定领域中的弱点。数据和评估代码可在此HTTPS URL上找到。

Title: ReviewGraph: A Knowledge Graph Embedding Based Framework for Review Rating Prediction with Sentiment Features

Authors: A.J.W. de Vink, Natalia Amat-Lefort, Lifeng Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.13953
Pdf URL: https://arxiv.org/pdf/2508.13953
Copy Paste: [[2508.13953]] ReviewGraph: A Knowledge Graph Embedding Based Framework for Review Rating Prediction with Sentiment Features(https://arxiv.org/abs/2508.13953)
Keywords: language model, llm, retrieval-augmented generation
Abstract: In the hospitality industry, understanding the factors that drive customer review ratings is critical for improving guest satisfaction and business performance. This work proposes ReviewGraph for Review Rating Prediction (RRP), a novel framework that transforms textual customer reviews into knowledge graphs by extracting (subject, predicate, object) triples and associating sentiment scores. Using graph embeddings (Node2Vec) and sentiment features, the framework predicts review rating scores through machine learning classifiers. We compare ReviewGraph performance with traditional NLP baselines (such as Bag of Words, TF-IDF, and Word2Vec) and large language models (LLMs), evaluating them in the HotelRec dataset. In comparison to the state of the art literature, our proposed model performs similar to their best performing model but with lower computational cost (without ensemble). While ReviewGraph achieves comparable predictive performance to LLMs and outperforms baselines on agreement-based metrics such as Cohen's Kappa, it offers additional advantages in interpretability, visual exploration, and potential integration into Retrieval-Augmented Generation (RAG) systems. This work highlights the potential of graph-based representations for enhancing review analytics and lays the groundwork for future research integrating advanced graph neural networks and fine-tuned LLM-based extraction methods. We will share ReviewGraph output and platform open-sourced on our GitHub page this https URL
摘要：在酒店行业中，了解推动客户审核评级的因素对于提高客人满意度和业务绩效至关重要。这项工作提出了用于审查评级预测（RRP）的审查图，该框架是一个新颖的框架，通过提取（主题，谓词，对象）三元和关联情感分数将文本客户评论转换为知识图。该框架使用Graph Embeddings（Node2VEC）和情感功能，通过机器学习分类器预测评估评分得分。我们将评论图的性能与传统的NLP基线（例如单词袋，TF-IDF和Word2Vec）和大型语言模型（LLMS）进行比较，并在HotelRec数据集中对其进行了评估。与艺术文献的状态相比，我们提出的模型的性能类似于其最佳性能模型，但计算成本较低（没有合奏）。虽然ReviewGraph可以在Cohen的Kappa等基于协议的指标上实现可比的LLM的可比预测性能，并且在Cohen's Kappa等基于协议的指标方面具有额外的优势，但它在可解释性，视觉探索和潜在集成中提供了更多优势，并将其潜在地集成到检索演出的生成（RAG）系统中。这项工作突出了基于图表的增强审查分析的潜力，并为将来的研究奠定了基础，以整合高级图神经网络和基于LLM的萃取方法。我们将在我们的github页面上共享评论图的输出和平台此https url

Title: Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization

Authors: Shaohua Duan, Xinze Li, Zhenghao Liu, Xiaoyuan Yi, Yukun Yan, Shuo Wang, Yu Gu, Ge Yu, Maosong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.13993
Pdf URL: https://arxiv.org/pdf/2508.13993
Copy Paste: [[2508.13993]] Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization(https://arxiv.org/abs/2508.13993)
Keywords: language model, llm, long context
Abstract: Long-context modeling is critical for a wide range of real-world tasks, including long-context question answering, summarization, and complex reasoning tasks. Recent studies have explored fine-tuning Large Language Models (LLMs) with synthetic data to enhance their long-context capabilities. However, the effectiveness of such approaches is often limited by the low diversity and factual inconsistencies in the generated data. To address these challenges, we propose LongMab-PO, a novel framework that leverages a Multi-Armed Bandit (MAB) rollout strategy to identify the most informative chunks from the given long context for sampling high-quality and diverse responses and constructing preference data pairs for Direct Preference Optimization (DPO) training. Specifically, we treat context chunks as arms of MAB, select chunks based on their expected reward scores to input into LLMs to generate responses, and iteratively update these scores based on reward feedback. This exploration and exploitation process enables the model to focus on the most relevant context segments, thereby generating and collecting high-quality and diverse responses. Finally, we collect these generated responses from the rollout process and apply the DPO method to further optimize the LLM. Experimental results show that LongMab-PO significantly improves the diversity and quality of preference data pairs, achieving state-of-the-art performance on long-context reasoning benchmarks. All code and data will be released on this https URL.
摘要：长篇小说建模对于各种现实世界的任务至关重要，包括长篇小说的回答，摘要和复杂的推理任务。最近的研究探索了使用合成数据的微调大语言模型（LLM），以增强其长篇文化功能。但是，这种方法的有效性通常受到生成数据的多样性和事实不一致的限制。为了应对这些挑战，我们提出了Longmab-Po，这是一个新型框架，该框架利用了多军强盗（MAB）推出策略来确定从给定的长上下文中确定最有用的块，以取得高质量和多样化的响应，并构建偏好数据对直接偏好优化（DPO）培训。具体来说，我们将上下文块视为单位臂的臂，根据他们的预期奖励分数选择块以输入LLMS以产生响应，并根据奖励反馈迭代更新这些分数。这种探索和剥削过程使该模型能够专注于最相关的上下文细分市场，从而产生和收集高质量和多样化的响应。最后，我们从推出过程中收集这些生成的响应，并应用DPO方法进一步优化LLM。实验结果表明，longmab-po显着提高了偏好数据对的多样性和质量，从而在长期文化推理基准上实现了最先进的性能。所有代码和数据将在此HTTPS URL上发布。

Title: Ask Good Questions for Large Language Models

Authors: Qi Wu, Zhongqi Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.14025
Pdf URL: https://arxiv.org/pdf/2508.14025
Copy Paste: [[2508.14025]] Ask Good Questions for Large Language Models(https://arxiv.org/abs/2508.14025)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have significantly improved the performance of dialog systems, yet current approaches often fail to provide accurate guidance of topic due to their inability to discern user confusion in related concepts. To address this, we introduce the Ask-Good-Question (AGQ) framework, which features an improved Concept-Enhanced Item Response Theory (CEIRT) model to better identify users' knowledge levels. Our contributions include applying the CEIRT model along with LLMs to directly generate guiding questions based on the inspiring text, greatly improving information retrieval efficiency during the question & answer process. Through comparisons with other baseline methods, our approach outperforms by significantly enhencing the users' information retrieval experiences.
摘要：大型语言模型（LLM）的最新进展已显着改善了对话系统的性能，但是由于无法辨别相关概念中的用户混乱，目前的方法通常无法准确地提供主题的指导。为了解决这个问题，我们介绍了Ask-Good问题（AGQ）框架，该框架具有改进的概念增强项目响应理论（CEIRT）模型，以更好地识别用户的知识水平。我们的贡献包括将CEIRT模型与LLM一起应用于基于鼓舞人心的文本直接产生指导问题，从而在问答过程中大大提高了信息检索效率。通过与其他基线方法的比较，我们的方法通过显着增强用户的信息检索体验来优于表现。

Title: Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

Authors: Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, Weizhu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.14029
Pdf URL: https://arxiv.org/pdf/2508.14029
Copy Paste: [[2508.14029]] Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR(https://arxiv.org/abs/2508.14029)
Keywords: language model, llm
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.
摘要：具有可验证奖励（RLVR）的强化学习最近已成为培训后大语言模型（LLMS）的关键范式，尤其是对于复杂的推理任务。但是，已证明香草RLVR训练可以以损失政策熵为代价来提高通过@1的表现，从而减少了发电多样性并限制了Pass@k性能，这通常代表了LLM推理能力的上限。在本文中，我们从培训问题的角度系统地分析了该政策的产生多样性，并发现增加和更新培训问题有助于减轻培训期间的熵崩溃。基于这些观察结果，我们提出了一个在线自我播放，该自我播放具有RLVR培训的各种问题综合（SVS）策略，该培训使用该策略的正确解决方案来综合变异问题，同时确保其参考答案与原件相同。与标准RLVR相比，这种自我改善策略有效地维持了训练期间的政策熵，并且在竞争级AIME24和AIME25基准的PASS@32绩效中，通过长期改进并实现了长期改进，并实现了长期改进和22.8％的绝对增长。对从3B到32B的不同模型大小的12个推理基准的实验始终证明了SV的普遍性和鲁棒性。

Title: Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation

Authors: Dongyoon Hahm, Taywon Min, Woogyeol Jin, Kimin Lee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.14031
Pdf URL: https://arxiv.org/pdf/2508.14031
Copy Paste: [[2508.14031]] Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation(https://arxiv.org/abs/2508.14031)
Keywords: language model, llm, prompt, agent
Abstract: Beyond simple text generation, Large Language Models (LLMs) have evolved into agentic systems capable of planning and interacting with external tools to solve complex tasks. This evolution involves fine-tuning LLMs on agent-specific tasks to enhance their proficiency. However, safety concerns are frequently overlooked during this fine-tuning process. In this work, we show that aligned LLMs can become unintentionally misaligned, leading to a higher likelihood of executing harmful tasks and a reduced tendency to refuse them when fine-tuned to execute agentic tasks. To address these safety challenges, we propose Prefix INjection Guard (PING), a simple yet effective method that prepends automatically generated natural language prefixes to agent responses, guiding them to refuse harmful requests while preserving performance on benign tasks. Specifically, we introduce an iterative approach that alternates between (1) generating candidate prefixes and (2) selecting those that optimize both task performance and refusal behavior. Experimental results demonstrate that PING significantly enhances the safety of fine-tuned LLM agents without sacrificing their effectiveness. PING consistently outperforms existing prompting approaches across diverse benchmarks in both web navigation and code generation tasks. Our analysis of internal hidden states via linear probes reveals that prefix tokens are crucial for behavior modification, explaining the performance gains. WARNING: This paper contains contents that are unethical or offensive in nature.
摘要：除了简单的文本生成之外，大型语言模型（LLM）已经演变为能够计划和与外部工具进行互动以解决复杂任务的代理系统。这种演变涉及针对特定特定任务的微调LLM，以提高其熟练程度。但是，在这个微调过程中，经常忽略安全问题。在这项工作中，我们表明，对齐的LLM可能会无意识地被错位，从而导致执行有害任务的可能性更高，并且在微调执行代理任务时拒绝它们的趋势减少了。为了应对这些安全挑战，我们提出了前缀注射卫队（PING），这是一种简单而有效的方法，可预先自动生成自然语言前缀，以供代理响应，从而指导他们拒绝有害请求，同时保持良性任务的绩效。具体而言，我们引入了一种迭代方法，该方法在（1）生成候选前缀和（2）选择优化任务绩效和拒绝行为的方法之间交替。实验结果表明，PING显着提高了微调LLM代理的安全性而无需牺牲其有效性。 ping始终优于Web导航和代码生成任务中不同基准的现有提示方法。我们通过线性探针对内部隐藏状态的分析表明，前缀令牌对于修改行为至关重要，从而解释了性能的提高。警告：本文包含本质上不道德或冒犯性的内容。

Title: The Promise of Large Language Models in Digital Health: Evidence from Sentiment Analysis in Online Health Communities

Authors: Xiancheng Li, Georgios D. Karampatakis, Helen E. Wood, Chris J. Griffiths, Borislava Mihaylova, Neil S. Coulson, Alessio Pasinato, Pietro Panzarasa, Marco Viviani, Anna De Simoni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.14032
Pdf URL: https://arxiv.org/pdf/2508.14032
Copy Paste: [[2508.14032]] The Promise of Large Language Models in Digital Health: Evidence from Sentiment Analysis in Online Health Communities(https://arxiv.org/abs/2508.14032)
Keywords: language model, gpt, llm, prompt
Abstract: Digital health analytics face critical challenges nowadays. The sophisticated analysis of patient-generated health content, which contains complex emotional and medical contexts, requires scarce domain expertise, while traditional ML approaches are constrained by data shortage and privacy limitations in healthcare settings. Online Health Communities (OHCs) exemplify these challenges with mixed-sentiment posts, clinical terminology, and implicit emotional expressions that demand specialised knowledge for accurate Sentiment Analysis (SA). To address these challenges, this study explores how Large Language Models (LLMs) can integrate expert knowledge through in-context learning for SA, providing a scalable solution for sophisticated health data analysis. Specifically, we develop a structured codebook that systematically encodes expert interpretation guidelines, enabling LLMs to apply domain-specific knowledge through targeted prompting rather than extensive training. Six GPT models validated alongside DeepSeek and LLaMA 3.1 are compared with pre-trained language models (BioBERT variants) and lexicon-based methods, using 400 expert-annotated posts from two OHCs. LLMs achieve superior performance while demonstrating expert-level agreement. This high agreement, with no statistically significant difference from inter-expert agreement levels, suggests knowledge integration beyond surface-level pattern recognition. The consistent performance across diverse LLM models, supported by in-context learning, offers a promising solution for digital health analytics. This approach addresses the critical challenge of expert knowledge shortage in digital health research, enabling real-time, expert-quality analysis for patient monitoring, intervention assessment, and evidence-based health strategies.
摘要：如今，数字健康分析面临着关键的挑战。对患者生成的健康内容的复杂分析（包含复杂的情绪和医疗环境）需要稀缺的领域专业知识，而传统的ML方法则受到医疗保健环境中数据短缺和隐私限制的限制。在线卫生社区（OHCS）通过混合味道帖子，临床术语和隐性情绪表达来体现这些挑战，这些挑战需要专业知识以进行准确的情感分析（SA）。为了应对这些挑战，本研究探讨了大型语言模型（LLMS）如何通过SA中的内部学习学习来整合专家知识，从而为复杂的健康数据分析提供了可扩展的解决方案。具体而言，我们开发了一个结构化的代码书，该代码本可以系统地编码专家解释指南，从而使LLMS能够通过有针对性的提示而不是广泛的培训来应用特定领域的知识。将六个与DeepSeek和Llama 3.1一起验证的GPT模型与预先训练的语言模型（Biobert变体）和基于词典的方法进行了比较，并使用了400个来自两个OHC的专家注释的帖子。 LLM在展示专家级协议的同时取得了出色的表现。这一高度一致性与专家间一致性水平没有统计学意义的差异没有统计学意义，这表明超出表面层面模式识别的知识整合。在文章学习的支持下，各种LLM模型之间的一致性提供了一个有希望的解决方案，可用于数字健康分析。这种方法解决了数字健康研究中专家知识短缺的关键挑战，从而实现了用于患者监测，干预评估和基于证据的健康策略的实时，专家分析。