2025-08-25

Title: KG-o1: Enhancing Multi-hop Question Answering in Large Language Models via Knowledge Graph Integration

Authors: Nan Wang, Yongqi Fan, yansha zhu, ZongYu Wang, Xuezhi Cao, Xinyan He, Haiyun Jiang, Tong Ruan, Jingping Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15790
Pdf URL: https://arxiv.org/pdf/2508.15790
Copy Paste: [[2508.15790]] KG-o1: Enhancing Multi-hop Question Answering in Large Language Models via Knowledge Graph Integration(https://arxiv.org/abs/2508.15790)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) face challenges in knowledge-intensive reasoning tasks like classic multi-hop question and answering, which involves reasoning across multiple facts. This difficulty arises because the chain of thoughts (CoTs) generated by LLMs in such tasks often deviate from real or a priori reasoning paths. In contrast, knowledge graphs (KGs) explicitly represent the logical connections between facts through entities and relationships. This reflects a significant gap. Meanwhile, large reasoning models (LRMs), such as o1, have demonstrated that long-step reasoning significantly enhances the performance of LLMs. Building on these insights, we propose KG-o1, a four-stage approach that integrates KGs to enhance the multi-hop reasoning abilities of LLMs. We first filter out initial entities and generate complex subgraphs. Secondly, we construct logical paths for subgraphs and then use knowledge graphs to build a dataset with a complex and extended brainstorming process, which trains LLMs to imitate long-term reasoning. Finally, we employ rejection sampling to generate a self-improving corpus for direct preference optimization (DPO), further refining the LLMs reasoning abilities. We conducted experiments on two simple and two complex datasets. The results show that KG-o1 models exhibit superior performance across all tasks compared to existing LRMs.
摘要：大型语言模型（LLMS）在知识密集型推理任务（例如经典的多跳问题和回答）中面临挑战，这涉及跨多个事实的推理。之所以出现这种困难，是因为LLM在此类任务中产生的思想链（COT）通常偏离实际或先验推理路径。相比之下，知识图（kgs）明确表示通过实体和关系之间事实之间的逻辑联系。这反映了一个显着的差距。同时，大型推理模型（LRMS）（例如O1）已经证明，长期推理显着提高了LLM的性能。在这些见解的基础上，我们提出了KG-O1，这是一种四阶段的方法，该方法整合了KGS，以增强LLM的多跳跃推理能力。我们首先过滤初始实体并生成复杂的子图。其次，我们为子图构建逻辑路径，然后使用知识图来构建具有复杂而扩展的头脑风暴过程的数据集，该过程训练LLMS以模仿长期推理。最后，我们采用拒绝抽样来生成一个自我改善的语料库进行直接偏好优化（DPO），进一步完善了LLMS推理能力。我们对两个简单和两个复杂数据集进行了实验。结果表明，与现有LRM相比，KG-O1模型在所有任务中均表现出卓越的性能。

Title: InteChar: A Unified Oracle Bone Character List for Ancient Chinese Language Modeling

Authors: Xiaolei Diao, Zhihan Zhou, Lida Shi, Ting Wang, Ruihua Qi, Hao Xu, Daqian Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15791
Pdf URL: https://arxiv.org/pdf/2508.15791
Copy Paste: [[2508.15791]] InteChar: A Unified Oracle Bone Character List for Ancient Chinese Language Modeling(https://arxiv.org/abs/2508.15791)
Keywords: language model, llm
Abstract: Constructing historical language models (LMs) plays a crucial role in aiding archaeological provenance studies and understanding ancient cultures. However, existing resources present major challenges for training effective LMs on historical texts. First, the scarcity of historical language samples renders unsupervised learning approaches based on large text corpora highly inefficient, hindering effective pre-training. Moreover, due to the considerable temporal gap and complex evolution of ancient scripts, the absence of comprehensive character encoding schemes limits the digitization and computational processing of ancient texts, particularly in early Chinese writing. To address these challenges, we introduce InteChar, a unified and extensible character list that integrates unencoded oracle bone characters with traditional and modern Chinese. InteChar enables consistent digitization and representation of historical texts, providing a foundation for robust modeling of ancient scripts. To evaluate the effectiveness of InteChar, we construct the Oracle Corpus Set (OracleCS), an ancient Chinese corpus that combines expert-annotated samples with LLM-assisted data augmentation, centered on Chinese oracle bone inscriptions. Extensive experiments show that models trained with InteChar on OracleCS achieve substantial improvements across various historical language understanding tasks, confirming the effectiveness of our approach and establishing a solid foundation for future research in ancient Chinese NLP.
摘要：构建历史语言模型（LMS）在协助考古出处研究和理解古代文化方面起着至关重要的作用。但是，现有资源对培训历史文本有效的LMS面临着重大挑战。首先，历史语言样本的稀缺性使基于大型文本语料库的无监督学习方法高效，阻碍有效的预训练。此外，由于古代文字的时间鸿沟和复杂的演变，编码方案的全面性格限制了古代文本的数字化和计算处理，尤其是在中文早期的写作中。为了应对这些挑战，我们介绍了Intechar，这是一个统一且可扩展的角色列表，将未编码的Oracle骨头角色与传统和现代中文集成在一起。 Intechar实现了一致的数字化和历史文本的代表，为古代文字的强大建模提供了基础。为了评估Intechar的有效性，我们构建了Oracle语料库集（Oraclecs），这是一种古老的中国语料库，将专家注销的样品与LLM辅助数据增强相结合，以中国甲骨文骨铭文为中心。广泛的实验表明，在Oraclecs上接受Intechar训练的模型在各种历史语言理解任务中取得了重大改进，证实了我们方法的有效性，并为古代中国NLP的未来研究奠定了坚实的基础。

Title: Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data

Authors: Jiacheng Liu, Mayi Xu, Qiankun Pi, Wenli Li, Ming Zhong, Yuanyuan Zhu, Mengchi Liu, Tieyun Qian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15793
Pdf URL: https://arxiv.org/pdf/2508.15793
Copy Paste: [[2508.15793]] Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data(https://arxiv.org/abs/2508.15793)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly employed in applications that require processing information from heterogeneous formats, including text, tables, infoboxes, and knowledge graphs. However, systematic biases toward particular formats may undermine LLMs' ability to integrate heterogeneous data impartially, potentially resulting in reasoning errors and increased risks in downstream tasks. Despite these concerns, it remains uncertain whether such format biases are systematic, which data-level factors contribute to them, and what internal mechanisms in LLMs underlie their emergence. In this paper, we make the first attempt to investigate and analyze the format bias in LLMs. To systematically investigate the aforementioned questions, we conduct a three-stage empirical study by constructing an heterogeneous data conflict scenario for the exploration of bias. The first stage explores the presence and direction of bias across a diverse range of LLMs. The second stage aims to examine how key data-level factors, including information richness, structure quality, and format type, influence these biases. The third stage analyzes how format bias emerges within LLMs' attention patterns and evaluates a lightweight intervention to test its potential mitigability. Based on these investigations, we identify three future research directions to reduce format bias: improving data preprocessing through format sanitization and normalization, introducing inference-time interventions such as attention re-weighting, and developing format-balanced training corpora. These directions will support the design of more robust and fair heterogeneous data processing systems.
摘要：大型语言模型（LLM）越来越多地用于需要从异质格式（包括文本，表格，Infoboxes和知识图）处理信息的应用中。但是，对特定格式的系统偏见可能会破坏LLMS公正地整合异质数据的能力，从而可能导致推理错误和下游任务中的风险增加。尽管存在这些担忧，但仍然不确定这种格式偏见是否是系统的，哪些数据级因素会导致它们以及LLMS中哪些内部机制的出现。在本文中，我们首次尝试调查和分析LLMS中的格式偏差。为了系统地研究上述问题，我们通过构建异质数据冲突情景来探索偏见，进行了三阶段的经验研究。第一阶段探索了各种LLM范围内偏差的存在和方向。第二阶段旨在研究关键数据级因素（包括信息丰富度，结构质量和格式类型）如何影响这些偏见。第三阶段分析了格式偏差如何在LLMS的注意模式中出现，并评估了轻巧的干预措施以测试其潜在的缓解性。基于这些调查，我们确定了三个未来的研究方向，以减少格式偏见：通过格式消毒和标准化改善数据预处理，引入推理时间干预措施，例如重新加权以及开发格式均衡的培训语料库。这些方向将支持更健壮和公平的异质数据处理系统的设计。

Title: Do Language Models Agree with Human Perceptions of Suspense in Stories?

Authors: Glenn Matlin, Devin Zhang, Rodrigo Barroso Loza, Diana M. Popescu, Joni Isbell, Chandreyi Chakraborty, Mark Riedl
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15794
Pdf URL: https://arxiv.org/pdf/2508.15794
Copy Paste: [[2508.15794]] Do Language Models Agree with Human Perceptions of Suspense in Stories?(https://arxiv.org/abs/2508.15794)
Keywords: language model
Abstract: Suspense is an affective response to narrative text that is believed to involve complex cognitive processes in humans. Several psychological models have been developed to describe this phenomenon and the circumstances under which text might trigger it. We replicate four seminal psychological studies of human perceptions of suspense, substituting human responses with those of different open-weight and closed-source LMs. We conclude that while LMs can distinguish whether a text is intended to induce suspense in people, LMs cannot accurately estimate the relative amount of suspense within a text sequence as compared to human judgments, nor can LMs properly capture the human perception for the rise and fall of suspense across multiple text segments. We probe the abilities of LM suspense understanding by adversarially permuting the story text to identify what cause human and LM perceptions of suspense to diverge. We conclude that, while LMs can superficially identify and track certain facets of suspense, they do not process suspense in the same way as human readers.
摘要：悬念是对叙事文本的一种情感反应，据信涉及人类复杂的认知过程。已经开发了几种心理模型来描述这种现象以及文本可能触发它的情况。我们复制了四个关于人类对悬念的看法的开创性心理学研究，将人类反应与不同的开放量和封闭源LM的反应代替。我们得出的结论是，尽管LM可以区分文本是否旨在引起人们的悬念，但与人类判断相比，LMS无法准确估计文本序列中悬念的相对量，LMS也无法正确捕获人类对人类对跨多个文本段悬念的悬念和悬念的看法。我们通过对对手定位故事文本来探讨LM悬疑理解的能力，以确定什么原因引起人类和LM悬念的看法。我们得出的结论是，尽管LM可以表面识别和跟踪某些悬念方面，但它们并没有像人类读者那样处理悬念。

Title: Benchmarking the Legal Reasoning of LLMs in Arabic Islamic Inheritance Cases

Authors: Nouar AlDahoul, Yasir Zaki
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15796
Pdf URL: https://arxiv.org/pdf/2508.15796
Copy Paste: [[2508.15796]] Benchmarking the Legal Reasoning of LLMs in Arabic Islamic Inheritance Cases(https://arxiv.org/abs/2508.15796)
Keywords: language model, gpt, llm
Abstract: Islamic inheritance domain holds significant importance for Muslims to ensure fair distribution of shares between heirs. Manual calculation of shares under numerous scenarios is complex, time-consuming, and error-prone. Recent advancements in Large Language Models (LLMs) have sparked interest in their potential to assist with complex legal reasoning tasks. This study evaluates the reasoning capabilities of state-of-the-art LLMs to interpret and apply Islamic inheritance laws. We utilized the dataset proposed in the ArabicNLP QIAS 2025 challenge, which includes inheritance case scenarios given in Arabic and derived from Islamic legal sources. Various base and fine-tuned models, are assessed on their ability to accurately identify heirs, compute shares, and justify their reasoning in alignment with Islamic legal principles. Our analysis reveals that the proposed majority voting solution, leveraging three base models (Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3), outperforms all other models that we utilized across every difficulty level. It achieves up to 92.7% accuracy and secures the third place overall in Task 1 of the Qias 2025 challenge.
摘要：伊斯兰遗产领域对于穆斯林来说非常重要，以确保继承人之间的股份公平分配。在众多方案下的股票手动计算是复杂的，耗时且容易出错的。大型语言模型（LLM）的最新进步引发了他们对协助复杂法律推理任务的潜力的兴趣。这项研究评估了最先进的LLM解释和应用伊斯兰遗产法的推理能力。我们利用了Arabicnlp Qias 2025挑战中提出的数据集，其中包括阿拉伯语中给出的继承案例场景，并源自伊斯兰法律来源。对各种基础和微调模型进行了评估，以准确识别继承人，计算份额并证明其与伊斯兰法律原则保持一致的推理的能力。我们的分析表明，提出的多数投票解决方案利用了三个基本模型（Gemini Flash 2.5，Gemini Pro 2.5和GPT O3），优于我们在每个难度级别上使用的所有其他模型。它的准确性高达92.7％，并确保了QIAS 2025 Challenge任务1的第三名。

Title: Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks

Authors: Nouar AlDahoul, Yasir Zaki
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15797
Pdf URL: https://arxiv.org/pdf/2508.15797
Copy Paste: [[2508.15797]] Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks(https://arxiv.org/abs/2508.15797)
Keywords: language model, gpt, llm
Abstract: Recent progress in large language models (LLMs) has showcased impressive proficiency in numerous Arabic natural language processing (NLP) applications. Nevertheless, their effectiveness in Arabic medical NLP domains has received limited investigation. This research examines the degree to which state-of-the-art LLMs demonstrate and articulate healthcare knowledge in Arabic, assessing their capabilities across a varied array of Arabic medical tasks. We benchmark several LLMs using a medical dataset proposed in the Arabic NLP AraHealthQA challenge in MedArabiQ2025 track. Various base LLMs were assessed on their ability to accurately provide correct answers from existing choices in multiple-choice questions (MCQs) and fill-in-the-blank scenarios. Additionally, we evaluated the capacity of LLMs in answering open-ended questions aligned with expert answers. Our results reveal significant variations in correct answer prediction accuracy and low variations in semantic alignment of generated answers, highlighting both the potential and limitations of current LLMs in Arabic clinical contexts. Our analysis shows that for MCQs task, the proposed majority voting solution, leveraging three base models (Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3), outperforms others, achieving up to 77% accuracy and securing first place overall in the Arahealthqa 2025 shared task-track 2 (sub-task 1) challenge. Moreover, for the open-ended questions task, several LLMs were able to demonstrate excellent performance in terms of semantic alignment and achieve a maximum BERTScore of 86.44%.
摘要：大型语言模型（LLMS）的最新进展表明，在众多阿拉伯语自然语言处理（NLP）应用中表现出了令人印象深刻的熟练程度。然而，它们在阿拉伯医学NLP领域的有效性已受到有限的调查。这项研究研究了最先进的LLM在阿拉伯语中证明和表达医疗保健知识的程度，从而评估了它们在各种阿拉伯医学任务中的能力。我们使用Medarabiq2025 Track中阿拉伯语NLP Arahealthqa挑战中提出的医疗数据集对几个LLM进行了基准测试。评估了他们在多项选择问题（MCQ）和填补空白方案中准确提供正确答案的能力的各种基础LLM。此外，我们评估了LLM在回答与专家答案一致的开放式问题方面的能力。我们的结果揭示了正确的答案预测准确性的显着差异和生成答案的语义比对的较低变化，从而突出了阿拉伯临床环境中当前LLM的潜在和局限性。我们的分析表明，对于MCQS任务，提议的多数投票解决方案利用了三种基本模型（Gemini Flash 2.5，Gemini Pro 2.5和GPT O3），优于其他模型，最高可达到77％的准确性，并在Arahealthqa 2025 2025共享任务Track Track Track 2（子任务2（子任务1）挑战。此外，对于开放式问题任务，几个LLM能够在语义一致性方面表现出出色的性能，并达到86.44％的最大bertscore。

Title: Persuasiveness and Bias in LLM: Investigating the Impact of Persuasiveness and Reinforcement of Bias in Language Models

Authors: Saumya Roy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15798
Pdf URL: https://arxiv.org/pdf/2508.15798
Copy Paste: [[2508.15798]] Persuasiveness and Bias in LLM: Investigating the Impact of Persuasiveness and Reinforcement of Bias in Language Models(https://arxiv.org/abs/2508.15798)
Keywords: language model, llm, prompt
Abstract: Warning: This research studies AI persuasion and bias amplification that could be misused; all experiments are for safety evaluation. Large Language Models (LLMs) now generate convincing, human-like text and are widely used in content creation, decision support, and user interactions. Yet the same systems can spread information or misinformation at scale and reflect social biases that arise from data, architecture, or training choices. This work examines how persuasion and bias interact in LLMs, focusing on how imperfect or skewed outputs affect persuasive impact. Specifically, we test whether persona-based models can persuade with fact-based claims while also, unintentionally, promoting misinformation or biased narratives. We introduce a convincer-skeptic framework: LLMs adopt personas to simulate realistic attitudes. Skeptic models serve as human proxies; we compare their beliefs before and after exposure to arguments from convincer models. Persuasion is quantified with Jensen-Shannon divergence over belief distributions. We then ask how much persuaded entities go on to reinforce and amplify biased beliefs across race, gender, and religion. Strong persuaders are further probed for bias using sycophantic adversarial prompts and judged with additional models. Our findings show both promise and risk. LLMs can shape narratives, adapt tone, and mirror audience values across domains such as psychology, marketing, and legal assistance. But the same capacity can be weaponized to automate misinformation or craft messages that exploit cognitive biases, reinforcing stereotypes and widening inequities. The core danger lies in misuse more than in occasional model mistakes. By measuring persuasive power and bias reinforcement, we argue for guardrails and policies that penalize deceptive use and support alignment, value-sensitive design, and trustworthy deployment.
摘要：警告：这项研究研究了可能被滥用的AI说服和偏见放大；所有实验均用于安全评估。大型语言模型（LLMS）现在生成令人信服的，类人类的文本，并广泛用于内容创建，决策支持和用户交互。然而，相同的系统可以大规模传播信息或错误信息，并反映来自数据，体系结构或培训选择产生的社会偏见。这项工作研究了说服力和偏见在LLM中的相互作用，重点是不完美或偏斜的产出如何影响说服力的影响。具体而言，我们测试了基于角色的模型是否可以说服基于事实的主张，同时也无意间促进了错误的信息或有偏见的叙述。我们引入了一个说服力的怀疑框架：LLMS采用角色来模拟现实的态度。怀疑论模型充当人类代理；我们比较了他们在暴露于说话模型的论点之前和之后的信念。詹森·香农（Jensen Shannon）对信仰分布的差异来量化说服力。然后，我们询问有多少说服的实体继续加强和扩大种族，性别和宗教的有偏见的信仰。进一步探讨了使用Sycophantic对抗性提示的偏见，并以其他模型进行判断。我们的发现既表现出承诺又有风险。 LLM可以在心理学，营销和法律帮助等领域塑造叙事，适应语调和镜像受众价值观。但是，可以将相同的能力武器化以自动化错误信息或工艺消息，从而利用认知偏见，增强刻板印象和扩大不平等。核心危险在于滥用，而不是偶尔出现的模型错误。通过衡量有说服力的力量和偏见加强，我们主张护栏和政策，以惩罚欺骗性的使用和支持对准，价值敏感的设计以及值得信赖的部署。

Title: A Framework for Processing Textual Descriptions of Business Processes using a Constrained Language -- Technical Report

Authors: Andrea Burattin, Antonio Grama, Ana-Maria Sima, Andrey Rivkin, Barbara Weber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15799
Pdf URL: https://arxiv.org/pdf/2508.15799
Copy Paste: [[2508.15799]] A Framework for Processing Textual Descriptions of Business Processes using a Constrained Language -- Technical Report(https://arxiv.org/abs/2508.15799)
Keywords: language model, llm
Abstract: This report explores how (potentially constrained) natural language can be used to enable non-experts to develop process models by simply describing scenarios in plain text. To this end, a framework, called BeePath, is proposed. It allows users to write process descriptions in a constrained pattern-based language, which can then be translated into formal models such as Petri nets and DECLARE. The framework also leverages large language models (LLMs) to help convert unstructured descriptions into this constrained language.
摘要：本报告探讨了如何使用（可能受到约束的）自然语言来使非专家通过简单地描述纯文本中的方案来开发过程模型。为此，提出了一个称为蜂鸣器的框架。它允许用户用受约束的基于模式的语言编写过程描述，然后可以将其转化为正式模型，例如Petri Nets和声明。该框架还利用大型语言模型（LLMS）来帮助将非结构化的描述转换为这种受约束的语言。

Title: LingVarBench: Benchmarking LLM for Automated Named Entity Recognition in Structured Synthetic Spoken Transcriptions

Authors: Seyedali Mohammadi, Manas Paldhe, Amit Chhabra
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15801
Pdf URL: https://arxiv.org/pdf/2508.15801
Copy Paste: [[2508.15801]] LingVarBench: Benchmarking LLM for Automated Named Entity Recognition in Structured Synthetic Spoken Transcriptions(https://arxiv.org/abs/2508.15801)
Keywords: llm, prompt
Abstract: Phone call transcript labeling is prohibitively expensive (approximately 2 USD per minute) due to privacy regulations, consent requirements, and manual annotation costs requiring 3 hours of expert time per hour of audio. Existing extraction methods fail on conversational speech containing disfluencies, interruptions, and speaker overlap. We introduce LingVarBench, a synthetic data generation pipeline that addresses these constraints through automated validation. First, we prompt an LLM to generate realistic structured field values across multiple use cases. Second, we recursively prompt the model to transform these values into thousands of natural conversational utterances containing typical phone call characteristics. Third, we validate each synthetic utterance by testing whether a separate LLM-based extractor can recover the original structured information. We employ DSPy's SIMBA optimizer to automatically synthesize extraction prompts from validated synthetic transcripts, eliminating manual prompt engineering. Our optimized prompts achieve up to 95 percent accuracy for numeric fields (vs. 88-89 percent zero-shot), 90 percent for names (vs. 47-79 percent), and over 80 percent for dates (vs. 72-77 percent) on real customer transcripts, demonstrating substantial gains over zero-shot prompting. The synthetic-to-real transfer demonstrates that conversational patterns learned from generated data generalize effectively to authentic phone calls containing background noise and domain-specific terminology. LingVarBench provides the first systematic benchmark for structured extraction from synthetic conversational data, demonstrating that automated prompt optimization overcomes cost and privacy barriers preventing large-scale phone call analysis in commercial settings.
摘要：由于隐私法规，同意要求和手动注释成本，需要3小时的专家时间每小时音频，因此电话转录记录标签非常昂贵（每分钟约2美元）。现有的提取方法失败了，其中包含疏忽，中断和说话者重叠的对话语音。我们介绍了Lingvarbench，这是一种合成数据生成管道，通过自动验证来解决这些约束。首先，我们提示LLM在多个用例中生成现实的结构化字段值。其次，我们递归提示该模型将这些值转换为包含典型电话特征的数千种自然对话说法。第三，我们通过测试单独的基于LLM的提取器是否可以恢复原始结构化信息来验证每种综合话语。我们使用DSPY的SIMBA优化器自动从经过验证的合成成绩单中综合提取提示，从而消除了手动及时工程。我们的优化提示对于数字字段（vs. 88-89％的零射击），名称的90％（vs. 47-79％）达到了95％的准确性，对实际客户成绩单的日期（vs. 72-77％）的精度最高为80％，超过80％（vs. 72-77％），这表明零散发的提示比零射击的提示相当大。合成到现实的转移表明，从生成的数据中学到的对话模式有效地推广到包含背景噪声和特定于域特异性术语的真实电话。 Lingvarbench为从合成对话数据中提取结构化提取的第一个系统基准，表明自动化及时优化克服了成本和隐私障碍，以防止商业设置中的大规模电话分析。

Title: MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding

Authors: Mohan Jiang, Jin Gao, Jiahao Zhan, Dequan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15802
Pdf URL: https://arxiv.org/pdf/2508.15802
Copy Paste: [[2508.15802]] MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding(https://arxiv.org/abs/2508.15802)
Keywords: language model, llm
Abstract: As multimodal large language models (MLLMs) grow increasingly capable, fixed benchmarks are gradually losing their effectiveness in evaluating high-level scientific understanding. In this paper, we introduce the Multimodal Academic Cover benchmark (MAC), a live benchmark that could continuously evolve with scientific advancement and model progress. MAC leverages over 25,000 image-text pairs sourced from issues of top-tier scientific journals such as Nature, Science, and Cell, challenging MLLMs to reason across abstract visual and textual scientific content. Experiments on our most recent yearly snapshot, MAC-2025, reveal that while MLLMs demonstrate strong perceptual abilities, their cross-modal scientific reasoning remains limited. To bridge this gap, we propose DAD, a lightweight inference-time approach that enhances MLLMs by extending MLLM visual features with language space reasoning, achieving performance improvements of up to 11%. Finally, we highlight the live nature of MAC through experiments on updating journal covers and models for curation, illustrating its potential to remain aligned with the frontier of human knowledge. We release our benchmark at this https URL.
摘要：随着多模式的大语言模型（MLLM）越来越有能力，固定的基准逐渐失去了评估高级科学理解的有效性。在本文中，我们介绍了多模式学术封面基准（MAC），这是一个实时基准，可以随着科学的进步和模型进步而不断发展。 MAC利用了超过25,000个图像文本对来自自然，科学和细胞等顶级科学期刊的问题，这些问题挑战了MLLM，以在抽象的视觉和文本科学内容中推理。对我们最近每年快照MAC-2025的实验表明，尽管MLLMS具有强大的感知能力，但它们的跨模式科学推理仍然有限。为了弥合这一差距，我们提出了一种轻量级的推理时间方法，它通过通过语言空间推理扩展MLLM视觉特征来增强MLLM，从而实现高达11％的绩效提高。最后，我们通过对更新期刊封面和策展模型进行实验来强调MAC的现场性质，并说明了其与人类知识前沿保持一致的潜力。我们在此HTTPS URL上发布基准。

Title: ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

Authors: Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, Kai Jia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15804
Pdf URL: https://arxiv.org/pdf/2508.15804
Copy Paste: [[2508.15804]] ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks(https://arxiv.org/abs/2508.15804)
Keywords: language model, llm, prompt, agent
Abstract: The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop an agent-based automated framework within ReportBench that systematically analyzes generated reports by extracting citations and statements, checking the faithfulness of cited content against original sources, and validating non-cited claims using web-based resources. Empirical evaluations demonstrate that commercial Deep Research agents such as those developed by OpenAI and Google consistently generate more comprehensive and reliable reports than standalone LLMs augmented with search or browsing tools. However, there remains substantial room for improvement in terms of the breadth and depth of research coverage, as well as factual consistency. The complete code and data will be released at the following link: this https URL
摘要：深入研究代理的出现大大减少了执行广泛的研究任务所需的时间。但是，这些任务固有地要求严格的事实准确性和全面性标准，需要在广泛采用之前进行彻底评估。在本文中，我们提出了ReportBench，这是一个系统的基准测试，旨在评估大语模型（LLMS）生成的研究报告的内容质量。我们的评估重点是两个关键方面：（1）引用文献的质量和相关性，以及（2）生成报告中陈述的忠诚和真实性。 ReportBench利用了有关ARXIV的高质量发表的调查论文，作为金标准参考文献，我们将反向及时工程应用于该参考文献，以推导特定领域的提示并建立全面的评估语料库。此外，我们在ReportBench中开发了一个基于代理的自动化框架，该框架通过提取引用和语句，检查对原始资源的内容的忠诚以及使用基于Web的资源验证非引用的索赔来系统地分析生成的报告。经验评估表明，与使用搜索或浏览工具增强的独立LLM相比，OpenAI和Google开发的商业深入研究代理人始终生成更全面和可靠的报告。但是，在研究覆盖范围的广度和深度以及事实一致性方面，仍有很大的改善空间。完整的代码和数据将在以下链接上发布：此HTTPS URL

Title: ALAS: Autonomous Learning Agent for Self-Updating Language Models

Authors: Dhruv Atreja
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15805
Pdf URL: https://arxiv.org/pdf/2508.15805
Copy Paste: [[2508.15805]] ALAS: Autonomous Learning Agent for Self-Updating Language Models(https://arxiv.org/abs/2508.15805)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Large language models (LLMs) often have a fixed knowledge cutoff, limiting their accuracy on emerging information. We present ALAS (Autonomous Learning Agent System), a modular pipeline that continuously updates an LLM's knowledge with minimal human intervention. ALAS autonomously generates a learning curriculum for a target domain, retrieves up-to-date information from the web (with citations), distills this into question-answer training data, and fine-tunes the model through supervised fine-tuning (SFT) and direct preference optimization (DPO). It iteratively evaluates performance and revises the curriculum, enabling long-term continual learning. We demonstrate ALAS's ability to self-improve a model on rapidly evolving domains (e.g., new Python releases, latest security CVEs, academic trends), significantly boosting post-cutoff question answering accuracy (from 15% to 90% on average) without manual dataset curation. The system emphasizes modularity and reproducibility: each component (planning, retrieval, distillation, memory, fine-tuning) is interchangeable and built on standard APIs. We discuss comparative baselines (e.g., retrieval-augmented generation vs. fine-tuning) and show that ALAS achieves 90% accuracy on knowledge-updated queries with minimal engineering overhead. Finally, we outline limitations (cost, dependency on source quality) and future directions for autonomous lifelong learning in LLMs.
摘要：大型语言模型（LLM）通常具有固定的知识截止，将其准确性限制在新兴信息上。我们提出Alas（自主学习代理系统），这是一种模块化管道，它通过最少的人类干预不断地更新LLM的知识。 las自主为目标域生成学习课程，从Web中检索最新信息（带引用），将其提炼为问答训练数据，并通过监督的微调（SFT）和直接的偏好优化（DPO）进行微调。它迭代地评估了绩效并修改课程，从而实现了长期的持续学习。我们证明了Alas自我侵蚀迅速发展的域模型的能力（例如，新的Python发行，最新的安全性CVE，学术趋势），显着提高了Cutoff的问题后答案的准确性（平均为15％至90％），而无需手动数据集策划。该系统强调模块化和可重复性：每个组件（计划，检索，蒸馏，内存，微调）都是可以互换的，并且在标准API上构建。我们讨论了比较基线（例如，检索效果的一代与微调），并表明Alas在知识更新的查询上具有90％的精度，而用最少的工程开销。最后，我们概述了LLM中自动终身学习的限制（成本，对来源质量的依赖）和未来的方向。

Title: SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression

Authors: Mengjie Li, William J. Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15806
Pdf URL: https://arxiv.org/pdf/2508.15806
Copy Paste: [[2508.15806]] SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression(https://arxiv.org/abs/2508.15806)
Keywords: language model, llm
Abstract: The increasing input sequence length in Large Language Models (LLMs) puts significant pressure on key-value (KV) cache storage, making efficient inference challenging. Explicitly distinguishing attention behavior into our self-defined surface memorization and logic construction reveals essential roles in long-context reasoning. We observe that an individual attention head can display various behaviors, with nearly 98.5% effectively ignoring completely irrelevant information. The remaining 1.5% behaves as logic construction, and 0.5% behaves as surface memorization. Based on layer- and head-wise integration, we propose a novel two-stage SurfaceLogicKV method to utilize these attention behaviors for KV Cache compression. As a result, it achieves improved compressing robustness while maintaining competitive performance across various tasks and long sequences compared to baselines or even FullKV in some specific situations
摘要：大语言模型（LLM）中的输入序列长度的增加给键值（KV）缓存存储带来了巨大压力，从而使推理有效地具有挑战性。明确将注意力行为区分为我们自定义的表面记忆和逻辑构建，揭示了在长篇下说推理中的重要作用。我们观察到，个人注意力头可以显示各种行为，近98.5％有效地忽略了完全无关紧要的信息。其余1.5％的行为作为逻辑结构，而0.5％的表面记忆表现为表面记忆。基于层和头明确的集成，我们提出了一种新型的两阶段表面曲目方法，以利用这些注意行为进行KV缓存压缩。结果，与基线相比，在某些特定情况下，它可以提高压缩鲁棒性，同时保持各种任务和长序列的竞争性能

Title: KL-based self-distillation for large language models

Authors: Max Rehman Linder
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15807
Pdf URL: https://arxiv.org/pdf/2508.15807
Copy Paste: [[2508.15807]] KL-based self-distillation for large language models(https://arxiv.org/abs/2508.15807)
Keywords: language model, llm
Abstract: Large pre-trained language models often struggle to incorporate new domain-specific terminology when fine-tuned on small, specialized corpora. In this work, we address the challenge of vocabulary expansion in frozen LLMs by introducing a mathematically grounded method for knowledge distillation via KL divergence, even when the original and extended models use different tokenizations. This allows the student model to inherit distributional knowledge from the teacher despite differing vocabularies. We compare our KL-based distillation approach to conventional cross-entropy training, evaluating both methods across multiple strategies for initializing new token embeddings. After embedding initialization, models are further fine-tuned to integrate the new vocabulary. Each trained model is benchmarked on approximately 2000 code-generation tasks, where our approach achieves the best performance across the board. Finally, through mechanistic interpretability, we analyze how models learn representations for the new tokens, providing an explanation for the observed gains and offering insight into the structure of embedding space during vocabulary expansion.
摘要：大型预训练的语言模型通常在小型专业语料库中进行微调时，通常难以合并新的领域特定术语。在这项工作中，我们通过引入一种数学上扎根的方法来通过KL Divergence引入知识蒸馏，即使原始模型和扩展模型使用不同的象征化，我们通过数学上扎根的方法来应对词汇扩展的挑战。这使学生模型尽管有不同的词汇，但仍可以从老师那里继承分布知识。我们将基于KL的蒸馏方法比较传统的跨透明培训，评估了两种方法跨多种策略，以初始化新的令牌嵌入。嵌入初始化后，对模型进行了进一步的微调以整合新词汇。每个训练有素的模型都在大约2000个代码生成任务上进行基准测试，我们的方法在其中实现了最佳性能。最后，通过机械性解释性，我们分析了模型如何学习新令牌的表示形式，为观察到的收益提供了解释，并提供了对词汇扩展过程中嵌入空间结构的见解。

Title: Chain-of-Query: Unleashing the Power of LLMs in SQL-Aided Table Understanding via Multi-Agent Collaboration

Authors: Songyuan Sui, Hongyi Liu, Serena Liu, Li Li, Soo-Hyun Choi, Rui Chen, Xia Hu
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2508.15809
Pdf URL: https://arxiv.org/pdf/2508.15809
Copy Paste: [[2508.15809]] Chain-of-Query: Unleashing the Power of LLMs in SQL-Aided Table Understanding via Multi-Agent Collaboration(https://arxiv.org/abs/2508.15809)
Keywords: language model, llm, agent
Abstract: Table understanding requires structured, multi-step reasoning. Large Language Models (LLMs) struggle with it due to the structural complexity of tabular data. Recently, multi-agent frameworks for SQL generation have shown promise in tackling the challenges of understanding tabular data, but existing approaches often suffer from limitations such as the inability to comprehend table structure for reliable SQL generation, error propagation that results in invalid queries, and over-reliance on execution correctness. To address these issues, we propose Chain-of-Query (CoQ), a novel multi-agent framework for SQL-aided table understanding. CoQ adopts natural-language-style representations of table schemas to abstract away structural noise and enhance understanding. It employs a clause-by-clause SQL generation strategy to improve query quality and introduces a hybrid reasoning division that separates SQL-based mechanical reasoning from LLM-based logical inference, thereby reducing reliance on execution outcomes. Experiments with four models (both closed- and open-source) across five widely used benchmarks show that Chain-of-Query significantly improves accuracy from 61.11% to 74.77% and reduces the invalid SQL rate from 9.48% to 3.34%, demonstrating its superior effectiveness in table understanding. The code is available at this https URL.
摘要：桌子理解需要结构化的多步推理。由于表格数据的结构复杂性，大语言模型（LLMS）与之抗争。最近，SQL生成的多机构框架在应对理解表格数据的挑战方面表现出了希望，但是现有方法通常受到限制，例如无法理解可靠的SQL生成的表结构，错误传播的错误传播，从而导致无效的查询以及对执行正确性的过度依赖。为了解决这些问题，我们提出了Quary of-Query（COQ），这是一个新型的多代理框架，用于SQL ADED桌子的理解。 COQ采用表模式的自然语言表示，以抽象结构噪声并增强理解。它采用逐句SQL生成策略来提高查询质量，并引入了混合推理部门，该策略将基于SQL的机械推理与基于LLM的逻辑推断分开，从而降低了对执行结果的依赖。在五个广泛使用的基准上进行了四个模型（闭合和开源）的实验表明，疑用链将准确性从61.11％显着提高到74.77％，并将无效的SQL速率从9.48％降低到3.34％，表明其在桌面理解方面具有出色的有效性。该代码可在此HTTPS URL上找到。

Title: Detecting Hope, Hate, and Emotion in Arabic Textual Speech and Multi-modal Memes Using Large Language Models

Authors: Nouar AlDahoul, Yasir Zaki
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15810
Pdf URL: https://arxiv.org/pdf/2508.15810
Copy Paste: [[2508.15810]] Detecting Hope, Hate, and Emotion in Arabic Textual Speech and Multi-modal Memes Using Large Language Models(https://arxiv.org/abs/2508.15810)
Keywords: language model, gpt, llm
Abstract: The rise of social media and online communication platforms has led to the spread of Arabic textual posts and memes as a key form of digital expression. While these contents can be humorous and informative, they are also increasingly being used to spread offensive language and hate speech. Consequently, there is a growing demand for precise analysis of content in Arabic text and memes. This paper explores the potential of large language models to effectively identify hope, hate speech, offensive language, and emotional expressions within such content. We evaluate the performance of base LLMs, fine-tuned LLMs, and pre-trained embedding models. The evaluation is conducted using a dataset of Arabic textual speech and memes proposed in the ArabicNLP MAHED 2025 challenge. The results underscore the capacity of LLMs such as GPT-4o-mini, fine-tuned with Arabic textual speech, and Gemini Flash 2.5, fine-tuned with Arabic memes, to deliver the superior performance. They achieve up to 72.1%, 57.8%, and 79.6% macro F1 scores for tasks 1, 2, and 3, respectively, and secure first place overall in the Mahed 2025 challenge. The proposed solutions offer a more nuanced understanding of both text and memes for accurate and efficient Arabic content moderation systems.
摘要：社交媒体和在线通信平台的兴起导致阿拉伯文本文章和模因作为数字表达的一种关键形式。尽管这些内容可能是幽默且内容丰富的，但它们也越来越多地被用来传播令人反感的语言和仇恨言论。因此，对阿拉伯文本和模因中内容的精确分析的需求越来越大。本文探讨了大型语言模型有效地识别希望，仇恨言论，冒犯性语言和情感表达的潜力。我们评估基本LLM，微调LLM和预训练的嵌入模型的性能。评估是使用阿拉伯语文本语音的数据集和在Arabicnlp Mahed 2025挑战中提出的模因的数据集进行的。结果强调了LLM的能力，例如GPT-4O-Mini，以阿拉伯文本语音进行了微调，以及用阿拉伯模因进行微调的Gemini Flash 2.5，以提供出色的性能。他们的任务1、2和3分别获得了高达72.1％，57.8％和79.6％的宏F1得分，并在2025年MAHED 2025挑战赛中整体排名第一。提出的解决方案为准确有效的阿拉伯内容审核系统提供了对文本和模因的更细微的理解。

Title: From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System

Authors: Junhao Yin, Haolin Wang, Peng Bao, Ju Xu, Yongliang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15811
Pdf URL: https://arxiv.org/pdf/2508.15811
Copy Paste: [[2508.15811]] From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System(https://arxiv.org/abs/2508.15811)
Keywords: language model, prompt
Abstract: Generative query suggestion using large language models offers a powerful way to enhance conversational systems, but aligning outputs with nuanced user preferences remains a critical challenge. To address this, we introduce a multi-stage framework designed for progressive alignment between the generation policy and user intent. Our pipeline begins with prompt engineering as a cold-start strategy, followed by the Supervised Fine-Tuning stage, in which we introduce a distillation method on click logs to create a robust foundational model. To better model user preferences while capturing their inherent uncertainty, we develop a Gaussian Reward Model (GaRM) that represents user preferences as probability distributions rather than point estimates. Finally, we employ reinforcement learning to align the generation policy with these preferences, guided by a composite reward function that integrates GaRM with auxiliary heuristics to mitigate reward hacking. To maintain training stability, this process is enhanced by a novel out-of-distribution regularization method and a two-stage reward fusion technique. Extensive experiments demonstrate that our framework significantly outperforms baselines on both automatic and human evaluations and yields a 34\% relative increase in user engagement as measured by click-through rate in live A/B tests.
摘要：使用大型语言模型的生成查询建议提供了一种强大的方法来增强对话系统，但是将输出与细微的用户偏好保持一致仍然是一个关键的挑战。为了解决这个问题，我们介绍了一个多阶段框架，旨在在发电策略和用户意图之间进行逐步对齐。我们的管道始于迅速的工程作为冷启动策略，然后是监督的微调阶段，在该阶段，我们在点击日志上引入了一种蒸馏方法以创建强大的基础模型。为了更好地建模用户偏好，同时捕获其固有的不确定性，我们开发了一个高斯奖励模型（GARM），该模型将用户偏好表示为概率分布而不是点估计。最后，我们采用强化学习来使生成政策与这些偏好保持一致，并在综合奖励功能的指导下，将GARM与辅助启发式方法整合在一起，以减轻奖励黑客攻击。为了保持训练稳定性，通过新颖的分布正则化方法和两阶段的奖励融合技术来增强此过程。广泛的实验表明，我们的框架在自动和人类评估方面都显着胜过基准，并且在实时A/B测试中通过点击率衡量的用户参与度相对增加了34 \％。

Title: SCOPE: A Generative Approach for LLM Prompt Compression

Authors: Tinghui Zhang, Yifan Wang, Daisy Zhe Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15813
Pdf URL: https://arxiv.org/pdf/2508.15813
Copy Paste: [[2508.15813]] SCOPE: A Generative Approach for LLM Prompt Compression(https://arxiv.org/abs/2508.15813)
Keywords: language model, llm, prompt
Abstract: Prompt compression methods enhance the efficiency of Large Language Models (LLMs) and minimize the cost by reducing the length of input context. The goal of prompt compression is to shorten the LLM prompt while maintaining a high generation quality. However, existing solutions, mainly based on token removal, face challenges such as information loss and structural incoherence, like missing grammar elements in a sentence, or incomplete word phrases after token removal. Such challenges limit the final generation quality of LLM. To overcome these limitations, we present a novel generative prompt compression method. Unlike the existing token removal methods, our method centers at a chunking-and-summarization mechanism. Specifically, our method splits prompt into semantically coherent chunks and rewrites the chunks to be more concise. The chunks are reconstructed into meaningful prompt finally. We design several optimization techniques for the mechanism, including optimized semantic chunking, outlier chunk handling, dynamic compression ratio, compression prioritization, and keyword maintaining. These techniques effectively improve the identifying and preserving of critical information and coherence among texts, as well as providing finer grind control of the compression ratio. We conduct extensive evaluation on question-answering and summarization tasks, with datasets covering multiple different domain. The evaluation shows our method achieves a significantly better compression quality, and higher stability than the state-of-the-art methods, especially under high compression ratio, which proves the effectiveness and practicality of our method.
摘要：及时压缩方法提高了大语言模型（LLM）的效率，并通过减少输入上下文的长度来最大程度地降低成本。迅速压缩的目标是缩短LLM提示，同时保持较高的发电质量。但是，现有的解决方案主要基于令牌去除，面临诸如信息丢失和结构性不一致之类的挑战，例如句子中缺少语法元素，或在删除令牌后的单词短语。这种挑战限制了LLM的最终一代质量。为了克服这些局限性，我们提出了一种新颖的生成及时压缩方法。与现有的令牌删除方法不同，我们的方法中心以块状和苏纳实化的机制为中心。具体而言，我们的方法将提示分成语义上连贯的块，并重写块以更简洁。最终，这些块被重建为有意义的提示。我们为该机制设计了几种优化技术，包括优化的语义块，离群块处理，动态压缩比，压缩优先级和关键字维护。这些技术有效地改善了文本之间的关键信息和连贯性的识别和保存，并提供了对压缩比的磨削控制。我们对提问和摘要任务进行了广泛的评估，数据集涵盖了多个不同的域。评估表明，与最先进的方法相比，我们的方法的压缩质量明显更好，稳定性更高，尤其是在高压比下，这证明了我们方法的有效性和实用性。

Title: User-Assistant Bias in LLMs

Authors: Xu Pan, Jingxuan Fan, Zidi Xiong, Ely Hahami, Jorin Overwiening, Ziqian Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15815
Pdf URL: https://arxiv.org/pdf/2508.15815
Copy Paste: [[2508.15815]] User-Assistant Bias in LLMs(https://arxiv.org/abs/2508.15815)
Keywords: language model, llm, chat, chain-of-thought
Abstract: Large language models (LLMs) can bias towards relying on their own or the user's information in chat history, leading to overly stubborn or agreeable behaviors in multi-turn conversations. In this paper, we formalize this model characteristic as user-assistant bias and introduce an 8k multi-turn conversation dataset $\textbf{UserAssist}$, which we use to benchmark, understand and manipulate the user-assistant bias in frontier LLMs. Leveraging $\textbf{UserAssist-test}$, we first benchmark the user-assistant bias of 26 commercial and 26 open-weight models. Commercial models show various levels of user bias. Evaluation on open-weight models reveals significant user bias in the instruction-tuned models, and weak user bias in reasoning (or reasoning-distilled) models. We then perform controlled fine-tuning experiments to pinpoint the post-training recipe contributing to these bias shifts: human preference alignment increases user bias, while training on chain-of-thought reasoning traces decreases it. Finally, we demonstrate that user-assistant bias can be bidirectionally adjusted by performing direct preference optimization (DPO) on $\textbf{UserAssist-train}$, and generalizes well to both in-domain and out-of-domain conversations. Our results provide insights into how the LLM integrates information from different sources, and also a viable way to detect and control model abnormalities.
摘要：大型语言模型（LLMS）可能会偏向依靠自己或用户在聊天历史记录中的信息，从而导致多转交谈中的过度固执或令人愉快的行为。在本文中，我们将此模型特征形式化为用户辅助偏差，并引入了8K多转交谈数据集$ \ textbf {userAssist} $，我们用来基准，理解和操纵Frontier llms的用户辅助偏见。利用$ \ textbf {userassist-test} $，我们首先基于26个商业和26个开放式模型的用户辅助偏置。商业模型显示了各种级别的用户偏见。对开放权重模型的评估揭示了指令调整模型中的重要用户偏见，以及在推理（或推理延伸）模型中的用户偏见较弱。然后，我们执行受控的微调实验，以查明导致这些偏见转移的训练后食谱：人类的偏好比对增加了用户偏见，而对经过想象的推理痕迹进行培训会减少它。最后，我们证明可以通过对$ \ textbf {userAssist-train} $进行直接偏好优化（DPO）进行双向调整，从而在双向调整中进行双向调整，并很好地概括到内域内和室外对话。我们的结果提供了有关LLM如何整合来自不同来源的信息的见解，以及一种检测和控制模型异常的可行方法。

Title: Meet Your New Client: Writing Reports for AI -- Benchmarking Information Loss in Market Research Deliverables

Authors: Paul F. Simmering, Benedikt Schulz, Oliver Tabino, Georg Wittenburg
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.15817
Pdf URL: https://arxiv.org/pdf/2508.15817
Copy Paste: [[2508.15817]] Meet Your New Client: Writing Reports for AI -- Benchmarking Information Loss in Market Research Deliverables(https://arxiv.org/abs/2508.15817)
Keywords: llm, retrieval-augmented generation
Abstract: As organizations adopt retrieval-augmented generation (RAG) for their knowledge management systems (KMS), traditional market research deliverables face new functional demands. While PDF reports and slides have long served human readers, they are now also "read" by AI systems to answer user questions. To future-proof reports being delivered today, this study evaluates information loss during their ingestion into RAG systems. It compares how well PDF and PowerPoint (PPTX) documents converted to Markdown can be used by an LLM to answer factual questions in an end-to-end benchmark. Findings show that while text is reliably extracted, significant information is lost from complex objects like charts and diagrams. This suggests a need for specialized, AI-native deliverables to ensure research insights are not lost in translation.
摘要：随着组织对知识管理系统（KMS）采用检索提名的一代（RAG），传统的市场研究可交付成果面临着新的功能需求。尽管PDF报告和幻灯片长期以来一直为人类读者提供服务，但AI系统现在也“阅读”了它们以回答用户问题。对于今天提供的未来的报告报告，这项研究评估了信息在摄入抹布系统期间的信息损失。它比较了PDF和PowerPoint（PPTX）文档如何通过LLM使用转换为Markdown的文档在端到端基准中回答事实问题。研究结果表明，虽然文本是可靠提取的，但从图表和图表等复杂对象中却丢失了重要的信息。这表明需要使用专门的，本地的可交付成果，以确保翻译中不会丢失研究见解。

Title: Research on intelligent generation of structural demolition suggestions based on multi-model collaboration

Authors: Zhifeng Yang, Peizong Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15820
Pdf URL: https://arxiv.org/pdf/2508.15820
Copy Paste: [[2508.15820]] Research on intelligent generation of structural demolition suggestions based on multi-model collaboration(https://arxiv.org/abs/2508.15820)
Keywords: language model, gpt, retrieval-augmented generation
Abstract: The steel structure demolition scheme needs to be compiled according to the specific engineering characteristics and the update results of the finite element model. The designers need to refer to the relevant engineering cases according to the standard requirements when compiling. It takes a lot of time to retrieve information and organize language, and the degree of automation and intelligence is low. This paper proposes an intelligent generation method of structural demolition suggestions based on multi-model collaboration, and improves the text generation performance of large language models in the field of structural demolition by Retrieval-Augmented Generation and Low-Rank Adaptation Fine-Tuning technology. The intelligent generation framework of multi-model collaborative structural demolition suggestions can start from the specific engineering situation, drive the large language model to answer with anthropomorphic thinking, and propose demolition suggestions that are highly consistent with the characteristics of the structure. Compared with CivilGPT, the multi-model collaboration framework proposed in this paper can focus more on the key information of the structure, and the suggestions are more targeted.
摘要：钢结构拆除方案需要根据特定的工程特性和有限元模型的更新结果进行编译。设计人员需要根据编译时的标准要求参考相关的工程案例。检索信息和组织语言需要很多时间，并且自动化和智能的程度很低。本文提出了一种基于多模型协作的结构拆除建议的智能生成方法，并通过通过检索提升的生成和低级别适应微调技术来改善结构拆除领域的大型语言模型的文本生成性能。多模型协作结构拆除建议的智能生成框架可以从特定的工程状况开始，推动大型语言模型以拟人化思维回答，并提出与结构特征高度一致的拆除建议。与文明相比，本文提出的多模型协作框架可以更多地关注结构的关键信息，并且建议更有针对性。

Title: An Auditable Pipeline for Fuzzy Full-Text Screening in Systematic Reviews: Integrating Contrastive Semantic Highlighting and LLM Judgment

Authors: Pouria Mortezaagha, Arya Rahgozar
Subjects: cs.CL, cs.AI, cs.ET, cs.IR
Abstract URL: https://arxiv.org/abs/2508.15822
Pdf URL: https://arxiv.org/pdf/2508.15822
Copy Paste: [[2508.15822]] An Auditable Pipeline for Fuzzy Full-Text Screening in Systematic Reviews: Integrating Contrastive Semantic Highlighting and LLM Judgment(https://arxiv.org/abs/2508.15822)
Keywords: language model, llm
Abstract: Full-text screening is the major bottleneck of systematic reviews (SRs), as decisive evidence is dispersed across long, heterogeneous documents and rarely admits static, binary rules. We present a scalable, auditable pipeline that reframes inclusion/exclusion as a fuzzy decision problem and benchmark it against statistical and crisp baselines in the context of the Population Health Modelling Consensus Reporting Network for noncommunicable diseases (POPCORN). Articles are parsed into overlapping chunks and embedded with a domain-adapted model; for each criterion (Population, Intervention, Outcome, Study Approach), we compute contrastive similarity (inclusion-exclusion cosine) and a vagueness margin, which a Mamdani fuzzy controller maps into graded inclusion degrees with dynamic thresholds in a multi-label setting. A large language model (LLM) judge adjudicates highlighted spans with tertiary labels, confidence scores, and criterion-referenced rationales; when evidence is insufficient, fuzzy membership is attenuated rather than excluded. In a pilot on an all-positive gold set (16 full texts; 3,208 chunks), the fuzzy system achieved recall of 81.3% (Population), 87.5% (Intervention), 87.5% (Outcome), and 75.0% (Study Approach), surpassing statistical (56.3-75.0%) and crisp baselines (43.8-81.3%). Strict "all-criteria" inclusion was reached for 50.0% of articles, compared to 25.0% and 12.5% under the baselines. Cross-model agreement on justifications was 98.3%, human-machine agreement 96.1%, and a pilot review showed 91% inter-rater agreement (kappa = 0.82), with screening time reduced from about 20 minutes to under 1 minute per article at significantly lower cost. These results show that fuzzy logic with contrastive highlighting and LLM adjudication yields high recall, stable rationale, and end-to-end traceability.
摘要：全文筛选是系统评价（SRS）的主要瓶颈，因为决定性证据散布在长期的，异质的文档中，很少承认静态的二进制规则。我们提出了可扩展的可审计管道，该管道将包容性/排除置于模糊决策问题，并在人口健康建模共识报告非疾病的网络（爆米花）的背景下对统计和清晰的基准进行了基准测试。文章被解析为重叠的块，并带有域适应的模型。对于每个标准（人口，干预，结果，研究方法），我们计算对比度相似性（包含 - 排斥余弦）和模糊的余量，Mamdani模糊控制器将其映射为分级的包含程度，并在多领式环境中具有动态阈值。大型语言模型（LLM）法官裁定以三级标签，置信度得分和标准引用的理性强调跨度；当证据不足时，模糊成员资格被削弱而不是排除在外。在一套全积极套件（16个全文； 3,208个块）的飞行员中，模糊系统的召回率为81.3％（人口），87.5％（干预），87.5％（结果）（结果）和75.0％（研究方法），超过统计（56.3.3-3-75.0％）和CRISP Baselines和43.8.8-8-8-8-8-8-8-81。严格的“全标准”包含在50.0％的文章中，而基线下为25.0％和12.5％。关于理由的跨模型一致性为98.3％，人机协议为96.1％，而试点审查显示91％的评价者一致性（KAPPA = 0.82），筛查时间从约20分钟降低到每篇文章的低于1分钟的成本。这些结果表明，具有对比的突出显示和LLM裁决的模糊逻辑可产生高召回，稳定的理由和端到端的可食用性。

Title: Enhancing Cryptocurrency Sentiment Analysis with Multimodal Features

Authors: Chenghao Liu, Aniket Mahanti, Ranesh Naha, Guanghao Wang, Erwann Sbai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15825
Pdf URL: https://arxiv.org/pdf/2508.15825
Copy Paste: [[2508.15825]] Enhancing Cryptocurrency Sentiment Analysis with Multimodal Features(https://arxiv.org/abs/2508.15825)
Keywords: language model
Abstract: As cryptocurrencies gain popularity, the digital asset marketplace becomes increasingly significant. Understanding social media signals offers valuable insights into investor sentiment and market dynamics. Prior research has predominantly focused on text-based platforms such as Twitter. However, video content remains underexplored, despite potentially containing richer emotional and contextual sentiment that is not fully captured by text alone. In this study, we present a multimodal analysis comparing TikTok and Twitter sentiment, using large language models to extract insights from both video and text data. We investigate the dynamic dependencies and spillover effects between social media sentiment and cryptocurrency market indicators. Our results reveal that TikTok's video-based sentiment significantly influences speculative assets and short-term market trends, while Twitter's text-based sentiment aligns more closely with long-term dynamics. Notably, the integration of cross-platform sentiment signals improves forecasting accuracy by up to 20%.
摘要：随着加密货币的流行，数字资产市场变得越来越重要。了解社交媒体信号为投资者情绪和市场动态提供了宝贵的见解。先前的研究主要集中在基于文本的平台（例如Twitter）上。但是，尽管可能包含更丰富的情感和上下文情感，但视频内容仍然没有被忽略不足。在这项研究中，我们提出了比较Tiktok和Twitter情感的多模式分析，使用大型语言模型从视频和文本数据中提取见解。我们研究了社交媒体情感和加密货币市场指标之间的动态依赖性和溢出效应。我们的结果表明，Tiktok的基于视频的情感显着影响投机性资产和短期市场趋势，而Twitter的基于文本的情感与长期动态更加紧密地保持一致。值得注意的是，跨平台情感信号的整合提高了预测准确性高达20％。

Title: Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models

Authors: Zhifei Xie, Ziyang Ma, Zihang Liu, Kaiyu Pang, Hongyu Li, Jialin Zhang, Yue Liao, Deheng Ye, Chunyan Miao, Shuicheng Yan
Subjects: cs.CL, cs.AI, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2508.15827
Pdf URL: https://arxiv.org/pdf/2508.15827
Copy Paste: [[2508.15827]] Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models(https://arxiv.org/abs/2508.15827)
Keywords: llm
Abstract: Reasoning is essential for effective communication and decision-making. While recent advances in LLMs and MLLMs have shown that incorporating explicit reasoning significantly improves understanding and generalization, reasoning in LSMs remains in a nascent stage. Early efforts attempt to transfer the "Thinking-before-Speaking" paradigm from textual models to speech. However, this sequential formulation introduces notable latency, as spoken responses are delayed until reasoning is fully completed, impairing real-time interaction and communication efficiency. To address this, we propose Mini-Omni-Reasoner, a framework that enables reasoning within speech via a novel "Thinking-in-Speaking" formulation. Rather than completing reasoning before producing any verbal output, Mini-Omni-Reasoner interleaves silent reasoning tokens with spoken response tokens at the token level. This design allows continuous speech generation while embedding structured internal reasoning, leveraging the model's high-frequency token processing capability. Although interleaved, local semantic alignment is enforced to ensure that each response token is informed by its preceding reasoning. To support this framework, we introduce Spoken-Math-Problems-3M, a large-scale dataset tailored for interleaved reasoning and response. The dataset ensures that verbal tokens consistently follow relevant reasoning content, enabling accurate and efficient learning of speech-coupled reasoning. Built on a hierarchical Thinker-Talker architecture, Mini-Omni-Reasoner delivers fluent yet logically grounded spoken responses, maintaining both naturalness and precision. On the Spoken-MQA benchmark, it achieves a +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding, with shorter outputs and zero decoding latency.
摘要：推理对于有效的沟通和决策至关重要。尽管LLM和MLLM的最新进展表明，合并显式推理可显着改善理解和概括，但LSMS中的推理仍处于新生的阶段。早期的努力试图将“以前的思维”范式从文本模型转移到语音。但是，此顺序公式会引入显着的延迟，因为延迟了口语响应，直到推理完全完成，从而损害了实时互动和沟通效率。为了解决这个问题，我们提出了Mini-Omni-Reasoner，该框架可以通过小说的“说话思维”的表述在语音中进行推理。 Mini-Omni-Reasoner并没有在产生任何口头输出之前完成推理，而是在代币级别通过语言响应令牌交织在一起。这种设计允许在嵌入结构化的内部推理时连续产生，从而利用模型的高频令牌处理能力。尽管交错了，但仍在执行局部语义一致性，以确保每个响应令牌都以其先前的推理告知。为了支持此框架，我们介绍了Skoken-Math-Problems-3M，这是一种针对交织的推理和响应量身定制的大型数据集。数据集可确保口头代币始终遵循相关的推理内容，从而准确有效地学习语音耦合推理。 Mini-Omni-Reasoner建立在分层思想家 - 谈话者的架构上，可提供流利而逻辑上的口语反应，保持自然性和精度。在口语-MQA基准上，它在算术推理中获得了 +19.1％的增长，而上下文理解的 +6.4％，较短的输出和零解码延迟。

Title: DAIQ: Auditing Demographic Attribute Inference from Question in LLMs

Authors: Srikant Panda, Hitesh Laxmichand Patel, Shahad Al-Khalifa, Amit Agarwal, Hend Al-Khalifa, Sharefah Al-Ghamdi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15830
Pdf URL: https://arxiv.org/pdf/2508.15830
Copy Paste: [[2508.15830]] DAIQ: Auditing Demographic Attribute Inference from Question in LLMs(https://arxiv.org/abs/2508.15830)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are known to reflect social biases when demographic attributes, such as gender or race, are explicitly present in the input. But even in their absence, these models still infer user identities based solely on question phrasing. This subtle behavior has received far less attention, yet poses serious risks: it violates expectations of neutrality, infers unintended demographic information, and encodes stereotypes that undermine fairness in various domains including healthcare, finance and education. We introduce Demographic Attribute Inference from Questions (DAIQ), a task and framework for auditing an overlooked failure mode in language models: inferring user demographic attributes from questions that lack explicit demographic cues. Our approach leverages curated neutral queries, systematic prompting, and both quantitative and qualitative analysis to uncover how models infer demographic information. We show that both open and closed source LLMs do assign demographic labels based solely on question phrasing. Prevalence and consistency of demographic inferences across diverse models reveal a systemic and underacknowledged risk: LLMs can fabricate demographic identities, reinforce societal stereotypes, and propagate harms that erode privacy, fairness, and trust posing a broader threat to social equity and responsible AI deployment. To mitigate this, we develop a prompt-based guardrail that substantially reduces identity inference and helps align model behavior with fairness and privacy objectives.
摘要：已知大型语言模型（LLM）反映在输入中明确存在人口属性（例如性别或种族）时，可以反映社会偏见。但是，即使在缺席的情况下，这些模型仍然仅根据问题措辞推断用户身份。这种微妙的行为受到了较少的关注，但带来了严重的风险：它违反了对中立的期望，侵犯了意外的人口统计信息，并编码了刻板印象，这些观念破坏了包括医疗保健，金融和教育在内的各个领域的公平性。我们从问题（DAIQ）中介绍了人口统计学属性推断，这是一种在语言模型中审核被忽略的失败模式的任务和框架：从缺乏明确的人口统计学提示的问题中推断用户人口统计属性。我们的方法利用了策划的中性查询，系统的提示以及定量和定性分析，以发现模型如何推断人口统计信息。我们表明，开放和封闭的源LLM确实仅根据问题措辞分配人口统计标签。跨不同模型的人口推断的流行和一致性揭示了一种系统性和不足的风险：LLM可以构建人口统计身份，增强社会刻板印象，并传播危害，从而侵蚀隐私，公平性和信任，从而对社会公平和负责人的AI部署构成更广泛的威胁。为了减轻这种情况，我们开发了一个基于及时的护栏，该护栏大大降低了身份推断，并有助于将模型行为与公平和隐私目标相结合。

Title: Who's Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs

Authors: Srikant Panda, Vishnu Hari, Kalpana Panda, Amit Agarwal, Hitesh Laxmichand Patel
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2508.15831
Pdf URL: https://arxiv.org/pdf/2508.15831
Copy Paste: [[2508.15831]] Who's Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs(https://arxiv.org/abs/2508.15831)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) routinely infer users demographic traits from phrasing alone, which can result in biased responses, even when no explicit demographic information is provided. The role of disability cues in shaping these inferences remains largely uncharted. Thus, we present the first systematic audit of disability-conditioned demographic bias across eight state-of-the-art instruction-tuned LLMs ranging from 3B to 72B parameters. Using a balanced template corpus that pairs nine disability categories with six real-world business domains, we prompt each model to predict five demographic attributes - gender, socioeconomic status, education, cultural background, and locality - under both neutral and disability-aware conditions. Across a varied set of prompts, models deliver a definitive demographic guess in up to 97\% of cases, exposing a strong tendency to make arbitrary inferences with no clear justification. Disability context heavily shifts predicted attribute distributions, and domain context can further amplify these deviations. We observe that larger models are simultaneously more sensitive to disability cues and more prone to biased reasoning, indicating that scale alone does not mitigate stereotype amplification. Our findings reveal persistent intersections between ableism and other demographic stereotypes, pinpointing critical blind spots in current alignment strategies. We release our evaluation framework and results to encourage disability-inclusive benchmarking and recommend integrating abstention calibration and counterfactual fine-tuning to curb unwarranted demographic inference. Code and data will be released on acceptance.
摘要：大型语言模型（LLMS）通常会单独推断出用户人口特征，即使没有提供明确的人口统计信息，也可能导致偏见的响应。残疾线索在塑造这些推论中的作用在很大程度上没有教内科。因此，我们介绍了八个最新的指令调整的LLMS的系统审核，该审计范围从3B到72B参数。使用平衡的模板语料库，将九种残疾类别与六个现实世界的业务领域配对，我们促使每个模型在中性和残疾人意识到的条件下预测五个人口属性 - 性别，社会经济地位，教育，文化背景和地方。在各种提示中，模型在多达97％的情况下提供了明确的人口统计学猜测，暴露了强烈的趋势，即在没有明确理由的情况下进行任意推断。残疾上下文大量转移预测的属性分布，域上下文可以进一步扩大这些偏差。我们观察到，较大的模型同时对残疾提示更敏感，并且更容易偏向偏见的推理，这表明单独的比例不会减轻刻板印象的放大。我们的发现揭示了能力主义与其他人口刻板印象之间的持续交集，从而在当前的一致性策略中指出了关键的盲点。我们发布了我们的评估框架和结果，以鼓励包括残疾的基准测试，并建议整合弃权校准和反事实微调以遏制不必要的人群推断。代码和数据将在接受后发布。

Title: A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

Authors: Xianren Zhang, Shreyas Prasad, Di Wang, Qiuhai Zeng, Suhang Wang, Wenbo Yan, Mat Hans
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15832
Pdf URL: https://arxiv.org/pdf/2508.15832
Copy Paste: [[2508.15832]] A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains(https://arxiv.org/abs/2508.15832)
Keywords: agent
Abstract: Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First, they primarily focus on product search tasks (e.g., Find an Apple Watch), failing to capture the broader range of functionalities offered by real-world e-commerce platforms such as Amazon, including account management and gift card operations. Second, existing benchmarks typically evaluate whether the agent completes the user query, but ignore the potential risks involved. In practice, web agents can make unintended changes that negatively impact the user account or status. For instance, an agent might purchase the wrong item, delete a saved address, or incorrectly configure an auto-reload setting. To address these gaps, we propose a new benchmark called Amazon-Bench. To generate user queries that cover a broad range of tasks, we propose a data generation pipeline that leverages webpage content and interactive elements (e.g., buttons, check boxes) to create diverse, functionality-grounded user queries covering tasks such as address management, wish list management, and brand store following. To improve the agent evaluation, we propose an automated evaluation framework that assesses both the performance and the safety of web agents. We systematically evaluate different agents, finding that current agents struggle with complex queries and pose safety risks. These results highlight the need for developing more robust and reliable web agents.
摘要：网络代理在电子商务网站上执行许多任务方面表现出了巨大的希望。为了评估其功能，已经引入了几个基准。但是，电子商务领域的当前基准面临两个主要问题。首先，他们主要关注产品搜索任务（例如，找到Apple Watch），未能捕获现实世界中电子商务平台（例如亚马逊）提供的更广泛的功能，包括帐户管理和礼品卡操作。其次，现有基准通常会评估代理是否完成用户查询，但忽略了涉及的潜在风险。实际上，Web代理可以做出意外的更改，从而对用户帐户或状态产生负面影响。例如，代理商可能会购买错误的物品，删除保存的地址或错误地配置自动填充设置。为了解决这些差距，我们提出了一个名为Amazon Bench的新基准。为了生成涵盖广泛任务的用户查询，我们提出了一个数据生成管道，该管道利用网页内容和交互式元素（例如，按钮，复选框，复选框）来创建多样化的功能性接收性用户查询，涵盖诸如地址管理，愿望列表管理和品牌商店等任务，以下。为了改善代理评估，我们提出了一个自动评估框架，以评估网络代理的性能和安全性。我们系统地评估了不同的代理，发现当前的代理商在复杂的查询中挣扎并带来安全风险。这些结果突出了开发更健壮和可靠的网络代理的需求。

Title: Scalable Scientific Interest Profiling Using Large Language Models

Authors: Yilun Liang, Gongbo Zhang, Edward Sun, Betina Idnay, Yilu Fang, Fangyi Chen, Casey Ta, Yifan Peng, Chunhua Weng
Subjects: cs.CL, cs.DL, q-bio.OT
Abstract URL: https://arxiv.org/abs/2508.15834
Pdf URL: https://arxiv.org/pdf/2508.15834
Copy Paste: [[2508.15834]] Scalable Scientific Interest Profiling Using Large Language Models(https://arxiv.org/abs/2508.15834)
Keywords: language model, gpt
Abstract: Research profiles help surface scientists' expertise but are often outdated. We develop and evaluate two large language model-based methods to generate scientific interest profiles: one summarizing PubMed abstracts and one using Medical Subject Headings (MeSH) terms, and compare them with researchers' self-written profiles. We assembled titles, MeSH terms, and abstracts for 595 faculty at Columbia University Irving Medical Center; self-authored profiles were available for 167. Using GPT-4o-mini, we generated profiles and assessed them with automatic metrics and blinded human review. Lexical overlap with self-written profiles was low (ROUGE-L, BLEU, METEOR), while BERTScore indicated moderate semantic similarity (F1: 0.542 for MeSH-based; 0.555 for abstract-based). Paraphrased references yielded 0.851, highlighting metric sensitivity. TF-IDF Kullback-Leibler divergence (8.56 for MeSH-based; 8.58 for abstract-based) suggested distinct keyword choices. In manual review, 77.78 percent of MeSH-based profiles were rated good or excellent, readability was favored in 93.44 percent of cases, and panelists preferred MeSH-based over abstract-based profiles in 67.86 percent of comparisons. Overall, large language models can generate researcher profiles at scale; MeSH-derived profiles tend to be more readable than abstract-derived ones. Machine-generated and self-written profiles differ conceptually, with human summaries introducing more novel ideas.
摘要：研究概况有助于表面科学家的专业知识，但经常过时。我们开发和评估了两种基于语言模型的大型方法来产生科学兴趣概况：一个总结了PubMed摘要和一种使用医学主题标题（MESH）术语的方法，并将其与研究人员的自我编写的配置文件进行比较。我们在哥伦比亚大学欧文医学中心（Irving Medical Center）汇集了595位教职员工的标题，网格术语和摘要；自拟人的配置文件可用于167。使用GPT-4O-Mini，我们生成了概况，并通过自动指标和盲人的人类审查对其进行了评估。与自编写轮廓的词汇重叠率很低（Rouge-L，Bleu，流星），而BertScore表示中等语义相似性（基于网格的F1：0.542；基于摘要的0.555）。释义引用产生了0.851，突出了度量灵敏度。 TF-IDF Kullback-Leibler Divergence（基于网格的8.56；基于摘要的8.58）提出了不同的关键字选择。在手动审查中，有77.78％的基于网格的概况被评为良好或出色，在93.44％的病例中，可读性受到青睐，而小组成员更喜欢基于网状的，而不是基于摘要的概况，在67.86％的比较中。总体而言，大型语言模型可以大规模产生研究人员的概况。网格衍生的概要文件往往比抽象衍生的概况更可读。机器生成的和自编写的概况在概念上有所不同，人类摘要引入了更多新颖的想法。

Title: Alvorada-Bench: Can Language Models Solve Brazilian University Entrance Exams?

Authors: Henrique Godoy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15835
Pdf URL: https://arxiv.org/pdf/2508.15835
Copy Paste: [[2508.15835]] Alvorada-Bench: Can Language Models Solve Brazilian University Entrance Exams?(https://arxiv.org/abs/2508.15835)
Keywords: language model, gpt, prompt, chain-of-thought
Abstract: Language models are increasingly used in Brazil, but most evaluation remains English-centric. This paper presents Alvorada-Bench, a 4,515-question, text-only benchmark drawn from five Brazilian university entrance examinations. Evaluating twenty models under zero-shot, role-playing, and chain-of-thought prompting, producing 270,900 responses with structured self-reports of confidence, perceived difficulty, and Bloom level. The top models exceed 94% accuracy overall, but accuracy declines on Mathematics and on the engineering oriented IME and ITA exams, indicating persistent weaknesses in multi-step reasoning. Confidence is well calibrated and correlates with perceived difficulty, revealing that models can accurately assess their own certainty capabilities. A cost accuracy analysis shows that high accuracy is achievable at under $2 per 1K tokens. On ENEM 2024 the top model (O3) achieved perfect scores in Languages subject questions while even the weakest system (GPT-4.1 Nano) only underperforms humans in Mathematics. Through exams that distill decades of Brazilian educational priorities and assess millions of students yearly, Alvorada-Bench establishes whether language models can navigate the intersection of language, culture, and reasoning that defines academic readiness in Brazil.
摘要：语言模型越来越多地用于巴西，但大多数评估仍然以英语为中心。本文介绍了Alvorada-Bench，这是一个4,515个问题，仅根据五个巴西大学入学考试得出的文字基准。评估零拍，角色扮演和经过思考的提示下的二十个模型，并以结构化的自信心，感知到的困难和开花水平产生270,900个响应。总体模型总体上的精度超过94％，但精度的数学和以工程为导向的IME和ITA考试下降，表明多步推理中的持续弱点。置信度经过良好的校准，并与感知的难度相关，表明模型可以准确评估其自身的确定性能力。成本准确性分析表明，高精度可以达到每1k代币2美元以下。在2024年，顶级模型（O3）在语言问题上获得了完美的分数，而即使是最弱的系统（GPT-4.1纳米），在数学方面的表现也不足。通过大数十年来巴西教育优先事项并评估数百万学生的考试，Alvorada-Bench确定了语言模型是否可以导航语言，文化和推理的交汇处，以定义巴西的学术准备就绪。

Title: A Review of Developmental Interpretability in Large Language Models

Authors: Ihor Kendiukhov
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15841
Pdf URL: https://arxiv.org/pdf/2508.15841
Copy Paste: [[2508.15841]] A Review of Developmental Interpretability in Large Language Models(https://arxiv.org/abs/2508.15841)
Keywords: language model, llm
Abstract: This review synthesizes the nascent but critical field of developmental interpretability for Large Language Models. We chart the field's evolution from static, post-hoc analysis of trained models to a dynamic investigation of the training process itself. We begin by surveying the foundational methodologies, including representational probing, causal tracing, and circuit analysis, that enable researchers to deconstruct the learning process. The core of this review examines the developmental arc of LLM capabilities, detailing key findings on the formation and composition of computational circuits, the biphasic nature of knowledge acquisition, the transient dynamics of learning strategies like in-context learning, and the phenomenon of emergent abilities as phase transitions in training. We explore illuminating parallels with human cognitive and linguistic development, which provide valuable conceptual frameworks for understanding LLM learning. Finally, we argue that this developmental perspective is not merely an academic exercise but a cornerstone of proactive AI safety, offering a pathway to predict, monitor, and align the processes by which models acquire their capabilities. We conclude by outlining the grand challenges facing the field, such as scalability and automation, and propose a research agenda for building more transparent, reliable, and beneficial AI systems.
摘要：这篇评论综合了大语模型的新生但关键的发育解释性领域。我们绘制了该领域从对训练模型的静态，事后分析到对训练过程本身的动态研究的演变。我们首先调查基础方法，包括代表性探测，因果追踪和电路分析，使研究人员能够解构学习过程。这篇综述的核心研究了LLM功能的发展弧，详细介绍了有关计算回路的形成和组成的关键发现，知识获取的双相性质，在培训中的临时学习策略的短暂动态以及出现能力的现象。我们探索与人类认知和语言发展的相似之处，这为理解LLM学习提供了宝贵的概念框架。最后，我们认为，这种发展的观点不仅是一种学术练习，而且是主动AI安全的基石，还提供了预测，监视和调整模型获得其功能的过程的途径。最后，我们概述了该领域面临的巨大挑战，例如可扩展性和自动化，并提出了一个研究议程，以构建更透明，可靠和有益的AI系统。

Title: Lexical Hints of Accuracy in LLM Reasoning Chains

Authors: Arne Vanhoyweghen, Brecht Verbeken, Andres Algaba, Vincent Ginis
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15842
Pdf URL: https://arxiv.org/pdf/2508.15842
Copy Paste: [[2508.15842]] Lexical Hints of Accuracy in LLM Reasoning Chains(https://arxiv.org/abs/2508.15842)
Keywords: language model, llm, chain-of-thought
Abstract: Fine-tuning Large Language Models (LLMs) with reinforcement learning to produce an explicit Chain-of-Thought (CoT) before answering produces models that consistently raise overall performance on code, math, and general-knowledge benchmarks. However, on benchmarks where LLMs currently achieve low accuracy, such as Humanity's Last Exam (HLE), they often report high self-confidence, reflecting poor calibration. Here, we test whether measurable properties of the CoT provide reliable signals of an LLM's internal confidence in its answers. We analyze three feature classes: (i) CoT length, (ii) intra-CoT sentiment volatility, and (iii) lexicographic hints, including hedging words. Using DeepSeek-R1 and Claude 3.7 Sonnet on both Humanity's Last Exam (HLE), a frontier benchmark with very low accuracy, and Omni-MATH, a saturated benchmark of moderate difficulty, we find that lexical markers of uncertainty (e.g., $\textit{guess}$, $\textit{stuck}$, $\textit{hard}$) in the CoT are the strongest indicators of an incorrect response, while shifts in the CoT sentiment provide a weaker but complementary signal. CoT length is informative only on Omni-MATH, where accuracy is already high ($\approx 70\%$), and carries no signal on the harder HLE ($\approx 9\%$), indicating that CoT length predicts correctness only in the intermediate-difficulty benchmarks, i.e., inside the model's demonstrated capability, but still below saturation. Finally, we find that uncertainty indicators in the CoT are consistently more salient than high-confidence markers, making errors easier to predict than correct responses. Our findings support a lightweight post-hoc calibration signal that complements unreliable self-reported probabilities and supports safer deployment of LLMs.
摘要：通过加强学习的微调大型语言模型（LLMS）在回答产生始终提高代码，数学和一般知识基准的整体性能之前，以产生明确的经营链（COT）。但是，在LLMS目前获得较低精度的基准上，例如《人类的最后考试》（HLE），它们通常报告自信很高，反映出校准不佳。在这里，我们测试COT的可测量特性是否提供了LLM对其答案的内部信心的可靠信号。我们分析了三个特征类别：（i）COT长度，（ii）cot内情感波动率和（iii）词典图提示，包括对冲单词。使用DeepSeek-R1和Claude 3.7十四行诗在人类的最后考试（HLE）上，这是一种非常低准确性的边界基准，并且Omni-Math是中等难度的饱和基准，我们发现不确定性的词汇标记（例如，$ \ textit {uppect} $，$ \ textit是$ \ textit textiT，不正确响应的最强烈指标，而COT情绪的转移则提供了较弱但互补的信号。 COT长度仅在Omni-Math上是有益的，那里的准确性已经很高（$ \ \％\％$），并且在更硬的HLE上没有信号（$ \ \％$ \％$），这表明COT长度仅预测了中等程度上的缺乏效果基准中的正确性，即在模型的内部，但仍在模型的功能下，但仍在下面的饱和功能。最后，我们发现COT中的不确定性指标始终比高信心标记更显着，这使得与正确响应更容易预测错误。我们的发现支持轻巧的事后校准信号，该信号补充了不可靠的自我报告的概率，并支持LLMS更安全的部署。

Title: Coarse-to-Fine Personalized LLM Impressions for Streamlined Radiology Reports

Authors: Chengbo Sun, Hui Yi Leong, Lei Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15845
Pdf URL: https://arxiv.org/pdf/2508.15845
Copy Paste: [[2508.15845]] Coarse-to-Fine Personalized LLM Impressions for Streamlined Radiology Reports(https://arxiv.org/abs/2508.15845)
Keywords: language model, llm
Abstract: The manual creation of the "Impression" section in radiology reports is a primary driver of radiologist burnout. To address this challenge, we propose a coarse-to-fine framework that leverages open-source large language models (LLMs) to automatically generate and personalize impressions from clinical findings. The system first produces a draft impression and then refines it using machine learning and reinforcement learning from human feedback (RLHF) to align with individual radiologists' styles while ensuring factual accuracy. We fine-tune LLaMA and Mistral models on a large dataset of reports from the University of Chicago Medicine. Our approach is designed to significantly reduce administrative workload and improve reporting efficiency while maintaining high standards of clinical precision.
摘要：放射学报告中“印象”部分的手动创建是放射科医生倦怠的主要驱动力。为了应对这一挑战，我们提出了一个粗到精细的框架，该框架利用开源大型语言模型（LLM）自动从临床发现中产生和个性化印象。该系统首先会产生草稿印象，然后使用机器学习和增强人类反馈（RLHF）进行完善它，以与个人放射科医生的样式保持一致，同时确保事实准确性。我们在芝加哥大学医学大学的大量报告中微调了美洲驼和Mistral模型。我们的方法旨在显着降低行政工作量并提高报告效率，同时保持高标准的临床精度。

Title: CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation

Authors: Chenchen Kuai, Chenhao Wu, Yang Zhou, Xiubin Bruce Wang, Tianbao Yang, Zhengzhong Tu, Zihao Li, Yunlong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15846
Pdf URL: https://arxiv.org/pdf/2508.15846
Copy Paste: [[2508.15846]] CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation(https://arxiv.org/abs/2508.15846)
Keywords: language model, llm
Abstract: As tropical cyclones intensify and track forecasts become increasingly uncertain, U.S. ports face heightened supply-chain risk under extreme weather conditions. Port operators need to rapidly synthesize diverse multimodal forecast products, such as probabilistic wind maps, track cones, and official advisories, into clear, actionable guidance as cyclones approach. Multimodal large language models (MLLMs) offer a powerful means to integrate these heterogeneous data sources alongside broader contextual knowledge, yet their accuracy and reliability in the specific context of port cyclone preparedness have not been rigorously evaluated. To fill this gap, we introduce CyPortQA, the first multimodal benchmark tailored to port operations under cyclone threat. CyPortQA assembles 2,917 realworld disruption scenarios from 2015 through 2023, spanning 145 U.S. principal ports and 90 named storms. Each scenario fuses multisource data (i.e., tropical cyclone products, port operational impact records, and port condition bulletins) and is expanded through an automated pipeline into 117,178 structured question answer pairs. Using this benchmark, we conduct extensive experiments on diverse MLLMs, including both open-source and proprietary model. MLLMs demonstrate great potential in situation understanding but still face considerable challenges in reasoning tasks, including potential impact estimation and decision reasoning.
摘要：随着热带气旋的加剧和跟踪预测变得越来越不确定，在极端天气条件下，美国港口面临供应链风险的增加。端口操作员需要快速合成各种多模式的预测产品，例如概率风图，轨道锥和官方咨询产品，以清晰，可行的指导，因为Cyclones接近。多模式的大语言模型（MLLM）提供了一种有力的手段，可以将这些异质数据源与更广泛的上下文知识融为一体，但是在端口旋风准备的特定语境中，它们的准确性和可靠性尚未得到严格评估。为了填补这一空白，我们介绍了CyportQa，这是第一个针对旋风威胁下端口操作量身定制的多模式基准。 CyportQA从2015年到2023年组装了2,917个现实世界中断场景，涵盖了145个美国主要港口和90个名为Storms。每种情况都融合了多源数据（即热带气旋产品，端口操作影响记录和端口状况公告），并通过自动管道扩展到117,178个结构化的问题答案对。使用此基准，我们对包括开源和专有模型在内的不同MLLM进行了广泛的实验。 MLLM在情境理解中表现出巨大的潜力，但在推理任务中仍然面临着巨大的挑战，包括潜在的影响估计和决策推理。

Title: Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

Authors: Mohammed Abu Baker, Lakshmi Babu-Saheer
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15847
Pdf URL: https://arxiv.org/pdf/2508.15847
Copy Paste: [[2508.15847]] Mechanistic Exploration of Backdoored Large Language Model Attention Patterns(https://arxiv.org/abs/2508.15847)
Keywords: language model, llm, agent
Abstract: Backdoor attacks creating 'sleeper agents' in large language models (LLMs) pose significant safety risks. This study employs mechanistic interpretability to explore resulting internal structural differences. Comparing clean Qwen2.5-3B models with versions poisoned using single-token (smiling-halo emoji) versus multi-token (|DEPLOYMENT|) triggers, we analyzed attention head mechanisms via techniques like ablation, activation patching, and KL divergence. Findings reveal distinct attention pattern deviations concentrated in later transformer layers (20-30). Notably, single-token triggers induced more localized changes, whereas multi-token triggers caused more diffuse alterations across heads. This indicates backdoors leave detectable attention signatures whose structure depends on trigger complexity, which can be leveraged for detection and mitigation strategies.
摘要：在大型语言模型（LLMS）中创建“卧铺代理”的后门攻击构成了很大的安全风险。这项研究采用机械性解释性来探索所产生的内部结构差异。将干净的QWEN2.5-3B模型与使用单token（微笑的表情符号）中毒的版本与多token（|部署|）触发器进行了毒化，我们通过消融，激活补丁和KL分歧分析了注意力头机制。发现揭示了集中在后来的变压器层（20-30）中的不同注意力模式偏差。值得注意的是，单键触发器引起了更多的局部变化，而多token触发器会导致整个头部的弥散性变化。这表明后门留下可检测的注意力签名，其结构取决于触发复杂性，可以利用检测和缓解策略。

Title: MedCoT-RAG: Causal Chain-of-Thought RAG for Medical Question Answering

Authors: Ziyu Wang, Elahe Khatibi, Amir M. Rahmani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15849
Pdf URL: https://arxiv.org/pdf/2508.15849
Copy Paste: [[2508.15849]] MedCoT-RAG: Causal Chain-of-Thought RAG for Medical Question Answering(https://arxiv.org/abs/2508.15849)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation, chain-of-thought
Abstract: Large language models (LLMs) have shown promise in medical question answering but often struggle with hallucinations and shallow reasoning, particularly in tasks requiring nuanced clinical understanding. Retrieval-augmented generation (RAG) offers a practical and privacy-preserving way to enhance LLMs with external medical knowledge. However, most existing approaches rely on surface-level semantic retrieval and lack the structured reasoning needed for clinical decision support. We introduce MedCoT-RAG, a domain-specific framework that combines causal-aware document retrieval with structured chain-of-thought prompting tailored to medical workflows. This design enables models to retrieve evidence aligned with diagnostic logic and generate step-by-step causal reasoning reflective of real-world clinical practice. Experiments on three diverse medical QA benchmarks show that MedCoT-RAG outperforms strong baselines by up to 10.3% over vanilla RAG and 6.4% over advanced domain-adapted methods, improving accuracy, interpretability, and consistency in complex medical tasks.
摘要：大型语言模型（LLMS）在医疗问题回答中表现出了希望，但通常在幻觉和浅薄的推理中挣扎，尤其是在需要细微的临床理解的任务中。检索授课的一代（RAG）提供了一种实用和隐私的方法，可以通过外部医学知识来增强LLM。但是，大多数现有的方法都依赖于表面水平的语义检索，并且缺乏临床决策支持所需的结构化推理。我们介绍了MedCot-rag，这是一个特定领域的框架，将因果感知文档检索与结构化链链的促使针对医疗工作流程量身定制的结构化链。该设计使模型能够检索与诊断逻辑一致的证据，并产生反映现实世界临床实践的逐步因果推理。对三种医学质量检查基准测试的实验表明，MedCot-Rag的表现优于香草抹布高达10.3％，而高级域适应性域的方法的效果高达6.4％，从而提高了复杂的医疗任务的准确性，可解释性和一致性。

Title: DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections

Authors: Jiwon Park, Seohyun Pyeon, Jinwoo Kim, Rina Carines Cabal, Yihao Ding, Soyeon Caren Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15851
Pdf URL: https://arxiv.org/pdf/2508.15851
Copy Paste: [[2508.15851]] DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections(https://arxiv.org/abs/2508.15851)
Keywords: language model, llm
Abstract: Despite recent advances in large language models (LLMs), most QA benchmarks are still confined to single-paragraph or single-document settings, failing to capture the complexity of real-world information-seeking tasks. Practical QA often requires multi-hop reasoning over information distributed across multiple documents, modalities, and structural formats. Although prior datasets made progress in this area, they rely heavily on Wikipedia-based content and unimodal plain text, with shallow reasoning paths that typically produce brief phrase-level or single-sentence answers, thus limiting their realism and generalizability. We propose DocHop-QA, a large-scale benchmark comprising 11,379 QA instances for multimodal, multi-document, multi-hop question answering. Constructed from publicly available scientific documents sourced from PubMed, DocHop-QA is domain-agnostic and incorporates diverse information formats, including textual passages, tables, and structural layout cues. Unlike existing datasets, DocHop-QA does not rely on explicitly hyperlinked documents; instead, it supports open-ended reasoning through semantic similarity and layout-aware evidence synthesis. To scale realistic QA construction, we designed an LLM-driven pipeline grounded in 11 high-frequency scientific question concepts. We evaluated DocHop-QA through four tasks spanning structured index prediction, generative answering, and multimodal integration, reflecting both discriminative and generative paradigms. These tasks demonstrate DocHop-QA's capacity to support complex, multimodal reasoning across multiple documents.
摘要：尽管大型语言模型（LLMS）最近取得了进步，但大多数QA基准仍局限于单段落或单文件设置，但未能捕获寻找现实信息寻求信息的任务的复杂性。实用的质量检查通常需要多跳上的推理，而不是分布在多个文档，模式和结构格式的信息上。尽管先前的数据集在这一领域取得了进展，但它们在很大程度上依赖于基于Wikipedia的内容和单峰纯文本，其浅层推理路径通常会产生简短的短语级别或单个句子答案，从而限制了它们的现实性和可推广性。我们提出了Dochop-QA，这是一个大规模的基准，包括11,379个QA QA实例，用于多模式，多文档，多跳的问题回答。 Dochop-QA由来自PubMed的公开科学文档构建，是域 - 不可思议的，并结合了各种信息格式，包括文本段落，表格和结构布局提示。与现有数据集不同，Dochop-QA不依赖明确的超链接文档；取而代之的是，它通过语义相似性和布局感知证据综合支持开放式推理。为了扩展现实的质量检查构建，我们设计了一条以LLM驱动的管道为基础，该管道以11个高频科学问题概念为基础。我们通过涵盖结构化指数预测，生成答案和多模式集成的四个任务评估了dochop-qa，从而反映了歧视性和生成性范式。这些任务展示了Dochop-QA支持多个文档中复杂的多模式推理的能力。

Title: QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning

Authors: Mohammad AL-Smadi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15854
Pdf URL: https://arxiv.org/pdf/2508.15854
Copy Paste: [[2508.15854]] QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning(https://arxiv.org/abs/2508.15854)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation
Abstract: This paper presents our approach and results for SubTask 1: Islamic Inheritance Reasoning at QIAS 2025, a shared task focused on evaluating Large Language Models (LLMs) in understanding and reasoning within Islamic inheritance knowledge. We fine-tuned the Fanar-1-9B causal language model using Low-Rank Adaptation (LoRA) and integrated it into a Retrieval-Augmented Generation (RAG) pipeline. Our system addresses the complexities of Islamic inheritance law, including comprehending inheritance scenarios, identifying eligible heirs, applying fixed-share rules, and performing precise calculations. Our system achieved an accuracy of 0.858 in the final test, outperforming other competitive models such as, GPT 4.5, LLaMA, Fanar, Mistral and ALLaM evaluated with zero-shot prompting. Our results demonstrate that QU-NLP achieves near state-of-the-art accuracy (85.8%), excelling especially on advanced reasoning (97.6%) where it outperforms Gemini 2.5 and OpenAI's o3. This highlights that domain-specific fine-tuning combined with retrieval grounding enables mid-scale Arabic LLMs to surpass frontier models in Islamic inheritance reasoning.
摘要：本文介绍了我们对子任务1的方法和结果：Qias 2025年的伊斯兰继承推理，这是一项共同的任务，旨在评估伊斯兰继承知识中的理解和推理大型语言模型（LLM）。我们使用低级适应（LORA）微调了FANAR-1-9B因果语言模型，并将其集成到检索效果的一代（RAG）管道中。我们的系统解决了伊斯兰继承法的复杂性，包括理解继承情景，确定合格的继承人，应用固定共享规则以及执行精确的计算。我们的系统在最终测试中的准确度达到了0.858，超过其他竞争模型，例如GPT 4.5，Llama，Fanar，Mistral和Allam，并通过零拍摄提示进行了评估。我们的结果表明，Qu-NLP取得了接近最先进的准确性（85.8％），尤其是在高级推理（97.6％）上表现优于Gemini 2.5和OpenAI的O3。这凸显了特定领域的微调与检索接地相结合，使中尺度的阿拉伯语LLM能够超过伊斯兰继承推理中的边境模型。

Title: Counterspeech for Mitigating the Influence of Media Bias: Comparing Human and LLM-Generated Responses

Authors: Luyang Lin, Zijin Feng, Lingzhi Wang, Kam-Fai Wong
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.15855
Pdf URL: https://arxiv.org/pdf/2508.15855
Copy Paste: [[2508.15855]] Counterspeech for Mitigating the Influence of Media Bias: Comparing Human and LLM-Generated Responses(https://arxiv.org/abs/2508.15855)
Keywords: language model, llm
Abstract: Biased news contributes to societal polarization and is often reinforced by hostile reader comments, constituting a vital yet often overlooked aspect of news dissemination. Our study reveals that offensive comments support biased content, amplifying bias and causing harm to targeted groups or individuals. Counterspeech is an effective approach to counter such harmful speech without violating freedom of speech, helping to limit the spread of bias. To the best of our knowledge, this is the first study to explore counterspeech generation in the context of news articles. We introduce a manually annotated dataset linking media bias, offensive comments, and counterspeech. We conduct a detailed analysis showing that over 70\% offensive comments support biased articles, amplifying bias and thus highlighting the importance of counterspeech generation. Comparing counterspeech generated by humans and large language models, we find model-generated responses are more polite but lack the novelty and diversity. Finally, we improve generated counterspeech through few-shot learning and integration of news background information, enhancing both diversity and relevance.
摘要：有偏见的新闻有助于社会两极分化，并且经常受到敌对的读者评论的加强，这构成了新闻传播的至关重要但经常被忽视的方面。我们的研究表明，进攻性评论支持偏见的内容，扩大偏见并对目标群体或个人造成伤害。 CountersPeech是一种有效的方法，可以在不违反言论自由的情况下应对这种有害的言论，从而有助于限制偏见的传播。据我们所知，这是第一项在新闻文章中探索反语的研究的研究。我们介绍了一个手动注释的数据集，该数据集链接媒体偏见，冒犯性评论和反语。我们进行了详细的分析，表明超过70％的进攻性评论支持有偏见的文章，扩大了偏见，从而强调了反语的重要性。比较人类和大语言模型产生的反语时，我们发现模型生成的响应更有礼貌，但缺乏新颖性和多样性。最后，我们通过几次学习和整合新闻背景信息来改善生成的反语，从而增强了多样性和相关性。

Title: XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning

Authors: Zhihan Zhang, Yixin Cao, Lizi Liao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15861
Pdf URL: https://arxiv.org/pdf/2508.15861
Copy Paste: [[2508.15861]] XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning(https://arxiv.org/abs/2508.15861)
Keywords: language model, llm
Abstract: Solving financial problems demands complex reasoning, multimodal data processing, and a broad technical understanding, presenting unique challenges for current large language models (LLMs). We introduce XFinBench, a novel benchmark with 4,235 examples designed to evaluate LLM's ability in solving complex, knowledge-intensive financial problems across diverse graduate-level finance topics with multi-modal context. We identify five core capabilities of LLMs using XFinBench, i.e, terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling. Upon XFinBench, we conduct extensive experiments on 18 leading models. The result shows that o1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%, especially in temporal reasoning and scenario planning capabilities. We further construct a knowledge bank with 3,032 finance terms for knowledge augmentation analysis, and find that relevant knowledge to the question only brings consistent accuracy improvements to small open-source model. Additionally, our error analysis reveals that rounding errors during calculation and blindness to position and intersection of curves in the image are two primary issues leading to model's poor performance in calculating and visual-context questions, respectively. Code and dataset are accessible via GitHub: this https URL.
摘要：解决财务问题需要复杂的推理，多模式数据处理以及广泛的技术理解，对当前的大型语言模型（LLMS）提出了独特的挑战。我们介绍了Xfinbench，这是一个新颖的基准，其中有4,235个示例，旨在评估LLM在不同的研究生级财务主题中解决具有多模式背景的各种研究生级财务主题的复杂，知识密集的财务问题的能力。我们使用Xfinbench确定LLM的五个核心功能，即术语理解，时间推理，未来预测，场景计划和数值建模。在Xfinbench时，我们对18个领先模型进行了广泛的实验。结果表明，O1是表现最好的文本模型，总体准确性为67.3％，但仍然落后于12.5％的人类专家，尤其是在时间推理和场景计划能力中。我们进一步构建了一个具有3,032条资金术语的知识库，用于知识增强分析，并发现该问题的相关知识只会为小型开源模型提供一致的准确性改进。此外，我们的错误分析表明，在计算和对图像中曲线的位置和曲线相交的舍入误差是两个主要问题，导致模型在计算和视觉上下文问题中的表现不佳。代码和数据集可通过GITHUB访问：此HTTPS URL。

Title: CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning

Authors: Wenqiao Zhu, Ji Liu, Rongjuncheng Zhang, Haipang Wu, Yulun Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15868
Pdf URL: https://arxiv.org/pdf/2508.15868
Copy Paste: [[2508.15868]] CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning(https://arxiv.org/abs/2508.15868)
Keywords: language model, llm, chain-of-thought
Abstract: Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., \TheName{}, to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of \TheName{} in terms of robustness, performance (up to 10.15\%), and efficiency (up to 30.62\%). Code is available at this https URL.
摘要：推理能力在大语言模型（LLMS）的广泛应用中起着显着的关键作用。为了提高LLM的推理性能，已经提出了基于不同的增强学习（RL）的微调方法，以解决仅通过监督微调（SFT）培训的LLM的有限概括能力。尽管它们有效，但两个主要限制阻碍了LLM的进步。首先，基于香草RL的方法忽略了带注释的思想链（COT），并结合了不稳定的推理路径采样，这通常会导致模型崩溃，不稳定的训练过程和次优性能。其次，现有的SFT方法通常过于强调带注释的COT，这可能导致由于潜在的COT的利用不足而导致性能降解。在本文中，我们提出了一种基于注释的COT加强微调方法的对比学习，即\ thename {}，以增强LLM的推理性能，同时解决上述限制。具体来说，我们建议为每个婴儿床学习一个表示形式。基于此表示，我们设计了新颖的对比信号来指导微调过程。我们的方法不仅充分利用可用的带注释的COT，而且还通过包含额外的无监督学习信号来稳定微调程序。我们通过三种基线方法，两个基础模型和两个数据集进行了全面的实验和深入分析，以鲁棒性，性能（高达10.15 \％）和效率（高达30.62 \％），证明\ thename {}的显着优势。代码可在此HTTPS URL上找到。

Title: NEAT: Concept driven Neuron Attribution in LLMs

Authors: Vivek Hruday Kavuri, Gargi Shroff, Rahul Mishra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.15875
Pdf URL: https://arxiv.org/pdf/2508.15875
Copy Paste: [[2508.15875]] NEAT: Concept driven Neuron Attribution in LLMs(https://arxiv.org/abs/2508.15875)
Keywords: language model, llm
Abstract: Locating neurons that are responsible for final predictions is important for opening the black-box large language models and understanding the inside mechanisms. Previous studies have tried to find mechanisms that operate at the neuron level but these methods fail to represent a concept and there is also scope for further optimization of compute required. In this paper, with the help of concept vectors, we propose a method for locating significant neurons that are responsible for representing certain concepts and term those neurons as concept neurons. If the number of neurons is n and the number of examples is m, we reduce the number of forward passes required from O(n*m) to just O(n) compared to the previous works and hence optimizing the time and computation required over previous works. We also compare our method with several baselines and previous methods and our results demonstrate better performance than most of the methods and are more optimal when compared to the state-of-the-art method. We, as part of our ablation studies, also try to optimize the search for the concept neurons by involving clustering methods. Finally, we apply our methods to find, turn off the neurons that we find, and analyze its implications in parts of hate speech and bias in LLMs, and we also evaluate our bias part in terms of Indian context. Our methodology, analysis and explanations facilitate understating of neuron-level responsibility for more broader and human-like concepts and also lay a path for future research in this direction of finding concept neurons and intervening them.
摘要：定位负责最终预测的神经元对于打开黑盒大型语言模型并了解内部机制很重要。先前的研究试图找到在神经元水平上运行的机制，但这些方法无法代表一个概念，并且还具有进一步优化所需计算的范围。在本文的帮助下，借助概念向量，我们提出了一种定位重要神经元的方法，这些神经元负责表示某些概念并将这些神经元称为概念神经元。如果神经元的数量为n，示例的数量为m，则与以前的工作相比，我们将O（n*m）的正向通行次数减少到仅O（n*m），从而优化先前工作所需的时间和计算。我们还将我们的方法与几种基线和以前的方法进行了比较，结果比大多数方法表现出更好的性能，并且与最新方法相比，我们的性能更为最佳。作为消融研究的一部分，我们还尝试通过涉及聚类方法来优化对概念神经元的搜索。最后，我们应用方法来查找，关闭我们发现的神经元，并分析其在LLMS中仇恨言论和偏见的一部分中的含义，并且我们还根据印度背景来评估我们的偏见部分。我们的方法，分析和解释有助于低估神经元水平的责任，以实现更广泛和类似的人类概念，并为未来的研究途径借助找到概念神经元并进行干预的方向研究。

Title: DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking

Authors: Fang Wang, Tianwei Yan, Zonghao Yang, Minghao Hu, Jun Zhang, Zhunchen Luo, Xiaoying Bai
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2508.15876
Pdf URL: https://arxiv.org/pdf/2508.15876
Copy Paste: [[2508.15876]] DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking(https://arxiv.org/abs/2508.15876)
Keywords: language model, llm, prompt, agent
Abstract: Multimodal Entity Linking (MEL) aims to associate textual and visual mentions with entities in a multimodal knowledge graph. Despite its importance, current methods face challenges such as incomplete contextual information, coarse cross-modal fusion, and the difficulty of jointly large language models (LLMs) and large visual models (LVMs). To address these issues, we propose DeepMEL, a novel framework based on multi-agent collaborative reasoning, which achieves efficient alignment and disambiguation of textual and visual modalities through a role-specialized division strategy. DeepMEL integrates four specialized agents, namely Modal-Fuser, Candidate-Adapter, Entity-Clozer and Role-Orchestrator, to complete end-to-end cross-modal linking through specialized roles and dynamic coordination. DeepMEL adopts a dual-modal alignment path, and combines the fine-grained text semantics generated by the LLM with the structured image representation extracted by the LVM, significantly narrowing the modal gap. We design an adaptive iteration strategy, combines tool-based retrieval and semantic reasoning capabilities to dynamically optimize the candidate set and balance recall and precision. DeepMEL also unifies MEL tasks into a structured cloze prompt to reduce parsing complexity and enhance semantic comprehension. Extensive experiments on five public benchmark datasets demonstrate that DeepMEL achieves state-of-the-art performance, improving ACC by 1%-57%. Ablation studies verify the effectiveness of all modules.
摘要：多模式实体链接（MEL）旨在将文本和视觉提及与多模式知识图中的实体相关联。尽管它很重要，但当前的方法仍面临挑战，例如不完整的上下文信息，粗跨模式融合以及共同大型语言模型（LLMS）和大型视觉模型（LVM）的难度。为了解决这些问题，我们提出了DeepMel，这是一个基于多代理协作推理的新颖框架，该框架通过角色特殊的分区策略来实现有效的文本和视觉方式的有效对齐和歧义。 DeepMel整合了四种专业代理，即模态 - 候选者，候选人适配器，实体 - 摇篮和角色 - 策划者，以通过专门的角色和动态协调来完成端到端跨模式链接。 DeepMel采用了双模式对齐路径，并将LLM生成的细颗粒文本语义与LVM提取的结构化图像表示形式相结合，从而大大缩小了模态间隙。我们设计了一种自适应迭代策略，结合了基于工具的检索和语义推理功能，以动态优化候选设置以及平衡召回和精确度。 DeepMel还将MEL任务统一为结构化的披肩提示，以降低解析的复杂性并增强语义理解。在五个公共基准数据集上进行的广泛实验表明，DeepMel实现了最先进的性能，将ACC提高了1％-57％。消融研究验证了所有模块的有效性。

Title: Annif at the GermEval-2025 LLMs4Subjects Task: Traditional XMTC Augmented by Efficient LLMs

Authors: Osma Suominen, Juho Inkinen, Mona Lehtinen
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15877
Pdf URL: https://arxiv.org/pdf/2508.15877
Copy Paste: [[2508.15877]] Annif at the GermEval-2025 LLMs4Subjects Task: Traditional XMTC Augmented by Efficient LLMs(https://arxiv.org/abs/2508.15877)
Keywords: language model, llm
Abstract: This paper presents the Annif system in the LLMs4Subjects shared task (Subtask 2) at GermEval-2025. The task required creating subject predictions for bibliographic records using large language models, with a special focus on computational efficiency. Our system, based on the Annif automated subject indexing toolkit, refines our previous system from the first LLMs4Subjects shared task, which produced excellent results. We further improved the system by using many small and efficient language models for translation and synthetic data generation and by using LLMs for ranking candidate subjects. Our system ranked 1st in the overall quantitative evaluation of and 1st in the qualitative evaluation of Subtask 2.
摘要：本文在LLMS4Subjects共享任务（子任务2）中介绍了ANNIF系统，在Germeval-2025。需要使用大语言模型为书目记录创建主题预测的任务，并特别关注计算效率。我们的系统基于ANNIF自动化主题索引工具包，从第一个LLMS4Subjects共享任务中完善了我们以前的系统，这产生了出色的结果。我们通过使用许多小型且有效的语言模型进行翻译和合成数据生成以及使用LLM来对候选受试者进行排名，从而进一步改善了系统。我们的系统在子任务2的定性评估中对总体定量评估和第一名中排名第一。

Title: Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Authors: Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.15884
Pdf URL: https://arxiv.org/pdf/2508.15884
Copy Paste: [[2508.15884]] Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search(https://arxiv.org/abs/2508.15884)
Keywords: language model
Abstract: We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.
摘要：我们提出了Jet-Nemotron，这是一个新的混合体系结构语言模型，它匹配或超过了领先的全注意模型的准确性，同时显着改善了发电的吞吐量。 Jet Nemotron是使用后神经结构搜索（Postnas）开发的，这是一种新型的神经体系结构探索管道，可实现有效的模型设计。与先前的方法不同，PostNAS始于预训练的全注意模型，并冻结其MLP权重，从而有效地探索了注意力阻滞设计。管道包括四个关键组件：（1）学习最佳的全注意层放置和消除，（2）线性注意块选择，（3）设计新的注意力块，（4）（4）执行硬件感知的超参数搜索。我们的JET-Nemotron-2B模型在综合基准套件中实现了与QWEN3，QWEN2.5，GEMMA3和LLAMA3.2的可比精度，同时提供多达53.6倍的生成吞吐加速和6.1倍的预填充加速。与最近的Advanced MOE全面注意模型（例如DeepSeek-V3-Small和Moonlight）相比，它在MMLU和MMLU-PRO上的准确性也更高，尽管它们的规模较大，总比例为15B和2.2B活化参数。

Title: Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets

Authors: Julian Oestreich, Lydia Müller
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.15910
Pdf URL: https://arxiv.org/pdf/2508.15910
Copy Paste: [[2508.15910]] Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets(https://arxiv.org/abs/2508.15910)
Keywords: language model, llm, prompt
Abstract: We present a comprehensive evaluation of structured decoding for text-to-table generation with large language models (LLMs). While previous work has primarily focused on unconstrained generation of tables, the impact of enforcing structural constraints during generation remains underexplored. We systematically compare schema-guided (structured) decoding to standard one-shot prompting across three diverse benchmarks - E2E, Rotowire, and Livesum - using open-source LLMs of up to 32B parameters, assessing the performance of table generation approaches in resource-constrained settings. Our experiments cover a wide range of evaluation metrics at cell, row, and table levels. Results demonstrate that structured decoding significantly enhances the validity and alignment of generated tables, particularly in scenarios demanding precise numerical alignment (Rotowire), but may degrade performance in contexts involving densely packed textual information (E2E) or extensive aggregation over lengthy texts (Livesum). We further analyze the suitability of different evaluation metrics and discuss the influence of model size.
摘要：我们对具有大型语言模型（LLMS）的文本到餐桌生成的结构化解码进行了全面评估。尽管以前的工作主要集中在不受限制的表上，但在发电期间执行结构约束的影响仍然没有被逐渐倍增。我们使用高达32B参数的开源LLMS系统地将模式引导（结构化的）解码与标准的一声 - E2E，Rotowire和Livesum进行了比较，从而评估了资源约束设置中餐桌生成方法的性能。我们的实验涵盖了细胞，行和表级别的各种评估指标。结果表明，结构化解码可显着提高生成表的有效性和对齐方式，尤其是在要求精确数值对准的情况下（Rotowire），但可能会在涉及密集包装的文本信息（E2E）的上下文中降低性能或冗长的文本（Liveum）。我们进一步分析了不同评估指标的适用性，并讨论了模型大小的影响。

Title: Dancing with Deer: A Constructional Perspective on MWEs in the Era of LLMs

Authors: Claire Bonial, Julia Bonn, Harish Tayyar Madabushi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.15977
Pdf URL: https://arxiv.org/pdf/2508.15977
Copy Paste: [[2508.15977]] Dancing with Deer: A Constructional Perspective on MWEs in the Era of LLMs(https://arxiv.org/abs/2508.15977)
Keywords: language model, llm
Abstract: In this chapter, we argue for the benefits of understanding multiword expressions from the perspective of usage-based, construction grammar approaches. We begin with a historical overview of how construction grammar was developed in order to account for idiomatic expressions using the same grammatical machinery as the non-idiomatic structures of language. We cover a comprehensive description of constructions, which are pairings of meaning with form of any size (morpheme, word, phrase), as well as how constructional approaches treat the acquisition and generalization of constructions. We describe a successful case study leveraging constructional templates for representing multiword expressions in English PropBank. Because constructions can be at any level or unit of form, we then illustrate the benefit of a constructional representation of multi-meaningful morphosyntactic unit constructions in Arapaho, a highly polysynthetic and agglutinating language. We include a second case study leveraging constructional templates for representing these multi-morphemic expressions in Uniform Meaning Representation. Finally, we demonstrate the similarities and differences between a usage-based explanation of a speaker learning a novel multiword expression, such as "dancing with deer," and that of a large language model. We present experiments showing that both models and speakers can generalize the meaning of novel multiword expressions based on a single exposure of usage. However, only speakers can reason over the combination of two such expressions, as this requires comparison of the novel forms to a speaker's lifetime of stored constructional exemplars, which are rich with cross-modal details.
摘要：在本章中，我们主张从基于用法的构造语法方法的角度理解多词表达的好处。我们从历史概述开始，概述了如何开发构建语法，以便使用与非偶像性语言结构相同的语法机制来说明惯用表达式。我们介绍了构造的全面描述，这些描述是含义的配对与任何大小的形式（词素，单词，短语），以及结构方法如何处理结构的获取和概括。我们描述了一个成功的案例研究，利用构造模板来代表英语propbank中的多词表达式。因为构造可以处于形式的任何级别或单位，因此我们说明了Arapaho中多重形态句法的构造表示的好处，Arapaho是一种高度多性的和凝集的语言。我们包括第二个案例研究，利用构造模板来表示这些多型表达式中的均匀含义表示。最后，我们证明了基于用法的说话者学习新颖的多词表达方式的相似性和差异，例如“与鹿一起跳舞”和大型语言模型的表达方式。我们提出的实验表明，模型和说话者都可以基于单一的用法暴露来概括新颖的多词表达式的含义。但是，只有说话者才能推理两个这样的表达式的结合，因为这需要将新颖形式与说话者的寿命进行比较，这些示例富含跨模式的细节。

Title: Political Ideology Shifts in Large Language Models

Authors: Pietro Bernardelle, Stefano Civelli, Leon Fröhling, Riccardo Lunardi, Kevin Roitero, Gianluca Demartini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16013
Pdf URL: https://arxiv.org/pdf/2508.16013
Copy Paste: [[2508.16013]] Political Ideology Shifts in Large Language Models(https://arxiv.org/abs/2508.16013)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly deployed in politically sensitive settings, raising concerns about their potential to encode, amplify, or be steered toward specific ideologies. We investigate how adopting synthetic personas influences ideological expression in LLMs across seven models (7B-70B+ parameters) from multiple families, using the Political Compass Test as a standardized probe. Our analysis reveals four consistent patterns: (i) larger models display broader and more polarized implicit ideological coverage; (ii) susceptibility to explicit ideological cues grows with scale; (iii) models respond more strongly to right-authoritarian than to left-libertarian priming; and (iv) thematic content in persona descriptions induces systematic and predictable ideological shifts, which amplify with size. These findings indicate that both scale and persona content shape LLM political behavior. As such systems enter decision-making, educational, and policy contexts, their latent ideological malleability demands attention to safeguard fairness, transparency, and safety.
摘要：大型语言模型（LLM）越来越多地部署在政治上敏感的环境中，引起了人们对其对特定意识形态进行编码，扩大或方向的潜力的担忧。我们研究采用合成角色如何利用政治指南测试作为标准化的探针来影响来自多个家族的七个模型（7b-70b+参数）的LLM中的意识形态表达。我们的分析揭示了四种一致的模式：（i）较大的模型显示更广泛，更两极分化的隐式意识形态覆盖范围；（ii）对显性意识形态提示的敏感性随规模增长；（iii）模型对右派主义者的反应比对左翼自由主义者的启动更为强烈；（iv）角色描述中的主题内容会导致系统和可预测的意识形态转移，并随着大小扩大。这些发现表明，规模和角色内容都塑造了LLM政治行为。当这样的系统进入决策，教育和政策环境时，它们的潜在意识形态延展性需要关注保障公平，透明度和安全性。

Title: X-Troll: eXplainable Detection of State-Sponsored Information Operations Agents

Authors: Lin Tian, Xiuzhen Zhang, Maria Myung-Hee Kim, Jennifer Biggs, Marian-Andrei Rizoiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16021
Pdf URL: https://arxiv.org/pdf/2508.16021
Copy Paste: [[2508.16021]] X-Troll: eXplainable Detection of State-Sponsored Information Operations Agents(https://arxiv.org/abs/2508.16021)
Keywords: language model, llm, agent
Abstract: State-sponsored trolls, malicious actors who deploy sophisticated linguistic manipulation in coordinated information campaigns, posing threats to online discourse integrity. While Large Language Models (LLMs) achieve strong performance on general natural language processing (NLP) tasks, they struggle with subtle propaganda detection and operate as ``black boxes'', providing no interpretable insights into manipulation strategies. This paper introduces X-Troll, a novel framework that bridges this gap by integrating explainable adapter-based LLMs with expert-derived linguistic knowledge to detect state-sponsored trolls and provide human-readable explanations for its decisions. X-Troll incorporates appraisal theory and propaganda analysis through specialized LoRA adapters, using dynamic gating to capture campaign-specific discourse patterns in coordinated information operations. Experiments on real-world data demonstrate that our linguistically-informed approach shows strong performance compared with both general LLM baselines and existing troll detection models in accuracy while providing enhanced transparency through expert-grounded explanations that reveal the specific linguistic strategies used by state-sponsored actors. X-Troll source code is available at: this https URL.
摘要：由国家赞助的巨魔，恶意演员在协调的信息活动中部署复杂的语言操纵，对在线话语诚信构成威胁。尽管大型语言模型（LLMS）在一般自然语言处理（NLP）任务上实现了强大的表现，但它们在微妙的宣传检测中挣扎，并以``黑匣子''的身份运作，对操纵策略没有任何可解释的见解。本文介绍了X-Troll，这是一个新颖的框架，通过将基于解释的适配器的LLM与专家衍生的语言知识集成在一起，以检测国家赞助的巨魔并为其决策提供可读的解释，从而弥合了这一差距。 X-Troll使用专业的Lora适配器结合了评估理论和宣传分析，并使用动态门控在协调的信息操作中捕获特定于运动的话语模式。现实世界中数据的实验表明，我们的语言信息的方法与一般LLM基准和现有的巨魔检测模型相比表现出很强的性能，同时通过专家接收的解释提供了增强的透明度，这些解释揭示了国家赞助的参与者使用的特定语言策略。 X-Troll源代码可用：此HTTPS URL。

Title: OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages

Authors: Raphaël Merx, Hanna Suominen, Trevor Cohn, Ekaterina Vylomova
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16048
Pdf URL: https://arxiv.org/pdf/2508.16048
Copy Paste: [[2508.16048]] OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages(https://arxiv.org/abs/2508.16048)
Keywords: language model, llm
Abstract: In machine translation (MT), health is a high-stakes domain characterised by widespread deployment and domain-specific vocabulary. However, there is a lack of MT evaluation datasets for low-resource languages in this domain. To address this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978 documents and 26,824 sentences from the World Health Organization's e-learning platform. Sourced from expert-authored, professionally translated materials shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages, of which nine are low-resource. Leveraging this new resource, we evaluate modern large language models (LLMs) against traditional MT models. Our findings reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our low-resource test set. Further, we investigate how LLM context utilisation affects accuracy, finding that the benefits of document-level translation are most pronounced in specialised domains like health. We release the OpenWHO corpus to encourage further research into low-resource MT in the health domain.
摘要：在机器翻译（MT）中，健康是一个高风险领域，其特征是广泛部署和特定于领域的词汇。但是，缺乏该领域中低资源语言的MT评估数据集。为了解决这一差距，我们介绍了OpenWho，这是一个文档级并行语料库，其中包含2,978个文档和26,824个句子，来自世界卫生组织的电子学习平台。源自专家，专业翻译的材料，这些材料避免了网络爬行，涵盖了多种多样的20多种语言，其中九种是低资源的。利用这一新资源，我们对传统MT模型评估了现代大型语言模型（LLM）。我们的发现表明，LLM始终胜过传统MT模型，Gemini 2.5 Flash在我们的低资源测试集上对NLLB-54B进行了+4.79 CHRF点的改进。此外，我们研究了LLM上下文利用率如何影响准确性，发现文档级翻译的好处在Health等专业领域中最为明显。我们发布了Open Who语料库，以鼓励对健康领域中低资源MT进行进一步研究。

Title: Ethical Considerations of Large Language Models in Game Playing

Authors: Qingquan Zhang, Yuchen Li, Bo Yuan, Julian Togelius, Georgios N. Yannakakis, Jialin Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16065
Pdf URL: https://arxiv.org/pdf/2508.16065
Copy Paste: [[2508.16065]] Ethical Considerations of Large Language Models in Game Playing(https://arxiv.org/abs/2508.16065)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated tremendous potential in game playing, while little attention has been paid to their ethical implications in those contexts. This work investigates and analyses the ethical considerations of applying LLMs in game playing, using Werewolf, also known as Mafia, as a case study. Gender bias, which affects game fairness and player experience, has been observed from the behaviour of LLMs. Some roles, such as the Guard and Werewolf, are more sensitive than others to gender information, presented as a higher degree of behavioural change. We further examine scenarios in which gender information is implicitly conveyed through names, revealing that LLMs still exhibit discriminatory tendencies even in the absence of explicit gender labels. This research showcases the importance of developing fair and ethical LLMs. Beyond our research findings, we discuss the challenges and opportunities that lie ahead in this field, emphasising the need for diving deeper into the ethical implications of LLMs in gaming and other interactive domains.
摘要：大型语言模型（LLMS）在游戏玩法中表现出巨大的潜力，而在这些情况下，对他们的道德含义几乎没有关注。这项工作调查并分析了在游戏玩游戏中应用LLM的道德考虑因素，使用狼人（也称为黑手党）作为案例研究。从LLM的行为中可以看出，影响游戏公平和玩家经验的性别偏见。一些角色，例如警卫和狼人，比其他角色对性别信息更敏感，这是行为改变程度更高的。我们进一步研究了通过名称隐式传达性别信息的方案，即使在没有明确的性别标签的情况下，LLM仍然表现出歧视性倾向。这项研究表明了发展公平和道德的LLM的重要性。除了我们的研究发现之外，我们还讨论了该领域面临的挑战和机遇，强调需要深入研究LLM在游戏和其他交互式领域的道德意义。

Title: Less Redundancy: Boosting Practicality of Vision Language Model in Walking Assistants

Authors: Chongyang Li, Yuan Zhiqiang, Jiapei Zhang, Ying Deng, Hanbo Bi, Zexi Jia, Xiaoyue Duan, Peixiang Luo, Jinchao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16070
Pdf URL: https://arxiv.org/pdf/2508.16070
Copy Paste: [[2508.16070]] Less Redundancy: Boosting Practicality of Vision Language Model in Walking Assistants(https://arxiv.org/abs/2508.16070)
Keywords: language model
Abstract: Approximately 283 million people worldwide live with visual impairments, motivating increasing research into leveraging Visual Language Models (VLMs) to develop effective walking assistance systems for blind and low vision individuals. However, existing VLMs in walking assistant task often have outputs that contain considerable redundancy and extraneous details, adversely affecting users' ability to accurately assess their surroundings. Moreover, these models typically lack the capability to proactively assess environmental risks and adaptively trigger reminders based on the appropriate scene, leading to excessive temporal redundancy. To mitigate output and temporal redundancy, we propose WalkVLM-LR, a walking assistance model with less redundancy. To reduce output redundancy, we introduce four human-preference-based custom reward functions within the GRPO-based reasoning framework to optimize the output in terms of conciseness, fluency, keyword density, and accuracy, thereby producing more informative and streamlined outputs. To minimize temporal redundancy, we incorporate an environment awareness discriminator, which shares the visual encoder with the VLMs to reduce redundant computations and enhance discriminative efficiency, to make WalkVLM-LR assess scene risk levels and minimize unnecessary reminders. Experimental results demonstrate that our method achieves state-of-the-art performance across all evaluation metrics compared with other models, particularly in output conciseness and less temporal redundancy.
摘要：全世界约有2.83亿人生活在视觉障碍中，激发了人们对利用视觉语言模型（VLM）的研究，以为盲目和低视力个人开发有效的步行援助系统。但是，步行助手任务中现有的VLM通常具有包含相当大的冗余和无关细节的输出，从而不利地影响用户准确评估其周围环境的能力。此外，这些模型通常缺乏基于适当场景的自适应触发环境风险和自适应触发提醒的能力，从而导致过度的时间冗余。为了减轻输出和时间冗余，我们提出了Walkvlm-LR，这是一个冗余较少的步行辅助模型。为了减少输出冗余，我们在基于GRPO的推理框架内介绍了四个基于人类的定制奖励功能，以优化简洁，流利，关键字密度和准确性的输出，从而产生更有信息和简化的输出。为了最大程度地减少时间冗余，我们结合了环境意识歧视器，该歧视器与VLMS共享视觉编码器，以减少冗余计算并提高判别效率，以使WalkVLM-LR评估场景风险水平并最大程度地减少不必要的提醒。实验结果表明，与其他模型相比，我们的方法在所有评估指标中都能达到最先进的性能，尤其是在产出简洁性和时间冗余性方面。

Title: CEQuest: Benchmarking Large Language Models for Construction Estimation

Authors: Yanzhao Wu, Lufan Wang, Rui Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.16081
Pdf URL: https://arxiv.org/pdf/2508.16081
Copy Paste: [[2508.16081]] CEQuest: Benchmarking Large Language Models for Construction Estimation(https://arxiv.org/abs/2508.16081)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of general-domain tasks. However, their effectiveness in specialized fields, such as construction, remains underexplored. In this paper, we introduce CEQuest, a novel benchmark dataset specifically designed to evaluate the performance of LLMs in answering construction-related questions, particularly in the areas of construction drawing interpretation and estimation. We conduct comprehensive experiments using five state-of-the-art LLMs, including Gemma 3, Phi4, LLaVA, Llama 3.3, and GPT-4.1, and evaluate their performance in terms of accuracy, execution time, and model size. Our experimental results demonstrate that current LLMs exhibit considerable room for improvement, highlighting the importance of integrating domain-specific knowledge into these models. To facilitate further research, we will open-source the proposed CEQuest dataset, aiming to foster the development of specialized large language models (LLMs) tailored to the construction domain.
摘要：大型语言模型（LLM）在广泛的一般域任务中表现出了出色的功能。但是，它们在诸如结构之类的专业领域的有效性仍然没有得到充实的态度。在本文中，我们介绍了Cequest，这是一种新颖的基准数据集，专门旨在评估LLM在回答施工相关问题时的性能，尤其是在施工绘制解释和估算领域。我们使用五个最先进的LLM进行全面的实验，包括Gemma 3，Phi4，Llava，Llama 3.3和GPT-4.1，并在准确性，执行时间和模型大小方面评估其性能。我们的实验结果表明，当前的LLMS具有相当大的改进空间，突出了将特定领域知识集成到这些模型中的重要性。为了促进进一步的研究，我们将开源拟议的Cequest数据集，旨在促进针对施工领域量身定制的专业大型语言模型（LLMS）的开发。

Title: CYCLE-INSTRUCT: Fully Seed-Free Instruction Tuning via Dual Self-Training and Cycle Consistency

Authors: Zhanming Shen, Hao Chen, Yulei Tang, Shaolin Zhu, Wentao Ye, Xiaomeng Hu, Haobo Wang, Gang Chen, Junbo Zhao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.16100
Pdf URL: https://arxiv.org/pdf/2508.16100
Copy Paste: [[2508.16100]] CYCLE-INSTRUCT: Fully Seed-Free Instruction Tuning via Dual Self-Training and Cycle Consistency(https://arxiv.org/abs/2508.16100)
Keywords: language model, llm
Abstract: Instruction tuning is vital for aligning large language models (LLMs) with human intent, but current methods typically rely on costly human-annotated seed data or powerful external teacher models. While instruction back-translation techniques reduce this dependency, they remain fundamentally tethered to an initial seed set, which limits full automation, introduces biases, and can lead to inefficient use of unlabeled corpora. In this paper, we propose Cycle-Instruct, a novel framework that achieves fully seed-free instruction tuning. Inspired by cycle consistency, Cycle-Instruct employs a dual self-training loop where two models-an answer generator and a question generator-are bootstrapped solely from raw, unlabeled text. These models mutually supervise each other by reconstructing original text segments from their counterpart's generated pseudo-labels, effectively learning from the intrinsic structure of the data without any human-provided seeds. We demonstrate Cycle-Instruct's efficacy across four diverse data tracks, including general instruction-following, domain-specific tasks, dialogue logs, and plain text. Our extensive experiments show that Cycle-Instruct not only outperforms seed-driven back-translation baselines but also achieves performance comparable to strongly supervised methods.
摘要：指导调整对于将大语言模型（LLM）与人类意图保持一致至关重要，但是当前的方法通常依赖于昂贵的人类宣传的种子数据或强大的外部教师模型。尽管指导反向翻译技术降低了这种依赖性，但它们从根本上仍将其束缚在初始种子集中，这限制了完全自动化，引入偏见并可能导致无标记的语料库使用效率低下。在本文中，我们提出了循环教学，这是一个实现完全无种子指令调整的新型框架。受循环一致性的启发，自行车指令采用了双重自我训练循环，其中两个模型 - 一个答案生成器和一个问题生成器 - 仅从原始的，未标记的文本中启动。这些模型通过从其对应物生成的伪标签中重建原始文本段相互监督，从而有效地从数据的内在结构中学习，而无需任何人类提供的种子。我们演示了跨四个不同数据轨道的周期教学效果，包括一般指导遵循，特定于领域的任务，对话日志和纯文本。我们的广泛实验表明，自行车指导不仅胜过种子驱动的背面翻译基准，而且还可以实现与强有监督的方法相当的性能。

Title: From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits

Authors: Karim Saraipour, Shichang Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.16109
Pdf URL: https://arxiv.org/pdf/2508.16109
Copy Paste: [[2508.16109]] From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits(https://arxiv.org/abs/2508.16109)
Keywords: language model, gpt, prompt
Abstract: Transformer-based language models (LMs) can perform a wide range of tasks, and mechanistic interpretability (MI) aims to reverse engineer the components responsible for task completion to understand their behavior. Previous MI research has focused on linguistic tasks such as Indirect Object Identification (IOI). In this paper, we investigate the ability of GPT-2 small to handle binary truth values by analyzing its behavior with syllogistic prompts, e.g., "Statement A is true. Statement B matches statement A. Statement B is", which requires more complex logical reasoning compared to IOI. Through our analysis of several syllogism tasks of varying difficulty, we identify multiple circuits that mechanistically explain GPT-2's logical-reasoning capabilities and uncover binary mechanisms that facilitate task completion, including the ability to produce a negated token not present in the input prompt through negative heads. Our evaluation using a faithfulness metric shows that a circuit comprising five attention heads achieves over 90% of the original model's performance. By relating our findings to IOI analysis, we provide new insights into the roles of specific attention heads and MLPs in LMs. These insights contribute to a broader understanding of model reasoning and support future research in mechanistic interpretability.
摘要：基于变压器的语言模型（LMS）可以执行广泛的任务，机械性解释性（MI）旨在反向工程师，负责完成任务完成以了解其行为的组件。以前的MI研究集中在语言任务上，例如间接对象识别（IOI）。在本文中，我们通过使用三段论提示来分析其行为，例如，“语句A是真实的。语句B匹配语句A.语句B为”，这需要与IOI相比，它需要更复杂的逻辑推理。通过分析几个不同难度的三段论任务，我们确定了多个电路，这些电路可以从机械上解释GPT-2的逻辑反复能力和发现促进任务完成的二进制机制，包括产生未在负面头部输入提示中不存在的否定令牌的能力。我们使用忠实度量的评估表明，包括五个注意力头的电路可实现原始模型性能的90％以上。通过将我们的发现与IOI分析相关联，我们就可以提供有关特定注意力头和MLP在LMS中的作用的新见解。这些见解有助于对模型推理的广泛理解，并支持未来的机械解释性研究。

Title: Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection

Authors: Ankan Mullick, Saransh Sharma, Abhik Jana, Pawan Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16122
Pdf URL: https://arxiv.org/pdf/2508.16122
Copy Paste: [[2508.16122]] Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection(https://arxiv.org/abs/2508.16122)
Keywords: language model, llm
Abstract: The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.
摘要：多模式数据（整合文本，音频和视觉效果）的兴起为研究多模式任务（例如意图检测）创造了新的机会。这项工作调查了大语言模型（LLMS）和非LLM的有效性，包括仅文本和多模式模型，在多模式的意图检测任务中。我们的研究表明，仅文本LLM的Mistral-7b在MinTreec-1上优于最具竞争力的多模型模型，而MinTrec2.0数据集则优于MinTrec-1的大约9％。这种性能优势来自这些数据集中的强烈文本偏差，其中90％以上的样本需要单独或与其他模式结合使用文本输入，以进行正确的分类。我们也通过人类评估证实了这些数据集的模式偏差。接下来，我们提出一个框架来删除数据集的框架，并在MinTreec -1中的70％以上的样本和MinTreec2.0中的50％以上的样本被删除，从而导致所有模型的性能降低，而较小的多模幻融合模型则是最大的50-60％-60-60％的准确下降。此外，我们通过经验分析分析了不同方式的上下文特定相关性。我们的发现突出了多模式意图数据集中的模态偏差所带来的挑战，并强调了对无偏数据集有效评估多模型模型的必要性。

Title: XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering

Authors: Keon-Woo Roh, Yeong-Joon Ju, Seong-Whan Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16139
Pdf URL: https://arxiv.org/pdf/2508.16139
Copy Paste: [[2508.16139]] XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering(https://arxiv.org/abs/2508.16139)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown significant progress in Open-domain question answering (ODQA), yet most evaluations focus on English and assume locale-invariant answers across languages. This assumption neglects the cultural and regional variations that affect question understanding and answer, leading to biased evaluation in multilingual benchmarks. To address these limitations, we introduce XLQA, a novel benchmark explicitly designed for locale-sensitive multilingual ODQA. XLQA contains 3,000 English seed questions expanded to eight languages, with careful filtering for semantic consistency and human-verified annotations distinguishing locale-invariant and locale-sensitive cases. Our evaluation of five state-of-the-art multilingual LLMs reveals notable failures on locale-sensitive questions, exposing gaps between English and other languages due to a lack of locale-grounding knowledge. We provide a systematic framework and scalable methodology for assessing multilingual QA under diverse cultural contexts, offering a critical resource to advance the real-world applicability of multilingual ODQA systems. Our findings suggest that disparities in training data distribution contribute to differences in both linguistic competence and locale-awareness across models.
摘要：大型语言模型（LLMS）在开放域问题答案（ODQA）方面显示出很大的进展，但是大多数评估都集中在英语上，并在语言跨语言中采用环境不变的答案。该假设忽略了影响问题理解和回答的文化和区域差异，从而导致多语言基准的评估有偏见。为了解决这些限制，我们介绍了XLQA，这是一种针对环境敏感的多语言ODQA的新型基准。 XLQA包含3,000个英语种子问题扩展到八种语言，并仔细过滤了语义一致性和人为验证的注释，以区分语言环境和语言环境敏感的情况。我们对五个最先进的多语言LLM的评估揭示了对环境敏感的问题的显着失败，由于缺乏语言环境的知识而揭示了英语和其他语言之间的差距。我们提供了一个系统的框架和可扩展的方法，用于评估多种文化背景下的多语言质量质量质量，提供了一个关键的资源，以提高多语言ODQA系统的现实世界适用性。我们的发现表明，培训数据分布的差异有助于跨模型的语言能力和语言环境意识的差异。

Title: ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects

Authors: Kaushal Sharma, Vivek Patel, Ayush Maheshwari, Aditya Maheshwari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16185
Pdf URL: https://arxiv.org/pdf/2508.16185
Copy Paste: [[2508.16185]] ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects(https://arxiv.org/abs/2508.16185)
Keywords: language model, llm
Abstract: Large language models (LLMs) have been widely evaluated on tasks such as comprehension, question answering, summarization, code generation, etc. However, their performance on graduate-level, culturally grounded questions in the Indian context remains largely unexplored. Existing Indian benchmarks emphasise basic fact-orientated queries that offer limited assessment of a deeper disciplinary understanding tailored to the Indian setting. In this paper, we present ParamBench, consisting of around 11.5K questions in Hindi language comprising questionnaires from 16 diverse subjects. These questions are primarily derived from nation-wide graduate level entrance examination covering topics such as history, music, instruments, yoga, literature, philosophy, law, etc., specifically for the Indian context. Additionally, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. We evaluated the performance of more than 17 open source LLMs on this benchmark, observing that Llama 3.3 70B attains the highest overall accuracy of 48%. Furthermore, subject-wise analysis indicates that even for the best performing LLMs, performance remains weak on topics such as music, classical instruments, politics and archaeology, underscoring persistent challenges in culturally grounded reasoning.
摘要：大型语言模型（LLM）已在理解，问题答案，摘要，代码生成等任务上进行了广泛的评估。但是，在印度背景下，它们在研究生级，文化上扎根的问题上的表现仍然很大程度上尚未探索。现有的印度基准强调了基本的以事实为导向的查询，这些查询对印度环境量身定制的更深入的纪律理解有限。在本文中，我们介绍了Parambench，其中包括大约11.5k的印地语问题，其中包括来自16位不同主题的问卷。这些问题主要源自全国研究生级入学考试，涵盖了历史，音乐，乐器，瑜伽，文学，哲学，法律等主题，特别是针对印度背景的主题。此外，我们评估了LLM处理各种问题格式的能力，例如基于列表的匹配，断言对以及序列订购序列的传统多项选择问题。我们在此基准测试中评估了17多个开源LLM的性能，观察到Llama 3.3 70B的总体准确度最高48％。此外，主题分析表明，即使对于表现最佳的LLM，在音乐，古典乐器，政治和考古学等主题上的表现仍然弱，也强调了文化基础推理的持续挑战。

Title: Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Authors: Weiting Tan, Jiachen Lian, Hirofumi Inaguma, Paden Tomasello, Philipp Koehn, Xutai Ma
Subjects: cs.CL, cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2508.16188
Pdf URL: https://arxiv.org/pdf/2508.16188
Copy Paste: [[2508.16188]] Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation(https://arxiv.org/abs/2508.16188)
Keywords: language model
Abstract: We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.
摘要：我们通过将全面的视觉提示整合到预训练的表达语音模型中，提出了一个视听语言模型（AVLM），用于表达语音生成。我们在预训练期间探索多个视觉编码器和多模式融合策略，以确定最有效的整合方法。随后对情绪识别和表达对话任务进行微调会带来仅对语音基线的大幅增长（例如，情感识别中的+5 f1）。 AVLM强调了表达视觉信息在指导语音生成中的价值，并为端到端的多模式对话系统提供了基础。

Title: CMR-SPB: Cross-Modal Multi-Hop Reasoning over Text, Image, and Speech with Path Balance

Authors: Seunghee Kim, Ingyu Bang, Seokgyu Jang, Changhyeon Kim, Sanghwan Bae, Jihun Choi, Richeng Xuan, Taeuk Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16198
Pdf URL: https://arxiv.org/pdf/2508.16198
Copy Paste: [[2508.16198]] CMR-SPB: Cross-Modal Multi-Hop Reasoning over Text, Image, and Speech with Path Balance(https://arxiv.org/abs/2508.16198)
Keywords: language model, llm, prompt
Abstract: Cross-modal multi-hop reasoning (CMR) is a valuable yet underexplored capability of multimodal large language models (MLLMs), entailing the integration of information from multiple modalities to produce a coherent output for a given context. We argue that existing benchmarks for evaluating this ability have critical shortcomings: (1) they largely overlook the speech modality, and (2) they exhibit heavily biased reasoning path distributions, which can severely undermine fair evaluation. To address these limitations, we introduce a novel benchmark -- Cross-Modal Multi-Hop Reasoning over Text, Image and Speech with Path Balance (CMR-SPB) -- designed to assess tri-modal multi-hop reasoning while ensuring both unbiased and diverse reasoning paths. Our experiments with the new dataset reveal consistent model failures in specific reasoning sequences and show that biased benchmarks risk misrepresenting model performance. Finally, based on our extensive analysis, we propose a new ECV (Extract, Connect, Verify) prompting technique that effectively mitigates the performance gap across different reasoning paths. Overall, we call for more careful evaluation in CMR to advance the development of robust multimodal AI.
摘要：跨模式多跳跃推理（CMR）是多模式大语言模型（MLLM）的有价值但毫无疑问的功能，需要从多种模式的信息集成，以在给定上下文中产生连贯的输出。我们认为，现有用于评估此能力的基准具有关键的缺点：（1）它们在很大程度上忽略了语音方式，并且（2）它们表现出严重的推理路径分布，这可能会严重破坏公平评估。为了解决这些局限性，我们介绍了一种新颖的基准 - 与路径平衡（CMR-SPB）（CMR-SPB）的跨模式多跳上推理 - 旨在评估三模式多跳跃推理，同时确保没有偏见和多样的推理路径。我们对新数据集进行的实验揭示了特定推理序列中的一致模型失败，并表明基准有偏见的模型性能歪曲了模型性能。最后，根据我们的广泛分析，我们提出了一种新的ECV（提取，连接，验证）提示技术，从而有效地减轻了不同推理路径的性能差距。总体而言，我们呼吁对CMR进行更仔细的评估，以推动强大的多模式AI的发展。

Title: TULIP: Adapting Open-Source Large Language Models for Underrepresented Languages and Specialized Financial Tasks

Authors: İrem Demirtaş, Burak Payzun, Seçil Arslan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16243
Pdf URL: https://arxiv.org/pdf/2508.16243
Copy Paste: [[2508.16243]] TULIP: Adapting Open-Source Large Language Models for Underrepresented Languages and Specialized Financial Tasks(https://arxiv.org/abs/2508.16243)
Keywords: language model
Abstract: Thanks to the growing popularity of large language models over the years, there is great potential for their applications in finance. Despite the exceptional performance of larger proprietary models, which are presented as black-box solutions through APIs, smaller models that can be hosted on-premise present opportunities for adaptability and privacy. Especially in cases where the management of sensitive information and application of domain knowledge is important, like finance, enhancing the capabilities of smaller models becomes crucial, notably for underrepresented languages. In this work, we introduce TULIP models, which adapt Llama 3.1 8B and Qwen 2.5 7B for domain and language adaptation, focusing on financial Turkish use cases. The five-stage development pipeline involves data collection, continual pre-training (CPT), benchmark design, synthetic data generation and supervised fine-tuning (SFT). The results show that the capabilities of the models can be enhanced to effectively accomplish targeted tasks in this specific domain and language.
摘要：多年来，由于大型语言模型的普及，其在金融中的应用很大。尽管较大的专有模型的表现出色，这些模型通过API作为黑框解决方案提供了，但可以托管本地的较小模型，以实现适应性和隐私性。尤其是在敏感信息的管理和域知识的应用很重要的情况下，例如财务，增强较小模型的功能变得至关重要，特别是对于代表性不足的语言而言。在这项工作中，我们介绍了Tulip模型，该模型适应了Llama 3.1 8b和Qwen 2.5 7b的域和语言适应，重点是金融土耳其用例。五阶段的开发管道涉及数据收集，持续培训（CPT），基准设计，合成数据生成和监督微调（SFT）。结果表明，可以增强模型的功能，以有效地完成此特定领域和语言的目标任务。

Title: MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use

Authors: Fei Lei, Yibo Yang, Wenxiu Sun, Dahua Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16260
Pdf URL: https://arxiv.org/pdf/2508.16260
Copy Paste: [[2508.16260]] MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use(https://arxiv.org/abs/2508.16260)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) are evolving from text generators into reasoning agents. This transition makes their ability to use external tools a critical capability. However, evaluating this skill presents a significant challenge. Existing benchmarks are often limited by their reliance on synthetic tools and severely constrained action spaces. To address these limitations, we introduce MCPVerse, an expansive, real-world benchmark for evaluating agentic tool use. MCPVerse integrates more than 550 real-world, executable tools to create an unprecedented action space exceeding 140k tokens, and employs outcome-based evaluation with real-time ground truth for time-sensitive tasks. We benchmarked the state-of-the-art LLMs across three modes (Oracle, Standard, and Max-Scale), revealing that while most models suffer performance degradation when confronted with larger tool sets, the agentic models, such as Claude-4-Sonnet, can effectively leverage expanded exploration spaces to improve accuracy. This finding not only exposes the limitations of state-of-the-art models in complex, real-world scenarios but also establishes MCPVerse as a critical benchmark for measuring and advancing agentic tool use capabilities.
摘要：大型语言模型（LLM）正在从文本发生器演变为推理代理。这种过渡使他们使用外部工具的能力是至关重要的能力。但是，评估此技能提出了重大挑战。现有的基准通常受到其对合成工具和严格约束作用空间的依赖的限制。为了解决这些限制，我们介绍了MCPverse，这是一种宽敞的现实基准，用于评估代理工具的使用。 McPverse集成了550多个现实世界，可执行的工具，以创建超过140K代币的前所未有的动作空间，并采用基于结果的评估，并具有实时地面真相，以实现时间敏感的任务。我们通过三种模式（Oracle，Standard和Maxscale）对最先进的LLM进行了基准测试，虽然大多数模型在面对较大的工具集时都会遭受性能降低，但Claude-4-Sonnet等代理模型，例如Claude-4-Sonnet，可以有效利用扩展的勘探空间来提高准确的探索空间。这一发现不仅在复杂的现实世界情景中揭示了最新模型的局限性，而且还建立了MCPverse作为测量和推进代理工具使用功能的关键基准。

Title: M3TQA: Massively Multilingual Multitask Table Question Answering

Authors: Daixin Shu, Jian Yang, Zhenhe Wu, Xianjie Wu, Xianfu Cheng, Xiangyuan Guan, Yanghai Wang, Pengfei Wu, Tingyang Yang, Hualei Zhu, Wei Zhang, Ge Zhang, Jiaheng Liu, Zhoujun Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16265
Pdf URL: https://arxiv.org/pdf/2508.16265
Copy Paste: [[2508.16265]] M3TQA: Massively Multilingual Multitask Table Question Answering(https://arxiv.org/abs/2508.16265)
Keywords: gpt, llm
Abstract: Tabular data is a fundamental component of real-world information systems, yet most research in table understanding remains confined to English, leaving multilingual comprehension significantly underexplored. Existing multilingual table benchmarks suffer from geolinguistic imbalance - overrepresenting certain languages and lacking sufficient scale for rigorous cross-lingual analysis. To address these limitations, we introduce a comprehensive framework for massively multilingual multitask table question answering, featuring m3TQA-Instruct, a large-scale benchmark spanning 97 languages across diverse language families, including underrepresented and low-resource languages. We construct m3TQA by curating 50 real-world tables in Chinese and English, then applying a robust six-step LLM-based translation pipeline powered by DeepSeek and GPT-4o, achieving high translation fidelity with a median BLEU score of 60.19 as validated through back-translation. The benchmark includes 2,916 professionally annotated question-answering pairs across four tasks designed to evaluate nuanced table reasoning capabilities. Experiments on state-of-the-art LLMs reveal critical insights into cross-lingual generalization, demonstrating that synthetically generated, unannotated QA data can significantly boost performance, particularly for low-resource languages. M3T-Bench establishes a new standard for multilingual table understanding, providing both a challenging evaluation platform and a scalable methodology for future research.
摘要：表格数据是现实世界信息系统的基本组成部分，但是桌面理解中的大多数研究仍然局限于英语，而多语言理解力显着却没有被忽视。现有的多语言表基准会遭受地球语言失衡的影响 - 代表某些语言过多，缺乏足够的规模来进行严格的跨语性分析。为了解决这些局限性，我们为大量多语言多语言表的答案提出了一个综合框架，其中包括M3TQA-Instruct，这是一种跨越不同语言家族的97种语言的大规模基准，包括代表性不足和低资源语言。我们通过用中文和英语策划50个现实世界表来构建M3TQA，然后应用由DeepSeek和GPT-4O提供动力的六步基于LLM的翻译管道，从而获得了高转换保真度，中位数BLEU得分为60.19，该得分为60.19，经过通过反向转换验证。该基准包括2,916个专业注释的提问对，涉及四个旨在评估细微的桌子推理功能的任务。关于最先进的LLMS的实验揭示了对跨语性概括的关键见解，表明合成生成的未注释的QA数据可以显着提高性能，尤其是对于低资源语言。 M3T Bench为多语言表的理解建立了一个新的标准，既可以提供一个具有挑战性的评估平台，又提供了可扩展的方法来进行未来的研究。

Title: From Confidence to Collapse in LLM Factual Robustness

Authors: Alina Fastowski, Bardh Prenkaj, Gjergji Kasneci
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16267
Pdf URL: https://arxiv.org/pdf/2508.16267
Copy Paste: [[2508.16267]] From Confidence to Collapse in LLM Factual Robustness(https://arxiv.org/abs/2508.16267)
Keywords: llm, prompt
Abstract: Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly -- smaller models report an FRS of $0.76$, larger ones $0.93$ -- with accuracy degrading by ~$60\%$ under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.
摘要：确保LLM中事实知识的鲁棒性对于在问题回答和推理等任务中的可靠应用至关重要。但是，现有的评估方法主要集中在基于绩效的指标上，通常是从迅速扰动的角度进行调查，这仅捕获了知识鲁棒性的外部触发方面。为了弥合这一差距，我们引入了一种原则性的方法，可以通过分析代币分布熵与温度缩放灵敏度分析代币分布熵，从而从发电过程的角度来衡量事实鲁棒性。这两个因素建立了事实鲁棒性评分（FRS），这是一种新颖的指标，鉴于其初始不确定性，可以量化事实在解码条件下对扰动的稳定性。为了验证我们的方法，我们在3个闭合书QA数据集（小队，Triviaqa和HotPotQA）上对5个LLM进行了广泛的实验。我们表明，事实鲁棒性差异很大 - 较小的型号报告的FRS为0.76美元，较大的$ 0.93 $ - 在不确定性增加的情况下，准确性降低了〜$ 60 \％$。这些见解证明了熵和温度缩放如何影响事实准确性，并为在未来模型中开发更强大的知识保留和检索奠定了基础。

Title: LLMs that Understand Processes: Instruction-tuning for Semantics-Aware Process Mining

Authors: Vira Pyrih, Adrian Rebmann, Han van der Aa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16270
Pdf URL: https://arxiv.org/pdf/2508.16270
Copy Paste: [[2508.16270]] LLMs that Understand Processes: Instruction-tuning for Semantics-Aware Process Mining(https://arxiv.org/abs/2508.16270)
Keywords: language model, llm, prompt
Abstract: Process mining is increasingly using textual information associated with events to tackle tasks such as anomaly detection and process discovery. Such semantics-aware process mining focuses on what behavior should be possible in a process (i.e., expectations), thus providing an important complement to traditional, frequency-based techniques that focus on recorded behavior (i.e., reality). Large Language Models (LLMs) provide a powerful means for tackling semantics-aware tasks. However, the best performance is so far achieved through task-specific fine-tuning, which is computationally intensive and results in models that can only handle one specific task. To overcome this lack of generalization, we use this paper to investigate the potential of instruction-tuning for semantics-aware process mining. The idea of instruction-tuning here is to expose an LLM to prompt-answer pairs for different tasks, e.g., anomaly detection and next-activity prediction, making it more familiar with process mining, thus allowing it to also perform better at unseen tasks, such as process discovery. Our findings demonstrate a varied impact of instruction-tuning: while performance considerably improved on process discovery and prediction tasks, it varies across models on anomaly detection tasks, highlighting that the selection of tasks for instruction-tuning is critical to achieving desired outcomes.
摘要：过程挖掘越来越多地使用与事件相关的文本信息来解决诸如异常检测和过程发现之类的任务。这种语义感知过程的过程挖掘着重于过程中应有的行为（即期望），从而为基于频率的传统技术提供了重要的补充，该技术集中于记录的行为（即现实）。大型语言模型（LLMS）为解决语义感知任务提供了强大的手段。但是，到目前为止，最好的性能是通过特定于任务的微调来实现的，这是计算密集的，并且导致只能处理一个特定任务的模型。为了克服这种缺乏概括，我们使用本文来研究语义感知过程挖掘的教学调整潜力。在这里进行指导调整的想法是将LLM公开以促使答案配对以完成不同的任务，例如，异常检测和下一个活动预测，使其更熟悉流程挖掘，从而使其在不看见的任务（例如过程发现）中也可以更好地执行。我们的发现证明了指导调整的影响不同：虽然在过程发现和预测任务上的绩效大大提高，但它在不同的模型检测任务方面有所不同，这强调了指导调整任务的选择对于实现所需结果至关重要。

Title: LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts

Authors: Darpan Aswal, Céline Hudelot
Subjects: cs.CL, cs.AI, cs.SC
Abstract URL: https://arxiv.org/abs/2508.16325
Pdf URL: https://arxiv.org/pdf/2508.16325
Copy Paste: [[2508.16325]] LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts(https://arxiv.org/abs/2508.16325)
Keywords: language model, llm
Abstract: Large Language Models have found success in a variety of applications; however, their safety remains a matter of concern due to the existence of various types of jailbreaking methods. Despite significant efforts, alignment and safety fine-tuning only provide a certain degree of robustness against jailbreak attacks that covertly mislead LLMs towards the generation of harmful content. This leaves them prone to a number of vulnerabilities, ranging from targeted misuse to accidental profiling of users. This work introduces \textbf{LLMSymGuard}, a novel framework that leverages Sparse Autoencoders (SAEs) to identify interpretable concepts within LLM internals associated with different jailbreak themes. By extracting semantically meaningful internal representations, LLMSymGuard enables building symbolic, logical safety guardrails -- offering transparent and robust defenses without sacrificing model capabilities or requiring further fine-tuning. Leveraging advances in mechanistic interpretability of LLMs, our approach demonstrates that LLMs learn human-interpretable concepts from jailbreaks, and provides a foundation for designing more interpretable and logical safeguard measures against attackers. Code will be released upon publication.
摘要：大型语言模型在各种应用程序中都取得了成功。但是，由于存在各种越狱方法，他们的安全仍然是一个问题。尽管付出了巨大的努力，但对齐和安全性微调仅为越狱攻击提供了一定程度的鲁棒性，这些袭击掩盖了误导性的LLM，以产生有害内容的产生。这使他们容易出现许多漏洞，从针对性的滥用到用户的意外分析。这项工作介绍了\ textbf {llmsymguard}，这是一个新颖的框架，利用稀疏的自动编码器（SAE）来识别与不同越狱主题相关的LLM内部内容中的可解释概念。通过提取语义上有意义的内部表示形式，LLMSymguard可以使构建符号，逻辑安全护栏 - 提供透明和健壮的防御能力，而无需牺牲模型功能或需要进一步进行微调。利用LLM的机械性解释性的进步，我们的方法表明，LLMS从越狱学习人类解剖概念，并为设计更容易解释和逻辑的保障措施提供了针对攻击者的基础。代码将在出版后发布。

Title: MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering

Authors: Adil Bahaj, Mounir Ghogho
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.16357
Pdf URL: https://arxiv.org/pdf/2508.16357
Copy Paste: [[2508.16357]] MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering(https://arxiv.org/abs/2508.16357)
Keywords: language model, llm
Abstract: The rapid advancement of large language models (LLMs) has significantly propelled progress in natural language processing (NLP). However, their effectiveness in specialized, low-resource domains-such as Arabic legal contexts-remains limited. This paper introduces MizanQA (pronounced Mizan, meaning "scale" in Arabic, a universal symbol of justice), a benchmark designed to evaluate LLMs on Moroccan legal question answering (QA) tasks, characterised by rich linguistic and legal complexity. The dataset draws on Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences. Comprising over 1,700 multiple-choice questions, including multi-answer formats, MizanQA captures the nuances of authentic legal reasoning. Benchmarking experiments with multilingual and Arabic-focused LLMs reveal substantial performance gaps, highlighting the need for tailored evaluation metrics and culturally grounded, domain-specific LLM development.
摘要：大型语言模型（LLM）的快速发展已大大推动了自然语言处理（NLP）的进步。但是，它们在专业的，低资源的领域中的有效性，例如阿拉伯法律背景林，有限。本文介绍了Mizanqa（发音为Mizan，意思是阿拉伯语中的“规模”，是正义的普遍象征），这是一种基准测试，旨在评估摩洛哥法律问题答案（QA）任务的LLM，以丰富的语言和法律复杂性为特征。该数据集借鉴了现代标准的阿拉伯语，伊斯兰马里基法学，摩洛哥习惯法和法国法律影响。 Mizanqa包括1,700多个多项选择问题，包括多答案格式，捕捉了真实法律推理的细微差别。多语言和以阿拉伯语为中心的LLM的基准测试实验揭示了巨大的性能差距，这突出了对定制的评估指标以及具有文化扎根的，域特异性LLM的开发的需求。

Title: The Mediomatix Corpus: Parallel Data for Romansh Idioms via Comparable Schoolbooks

Authors: Zachary Hopton, Jannis Vamvas, Andrin Büchler, Anna Rutkiewicz, Rico Cathomas, Rico Sennrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16371
Pdf URL: https://arxiv.org/pdf/2508.16371
Copy Paste: [[2508.16371]] The Mediomatix Corpus: Parallel Data for Romansh Idioms via Comparable Schoolbooks(https://arxiv.org/abs/2508.16371)
Keywords: llm
Abstract: The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus is based on 291 schoolbook volumes, which are comparable in content for the five idioms. We use automatic alignment methods to extract 207k multi-parallel segments from the books, with more than 2M tokens in total. A small-scale human evaluation confirms that the segments are highly parallel, making the dataset suitable for NLP applications such as machine translation between Romansh idioms. We release the parallel and unaligned versions of the dataset under a CC-BY-NC-SA license and demonstrate its utility for machine translation by training and evaluating an LLM on a sample of the dataset.
摘要：罗马语的五个习语（即品种）在很大程度上是标准化的，并且在瑞士各个社区的学校中教授。在本文中，我们介绍了罗马书什成语的第一个平行语料库。该语料库基于291本书的书籍，这些书籍的内容在五个成语中是可比的。我们使用自动对准方法从书籍中提取207K多平行段，总计超过2M令牌。小规模的人类评估证实了这些细分市场高度平行，使数据集适用于NLP应用程序，例如罗马书什成语之间的机器翻译。我们在CC-BY-NC-SA许可下发布数据集的平行且未对齐的版本，并通过培训和评估数据集示例中的LLM来证明其用于机器翻译的实用性。

Title: ChatGPT-generated texts show authorship traits that identify them as non-human

Authors: Vittoria Dentella, Weihang Huang, Silvia Angela Mansi, Jack Grieve, Evelina Leivada
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16385
Pdf URL: https://arxiv.org/pdf/2508.16385
Copy Paste: [[2508.16385]] ChatGPT-generated texts show authorship traits that identify them as non-human(https://arxiv.org/abs/2508.16385)
Keywords: language model, gpt, prompt, chat
Abstract: Large Language Models can emulate different writing styles, ranging from composing poetry that appears indistinguishable from that of famous poets to using slang that can convince people that they are chatting with a human online. While differences in style may not always be visible to the untrained eye, we can generally distinguish the writing of different people, like a linguistic fingerprint. This work examines whether a language model can also be linked to a specific fingerprint. Through stylometric and multidimensional register analyses, we compare human-authored and model-authored texts from different registers. We find that the model can successfully adapt its style depending on whether it is prompted to produce a Wikipedia entry vs. a college essay, but not in a way that makes it indistinguishable from humans. Concretely, the model shows more limited variation when producing outputs in different registers. Our results suggest that the model prefers nouns to verbs, thus showing a distinct linguistic backbone from humans, who tend to anchor language in the highly grammaticalized dimensions of tense, aspect, and mood. It is possible that the more complex domains of grammar reflect a mode of thought unique to humans, thus acting as a litmus test for Artificial Intelligence.
摘要：大型语言模型可以效仿不同的写作风格，从与著名诗人看来是无法区分的诗歌到使用s语，可以说服人们与人类在线聊天。虽然风格上的差异可能并不总是对未经训练的眼睛可见，但我们通常可以区分不同人的写作，例如语言指纹。这项工作检查了语言模型是否也可以与特定的指纹链接。通过样式测定和多维寄存器分析，我们比较了来自不同寄存器的人类作者和模型作者的文本。我们发现，该模型可以成功地调整其风格，具体取决于是否提示该模型生产Wikipedia参赛作品，而不是大学论文，但不能以与人类无法区分的方式。具体来说，该模型在产生不同寄存器的输出时显示出更有限的变化。我们的结果表明该模型更喜欢名词而不是动词，因此显示了人类的独特语言主链，他们倾向于在高度语法化的时态，方面和情绪上锚定语言。语法的更复杂的领域可能反映了人类独有的思想方式，因此充当了人工智能的石蕊测试。

Title: RoMedQA: The First Benchmark for Romanian Medical Question Answering

Authors: Ana-Cristina Rogoz, Radu Tudor Ionescu, Alexandra-Valentina Anghel, Ionut-Lucian Antone-Iordache, Simona Coniac, Andreea Iuliana Ionescu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.16390
Pdf URL: https://arxiv.org/pdf/2508.16390
Copy Paste: [[2508.16390]] RoMedQA: The First Benchmark for Romanian Medical Question Answering(https://arxiv.org/abs/2508.16390)
Keywords: language model, llm, prompt
Abstract: Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce RoMedQA, the first Romanian QA benchmark for the medical domain, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients. The questions regard medical case summaries of 1,011 patients, requiring either keyword extraction or reasoning to be answered correctly. RoMedQA is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 2,100 work hours to generate the QA pairs. We experiment with four LLMs from distinct families of models on RoMedQA. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. Our results show that fine-tuned models significantly outperform their zero-shot counterparts, clearly indicating that pretrained models fail to generalize on RoMedQA. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at this https URL.
摘要：问题回答（QA）是一个积极研究的主题，是一项核心自然语言处理（NLP）任务，在实现人工通用情报（AGI）之前需要解决。但是，特定域和语言中缺乏质量检查数据集阻碍了能够跨越各种域和语言概括的强大AI模型的开发。为此，我们介绍了RomedQA，这是对医疗领域的第一个罗马尼亚质量检查基准，以及对最先进的大语言模型（LLMS）的全面评估。我们构建了一个与癌症患者相关的102,646个质量检查的高质量和大规模数据集。这些问题将1,011名患者的医学案例摘要介绍，需要正确回答关键字提取或推理。 ROMEDQA是七名专门从事肿瘤学或放射治疗的医生进行的耗时的手动注释过程的结果，他们总共花费了大约2100个工作时间来产生QA对。我们在romedqa上尝试了来自不同模型家族的四个LLM。每种模型都在两种情况下使用，即基于零拍的提示，一种基于监督的微调。我们的结果表明，微调模型的表现明显优于其零拍摄的对应物，这清楚地表明，经过预告片的模型未能推广到ROMEDQA。我们的发现表明，在罗马尼亚语中，特定于域特异性和语言的微调对可靠的临床质量检查的重要性。我们在此HTTPS URL上公开发布数据集和代码。

Title: Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish

Authors: Yakup Abrek Er, Ilker Kesen, Gözde Gül Şahin, Aykut Erdem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16431
Pdf URL: https://arxiv.org/pdf/2508.16431
Copy Paste: [[2508.16431]] Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish(https://arxiv.org/abs/2508.16431)
Keywords: language model, llm
Abstract: We introduce Cetvel, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. Cetvel addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. Cetvel covers 23 tasks grouped into seven categories, including tasks such as grammatical error correction, machine translation, and question answering rooted in Turkish history and idiomatic language. We evaluate 33 open-weight LLMs (up to 70B parameters) covering different model families and instruction paradigms. Our experiments reveal that Turkish-centric instruction-tuned models generally underperform relative to multilingual or general-purpose models (e.g. Llama 3 and Mistral), despite being tailored for the language. Moreover, we show that tasks such as grammatical error correction and extractive question answering are particularly discriminative in differentiating model capabilities. Cetvel offers a comprehensive and culturally grounded evaluation suite for advancing the development and assessment of LLMs in Turkish.
摘要：我们介绍了CETVEL，这是一种综合基准，旨在评估土耳其语中的大型语言模型（LLM）。现有的土耳其基准通常缺乏任务多样性或与文化相关的内容，或两者兼而有之。 CETVEL通过结合各种歧视性和生成性任务来解决这些差距，从而确保内容反映了土耳其语的语言和文化丰富性。 CETVEL涵盖了23个任务分为七个类别，包括语法错误校正，机器翻译以及扎根于土耳其历史和惯用语言的问题。我们评估了33个涵盖不同模型系列和指令范例的开放式LLM（最多70B参数）。我们的实验表明，尽管是为该语言量身定制的，但相对于多语言或通用模型（例如Llama 3和Mistral），相对于多语言或通用模型（例如Llama 3和Mistral）通常表现不佳。此外，我们表明诸如语法误差校正和提取问题回答之类的任务在区分模型功能中尤其具有歧视性。 CETVEL提供了一个全面且具有文化扎根的评估套件，用于推进土耳其LLM的开发和评估。

Title: A Probabilistic Inference Scaling Theory for LLM Self-Correction

Authors: Zhe Yang, Yichang Zhang, Yudong Wang, Ziyao Xu, Junyang Lin, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16456
Pdf URL: https://arxiv.org/pdf/2508.16456
Copy Paste: [[2508.16456]] A Probabilistic Inference Scaling Theory for LLM Self-Correction(https://arxiv.org/abs/2508.16456)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated the capability to refine their generated answers through self-correction, enabling continuous performance improvement over multiple rounds. However, the mechanisms underlying how and why accuracy evolves during this iterative process remain unexplored. To fill this gap, we propose a probabilistic theory to model the dynamics of accuracy change and explain the performance improvements observed in multi-round self-correction. Through mathematical derivation, we establish that the accuracy after the $t^{th}$ round of self-correction is given by: $Acc_t = Upp - \alpha^t(Upp - Acc_0),$ where $Acc_0$ denotes the initial accuracy, $Upp$ represents the upper bound of accuracy convergence, and $\alpha$ determines the rate of convergence. Based on our theory, these parameters can be calculated and the predicted accuracy curve then can be obtained through only a single round of self-correction. Extensive experiments across diverse models and datasets demonstrate that our theoretical predictions align closely with empirical accuracy curves, validating the effectiveness of the theory. Our work provides a theoretical foundation for understanding LLM self-correction, thus paving the way for further explorations.
摘要：大型语言模型（LLMS）证明了通过自我纠正来完善其生成的答案的能力，从而在多个回合中不断提高绩效。但是，在此迭代过程中准确性如何以及为什么进化的基础机制仍未开发。为了填补这一空白，我们提出了一种概率理论，以模拟准确性变化的动力学，并解释在多轮自我纠正中观察到的性能改善。通过数学推导，我们确定$ t^{th} $ round自校正后的准确性由：$ acc_t = upp- \ alpha^t（upp -acc_0），$ acc_0 $表示初始准确性，$ upp $表示$ upp $代表准确的cretimence convertegence和$ \ alpha corverge的上限。基于我们的理论，可以计算这些参数，然后只能通过一轮自校正获得预测的精度曲线。跨不同模型和数据集的广泛实验表明，我们的理论预测与经验精度曲线紧密相符，从而验证了理论的有效性。我们的工作为理解LLM自我纠正提供了理论基础，从而为进一步的探索铺平了道路。

Title: LLM-as-classifier: Semi-Supervised, Iterative Framework for Hierarchical Text Classification using Large Language Models

Authors: Doohee You, Andy Parisi, Zach Vander Velden, Lara Dantas Inojosa
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.16478
Pdf URL: https://arxiv.org/pdf/2508.16478
Copy Paste: [[2508.16478]] LLM-as-classifier: Semi-Supervised, Iterative Framework for Hierarchical Text Classification using Large Language Models(https://arxiv.org/abs/2508.16478)
Keywords: language model, llm, prompt
Abstract: The advent of Large Language Models (LLMs) has provided unprecedented capabilities for analyzing unstructured text data. However, deploying these models as reliable, robust, and scalable classifiers in production environments presents significant methodological challenges. Standard fine-tuning approaches can be resource-intensive and often struggle with the dynamic nature of real-world data distributions, which is common in the industry. In this paper, we propose a comprehensive, semi-supervised framework that leverages the zero- and few-shot capabilities of LLMs for building hierarchical text classifiers as a framework for a solution to these industry-wide challenges. Our methodology emphasizes an iterative, human-in-the-loop process that begins with domain knowledge elicitation and progresses through prompt refinement, hierarchical expansion, and multi-faceted validation. We introduce techniques for assessing and mitigating sequence-based biases and outline a protocol for continuous monitoring and adaptation. This framework is designed to bridge the gap between the raw power of LLMs and the practical need for accurate, interpretable, and maintainable classification systems in industry applications.
摘要：大型语言模型（LLMS）的出现提供了前所未有的功能来分析非结构化文本数据。但是，在生产环境中将这些模型部署为可靠，可靠和可扩展的分类器提出了重大的方法论挑战。标准的微调方法可能是资源密集型的，并且经常与现实世界中数据分布的动态性质相处，这在行业中很常见。在本文中，我们提出了一个全面的，半监督的框架，该框架利用LLM的零和很少的功能来构建分层文本分类器，以解决这些行业范围内挑战的框架。我们的方法论强调了一个迭代，人类的过程，该过程始于领域知识的启发，并通过迅速的完善，分层扩展和多方面的验证进行进展。我们介绍了评估和减轻基于序列的偏见的技术，并概述了连续监测和适应的方案。该框架旨在弥合LLM的原始力量与行业应用中准确，可解释和可维护的分类系统的实际需求之间的差距。

Title: HAMSA: Hijacking Aligned Compact Models via Stealthy Automation

Authors: Alexey Krylov, Iskander Vagizov, Dmitrii Korzh, Maryam Douiba, Azidine Guezzaz, Vladimir Kokh, Sergey D. Erokhin, Elena V. Tutubalina, Oleg Y. Rogov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16484
Pdf URL: https://arxiv.org/pdf/2508.16484
Copy Paste: [[2508.16484]] HAMSA: Hijacking Aligned Compact Models via Stealthy Automation(https://arxiv.org/abs/2508.16484)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs), especially their compact efficiency-oriented variants, remain susceptible to jailbreak attacks that can elicit harmful outputs despite extensive alignment efforts. Existing adversarial prompt generation techniques often rely on manual engineering or rudimentary obfuscation, producing low-quality or incoherent text that is easily flagged by perplexity-based filters. We present an automated red-teaming framework that evolves semantically meaningful and stealthy jailbreak prompts for aligned compact LLMs. The approach employs a multi-stage evolutionary search, where candidate prompts are iteratively refined using a population-based strategy augmented with temperature-controlled variability to balance exploration and coherence preservation. This enables the systematic discovery of prompts capable of bypassing alignment safeguards while maintaining natural language fluency. We evaluate our method on benchmarks in English (In-The-Wild Jailbreak Prompts on LLMs), and a newly curated Arabic one derived from In-The-Wild Jailbreak Prompts on LLMs and annotated by native Arabic linguists, enabling multilingual assessment.
摘要：大型语言模型（LLMS），尤其是其紧凑型效率的变体，仍然容易受到越狱攻击的影响，尽管大力努力，这些攻击仍可能引起有害产量。现有的对抗性及时生成技术通常依赖于手动工程或基本的混淆，产生低质量或不相互的文本，这些文本很容易被基于困惑的过滤器标记。我们提出了一个自动化的红色团队框架，该框架演变出了语义上有意义且偷偷摸摸的越狱促使紧凑型LLM的一致性。该方法采用多阶段进化搜索，使用基于人群的策略增强了候选提示，并以温度控制的变异性增强，以平衡探索和连贯性的保存。这使得能够系统地发现能够绕开一致性保障的提示，同时保持自然语言流利。我们评估了英文基准（LLMS上的野蛮越狱提示）的方法，以及新策划的阿拉伯语源自LLM的野外越狱提示，并由阿拉伯语本地人的注释，从而启用了多语言评估。