2025-05-09

Title: How Social is It? A Benchmark for LLMs' Capabilities in Multi-user Multi-turn Social Agent Tasks

Authors: Yusen Wu, Junwu Xiong, Xiaotie Deng
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2505.04628
Pdf URL: https://arxiv.org/pdf/2505.04628
Copy Paste: [[2505.04628]] How Social is It? A Benchmark for LLMs' Capabilities in Multi-user Multi-turn Social Agent Tasks(https://arxiv.org/abs/2505.04628)
Keywords: language model, llm, agent
Abstract: Expanding the application of large language models (LLMs) to societal life, instead of primary function only as auxiliary assistants to communicate with only one person at a time, necessitates LLMs' capabilities to independently play roles in multi-user, multi-turn social agent tasks within complex social settings. However, currently the capability has not been systematically measured with available benchmarks. To address this gap, we first introduce an agent task leveling framework grounded in sociological principles. Concurrently, we propose a novel benchmark, How Social Is It (we call it HSII below), designed to assess LLM's social capabilities in comprehensive social agents tasks and benchmark representative models. HSII comprises four stages: format parsing, target selection, target switching conversation, and stable conversation, which collectively evaluate the communication and task completion capabilities of LLMs within realistic social interaction scenarios dataset, HSII-Dataset. The dataset is derived step by step from news dataset. We perform an ablation study by doing clustering to the dataset. Additionally, we investigate the impact of chain of thought (COT) method on enhancing LLMs' social performance. Since COT cost more computation, we further introduce a new statistical metric, COT-complexity, to quantify the efficiency of certain LLMs with COTs for specific social tasks and strike a better trade-off between measurement of correctness and efficiency. Various results of our experiments demonstrate that our benchmark is well-suited for evaluating social skills in LLMs.
摘要：扩展大型语言模型（LLM）在社会生活中的应用，而不是仅作为辅助助手一次与一个人进行交流，这是LLMS在复杂社会环境中独立扮演多用户多转变社会代理任务的能力。但是，目前尚未使用可用的基准测量该功能。为了解决这一差距，我们首先引入了以社会学原则为基础的代理任务级别级别。同时，我们提出了一个新颖的基准，即它的社交方式（我们在下面称其为HSII），旨在评估LLM在综合社会代理人任务和基准代表模型中的社交能力。 HSII包括四个阶段：格式解析，目标选择，目标切换对话和稳定的对话，该阶段共同评估了LLM在现实的社交交互情况数据集中LLM的沟通和任务完成功能，HSII-DATASET。该数据集是从新闻数据集逐步得出的。我们通过对数据集进行聚类来进行消融研究。此外，我们研究了思想链（COT）方法对增强LLMS社会绩效的影响。由于COT的成本更高计算，我们进一步引入了一个新的统计指标，COT复杂性，以量化某些LLM的效率，其中包括COTS来进行特定的社交任务，并在测量正确性和效率之间进行更好的权衡。我们的各种实验结果表明，我们的基准非常适合评估LLMS的社交技能。

Title: Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs

Authors: Dongxing Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04637
Pdf URL: https://arxiv.org/pdf/2505.04637
Copy Paste: [[2505.04637]] Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs(https://arxiv.org/abs/2505.04637)
Keywords: language model, llm
Abstract: Recent advancements in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in processing diverse data types, yet significant disparities persist between human cognitive processes and computational approaches to multimodal information integration. This research presents a systematic investigation into the parallels between human cross-modal chunking mechanisms and token representation methodologies in MLLMs. Through empirical studies comparing human performance patterns with model behaviors across visual-linguistic tasks, we demonstrate that conventional static tokenization schemes fundamentally constrain current models' capacity to simulate the dynamic, context-sensitive nature of human information processing. We propose a novel framework for dynamic cross-modal tokenization that incorporates adaptive boundaries, hierarchical representations, and alignment mechanisms grounded in cognitive science principles. Quantitative evaluations demonstrate that our approach yields statistically significant improvements over state-of-the-art models on benchmark tasks (+7.8% on Visual Question Answering, +5.3% on Complex Scene Description) while exhibiting more human-aligned error patterns and attention distributions. These findings contribute to the theoretical understanding of the relationship between human cognition and artificial intelligence, while providing empirical evidence for developing more cognitively plausible AI systems.
摘要：多模式大语言模型（MLLM）的最新进展表明，在处理多种数据类型方面具有显着的功能，但是人类认知过程与多模式信息集成的计算方法之间的显着差异持续存在。这项研究对MLLM中人类跨模式结构机制和令牌表示方法之间的相似之处进行了系统的研究。通过将人类绩效模式与跨视觉语言任务的模型行为进行比较的经验研究，我们证明了常规的静态令牌化方案从根本上限制了当前模型模拟人类信息处理的动态，上下文敏感性的能力。我们为动态跨模式令牌化提出了一个新的框架，该框架结合了基于认知科学原则的自适应边界，分层表示和对齐机制。定量评估表明，我们的方法对基准任务上的最先进模型产生了统计学上的显着改善（视觉询问答案的 +7.8％，在复杂场景描述中 +5.3％），同时表现出更多的人类协调的误差模式和注意力分布。这些发现有助于对人类认知与人工智能之间关系的理论理解，同时提供了开发更具认知上合理的AI系统的经验证据。

Title: A Comparative Benchmark of a Moroccan Darija Toxicity Detection Model (Typica.ai) and Major LLM-Based Moderation APIs (OpenAI, Mistral, Anthropic)

Authors: Hicham Assoudi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04640
Pdf URL: https://arxiv.org/pdf/2505.04640
Copy Paste: [[2505.04640]] A Comparative Benchmark of a Moroccan Darija Toxicity Detection Model (Typica.ai) and Major LLM-Based Moderation APIs (OpenAI, Mistral, Anthropic)(https://arxiv.org/abs/2505.04640)
Keywords: llm
Abstract: This paper presents a comparative benchmark evaluating the performance of this http URL's custom Moroccan Darija toxicity detection model against major LLM-based moderation APIs: OpenAI (omni-moderation-latest), Mistral (mistral-moderation-latest), and Anthropic Claude (claude-3-haiku-20240307). We focus on culturally grounded toxic content, including implicit insults, sarcasm, and culturally specific aggression often overlooked by general-purpose systems. Using a balanced test set derived from the OMCD_Typica.ai_Mix dataset, we report precision, recall, F1-score, and accuracy, offering insights into challenges and opportunities for moderation in underrepresented languages. Our results highlight this http URL's superior performance, underlining the importance of culturally adapted models for reliable content moderation.
摘要：本文提出了一个比较基准测试，该基准评估了该HTTP URL对基于LLM的主要适量API的自定义摩洛哥Darija毒性检测模型的性能：OpenAI（Omni-Moderation-LATEST），MISTRAL（MISTRAL-MODERATION-LATEST）和人类Claude Claude Claude（Claude-3-3-Haiiku-202240307）。我们专注于文化扎根的有毒含量，包括隐式侮辱，讽刺和经常被通用系统忽略的文化侵略。使用从OMCD_TYPICA.AI_MIX数据集派生的平衡测试集，我们报告了精度，召回，F1得分和准确性，从而深入了解了代表性不足的语言的挑战和机会。我们的结果突出了HTTP URL的出色性能，强调了文化化模型对可靠内容审核的重要性。

Title: ChatGPT for automated grading of short answer questions in mechanical ventilation

Authors: Tejas Jade, Alex Yartsev
Subjects: cs.CL, cs.LG, stat.CO
Abstract URL: https://arxiv.org/abs/2505.04645
Pdf URL: https://arxiv.org/pdf/2505.04645
Copy Paste: [[2505.04645]] ChatGPT for automated grading of short answer questions in mechanical ventilation(https://arxiv.org/abs/2505.04645)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Standardised tests using short answer questions (SAQs) are common in postgraduate education. Large language models (LLMs) simulate conversational language and interpret unstructured free-text responses in ways aligning with applying SAQ grading rubrics, making them attractive for automated grading. We evaluated ChatGPT 4o to grade SAQs in a postgraduate medical setting using data from 215 students (557 short-answer responses) enrolled in an online course on mechanical ventilation (2020--2024). Deidentified responses to three case-based scenarios were presented to ChatGPT with a standardised grading prompt and rubric. Outputs were analysed using mixed-effects modelling, variance component analysis, intraclass correlation coefficients (ICCs), Cohen's kappa, Kendall's W, and Bland--Altman statistics. ChatGPT awarded systematically lower marks than human graders with a mean difference (bias) of -1.34 on a 10-point scale. ICC values indicated poor individual-level agreement (ICC1 = 0.086), and Cohen's kappa (-0.0786) suggested no meaningful agreement. Variance component analysis showed minimal variability among the five ChatGPT sessions (G-value = 0.87), indicating internal consistency but divergence from the human grader. The poorest agreement was observed for evaluative and analytic items, whereas checklist and prescriptive rubric items had less disagreement. We caution against the use of LLMs in grading postgraduate coursework. Over 60% of ChatGPT-assigned grades differed from human grades by more than acceptable boundaries for high-stakes assessments.
摘要：使用简短答案问题（SAQ）的标准化测试在研究生教育中很常见。大型语言模型（LLMS）模拟了对话语言，并以与应用SAQ分级标准相符的方式来解释非结构化的自由文本响应，从而使其对自动化分级有吸引力。我们使用215名学生（557个短期答复）的数据（2020---2024）的215名学生（557个短期答复）（2020--2024）评估了研究生医疗环境中的Chatgpt 4O。对三种基于案例的方案的响应已通过标准化的评分提示和标准提示。使用混合效应建模，方差分析，类内相关系数（ICC），Cohen's Kappa，Kendall的W和Bland-bland-and-Altman统计来分析输出。在10分制中，Chatgpt的授予分数比人类分级分子降低了-1.34。 ICC值表明个人级协议差（ICC1 = 0.086），而科恩的Kappa（-0.0786）提出没有有意义的一致性。方差分析表明，五个chatgpt会话之间的变异性最小（g值= 0.87），表明内部一致性，但与人分级器的分歧。对于评估和分析项目，观察到最糟糕的一致性，而清单和规定的标题项目的分歧较少。我们警告不要使用LLM在评分研究生课程中。超过60％的Chatgpt分配等级与人类成绩不同，而高风险评估的可接受界限超过了。

Title: FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights

Authors: Chengzhang Yu, Yiming Zhang, Zhixin Liu, Zenghui Ding, Yining Sun, Zhanpeng Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04649
Pdf URL: https://arxiv.org/pdf/2505.04649
Copy Paste: [[2505.04649]] FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights(https://arxiv.org/abs/2505.04649)
Keywords: language model, gpt, llm, agent
Abstract: The automation of scientific research through large language models (LLMs) presents significant opportunities but faces critical challenges in knowledge synthesis and quality assurance. We introduce Feedback-Refined Agent Methodology (FRAME), a novel framework that enhances medical paper generation through iterative refinement and structured feedback. Our approach comprises three key innovations: (1) A structured dataset construction method that decomposes 4,287 medical papers into essential research components through iterative refinement; (2) A tripartite architecture integrating Generator, Evaluator, and Reflector agents that progressively improve content quality through metric-driven feedback; and (3) A comprehensive evaluation framework that combines statistical metrics with human-grounded benchmarks. Experimental results demonstrate FRAME's effectiveness, achieving significant improvements over conventional approaches across multiple models (9.91% average gain with DeepSeek V3, comparable improvements with GPT-4o Mini) and evaluation dimensions. Human evaluation confirms that FRAME-generated papers achieve quality comparable to human-authored works, with particular strength in synthesizing future research directions. The results demonstrated our work could efficiently assist medical research by building a robust foundation for automated medical research paper generation while maintaining rigorous academic standards.
摘要：通过大语言模型（LLM）自动化的科学研究自动化带来了重要的机会，但面临着知识综合和质量保证的关键挑战。我们介绍了反馈改良的代理方法（框架），这是一个新颖的框架，可通过迭代改进和结构化反馈来增强医疗纸的生产。我们的方法包括三个关键的创新：（1）一种结构化的数据集构造方法，该方法通过迭代细化将4,287篇医学论文分解为基本的研究组件；（2）一个三方架构集成了发电机，评估器和反射器代理，该构造通过公制驱动的反馈逐步提高内容质量；（3）将统计指标与人体基准相结合的全面评估框架。实验结果证明了框架的有效性，比多种模型的常规方法（DeepSeek V3平均增益，与GPT-4O MINI的可比改进）和评估维度相比，取得了显着改善。人类评估证实，框架生成的论文达到的质量与人为作品的作品相当，在综合未来的研究方向方面具有特殊的力量。结果表明，我们的工作可以通过为自动化医学研究造纸的生产建立强大的基础，同时保持严格的学术标准，从而有效地帮助医学研究。

Title: Scientific Hypothesis Generation and Validation: Methods, Datasets, and Future Directions

Authors: Adithya Kulkarni, Fatimah Alotaibi, Xinyue Zeng, Longfeng Wu, Tong Zeng, Barry Menglong Yao, Minqian Liu, Shuaicheng Zhang, Lifu Huang, Dawei Zhou
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04651
Pdf URL: https://arxiv.org/pdf/2505.04651
Copy Paste: [[2505.04651]] Scientific Hypothesis Generation and Validation: Methods, Datasets, and Future Directions(https://arxiv.org/abs/2505.04651)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Large Language Models (LLMs) are transforming scientific hypothesis generation and validation by enabling information synthesis, latent relationship discovery, and reasoning augmentation. This survey provides a structured overview of LLM-driven approaches, including symbolic frameworks, generative models, hybrid systems, and multi-agent architectures. We examine techniques such as retrieval-augmented generation, knowledge-graph completion, simulation, causal inference, and tool-assisted reasoning, highlighting trade-offs in interpretability, novelty, and domain alignment. We contrast early symbolic discovery systems (e.g., BACON, KEKADA) with modern LLM pipelines that leverage in-context learning and domain adaptation via fine-tuning, retrieval, and symbolic grounding. For validation, we review simulation, human-AI collaboration, causal modeling, and uncertainty quantification, emphasizing iterative assessment in open-world contexts. The survey maps datasets across biomedicine, materials science, environmental science, and social science, introducing new resources like AHTech and CSKG-600. Finally, we outline a roadmap emphasizing novelty-aware generation, multimodal-symbolic integration, human-in-the-loop systems, and ethical safeguards, positioning LLMs as agents for principled, scalable scientific discovery.
摘要：大型语言模型（LLM）正在通过启用信息综合，潜在关系发现和推理增强来改变科学假设的产生和验证。这项调查提供了LLM驱动方法的结构化概述，包括符号框架，生成模型，混合系统和多代理体系结构。我们研究了诸如检索效果的生成，知识图完成，模拟，因果推理和工具辅助推理的技术，突出了可解释性，新颖性和领域对齐方面的权衡。我们将早期的符号发现系统（例如培根，Kekada）与现代LLM管道进行了对比，这些管道通过微调，检索和符号接地来利用秘密学习和领域适应。为了进行验证，我们回顾了模拟，人类协作，因果建模和不确定性量化，并强调开放世界中的迭代评估。该调查将生物医学，材料科学，环境科学和社会科学的数据集介绍了Ahtech和CSKG-600等新资源。最后，我们概述了一个路线图，该路线图强调了新颖的一代，多模式符号构成，人类在循环系统和道德保障措施，将LLMS定位为原则上可扩展的科学发现的代理。

Title: Advancing Conversational Diagnostic AI with Multimodal Reasoning

Authors: Khaled Saab, Jan Freyberg, Chunjong Park, Tim Strother, Yong Cheng, Wei-Hung Weng, David G.T. Barrett, David Stutz, Nenad Tomasev, Anil Palepu, Valentin Liévin, Yash Sharma, Roma Ruparel, Abdullah Ahmed, Elahe Vedadi, Kimberly Kanada, Cian Hughes, Yun Liu, Geoff Brown, Yang Gao, Sean Li, S. Sara Mahdavi, James Manyika, Katherine Chou, Yossi Matias, Avinatan Hassidim, Dale R. Webster, Pushmeet Kohli, S.M. Ali Eslami, Joëlle Barral, Adam Rodman, Vivek Natarajan, Mike Schaekermann, Tao Tu, Alan Karthikesalingam, Ryutaro Tanno
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04653
Pdf URL: https://arxiv.org/pdf/2505.04653
Copy Paste: [[2505.04653]] Advancing Conversational Diagnostic AI with Multimodal Reasoning(https://arxiv.org/abs/2505.04653)
Keywords: language model, llm, chat
Abstract: Large Language Models (LLMs) have demonstrated great potential for conducting diagnostic conversations but evaluation has been largely limited to language-only interactions, deviating from the real-world requirements of remote care delivery. Instant messaging platforms permit clinicians and patients to upload and discuss multimodal medical artifacts seamlessly in medical consultation, but the ability of LLMs to reason over such data while preserving other attributes of competent diagnostic conversation remains unknown. Here we advance the conversational diagnosis and management performance of the Articulate Medical Intelligence Explorer (AMIE) through a new capability to gather and interpret multimodal data, and reason about this precisely during consultations. Leveraging Gemini 2.0 Flash, our system implements a state-aware dialogue framework, where conversation flow is dynamically controlled by intermediate model outputs reflecting patient states and evolving diagnoses. Follow-up questions are strategically directed by uncertainty in such patient states, leading to a more structured multimodal history-taking process that emulates experienced clinicians. We compared AMIE to primary care physicians (PCPs) in a randomized, blinded, OSCE-style study of chat-based consultations with patient actors. We constructed 105 evaluation scenarios using artifacts like smartphone skin photos, ECGs, and PDFs of clinical documents across diverse conditions and demographics. Our rubric assessed multimodal capabilities and other clinically meaningful axes like history-taking, diagnostic accuracy, management reasoning, communication, and empathy. Specialist evaluation showed AMIE to be superior to PCPs on 7/9 multimodal and 29/32 non-multimodal axes (including diagnostic accuracy). The results show clear progress in multimodal conversational diagnostic AI, but real-world translation needs further research.
摘要：大型语言模型（LLM）表现出了进行诊断对话的巨大潜力，但是评估在很大程度上仅限于仅语言的互动，而偏离了远程护理交付的现实需求。即时消息平台允许临床医生和患者在医疗咨询中无缝地上传和讨论多模式的医学工件，但是LLMS在此类数据上推理此类数据的能力同时保留了其他有能力的诊断对话的属性。在这里，我们通过收集和解释多模式数据的新能力来提高清晰医学情报探索者（AMIE）的对话诊断和管理性能，以及在咨询期间精确的原因。我们的系统利用Gemini 2.0 Flash实现了一个州感知的对话框架，其中对话流由反映患者状态和不断发展的诊断的中间模型输出动态控制。后续问题在战略上是由此类患者州的不确定性指导的，导致更结构化的多模式记录过程，模仿经验丰富的临床医生。我们将AMIE与初级保健医生（PCP）进行了比较，对与患者参与者进行聊天咨询的随机，盲人，欧司定式研究研究。我们使用智能手机皮肤照片，ECG和PDF等工件构建了105个评估场景，这些临床文档跨不同条件和人口统计。我们的专栏评估了多模式能力和其他临床意义有意义的轴，例如记录，诊断准确性，管理推理，交流和同理心。专家评估表明，AMIE在7/9多模式和29/32非偏移轴（包括诊断精度）上优于PCP。结果显示了多模式对话诊断AI的明显进展，但是现实世界的翻译需要进一步研究。

Title: A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient

Authors: Yehor Tereshchenko, Mika Hämäläinen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04654
Pdf URL: https://arxiv.org/pdf/2505.04654
Copy Paste: [[2505.04654]] A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient(https://arxiv.org/abs/2505.04654)
Keywords: language model, gpt, llm
Abstract: Artificial Intelligence (AI) and Large Language Models (LLMs) have rapidly evolved in recent years, showcasing remarkable capabilities in natural language understanding and generation. However, these advancements also raise critical ethical questions regarding safety, potential misuse, discrimination and overall societal impact. This article provides a comparative analysis of the ethical performance of various AI models, including the brand new DeepSeek-V3(R1 with reasoning and without), various GPT variants (4o, 3.5 Turbo, 4 Turbo, o1/o3 mini) and Gemini (1.5 flash, 2.0 flash and 2.0 flash exp) and highlights the need for robust human oversight, especially in situations with high stakes. Furthermore, we present a new metric for calculating harm in LLMs called Relative Danger Coefficient (RDC).
摘要：近年来，人工智能（AI）和大语言模型（LLM）迅速发展，展现了自然语言理解和产生的显着能力。但是，这些进步还提出了有关安全，潜在滥用，歧视和整体社会影响的关键道德问题。 This article provides a comparative analysis of the ethical performance of various AI models, including the brand new DeepSeek-V3(R1 with reasoning and without), various GPT variants (4o, 3.5 Turbo, 4 Turbo, o1/o3 mini) and Gemini (1.5 flash, 2.0 flash and 2.0 flash exp) and highlights the need for robust human oversight, especially in situations with high stakes.此外，我们提出了一个新的指标，用于计算LLM中的危害，称为相对危险系数（RDC）。

Title: Integration of Large Language Models and Traditional Deep Learning for Social Determinants of Health Prediction

Authors: Paul Landes, Jimeng Sun, Adam Cross
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04655
Pdf URL: https://arxiv.org/pdf/2505.04655
Copy Paste: [[2505.04655]] Integration of Large Language Models and Traditional Deep Learning for Social Determinants of Health Prediction(https://arxiv.org/abs/2505.04655)
Keywords: language model, llm
Abstract: Social Determinants of Health (SDoH) are economic, social and personal circumstances that affect or influence an individual's health status. SDoHs have shown to be correlated to wellness outcomes, and therefore, are useful to physicians in diagnosing diseases and in decision-making. In this work, we automatically extract SDoHs from clinical text using traditional deep learning and Large Language Models (LLMs) to find the advantages and disadvantages of each on an existing publicly available dataset. Our models outperform a previous reference point on a multilabel SDoH classification by 10 points, and we present a method and model to drastically speed up classification (12X execution time) by eliminating expensive LLM processing. The method we present combines a more nimble and efficient solution that leverages the power of the LLM for precision and traditional deep learning methods for efficiency. We also show highly performant results on a dataset supplemented with synthetic data and several traditional deep learning models that outperform LLMs. Our models and methods offer the next iteration of automatic prediction of SDoHs that impact at-risk patients.
摘要：卫生的社会决定因素（SDOH）是影响或影响个人健康状况的经济，社会和个人情况。 SDOHS已证明与健康结果相关，因此对于诊断疾病和决策的医生来说很有用。在这项工作中，我们使用传统的深度学习和大型语言模型（LLM）自动从临床文本中提取SDOHS，以在现有的公开数据集中找到每个人的优势和缺点。我们的模型在多标签SDOH分类上优于先前的参考点10分，我们提出了一种方法和模型，可以通过消除昂贵的LLM处理来大大加快分类（12倍执行时间）。我们提出的方法结合了一个更敏捷，更有效的解决方案，该解决方案利用LLM的力量来获得精确和传统的深度学习方法来提高效率。我们还在补充的数据集中显示出高度性能的结果，并表现出均超过LLM的综合数据和几种传统的深度学习模型。我们的模型和方法提供了影响高危患者的SDOHS自动预测的下一个迭代。

Title: AI-Generated Fall Data: Assessing LLMs and Diffusion Model for Wearable Fall Detection

Authors: Sana Alamgeer, Yasine Souissi, Anne H. H. Ngu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.04660
Pdf URL: https://arxiv.org/pdf/2505.04660
Copy Paste: [[2505.04660]] AI-Generated Fall Data: Assessing LLMs and Diffusion Model for Wearable Fall Detection(https://arxiv.org/abs/2505.04660)
Keywords: language model, gpt, llm
Abstract: Training fall detection systems is challenging due to the scarcity of real-world fall data, particularly from elderly individuals. To address this, we explore the potential of Large Language Models (LLMs) for generating synthetic fall data. This study evaluates text-to-motion (T2M, SATO, ParCo) and text-to-text models (GPT4o, GPT4, Gemini) in simulating realistic fall scenarios. We generate synthetic datasets and integrate them with four real-world baseline datasets to assess their impact on fall detection performance using a Long Short-Term Memory (LSTM) model. Additionally, we compare LLM-generated synthetic data with a diffusion-based method to evaluate their alignment with real accelerometer distributions. Results indicate that dataset characteristics significantly influence the effectiveness of synthetic data, with LLM-generated data performing best in low-frequency settings (e.g., 20Hz) while showing instability in high-frequency datasets (e.g., 200Hz). While text-to-motion models produce more realistic biomechanical data than text-to-text models, their impact on fall detection varies. Diffusion-based synthetic data demonstrates the closest alignment to real data but does not consistently enhance model performance. An ablation study further confirms that the effectiveness of synthetic data depends on sensor placement and fall representation. These findings provide insights into optimizing synthetic data generation for fall detection models.
摘要：训练秋季检测系统由于现实世界中跌倒数据的稀缺，尤其是老年人的稀缺性而具有挑战性。为了解决这个问题，我们探讨了大语言模型（LLMS）生成合成秋季数据的潜力。这项研究评估了文本到动作（T2M，SATO，PARCO）和文本对文本模型（GPT4O，GPT4，Gemini），以模拟现实的秋季场景。我们生成合成数据集并将它们与四个现实世界基线数据集集成在一起，以使用长期短期内存（LSTM）模型评估其对秋季检测性能的影响。此外，我们将LLM生成的合成数据与基于扩散的方法进行比较，以评估其与实际加速度计分布的比对。结果表明，数据集特性显着影响合成数据的有效性，而LLM生成的数据在低频设置（例如20Hz）中表现最佳，同时在高频数据集中显示出不稳定性（例如，200Hz）。虽然文本到动作模型比文本对文本模型产生更现实的生物力学数据，但它们对秋季检测的影响有所不同。基于扩散的合成数据证明了与实际数据最接近的一致性，但并不能始终增强模型性能。一项消融研究进一步证实，合成数据的有效性取决于传感器的放置和下降表示。这些发现为优化秋季检测模型的合成数据生成提供了见解。

Title: Personalized Risks and Regulatory Strategies of Large Language Models in Digital Advertising

Authors: Haoyang Feng, Yanjun Dai, Yuan Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04665
Pdf URL: https://arxiv.org/pdf/2505.04665
Copy Paste: [[2505.04665]] Personalized Risks and Regulatory Strategies of Large Language Models in Digital Advertising(https://arxiv.org/abs/2505.04665)
Keywords: language model, llm
Abstract: Although large language models have demonstrated the potential for personalized advertising recommendations in experimental environments, in actual operations, how advertising recommendation systems can be combined with measures such as user privacy protection and data security is still an area worthy of in-depth discussion. To this end, this paper studies the personalized risks and regulatory strategies of large language models in digital advertising. This study first outlines the principles of Large Language Model (LLM), especially the self-attention mechanism based on the Transformer architecture, and how to enable the model to understand and generate natural language text. Then, the BERT (Bidirectional Encoder Representations from Transformers) model and the attention mechanism are combined to construct an algorithmic model for personalized advertising recommendations and user factor risk protection. The specific steps include: data collection and preprocessing, feature selection and construction, using large language models such as BERT for advertising semantic embedding, and ad recommendations based on user portraits. Then, local model training and data encryption are used to ensure the security of user privacy and avoid the leakage of personal data. This paper designs an experiment for personalized advertising recommendation based on a large language model of BERT and verifies it with real user data. The experimental results show that BERT-based advertising push can effectively improve the click-through rate and conversion rate of advertisements. At the same time, through local model training and privacy protection mechanisms, the risk of user privacy leakage can be reduced to a certain extent.
摘要：尽管大型语言模型已经证明了在实验环境中的个性化广告建议的潜力，但在实际操作中，如何将广告建议系统与诸如用户隐私保护和数据安全等措施结合在一起仍然是值得深入讨论的领域。为此，本文研究了数字广告中大语言模型的个性化风险和监管策略。这项研究首先概述了大语言模型（LLM）的原理，尤其是基于变压器体系结构的自我发项机制，以及如何使模型能够理解和生成自然语言文本。然后，将BERT（来自变形金刚的双向编码器表示）模型和注意机制组合在一起，以构建一个算法模型，用于个性化广告建议和用户因素风险保护。具体步骤包括：数据收集和预处理，功能选择和构建，使用大型语言模型，例如BERT用于广告语义嵌入，以及基于用户肖像的AD建议。然后，使用本地模型培训和数据加密来确保用户隐私的安全性并避免个人数据的泄漏。本文设计了一个基于BERT的大语言模型的个性化广告建议的实验，并使用真实的用户数据验证它。实验结果表明，基于BERT的广告推动可以有效提高广告的点击率和转换率。同时，通过本地模型培训和隐私保护机制，可以在一定程度上减少用户隐私泄漏的风险。

Title: Fine-Tuning Large Language Models and Evaluating Retrieval Methods for Improved Question Answering on Building Codes

Authors: Mohammad Aqib, Mohd Hamza, Qipei Mei, Ying Hei Chui
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04666
Pdf URL: https://arxiv.org/pdf/2505.04666
Copy Paste: [[2505.04666]] Fine-Tuning Large Language Models and Evaluating Retrieval Methods for Improved Question Answering on Building Codes(https://arxiv.org/abs/2505.04666)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Building codes are regulations that establish standards for the design, construction, and safety of buildings to ensure structural integrity, fire protection, and accessibility. They are often extensive, complex, and subject to frequent updates, making manual querying challenging and time-consuming. Key difficulties include navigating large volumes of text, interpreting technical language, and identifying relevant clauses across different sections. A potential solution is to build a Question-Answering (QA) system that answers user queries based on building codes. Among the various methods for building a QA system, Retrieval-Augmented Generation (RAG) stands out in performance. RAG consists of two components: a retriever and a language model. This study focuses on identifying a suitable retriever method for building codes and optimizing the generational capability of the language model using fine-tuning techniques. We conducted a detailed evaluation of various retrieval methods by performing the retrieval on the National Building Code of Canada (NBCC) and explored the impact of domain-specific fine-tuning on several language models using the dataset derived from NBCC. Our analysis included a comparative assessment of different retrievers and the performance of both pre-trained and fine-tuned models to determine the efficacy and domain-specific adaptation of language models using fine-tuning on the NBCC dataset. Experimental results showed that Elasticsearch proved to be the most robust retriever among all. The findings also indicate that fine-tuning language models on an NBCC-specific dataset can enhance their ability to generate contextually relevant responses. When combined with context retrieved by a powerful retriever like Elasticsearch, this improvement in LLM performance can optimize the RAG system, enabling it to better navigate the complexities of the NBCC.
摘要：建筑法规是建立建筑物设计，建设和安全标准的法规，以确保结构完整性，防火和可访问性。它们通常是广泛，复杂的，并且会经常进行更新，从而使手动查询具有挑战性和耗时。关键困难包括浏览大量文本，解释技术语言以及识别不同部分之间的相关条款。一个潜在的解决方案是构建一个问题 - QA（QA）系统，该系统根据建筑物代码回答用户查询。在构建质量检查系统的各种方法中，检索增强的一代（RAG）在性能中脱颖而出。 RAG由两个组成部分组成：一个猎犬和一个语言模型。这项研究重点是确定一种合适的检索方法，用于构建代码并使用微调技术优化语言模型的世代能力。我们通过对加拿大国家建筑法规（NBCC）进行检索，对各种检索方法进行了详细的评估，并使用源自NBCC派生的数据集探索了针对多种语言模型的领域特定微调的影响。我们的分析包括对不同检索器的比较评估以及预先训练和微调模型的性能，以确定使用NBCC数据集中的微调对语言模型的功效和域特异性适应。实验结果表明，Elasticsearch被证明是所有人中最强大的检索器。这些发现还表明，NBCC特定数据集中的微调语言模型可以增强其生成上下文相关响应的能力。当与Elasticsearch这样的功能强大的猎犬检索到的上下文结合使用时，LLM性能的这种改进可以优化抹布系统，从而使其能够更好地浏览NBCC的复杂性。

Title: Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards

Authors: Yuxin Zhang, Meihao Fan, Ju Fan, Mingyang Yi, Yuyu Luo, Jian Tan, Guoliang Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04671
Pdf URL: https://arxiv.org/pdf/2505.04671
Copy Paste: [[2505.04671]] Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards(https://arxiv.org/abs/2505.04671)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have significantly improved performance on the Text-to-SQL task by leveraging their powerful reasoning capabilities. To enhance accuracy during the reasoning process, external Process Reward Models (PRMs) can be introduced during training and inference to provide fine-grained supervision. However, if misused, PRMs may distort the reasoning trajectory and lead to suboptimal or incorrect SQL this http URL address this challenge, we propose Reward-SQL, a framework that systematically explores how to incorporate PRMs into the Text-to-SQL reasoning process effectively. Our approach follows a "cold start, then PRM supervision" paradigm. Specifically, we first train the model to decompose SQL queries into structured stepwise reasoning chains using common table expressions (Chain-of-CTEs), establishing a strong and interpretable reasoning baseline. Then, we investigate four strategies for integrating PRMs, and find that combining PRM as an online training signal (GRPO) with PRM-guided inference (e.g., best-of-N sampling) yields the best results. Empirically, on the BIRD benchmark, Reward-SQL enables models supervised by a 7B PRM to achieve a 13.1% performance gain across various guidance strategies. Notably, our GRPO-aligned policy model based on Qwen2.5-Coder-7B-Instruct achieves 68.9% accuracy on the BIRD development set, outperforming all baseline methods under the same model size. These results demonstrate the effectiveness of Reward-SQL in leveraging reward-based supervision for Text-to-SQL reasoning. Our code is publicly available.
摘要：大型语言模型（LLMS）的最新进展通过利用其强大的推理能力来大大提高了文本到SQL任务的性能。为了提高推理过程中的准确性，可以在培训和推理期间引入外部过程奖励模型（PRM），以提供细粒度的监督。但是，如果滥用滥用，PRM可能会扭曲推理轨迹，并导致SQL次优或不正确的SQL此HTTP URL解决这一挑战，我们提出了奖励SQL，该框架是系统地探索如何将PRMS合并到文本到SQL推理过程中的框架。我们的方法遵循“寒冷的开始，然后是PRM监督”范式。具体而言，我们首先使用通用表表达式（CHAN-of-CTES）将SQL查询分解为结构化的逐步推理链中，从而建立了强大且可解释的推理基线。然后，我们研究了整合PRM的四种策略，并发现将PRM作为在线培训信号（GRPO）与PRM引导的推理（例如，最佳N采样）相结合可产生最佳结果。从经验上讲，在鸟类基准上，奖励-SQL使7B PRM监督的模型可以在各种指导策略中获得13.1％的绩效增长。值得注意的是，我们基于QWEN2.5-编码-7B-Instruct的GRPO一致性策略模型在鸟类开发集上达到了68.9％的精度，在相同的模型大小下优于所有基线方法。这些结果证明了奖励SQL在利用基于奖励的文本到SQL推理的监督方面的有效性。我们的代码公开可用。

Title: REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM

Authors: Madhur Jindal, Saurabh Deshpande
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04673
Pdf URL: https://arxiv.org/pdf/2505.04673
Copy Paste: [[2505.04673]] REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM(https://arxiv.org/abs/2505.04673)
Keywords: language model, gpt, llm
Abstract: Vision Large Language Models (VLLMs) represent a significant advancement in artificial intelligence by integrating image-processing capabilities with textual understanding, thereby enhancing user interactions and expanding application domains. However, their increased complexity introduces novel safety and ethical challenges, particularly in multi-modal and multi-turn conversations. Traditional safety evaluation frameworks, designed for text-based, single-turn interactions, are inadequate for addressing these complexities. To bridge this gap, we introduce the REVEAL (Responsible Evaluation of Vision-Enabled AI LLMs) Framework, a scalable and automated pipeline for evaluating image-input harms in VLLMs. REVEAL includes automated image mining, synthetic adversarial data generation, multi-turn conversational expansion using crescendo attack strategies, and comprehensive harm assessment through evaluators like GPT-4o. We extensively evaluated five state-of-the-art VLLMs, GPT-4o, Llama-3.2, Qwen2-VL, Phi3.5V, and Pixtral, across three important harm categories: sexual harm, violence, and misinformation. Our findings reveal that multi-turn interactions result in significantly higher defect rates compared to single-turn evaluations, highlighting deeper vulnerabilities in VLLMs. Notably, GPT-4o demonstrated the most balanced performance as measured by our Safety-Usability Index (SUI) followed closely by Pixtral. Additionally, misinformation emerged as a critical area requiring enhanced contextual defenses. Llama-3.2 exhibited the highest MT defect rate ($16.55 \%$) while Qwen2-VL showed the highest MT refusal rate ($19.1 \%$).
摘要：视觉大语言模型（VLLM）通过将图像处理功能与文本理解相结合，从而增强用户交互并扩展应用程序域，这代表了人工智能的重大进步。但是，它们增加的复杂性引入了新的安全性和道德挑战，尤其是在多模式和多转交谈中。为基于文本的单转交互而设计的传统安全评估框架不足以解决这些复杂性。为了弥合这一差距，我们介绍了揭示（负责评估启用视力的AI LLMS）框架，这是一种可扩展且自动化的管道，用于评估VLLM中的图像输入危害。揭示包括自动图像挖掘，合成对抗数据的生成，使用渐强攻击策略的多转交流扩展以及通过GPT-4O等评估者进行的全面伤害评估。我们广泛评估了五个重要的危害类别：性伤害，暴力和错误信息，广泛评估了五个最先进的VLLM，GPT-4O，Llama-3.2，Qwen2-Vl，Phi3.5V和PixTral。我们的发现表明，与单转交换相比，多转变相互作用导致缺陷率明显更高，从而突出了VLLMS中更深层次的脆弱性。值得注意的是，GPT-4O证明了通过我们的安全性索引（SUI）衡量的最平衡性能，紧随其后的是PixTral。此外，错误的信息是需要增强上下文防御的关键领域。 Llama-3.2表现出最高的MT缺陷率（$ 16.55 \％$），而Qwen2-VL的MT拒绝率最高（$ 19.1 \％\％$）。

Title: SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding

Authors: Jingyang Deng, Ran Chen, Jo-Ku Cheng, Jinwen Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04723
Pdf URL: https://arxiv.org/pdf/2505.04723
Copy Paste: [[2505.04723]] SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding(https://arxiv.org/abs/2505.04723)
Keywords: language model, llm, long context
Abstract: This study addresses key challenges in developing domain-specific large language models (LLMs) for Chinese state-owned assets and enterprises (SOAEs), where current approaches face three limitations: 1) constrained model capacity that limits knowledge integration and cross-task adaptability; 2) excessive reliance on domain-specific supervised fine-tuning (SFT) data, which neglects the broader applicability of general language patterns; and 3) inefficient inference acceleration for large models processing long contexts. In this work, we propose SOAEsV2-7B/72B, a specialized LLM series developed via a three-phase framework: 1) continual pre-training integrates domain knowledge while retaining base capabilities; 2) domain-progressive SFT employs curriculum-based learning strategy, transitioning from weakly relevant conversational data to expert-annotated SOAEs datasets to optimize domain-specific tasks; 3) distillation-enhanced speculative decoding accelerates inference via logit distillation between 72B target and 7B draft models, achieving 1.39-1.52$\times$ speedup without quality loss. Experimental results demonstrate that our domain-specific pre-training phase maintains 99.8% of original general language capabilities while significantly improving domain performance, resulting in a 1.08$\times$ improvement in Rouge-1 score and a 1.17$\times$ enhancement in BLEU-4 score. Ablation studies further show that domain-progressive SFT outperforms single-stage training, achieving 1.02$\times$ improvement in Rouge-1 and 1.06$\times$ in BLEU-4. Our work introduces a comprehensive, full-pipeline approach for optimizing SOAEs LLMs, bridging the gap between general language capabilities and domain-specific expertise.
摘要：这项研究解决了针对中国国有资产和企业（SOAES）开发特定领域的大语言模型（LLM）的关键挑战，其中当前方法面临三个局限性：1）限制知识集成和交叉任务适应性的约束模型能力； 2）过度依赖特定领域的监督微调（SFT）数据，该数据忽略了一般语言模式的更广泛适用性； 3）大型处理长上下文的大型推理加速度效率低下。在这项工作中，我们提出了SOAESV2-7B/72B，这是一个专业的LLM系列，是通过三相框架开发的：1）连续培训持续培训在保持基础能力的同时整合了域知识； 2）域促进的SFT采用基于课程的学习策略，从弱相关的对话数据过渡到专家宣布的SOAES数据集以优化特定领域的任务； 3）蒸馏增强的投机解码通过72B目标和7B草稿模型之间的logit蒸馏加速推理，实现了1.39-1.52 $ \ times $ speedup而不会损失质量损失。实验结果表明，我们的特定领域的训练阶段保持了99.8％的原始通用语言能力，同时显着提高了域的性能，从而获得了1.08 $ \ times $ \ times $改善Rouge-1得分和1.17 $ \ tims $ \ tims $ \ times $ $ \ times $ $ \ times $ $ \ tims $ \倍。消融研究进一步表明，域启动SFT的表现优于单阶段训练，在Rouge-1中获得了1.02 $ \ times $改进，而BLEU-4的1.06 $ \ times $。我们的工作介绍了一种全面的全高管方法，用于优化SOAES LLM，弥合一般语言能力和特定领域的专业知识之间的差距。

Title: Osiris: A Lightweight Open-Source Hallucination Detection System

Authors: Alex Shan, John Bauer, Christopher D. Manning
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04844
Pdf URL: https://arxiv.org/pdf/2505.04844
Copy Paste: [[2505.04844]] Osiris: A Lightweight Open-Source Hallucination Detection System(https://arxiv.org/abs/2505.04844)
Keywords: language model, gpt, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems have gained widespread adoption by application builders because they leverage sources of truth to enable Large Language Models (LLMs) to generate more factually sound responses. However, hallucinations, instances of LLM responses that are unfaithful to the provided context, often prevent these systems from being deployed in production environments. Current hallucination detection methods typically involve human evaluation or the use of closed-source models to review RAG system outputs for hallucinations. Both human evaluators and closed-source models suffer from scaling issues due to their high costs and slow inference speeds. In this work, we introduce a perturbed multi-hop QA dataset with induced hallucinations. Via supervised fine-tuning on our dataset, we achieve better recall with a 7B model than GPT-4o on the RAGTruth hallucination detection benchmark and offer competitive performance on precision and accuracy, all while using a fraction of the parameters. Code is released at our repository.
摘要：检索增强的生成（RAG）系统已通过应用程序建设者广泛采用，因为它们利用真理来源来使大型语言模型（LLMS）能够产生更多实际上合理的响应。但是，幻觉，对提供环境不忠的LLM响应实例，通常会阻止这些系统在生产环境中部署。当前的幻觉检测方法通常涉及人类评估或使用封闭源模型来审查幻觉的抹布系统输出。人类评估人员和封闭源模型都因其高成本和推理速度缓慢而遭受缩放问题。在这项工作中，我们引入了带有幻觉的扰动的多跳QA数据集。通过在我们的数据集中进行的监督微调，我们使用7B型号比GPT-4O在Ragtruth幻觉检测基准上获得了更好的回忆，并在使用一小部分参数的同时，在精确和准确性上提供了竞争性能。代码在我们的存储库中发布。

Title: Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards

Authors: Manveer Singh Tamber, Forrest Sheng Bao, Chenyu Xu, Ge Luo, Suleman Kazi, Minseok Bae, Miaoran Li, Ofer Mendelevitch, Renyi Qu, Jimmy Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04847
Pdf URL: https://arxiv.org/pdf/2505.04847
Copy Paste: [[2505.04847]] Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards(https://arxiv.org/abs/2505.04847)
Keywords: llm, hallucination
Abstract: Hallucinations remain a persistent challenge for LLMs. RAG aims to reduce hallucinations by grounding responses in contexts. However, even when provided context, LLMs still frequently introduce unsupported information or contradictions. This paper presents our efforts to measure LLM hallucinations with a focus on summarization tasks, assessing how often various LLMs introduce hallucinations when summarizing documents. We discuss Vectara's existing LLM hallucination leaderboard, based on the Hughes Hallucination Evaluation Model (HHEM). While HHEM and Vectara's Hallucination Leaderboard have garnered great research interest, we examine challenges faced by HHEM and current hallucination detection methods by analyzing the effectiveness of these methods on existing hallucination datasets. To address these limitations, we propose FaithJudge, an LLM-as-a-judge approach guided by few-shot human hallucination annotations, which substantially improves automated LLM hallucination evaluation over current methods. We introduce an enhanced hallucination leaderboard centered on FaithJudge, alongside our current hallucination leaderboard, enabling more reliable benchmarking of LLMs for hallucinations in RAG.
摘要：幻觉仍然是LLM的持续挑战。抹布旨在通过在上下文中扎根响应来减少幻觉。但是，即使提供上下文，LLMS仍然经常引入不支持的信息或矛盾。本文介绍了我们为衡量LLM幻觉的努力，重点关注汇总任务，评估各种LLM在汇总文档时引入幻觉的频率。我们根据休斯幻觉评估模型（HHEM）讨论Vectara现有的LLM幻觉排行榜。尽管HHEM和VECTARA的幻觉排行榜引起了极大的研究兴趣，但我们通过分析这些方法对现有幻觉数据集的有效性来研究HHEM和当前幻觉检测方法所面临的挑战。为了解决这些局限性，我们提出了FaithJudge，这是一种以人为幻觉注释很少的指导的LLM-AS-A-A-A-A-Gudge方法，该方法实质上改善了对当前方法的自动化LLM幻觉评估。我们引入了一个以Faithjudge为中心的增强的幻觉排行榜，以及我们目前的幻觉排行榜，为LAG的幻觉提供了更可靠的LLMS基准。

Title: An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education

Authors: Ramteja Sajja, Yusuf Sermet, Ibrahim Demir
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.04916
Pdf URL: https://arxiv.org/pdf/2505.04916
Copy Paste: [[2505.04916]] An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education(https://arxiv.org/abs/2505.04916)
Keywords: language model, llm, chat, retrieval-augmented generation
Abstract: Recent advances in AI have catalyzed the adoption of intelligent educational tools, yet many semantic retrieval systems remain ill-suited to the unique linguistic and structural characteristics of academic content. This study presents two open-source embedding models fine-tuned for educational question answering, particularly in the context of course syllabi. A synthetic dataset of 3,197 sentence pairs, spanning synonymous terminology, paraphrased questions, and implicit-explicit mappings, was constructed through a combination of manual curation and large language model (LLM)-assisted generation. Two training strategies were evaluated: (1) a baseline model fine-tuned using MultipleNegativesRankingLoss (MNRL), and (2) a dual-loss model that combines MNRL with CosineSimilarityLoss to improve both semantic ranking and similarity calibration. Evaluations were conducted on 28 university course syllabi using a fixed set of natural language questions categorized into course, faculty, and teaching assistant information. Results demonstrate that both fine-tuned models outperform strong open-source baselines, including all-MiniLM-L6-v2 and multi-qa-MiniLM-L6-cos-v1, and that the dual-loss model narrows the performance gap with high-performing proprietary embeddings such as OpenAI's text-embedding-3 series. This work contributes reusable, domain-aligned embedding models and provides a replicable framework for educational semantic retrieval, supporting downstream applications such as academic chatbots, retrieval-augmented generation (RAG) systems, and learning management system (LMS) integrations.
摘要：AI的最新进展促进了采用智能教育工具，但许多语义检索系统仍然不适合学术内容的独特语言和结构特征。这项研究介绍了两个开源嵌入模型，以调整教育问题的回答，尤其是在教学大纲的背景下。通过手动策展和大型语言模型（LLM）提及的一代组合构建了一个跨越同义词的术语，释义问题和隐式解释映射的合成数据集，涵盖了同义词。评估了两种培训策略：（1）使用多层次生物逆转录谷（MNRL）微调的基线模型，（2）将MNRL与CosineimilityLoss结合起来，以改善语义排名和相似性校准。使用固定的自然语言问题对28个大学课程课程进行了评估，这些自然语言问题将课程，教师和助教信息分类。结果表明，微型模型的表现优于强大的开源基线，包括所有Minilm-L6-V2和Multi-QA-Minilm-L6-COS-V1，并且双损失模型以高表现专有的专有嵌入方式缩小了性能差距，例如AS AS As Openai的Textai的Text-Embedding-3系列。这项工作贡献了可重复使用的域嵌入模型，并为教育语义检索提供了可复制的框架，并支持下游应用程序，例如学术聊天机器人，检索授权的生成（RAG）系统和学习管理系统（LMS）集成。

Title: Chain-of-Thought Tokens are Computer Program Variables

Authors: Fangwei Zhu, Peiyi Wang, Zhifang Sui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04955
Pdf URL: https://arxiv.org/pdf/2505.04955
Copy Paste: [[2505.04955]] Chain-of-Thought Tokens are Computer Program Variables(https://arxiv.org/abs/2505.04955)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-thoughts (CoT) requires large language models (LLMs) to generate intermediate steps before reaching the final answer, and has been proven effective to help LLMs solve complex reasoning tasks. However, the inner mechanism of CoT still remains largely unclear. In this paper, we empirically study the role of CoT tokens in LLMs on two compositional tasks: multi-digit multiplication and dynamic programming. While CoT is essential for solving these problems, we find that preserving only tokens that store intermediate results would achieve comparable performance. Furthermore, we observe that storing intermediate results in an alternative latent form will not affect model performance. We also randomly intervene some values in CoT, and notice that subsequent CoT tokens and the final answer would change correspondingly. These findings suggest that CoT tokens may function like variables in computer programs but with potential drawbacks like unintended shortcuts and computational complexity limits between tokens. The code and data are available at this https URL.
摘要：经过思考链（COT）需要大型语言模型（LLMS）才能在达到最终答案之前生成中间步骤，并已被证明有效地帮助LLMS解决复杂的推理任务。但是，COT的内部机制仍然在很大程度上不清楚。在本文中，我们从经验上研究了COT令牌在LLMS在两个组成任务中的作用：多位数乘法和动态编程。虽然COT对于解决这些问题至关重要，但我们发现只有保留存储中间结果的令牌可以实现可比的性能。此外，我们观察到以替代潜在形式存储中间结果不会影响模型性能。我们还随机干预COT中的某些值，并注意到后续的COT令牌和最终答案将相应地改变。这些发现表明，COT令牌在计算机程序中的作用可能像变量，但具有潜在的缺点，例如意外的快捷方式和代币之间的计算复杂性限制。该代码和数据可在此HTTPS URL上找到。

Title: Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes

Authors: Zhuocheng Gong, Jian Guan, Wei Wu, Huishuai Zhang, Dongyan Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.04993
Pdf URL: https://arxiv.org/pdf/2505.04993
Copy Paste: [[2505.04993]] Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes(https://arxiv.org/abs/2505.04993)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable success, yet aligning their generations with human preferences remains a critical challenge. Existing approaches to preference modeling often rely on an explicit or implicit reward function, overlooking the intricate and multifaceted nature of human preferences that may encompass conflicting factors across diverse tasks and populations. To address this limitation, we introduce Latent Preference Coding (LPC), a novel framework that models the implicit factors as well as their combinations behind holistic preferences using discrete latent codes. LPC seamlessly integrates with various offline alignment algorithms, automatically inferring the underlying factors and their importance from data without relying on pre-defined reward functions and hand-crafted combination weights. Extensive experiments on multiple benchmarks demonstrate that LPC consistently improves upon three alignment algorithms (DPO, SimPO, and IPO) using three base models (Mistral-7B, Llama3-8B, and Llama3-8B-Instruct). Furthermore, deeper analysis reveals that the learned latent codes effectively capture the differences in the distribution of human preferences and significantly enhance the robustness of alignment against noise in data. By providing a unified representation for the multifarious preference factors, LPC paves the way towards developing more robust and versatile alignment techniques for the responsible deployment of powerful LLMs.
摘要：大型语言模型（LLM）取得了杰出的成功，但是将世代与人类偏好保持一致仍然是一个关键的挑战。现有的偏好建模方法通常依赖于明确或隐性的奖励功能，忽略了人类偏好的复杂和多方面的性质，这些偏好可能包含各种任务和人群中冲突的因素。为了解决这一限制，我们介绍了潜在的偏好编码（LPC），这是一个新颖的框架，使用离散的潜在代码对整体偏好背后的隐含因素以及它们的组合进行了建模。 LPC与各种离线对齐算法无缝集成，自动从数据中自动推断出潜在的因素及其重要性，而无需依赖于预定的奖励功能和手工制作的组合权重。对多个基准测试的广泛实验表明，LPC使用三种基本模型（Mistral-7b，Llama3-8B和Llama3-8b-Instruction）对三种比对算法（DPO，SIMPO和IPO）的持续改进。此外，更深入的分析表明，学到的潜在代码有效地捕获了人类偏好分布的差异，并显着增强了对数据对数据的鲁棒性。通过为多种偏好因素提供统一的表示，LPC为开发更健壮和多功能的一致性技术铺平了为负责任的LLM的负责部署而言。

Title: Rethinking Invariance in In-context Learning

Authors: Lizhe Fang, Yifei Wang, Khashayar Gatmiry, Lei Fang, Yisen Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04994
Pdf URL: https://arxiv.org/pdf/2505.04994
Copy Paste: [[2505.04994]] Rethinking Invariance in In-context Learning(https://arxiv.org/abs/2505.04994)
Keywords: language model
Abstract: In-Context Learning (ICL) has emerged as a pivotal capability of auto-regressive large language models, yet it is hindered by a notable sensitivity to the ordering of context examples regardless of their mutual independence. To address this issue, recent studies have introduced several variant algorithms of ICL that achieve permutation invariance. However, many of these do not exhibit comparable performance with the standard auto-regressive ICL algorithm. In this work, we identify two crucial elements in the design of an invariant ICL algorithm: information non-leakage and context interdependence, which are not simultaneously achieved by any of the existing methods. These investigations lead us to the proposed Invariant ICL (InvICL), a methodology designed to achieve invariance in ICL while ensuring the two properties. Empirically, our findings reveal that InvICL surpasses previous models, both invariant and non-invariant, in most benchmark datasets, showcasing superior generalization capabilities across varying input lengths. Code is available at this https URL.
摘要：内在学习（ICL）已成为自动回归大语模型的关键能力，但是无论其相互独立性如何，对上下文示例的订购均具有明显的敏感性。为了解决这个问题，最近的研究介绍了达成置换不变性的几种变体算法。但是，其中许多与标准自动回归ICL算法表现出可比的性能。在这项工作中，我们确定了不变ICL算法的设计中的两个关键要素：信息非泄漏和上下文相互依存关系，这不是由任何现有方法同时实现的。这些调查使我们进入了拟议的不变ICL（INVICL），这是一种旨在在ICL中实现不变性的方法，同时确保了这两种属性。从经验上讲，我们的发现表明，在大多数基准数据集中，Invicl超过了先前的模型，无论是不变和不变的模型，都展示了各种输入长度的出色概括能力。代码可在此HTTPS URL上找到。

Title: The Pitfalls of Growing Group Complexity: LLMs and Social Choice-Based Aggregation for Group Recommendations

Authors: Cedric Waterschoot, Nava Tintarev, Francesco Barile
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.05016
Pdf URL: https://arxiv.org/pdf/2505.05016
Copy Paste: [[2505.05016]] The Pitfalls of Growing Group Complexity: LLMs and Social Choice-Based Aggregation for Group Recommendations(https://arxiv.org/abs/2505.05016)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly applied in recommender systems aimed at both individuals and groups. Previously, Group Recommender Systems (GRS) often used social choice-based aggregation strategies to derive a single recommendation based on the preferences of multiple people. In this paper, we investigate under which conditions language models can perform these strategies correctly based on zero-shot learning and analyse whether the formatting of the group scenario in the prompt affects accuracy. We specifically focused on the impact of group complexity (number of users and items), different LLMs, different prompting conditions, including In-Context learning or generating explanations, and the formatting of group preferences. Our results show that performance starts to deteriorate when considering more than 100 ratings. However, not all language models were equally sensitive to growing group complexity. Additionally, we showed that In-Context Learning (ICL) can significantly increase the performance at higher degrees of group complexity, while adding other prompt modifications, specifying domain cues or prompting for explanations, did not impact accuracy. We conclude that future research should include group complexity as a factor in GRS evaluation due to its effect on LLM performance. Furthermore, we showed that formatting the group scenarios differently, such as rating lists per user or per item, affected accuracy. All in all, our study implies that smaller LLMs are capable of generating group recommendations under the right conditions, making the case for using smaller models that require less computing power and costs.
摘要：大型语言模型（LLM）越来越多地应用于针对个人和群体的推荐系统。以前，小组推荐系统（GRS）经常使用基于社会选择的聚合策略来根据多人的偏好得出单个建议。在本文中，我们调查了在哪些条件下，语言模型可以基于零射击学习正确执行这些策略，并分析迅速中组场景的格式是否会影响准确性。我们专门专门研究了组复杂性（用户和项目数量），不同的LLM，不同的提示条件，包括在内的学习或生成解释以及组偏好的格式。我们的结果表明，在考虑100多个评级时，性能开始恶化。但是，并非所有语言模型对群体复杂性的增长同样敏感。此外，我们表明，在较高的组复杂性下，在较高的群体复杂度下可以显着提高性能，同时添加其他及时的修改，指定域提示或提示解释，这不会影响准确性。我们得出的结论是，由于其对LLM性能的影响，未来的研究应将群体复杂性作为GRS评估的一个因素。此外，我们表明将组方案的格式化不同，例如每个用户或每个项目的评分列表都会影响准确性。总而言之，我们的研究表明，较小的LLM能够在适当的条件下产生小组建议，从而使使用较小的模型需要更少的计算能力和成本。

Title: Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization

Authors: Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Jiang Zong, Hao Peng, Jianwei Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05017
Pdf URL: https://arxiv.org/pdf/2505.05017
Copy Paste: [[2505.05017]] Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization(https://arxiv.org/abs/2505.05017)
Keywords: language model, llm
Abstract: Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to downstream tasks. Since the majority of knowledge is acquired during pre-training, attributing the predictions of fine-tuned LLMs to their pre-training data may provide valuable insights. Influence functions have been proposed as a means to explain model predictions based on training data. However, existing approaches fail to compute ``multi-stage'' influence and lack scalability to billion-scale LLMs. In this paper, we propose the multi-stage influence function to attribute the downstream predictions of fine-tuned LLMs to pre-training data under the full-parameter fine-tuning paradigm. To enhance the efficiency and practicality of our multi-stage influence function, we leverage Eigenvalue-corrected Kronecker-Factored (EK-FAC) parameterization for efficient approximation. Empirical results validate the superior scalability of EK-FAC approximation and the effectiveness of our multi-stage influence function. Additionally, case studies on a real-world LLM, dolly-v2-3b, demonstrate its interpretive power, with exemplars illustrating insights provided by multi-stage influence estimates. Our code is public at this https URL.
摘要：预先训练的大型语言模型（LLMS）通常会微调以适应下游任务。由于大多数知识是在预训练期间获得的，因此将微调LLM的预测归因于其预训练数据可能会提供有价值的见解。已经提出了影响功能作为基于培训数据来解释模型预测的手段。但是，现有的方法无法计算``多阶段''影响，并且缺乏对数十亿级LLM的可扩展性。在本文中，我们提出了多阶段影响函数，以将微调LLMS的下游预测归因于全参数微调范式下的预训练数据。为了提高我们多阶段影响函数的效率和实用性，我们利用特征值校正的Kronecker-Factored（EK-FAC）参数化来有效近似。经验结果验证了EK-FAC近似的出色可伸缩性以及我们多阶段影响函数的有效性。此外，关于现实世界LLM Dolly-V2-3B的案例研究证明了其解释性能力，其中示例说明了由多阶段影响估计提供的见解。我们的代码在此HTTPS URL上是公开的。

Title: G-FOCUS: Towards a Robust Method for Assessing UI Design Persuasiveness

Authors: Jaehyun Jeon, Janghan Yoon, Minsoo Kim, Sumin Shim, Yejin Choi, Hanbin Kim, Youngjae Yu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.05026
Pdf URL: https://arxiv.org/pdf/2505.05026
Copy Paste: [[2505.05026]] G-FOCUS: Towards a Robust Method for Assessing UI Design Persuasiveness(https://arxiv.org/abs/2505.05026)
Keywords: language model
Abstract: Evaluating user interface (UI) design effectiveness extends beyond aesthetics to influencing user behavior, a principle central to Design Persuasiveness. A/B testing is the predominant method for determining which UI variations drive higher user engagement, but it is costly and time-consuming. While recent Vision-Language Models (VLMs) can process automated UI analysis, current approaches focus on isolated design attributes rather than comparative persuasiveness-the key factor in optimizing user interactions. To address this, we introduce WiserUI-Bench, a benchmark designed for Pairwise UI Design Persuasiveness Assessment task, featuring 300 real-world UI image pairs labeled with A/B test results and expert rationales. Additionally, we propose G-FOCUS, a novel inference-time reasoning strategy that enhances VLM-based persuasiveness assessment by reducing position bias and improving evaluation accuracy. Experimental results show that G-FOCUS surpasses existing inference strategies in consistency and accuracy for pairwise UI evaluation. Through promoting VLM-driven evaluation of UI persuasiveness, our work offers an approach to complement A/B testing, propelling progress in scalable UI preference modeling and design optimization. Code and data will be released publicly.
摘要：评估用户界面（UI）的设计有效性超出了美学范围，从而影响用户行为，这是设计有说服力的主要核心。 A/B测试是确定哪种UI变化推动更高用户参与度的主要方法，但既昂贵又耗时。尽管最近的视觉语言模型（VLM）可以处理自动化的UI分析，但当前的方法集中在孤立的设计属性上，而不是比较说服力 - 优化用户交互的关键因素。为了解决这个问题，我们介绍了Wiserui-Bench，这是一种用于成对UI设计说服力评估任务的基准测试，其中包含300个现实世界UI图像对，标有A/B测试结果和专家理由。此外，我们提出了G-Cocus，这是一种新型的推理时间推理策略，通过降低位置偏见和提高评估准确性来增强基于VLM的说服力评估。实验结果表明，G-Gocus在成对UI评估的一致性和准确性方面超过了现有的推理策略。通过促进对UI说服力的VLM驱动评估，我们的工作提供了一种补充A/B测试的方法，在可扩展的UI偏好模型和设计优化方面推动了进度。代码和数据将公开发布。

Title: Image-Text Relation Prediction for Multilingual Tweets

Authors: Matīss Rikters, Edison Marrese-Taylor
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.05040
Pdf URL: https://arxiv.org/pdf/2505.05040
Copy Paste: [[2505.05040]] Image-Text Relation Prediction for Multilingual Tweets(https://arxiv.org/abs/2505.05040)
Keywords: language model
Abstract: Various social networks have been allowing media uploads for over a decade now. Still, it has not always been clear what is their relation with the posted text or even if there is any at all. In this work, we explore how multilingual vision-language models tackle the task of image-text relation prediction in different languages, and construct a dedicated balanced benchmark data set from Twitter posts in Latvian along with their manual translations into English. We compare our results to previous work and show that the more recently released vision-language model checkpoints are becoming increasingly capable at this task, but there is still much room for further improvement.
摘要：十多年来，各种社交网络已经允许媒体上传。尽管如此，并不总是清楚他们与发布的文本有什么关系，甚至根本没有。在这项工作中，我们探讨了多语言视觉语言模型如何处理不同语言的图像文本关系预测的任务，并从拉特维亚的Twitter帖子中构建专用平衡的基准数据集以及其手册翻译为英语。我们将结果与以前的工作进行了比较，并表明最近发布的视觉语言模型检查站在这项任务上越来越有能力，但是还有很大的进一步改进空间。

Title: Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization

Authors: Ajwad Abrar, Farzana Tabassum, Sabbir Ahmed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05070
Pdf URL: https://arxiv.org/pdf/2505.05070
Copy Paste: [[2505.05070]] Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization(https://arxiv.org/abs/2505.05070)
Keywords: language model, gpt, llm
Abstract: Consumer Health Queries (CHQs) in Bengali (Bangla), a low-resource language, often contain extraneous details, complicating efficient medical responses. This study investigates the zero-shot performance of nine advanced large language models (LLMs): GPT-3.5-Turbo, GPT-4, Claude-3.5-Sonnet, Llama3-70b-Instruct, Mixtral-8x22b-Instruct, Gemini-1.5-Pro, Qwen2-72b-Instruct, Gemma-2-27b, and Athene-70B, in summarizing Bangla CHQs. Using the BanglaCHQ-Summ dataset comprising 2,350 annotated query-summary pairs, we benchmarked these LLMs using ROUGE metrics against Bangla T5, a fine-tuned state-of-the-art model. Mixtral-8x22b-Instruct emerged as the top performing model in ROUGE-1 and ROUGE-L, while Bangla T5 excelled in ROUGE-2. The results demonstrate that zero-shot LLMs can rival fine-tuned models, achieving high-quality summaries even without task-specific training. This work underscores the potential of LLMs in addressing challenges in low-resource languages, providing scalable solutions for healthcare query summarization.
摘要：低资源语言的孟加拉语（Bangla）的消费者健康查询（CHQ）通常包含无关的细节，使有效的医疗反应变得复杂。这项研究研究了九种高级大语模型（LLMS）的零拍摄性能：GPT-3.5-Turbo，GPT-4，Claude-3.5-Sonnet，Llama3-70b-Instratct，Mixtral-8x22b-Instruct CHQ。使用包含2,350个带注释的苏格尔对的Banglachq-Summ数据集，我们使用Rouge指标对这些LLM进行了对针对Bangla T5（一种微调的先进模型）进行基准测试。 Mixtral-8x22b-Instruct成为Rouge-1和Rouge-L中表现最佳的模型，而Bangla T5在Rouge-2中表现出色。结果表明，零射门LLM可以与微型模型相抗衡，即使没有特定于任务的培训，也可以实现高质量的摘要。这项工作强调了LLM在应对低资源语言挑战方面的潜力，为医疗保健查询摘要提供了可扩展的解决方案。

Title: Reliably Bounding False Positives: A Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction

Authors: Xiaowei Zhu, Yubing Ren, Yanan Cao, Xixun Lin, Fang Fang, Yangxi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05084
Pdf URL: https://arxiv.org/pdf/2505.05084
Copy Paste: [[2505.05084]] Reliably Bounding False Positives: A Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction(https://arxiv.org/abs/2505.05084)
Keywords: language model
Abstract: The rapid advancement of large language models has raised significant concerns regarding their potential misuse by malicious actors. As a result, developing effective detectors to mitigate these risks has become a critical priority. However, most existing detection methods focus excessively on detection accuracy, often neglecting the societal risks posed by high false positive rates (FPRs). This paper addresses this issue by leveraging Conformal Prediction (CP), which effectively constrains the upper bound of FPRs. While directly applying CP constrains FPRs, it also leads to a significant reduction in detection performance. To overcome this trade-off, this paper proposes a Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction (MCP), which both enforces the FPR constraint and improves detection performance. This paper also introduces RealDet, a high-quality dataset that spans a wide range of domains, ensuring realistic calibration and enabling superior detection performance when combined with MCP. Empirical evaluations demonstrate that MCP effectively constrains FPRs, significantly enhances detection performance, and increases robustness against adversarial attacks across multiple detectors and datasets.
摘要：大型语言模型的快速发展引起了人们对恶意演员的潜在滥用的重大关注。结果，开发有效的检测器来减轻这些风险已成为关键的重点。但是，大多数现有的检测方法都过度关注检测准确性，通常忽略了高误报率（FPR）带来的社会风险。本文通过利用保形预测（CP）来解决此问题，该预测有效地约束了FPR的上限。在直接应用CP限制FPR的同时，它也会显着降低检测性能。为了克服这一权衡，本文提出了通过多标记的共形预测（MCP）一个零拍到机器生成的文本检测框架，该框架都实施了FPR约束并改善了检测性能。本文还引入了RealDet，这是一个跨越广泛域的高质量数据集，可确保与MCP结合使用逼真的校准并实现出色的检测性能。经验评估表明，MCP有效地限制了FPR，显着提高了检测性能，并提高了对跨多个检测器和数据集的对抗性攻击的鲁棒性。

Title: Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders

Authors: Boyi Deng, Yu Wan, Yidan Zhang, Baosong Yang, Fuli Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05111
Pdf URL: https://arxiv.org/pdf/2505.05111
Copy Paste: [[2505.05111]] Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders(https://arxiv.org/abs/2505.05111)
Keywords: language model, llm
Abstract: The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise activation variance, which limit their reliability. Sparse Autoencoders (SAEs) offer a more nuanced analysis by decomposing the activations of LLMs into sparse linear combination of SAE features. We introduce a novel metric to assess the monolinguality of features obtained from SAEs, discovering that some features are strongly related to specific languages. Additionally, we show that ablating these SAE features only significantly reduces abilities in one language of LLMs, leaving others almost unaffected. Interestingly, we find some languages have multiple synergistic SAE features, and ablating them together yields greater improvement than ablating individually. Moreover, we leverage these SAE-derived language-specific features to enhance steering vectors, achieving control over the language generated by LLMs.
摘要：使用基于神经元或基于内部激活的方法检查了大语言模型（LLMS）中多语言能力背后的机制。但是，这些方法通常面临诸如叠加和层次激活方差之类的挑战，从而限制了它们的可靠性。稀疏的自动编码器（SAE）通过将LLM的激活分解为SAE特征的稀疏线性组合来提供更细微的分析。我们引入了一个新颖的指标，以评估从SAE获得的特征的单语言，发现某些功能与特定语言密切相关。此外，我们表明，将这些SAE烧毁仅大大降低了一种LLM语言的能力，而其他人几乎没有受到影响。有趣的是，我们发现某些语言具有多种协同的SAE功能，而将它们放在一起的功能比单独消融会带来更大的改进。此外，我们利用这些特定于SAE的语言特定功能来增强转向向量，从而控制了LLMS生成的语言。

Title: QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation

Authors: Mengze Hong, Wailing Ng, Di Jiang, Chen Jason Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05225
Pdf URL: https://arxiv.org/pdf/2505.05225
Copy Paste: [[2505.05225]] QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation(https://arxiv.org/abs/2505.05225)
Keywords: language model, gpt, llm
Abstract: The rapid advancement of Chinese large language models (LLMs) underscores the need for domain-specific evaluations to ensure reliable applications. However, existing benchmarks often lack coverage in vertical domains and offer limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for human expertise evaluation, we introduce QualBench, the first multi-domain Chinese QA benchmark dedicated to localized assessment of Chinese LLMs. The dataset includes over 17,000 questions across six vertical domains, with data selections grounded in 24 Chinese qualifications to closely align with national policies and working standards. Through comprehensive evaluation, the Qwen2.5 model outperformed the more advanced GPT-4o, with Chinese LLMs consistently surpassing non-Chinese models, highlighting the importance of localized domain knowledge in meeting qualification requirements. The best performance of 75.26% reveals the current gaps in domain coverage within model capabilities. Furthermore, we present the failure of LLM collaboration with crowdsourcing mechanisms and suggest the opportunities for multi-domain RAG knowledge enhancement and vertical domain LLM training with Federated Learning.
摘要：中国大语言模型（LLMS）的快速发展强调了对域特异性评估的需求，以确保可靠的应用程序。但是，现有的基准通常缺乏垂直域中的覆盖范围，并且对中国工作环境的见解有限。我们将资格考试作为人类专业知识评估的统一框架，我们介绍了Qualbench，这是第一个专门针对中国LLM的本地化评估的中国质量标准基准测试。该数据集在六个垂直领域中包含17,000多个问题，数据选择以24个中国资格为基础，以与国家政策和工作标准紧密相符。通过全面的评估，QWEN2.5模型的表现优于更高级的GPT-4O，中国LLM始终超过非中国模型，强调了本地化域知识在满足资格要求要求中的重要性。 75.26％的最佳性能揭示了模型功能中域覆盖率的当前差距。此外，我们介绍了LLM与众包机制合作的失败，并提出了通过联合学习的多域抹布知识增强和垂直域LLM培训的机会。

Title: Toward Reasonable Parrots: Why Large Language Models Should Argue with Us by Design

Authors: Elena Musi, Nadin Kokciyan, Khalid Al-Khatib, Davide Ceolin, Emmanuelle Dietz, Klara Gutekunst, Annette Hautli-Janisz, Cristian Manuel Santibañez Yañez, Jodi Schneider, Jonas Scholz, Cor Steging, Jacky Visser, Henning Wachsmuth
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2505.05298
Pdf URL: https://arxiv.org/pdf/2505.05298
Copy Paste: [[2505.05298]] Toward Reasonable Parrots: Why Large Language Models Should Argue with Us by Design(https://arxiv.org/abs/2505.05298)
Keywords: language model, llm
Abstract: In this position paper, we advocate for the development of conversational technology that is inherently designed to support and facilitate argumentative processes. We argue that, at present, large language models (LLMs) are inadequate for this purpose, and we propose an ideal technology design aimed at enhancing argumentative skills. This involves re-framing LLMs as tools to exercise our critical thinking rather than replacing them. We introduce the concept of 'reasonable parrots' that embody the fundamental principles of relevance, responsibility, and freedom, and that interact through argumentative dialogical moves. These principles and moves arise out of millennia of work in argumentation theory and should serve as the starting point for LLM-based technology that incorporates basic principles of argumentation.
摘要：在该立场论文中，我们倡导开发对话技术的发展，该技术固有地旨在支持和促进论证过程。我们认为，目前，大型语言模型（LLM）为此目的不足，我们提出了一种旨在增强论证技能的理想技术设计。这涉及将LLMS重新定义为行使我们的批判性思维而不是更换它们的工具。我们介绍了“合理的鹦鹉”的概念，该概念体现了相关性，责任和自由的基本原理，并通过争论性的对话举动进行互动。这些原则和移动是由千年来论证理论中引起的，应作为结合基本论证原理的基于LLM的技术的起点。

Title: ICon: In-Context Contribution for Automatic Data Selection

Authors: Yixin Yang, Qingxiu Dong, Linli Yao, Fangwei Zhu, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05327
Pdf URL: https://arxiv.org/pdf/2505.05327
Copy Paste: [[2505.05327]] ICon: In-Context Contribution for Automatic Data Selection(https://arxiv.org/abs/2505.05327)
Keywords: language model, llm
Abstract: Data selection for instruction tuning is essential for improving the performance of Large Language Models (LLMs) and reducing training cost. However, existing automated selection methods either depend on computationally expensive gradient-based measures or manually designed heuristics, which may fail to fully exploit the intrinsic attributes of data. In this paper, we propose In-context Learning for Contribution Measurement (ICon), a novel gradient-free method that takes advantage of the implicit fine-tuning nature of in-context learning (ICL) to measure sample contribution without gradient computation or manual indicators engineering. ICon offers a computationally efficient alternative to gradient-based methods and reduces human inductive bias inherent in heuristic-based approaches. ICon comprises three components and identifies high-contribution data by assessing performance shifts under implicit learning through ICL. Extensive experiments on three LLMs across 12 benchmarks and 5 pairwise evaluation sets demonstrate the effectiveness of ICon. Remarkably, on LLaMA3.1-8B, models trained on 15% of ICon-selected data outperform full datasets by 5.42% points and exceed the best performance of widely used selection methods by 2.06% points. We further analyze high-contribution samples selected by ICon, which show both diverse tasks and appropriate difficulty levels, rather than just the hardest ones.
摘要：指导调整的数据选择对于改善大语言模型（LLM）的性能和降低培训成本至关重要。但是，现有的自动选择方法要么取决于基于计算昂贵的措施或手动设计的启发式方法，因此可能无法完全利用数据的内在属性。在本文中，我们提出了贡献测量值（ICON）的文章学习，这是一种新型的无梯度方法，它利用内部文化学习（ICL）的隐式微调性质（ICL）来衡量样品贡献，而无需梯度计算或手动指示器工程。 ICON提供了基于梯度的方法的计算有效替代方案，并减少了基于启发式的方法固有的人类感应偏差。图标包括三个组件，并通过评估通过ICL的隐式学习下的性能转移来识别高分子数据。在12个基准和5个成对评估集的三个LLM上进行了广泛的实验证明了图标的有效性。值得注意的是，在Llama3.1-8B上，对15％的ICON选择数据的训练模型优于完整数据集的点数5.42％，并且超过了广泛使用选择方法的最佳性能，而不是2.06％。我们进一步分析了由ICON选择的高分子贡献样本，这些样本既显示出不同的任务和适当的难度水平，而不仅仅是最困难的级别。

Title: Frame In, Frame Out: Do LLMs Generate More Biased News Headlines than Humans?

Authors: Valeria Pastorino, Nafise Sadat Moosavi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05406
Pdf URL: https://arxiv.org/pdf/2505.05406
Copy Paste: [[2505.05406]] Frame In, Frame Out: Do LLMs Generate More Biased News Headlines than Humans?(https://arxiv.org/abs/2505.05406)
Keywords: language model, llm
Abstract: Framing in media critically shapes public perception by selectively emphasizing some details while downplaying others. With the rise of large language models in automated news and content creation, there is growing concern that these systems may introduce or even amplify framing biases compared to human authors. In this paper, we explore how framing manifests in both out-of-the-box and fine-tuned LLM-generated news content. Our analysis reveals that, particularly in politically and socially sensitive contexts, LLMs tend to exhibit more pronounced framing than their human counterparts. In addition, we observe significant variation in framing tendencies across different model architectures, with some models displaying notably higher biases. These findings point to the need for effective post-training mitigation strategies and tighter evaluation frameworks to ensure that automated news content upholds the standards of balanced reporting.
摘要：媒体中的框架批判性地塑造了公众的看法，通过在轻描淡写其他细节的同时选择性地强调一些细节。随着自动化新闻和内容创建中大型语言模型的兴起，与人类作者相比，这些系统可能会引入甚至扩大框架偏见的关注。在本文中，我们探讨了框架如何在开箱即用和微调的LLM生成的新闻内容中表现出来。我们的分析表明，特别是在政治和社会敏感的环境中，LLM往往比人类的框架表现出更为明显的框架。此外，我们观察到不同模型体系结构之间的框架趋势的显着差异，其中一些模型表现出较高的偏见。这些发现表明需要有效的培训后缓解策略和更严格的评估框架，以确保自动化新闻内容遵守平衡报告的标准。

Title: Crosslingual Reasoning through Test-Time Scaling

Authors: Zheng-Xin Yong, M. Farid Adilazuarda, Jonibek Mansurov, Ruochen Zhang, Niklas Muennighoff, Carsten Eickhoff, Genta Indra Winata, Julia Kreutzer, Stephen H. Bach, Alham Fikri Aji
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.05408
Pdf URL: https://arxiv.org/pdf/2505.05408
Copy Paste: [[2505.05408]] Crosslingual Reasoning through Test-Time Scaling(https://arxiv.org/abs/2505.05408)
Keywords: language model, chain-of-thought
Abstract: Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they outperform models twice their size. Second, we reveal that while English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs. Third, we discover an effective strategy to control the language of long CoT reasoning, and we observe that models reason better and more efficiently in high-resource languages. Finally, we observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English. Overall, we demonstrate the potentials, study the mechanisms and outline the limitations of crosslingual generalization of English reasoning test-time scaling. We conclude that practitioners should let English-centric RLMs reason in high-resource languages, while further work is needed to improve reasoning in low-resource languages and out-of-domain contexts.
摘要：大型语言模型的推理能力主要研究英语，即使审慎的模型是多语言的。在这项工作中，我们调查了长期以来的思考（COTS）在多大程度上进行英国推理的训练，可以跨越语言。首先，我们发现，以英语为中心的推理语言模型（RLMS）扩展推理对许多语言（包括低资源语言）的多语言数学推理进行了计算，以提高其尺寸两倍的程度。其次，我们透露，尽管以英语为中心的RLM的婴儿床自然是英语，但它们始终遵循引用和思考模式，以推理有关引用非英语输入的原因。第三，我们发现了一种有效的策略来控制长科克推理的语言，并且我们观察到模型在高资源语言中更好，更有效的原因。最后，我们观察到较差的室外推理概括，尤其是从STEM到文化常识性知识，甚至对于英语而言。总体而言，我们证明了潜力，研究了英语推理测试时间缩放的跨语言概括的机制并概述了局限性。我们得出的结论是，从业人员应允许以英语为中心的RLMS使用高资源语言，而需要进一步的工作来改善低资源语言和室外环境的推理。

Title: Reasoning Models Don't Always Say What They Think

Authors: Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, Ethan Perez
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.05410
Pdf URL: https://arxiv.org/pdf/2505.05410
Copy Paste: [[2505.05410]] Reasoning Models Don't Always Say What They Think(https://arxiv.org/abs/2505.05410)
Keywords: prompt, chain-of-thought
Abstract: Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models' actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.
摘要：经过思考链（COT）为人工智能安全提供了潜在的福音，因为它允许监视模型的婴儿床试图了解其意图和推理过程。但是，这种监视取决于对COT的有效性忠实地代表了模型的实际推理过程。 We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently使用提示（奖励黑客），即使没有针对COT监视器的培训，也不会增加口头表达的倾向。这些结果表明，COT监测是在训练和评估过程中注意到不希望行为的一种有希望的方法，但不足以将其排除在外。他们还建议，在不需要COT推理的设置中，对COT的测试时间监控不太可能可靠地捕捉稀有和灾难性的意外行为。

Title: TransProQA: an LLM-based literary Translation evaluation metric with Professional Question Answering

Authors: Ran Zhang, Wei Zhao, Lieve Macken, Steffen Eger
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.05423
Pdf URL: https://arxiv.org/pdf/2505.05423
Copy Paste: [[2505.05423]] TransProQA: an LLM-based literary Translation evaluation metric with Professional Question Answering(https://arxiv.org/abs/2505.05423)
Keywords: language model, llm
Abstract: The impact of Large Language Models (LLMs) has extended into literary domains. However, existing evaluation metrics prioritize mechanical accuracy over artistic expression and tend to overrate machine translation (MT) as being superior to experienced professional human translation. In the long run, this bias could result in a permanent decline in translation quality and cultural authenticity. In response to the urgent need for a specialized literary evaluation metric, we introduce TransProQA, a novel, reference-free, LLM-based question-answering (QA) framework designed specifically for literary translation evaluation. TransProQA uniquely integrates insights from professional literary translators and researchers, focusing on critical elements in literary quality assessment such as literary devices, cultural understanding, and authorial voice. Our extensive evaluation shows that while literary-finetuned XCOMET-XL yields marginal gains, TransProQA substantially outperforms current metrics, achieving up to 0.07 gain in correlation (ACC-EQ and Kendall's tau) and surpassing the best state-of-the-art (SOTA) metrics by over 15 points in adequacy assessments. Incorporating professional translator insights as weights further improves performance, highlighting the value of translator inputs. Notably, TransProQA approaches human-level evaluation performance comparable to trained linguistic annotators. It demonstrates broad applicability to open-source models such as LLaMA3.3-70b and Qwen2.5-32b, indicating its potential as an accessible and training-free literary evaluation metric and a valuable tool for evaluating texts that require local processing due to copyright or ethical considerations.
摘要：大语言模型（LLM）的影响已扩展到文学领域。但是，现有的评估指标优先于机械精度，而不是艺术表达，并且倾向于将机器翻译（MT）推高比经验丰富的专业人类翻译优越。从长远来看，这种偏见可能导致翻译质量和文化真实性的永久下降。为了响应迫切需要进行专业的文学评估指标，我们介绍了Transproqa，这是一种新颖的，无参考的基于LLM的问题避开（QA）框架（QA）框架，专为文学翻译评估而设计。 Transproqa独特地整合了专业文学翻译人员和研究人员的见解，重点关注文学质量评估的关键要素，例如文学设备，文化理解和作者声音。我们广泛的评估表明，尽管文学 - XCOMET-XL产生了边际收益，但Transproqa基本上要超过当前指标，可实现高达0.07的相关性增长（ACC-EQ和Kendall's tau），并超过15点的最佳态度（SOTA）指标，并超过了最佳的尚未达到最佳的态度（SOTA）度量。重新体重进一步提高了性能，突出了翻译人员输入的价值，将专业翻译人员的见解融为一体。值得注意的是，TransproQA的方法与训练有素的语言注释者相当。它证明了对诸如Llama3.3-70B和QWEN2.5-32B等开源模型的广泛适用性，表明其潜力是可访问且无培训的文学评估指标，以及用于评估由于版权或道德考虑而需要本地处理的文本的有价值工具。

Title: Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

Authors: Yudong Wang, Zixuan Fu, Jie Cai, Peijun Tang, Hongya Lyu, Yewei Fang, Zhi Zheng, Jie Zhou, Guoyang Zeng, Chaojun Xiao, Xu Han, Zhiyuan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05427
Pdf URL: https://arxiv.org/pdf/2505.05427
Copy Paste: [[2505.05427]] Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data(https://arxiv.org/abs/2505.05427)
Keywords: language model, llm
Abstract: Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. In addition, to efficiently filter high-quality data, we employ a lightweight classifier based on fastText, and successfully apply the filtering pipeline to two widely-used pre-training corpora, FineWeb and Chinese FineWeb datasets, resulting in the creation of the higher-quality Ultra-FineWeb dataset. Ultra-FineWeb contains approximately 1 trillion English tokens and 120 billion Chinese tokens. Empirical results demonstrate that the LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency.
摘要：数据质量已成为通过大型语言模型（LLM）快速发展增强模型性能的关键因素。模型驱动的数据过滤已越来越成为获取高质量数据的主要方法。但是，它仍然面临两个主要挑战：（1）缺乏有效的数据验证策略使得很难及时提供有关数据质量的反馈；（2）选择培训分类器的种子数据缺乏明确的标准，并且在很大程度上依赖于人类的专业知识，从而引入了一定程度的主观性。为了应对第一个挑战，我们引入了一种有效的验证策略，该策略可以快速评估数据对LLM培训的影响，并以最低的计算成本评估。为了应对第二个挑战，我们基于这样的假设，即高质量的种子数据对LLM培训有益，并且通过整合提出的验证策略，我们优化了正面和负样本的选择，并提出了有效的数据过滤管道。该管道不仅提高了过滤效率，分类器质量和鲁棒性，而且还大大降低了实验和推理成本。此外，为了有效地过滤高质量的数据，我们采用了基于FastText的轻量级分类器，并成功地将过滤管道应用于两个广泛使用的预训练前Corpora，FineWeb和Central Fine Web数据集，从而创建了高质量的Ultra-firta-FineWeb数据集。 Ultra-Fineweb包含大约1万亿个英语令牌和1,200亿个中文令牌。经验结果表明，在超细韦布（Ultra-Fineweb）上训练的LLM在多个基准任务中表现出显着的性能提高，从而验证了我们的管道在提高数据质量和培训效率方面的有效性。

Title: clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations

Authors: Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05445
Pdf URL: https://arxiv.org/pdf/2505.05445
Copy Paste: [[2505.05445]] clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations(https://arxiv.org/abs/2505.05445)
Keywords: language model, llm, prompt, chat, agent
Abstract: The emergence of instruction-tuned large language models (LLMs) has advanced the field of dialogue systems, enabling both realistic user simulations and robust multi-turn conversational agents. However, existing research often evaluates these components in isolation-either focusing on a single user simulator or a specific system design-limiting the generalisability of insights across architectures and configurations. In this work, we propose clem todd (chat-optimized LLMs for task-oriented dialogue systems development), a flexible framework for systematically evaluating dialogue systems under consistent conditions. clem todd enables detailed benchmarking across combinations of user simulators and dialogue systems, whether existing models from literature or newly developed ones. It supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints. We showcase clem todd's flexibility by re-evaluating existing task-oriented dialogue systems within this unified setup and integrating three newly proposed dialogue systems into the same evaluation pipeline. Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance, offering practical guidance for building efficient and effective conversational AI systems.
摘要：指导调节的大语言模型（LLM）的出现已经提高了对话系统的领域，从而实现了现实的用户模拟和强大的多转交谈代理。但是，现有的研究通常会以隔离为重点来评估这些组件 - 专注于单个用户模拟器或特定的系统设计限制跨体系结构和配置的洞察力的普遍性。在这项工作中，我们提出了Clem Todd（用于任务导向对话系统开发的CHAT优化LLMS），这是一个灵活的框架，用于在一致的条件下系统地评估对话系统。 CLEM TODD可以在用户模拟器和对话系统的组合中进行详细的基准测试，无论是文献中的现有模型还是新开发的模型。它支持插件的集成并确保统一的数据集，评估指标和计算约束。我们通过在此统一设置中重新评估现有的面向任务的对话系统并将三个新提出的对话系统集成到同一评估管道中，以展示Clem Todd的灵活性。我们的结果提供了有关建筑，规模和提示策略如何影响对话表现的可行见解，并为建立有效有效的对话AI系统提供了实用的指导。

Title: UKElectionNarratives: A Dataset of Misleading Narratives Surrounding Recent UK General Elections

Authors: Fatima Haouari, Carolina Scarton, Nicolò Faggiani, Nikolaos Nikolaidis, Bonka Kotseva, Ibrahim Abu Farha, Jens Linge, Kalina Bontcheva
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2505.05459
Pdf URL: https://arxiv.org/pdf/2505.05459
Copy Paste: [[2505.05459]] UKElectionNarratives: A Dataset of Misleading Narratives Surrounding Recent UK General Elections(https://arxiv.org/abs/2505.05459)
Keywords: language model, gpt
Abstract: Misleading narratives play a crucial role in shaping public opinion during elections, as they can influence how voters perceive candidates and political parties. This entails the need to detect these narratives accurately. To address this, we introduce the first taxonomy of common misleading narratives that circulated during recent elections in Europe. Based on this taxonomy, we construct and analyse UKElectionNarratives: the first dataset of human-annotated misleading narratives which circulated during the UK General Elections in 2019 and 2024. We also benchmark Pre-trained and Large Language Models (focusing on GPT-4o), studying their effectiveness in detecting election-related misleading narratives. Finally, we discuss potential use cases and make recommendations for future research directions using the proposed codebook and dataset.
摘要：误导性叙事在塑造选举期间塑造公众舆论方面起着至关重要的作用，因为它们可以影响选民如何看待候选人和政党。这需要需要准确检测这些叙述。为了解决这个问题，我们介绍了在欧洲最近选举期间散布的常见误导性叙事的第一个分类法。基于这种分类法，我们构建和分析了UkelectionNarratives：在2019年和2024年在英国大选中流传的人类通知误导叙事的第一个数据集。我们还基准了预先培训的和大型语言模型（侧重于GPT-4O），研究了它们在检测选举选举的误解叙事方面的有效性。最后，我们讨论了潜在的用例，并使用拟议的代码手册和数据集提出了未来研究方向的建议。

Title: Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

Authors: Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.05464
Pdf URL: https://arxiv.org/pdf/2505.05464
Copy Paste: [[2505.05464]] Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging(https://arxiv.org/abs/2505.05464)
Keywords: language model, llm
Abstract: Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.
摘要：视觉模型（VLM）将视觉感知与大语言模型（LLMS）等一般能力（例如推理）相结合。但是，可以将这两种能力组合和贡献的机制保持不足。在这项工作中，我们探讨了通过连接不同模型参数的模型合并来构成感知和推理。与以前的作品通常着重于合并相同类型的模型，我们建议将模型跨模式合并，从而使LLMS的推理能力合并到VLMS中。通过广泛的实验，我们证明了模型合并为以无培训方式将推理能力从LLMS转移到VLM的成功途径。此外，我们利用合并的模型来了解感知和推理的内部机制以及合并如何影响它。我们发现，感知能力主要是在模型的早期层中编码的，而推理在很大程度上是由中层层促进的。合并后，我们观察到所有层都开始有助于推理，而跨层的感知能力的分布在很大程度上保持不变。这些观察结果阐明了模型合并为多模式集成和解释工具的潜力。

Title: ComPO: Preference Alignment via Comparison Oracles

Authors: Peter Chen, Xi Chen, Wotao Yin, Tianyi Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.05465
Pdf URL: https://arxiv.org/pdf/2505.05465
Copy Paste: [[2505.05465]] ComPO: Preference Alignment via Comparison Oracles(https://arxiv.org/abs/2505.05465)
Keywords: language model, llm
Abstract: Direct alignment methods are increasingly used for aligning large language models (LLMs) with human preferences. However, these methods suffer from the issues of verbosity and likelihood displacement, which can be driven by the noisy preference pairs that induce similar likelihood for preferred and dispreferred responses. The contributions of this paper are two-fold. First, we propose a new preference alignment method based on comparison oracles and provide the convergence guarantee for its basic scheme. Second, we improve our method using some heuristics and conduct the experiments to demonstrate the flexibility and compatibility of practical scheme in improving the performance of LLMs using noisy preference pairs. Evaluations are conducted across multiple base and instruction-tuned models (Mistral-7B, Llama-3-8B and Gemma-2-9B) with benchmarks (AlpacaEval 2, MT-Bench and Arena-Hard). Experimental results show the effectiveness of our method as an alternative to addressing the limitations of existing direct alignment methods. A highlight of our work is that we evidence the importance of designing specialized methods for preference pairs with distinct likelihood margin, which complements the recent findings in \citet{Razin-2025-Unintentional}.
摘要：直接比对方法越来越多地用于使大语模型（LLM）与人类偏好相结合。但是，这些方法遭受了冗长和似然的位移问题，这些方法可以由嘈杂的偏好对驱动，这些噪声偏好对诱导了相似的可能性和偏爱响应的可能性。本文的贡献是两个方面。首先，我们提出了一种基于比较甲骨文的新偏好对准方法，并为其基本方案提供收敛保证。其次，我们使用一些启发式方法来改进我们的方法，并进行实验，以证明使用嘈杂的偏好对提高LLMS性能的实用方案的灵活性和兼容性。评估是在具有基准（Alpacaeval 2，MT-Bench和Arena-Hard）的多个基础和指导调节模型（Mistral-7b，Llama-3-8b和Gemma-2-9B）上进行的。实验结果表明，我们方法的有效性是解决现有直接比对方法的局限性的替代性。 A highlight of our work is that we evidence the importance of designing specialized methods for preference pairs with distinct likelihood margin, which complements the recent findings in \citet{Razin-2025-Unintentional}.