2025-07-08

Title: Loki's Dance of Illusions: A Comprehensive Survey of Hallucination in Large Language Models

Authors: Chaozhuo Li, Pengbo Wang, Chenxu Wang, Litian Zhang, Zheng Liu, Qiwei Ye, Yuanbo Xu, Feiran Huang, Xi Zhang, Philip S. Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02870
Pdf URL: https://arxiv.org/pdf/2507.02870
Copy Paste: [[2507.02870]] Loki's Dance of Illusions: A Comprehensive Survey of Hallucination in Large Language Models(https://arxiv.org/abs/2507.02870)
Keywords: language model, llm, hallucination
Abstract: Edgar Allan Poe noted, "Truth often lurks in the shadow of error," highlighting the deep complexity intrinsic to the interplay between truth and falsehood, notably under conditions of cognitive and informational asymmetry. This dynamic is strikingly evident in large language models (LLMs). Despite their impressive linguistic generation capabilities, LLMs sometimes produce information that appears factually accurate but is, in reality, fabricated, an issue often referred to as 'hallucinations'. The prevalence of these hallucinations can mislead users, affecting their judgments and decisions. In sectors such as finance, law, and healthcare, such misinformation risks causing substantial economic losses, legal disputes, and health risks, with wide-ranging this http URL our research, we have methodically categorized, analyzed the causes, detection methods, and solutions related to LLM hallucinations. Our efforts have particularly focused on understanding the roots of hallucinations and evaluating the efficacy of current strategies in revealing the underlying logic, thereby paving the way for the development of innovative and potent approaches. By examining why certain measures are effective against hallucinations, our study aims to foster a comprehensive approach to tackling this issue within the domain of LLMs.
摘要：埃德加·艾伦·坡（Edgar Allan Poe）指出：“真理经常潜伏在错误的阴影中”，强调了真理与虚假之间相互作用的深度复杂性，特别是在认知和信息不对称条件下。在大型语言模型（LLM）中，这种动态显而易见。尽管具有令人印象深刻的语言发电能力，但LLM有时会产生看起来准确的信息，但实际上是捏造的，通常被称为“幻觉”。这些幻觉的流行可能会误导用户，从而影响他们的判断和决定。在金融，法律和医疗保健等部门中，这种误导风险造成了实质性经济损失，法律纠纷和健康风险，并且通过广泛的HTTP URL我们的研究，我们进行了有条理地分类，分析了与LLM幻觉相关的原因，检测方法和解决方案。我们的努力尤其集中在理解幻觉的根源上，并评估当前策略在揭示基本逻辑方面的功效，从而为开发创新和有效的方法铺平了道路。通过检查为什么某些措施有效地抵抗幻觉，我们的研究旨在促进一种在LLMS领域内解决此问题的全面方法。

Title: ChatGPT is not A Man but Das Man: Representativeness and Structural Consistency of Silicon Samples Generated by Large Language Models

Authors: Dai Li, Linzhuo Li, Huilian Sophie Qiu
Subjects: cs.CL, cs.CY, cs.ET
Abstract URL: https://arxiv.org/abs/2507.02919
Pdf URL: https://arxiv.org/pdf/2507.02919
Copy Paste: [[2507.02919]] ChatGPT is not A Man but Das Man: Representativeness and Structural Consistency of Silicon Samples Generated by Large Language Models(https://arxiv.org/abs/2507.02919)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Large language models (LLMs) in the form of chatbots like ChatGPT and Llama are increasingly proposed as "silicon samples" for simulating human opinions. This study examines this notion, arguing that LLMs may misrepresent population-level opinions. We identify two fundamental challenges: a failure in structural consistency, where response accuracy doesn't hold across demographic aggregation levels, and homogenization, an underrepresentation of minority opinions. To investigate these, we prompted ChatGPT (GPT-4) and Meta's Llama 3.1 series (8B, 70B, 405B) with questions on abortion and unauthorized immigration from the American National Election Studies (ANES) 2020. Our findings reveal significant structural inconsistencies and severe homogenization in LLM responses compared to human data. We propose an "accuracy-optimization hypothesis," suggesting homogenization stems from prioritizing modal responses. These issues challenge the validity of using LLMs, especially chatbots AI, as direct substitutes for human survey data, potentially reinforcing stereotypes and misinforming policy.
摘要：越来越多地认为以聊天机器人和美洲驼的形式形式的大型语言模型（LLM）越来越被提议作为模拟人类意见的“硅样本”。这项研究研究了这一观念，认为LLM可能会歪曲人口级别的意见。我们确定了两个基本挑战：结构一致性的失败，其中响应准确性在人口聚集水平之间无法达到，均质化，少数族裔意见的代表性不足。为了调查这些内容，我们提示了Chatgpt（GPT-4）和Meta的Llama 3.1系列（8b，70b，405b），对美国国家选举研究（ANES）2020的堕胎和未经授权移民的疑问。我们的发现揭示了与人类数据相比，LLM响应中有重要的结构性不一致和严重的同质化。我们提出了一个“准确性优化假设”，提示均质化源于将模态响应的优先级排序。这些问题挑战了使用LLM，尤其是聊天机器人AI的有效性，作为人类调查数据的直接替代品，潜在地加强了刻板印象和误导政策。

Title: A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations

Authors: Phurich Saengthong, Boonnithi Jiaramaneepinit, Sheng Li, Manabu Okumura, Takahiro Shinozaki
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.02927
Pdf URL: https://arxiv.org/pdf/2507.02927
Copy Paste: [[2507.02927]] A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations(https://arxiv.org/abs/2507.02927)
Keywords: language model, llm
Abstract: Speech Large Language Models (Speech LLMs) have emerged as a crucial paradigm in recent years, extending the capabilities of traditional LLMs to speech tasks such as automatic speech recognition (ASR) and spoken dialogue modeling. However, their effectiveness in real-world multilingual conversations remains limited by the scarcity of data that captures natural conversational phenomena. To address this, the MLC-SLM Challenge provides a multilingual conversational dataset and evaluates models on two tasks: ASR with oracle segmentation (Task I) and joint diarization and recognition without oracle information (Task II). In this paper, we focus on Task II and propose a unified speech LLM that jointly performs diarization and ASR in an end-to-end manner. By reformulating the training data format and modifying the inference procedure, our model addresses the ambiguity inherent in pre-segmented audio and achieves a 54.87\% relative improvement in tcpWER/tcpCER over the baseline, ranking 8th overall, despite using a smaller LLM backbone. We also report results from Task I using a fine-tuned speech LLM.
摘要：近年来，语音大语言模型（语音LLM）已成为至关重要的范式，将传统LLM的能力扩展到语音任务，例如自动语音识别（ASR）和口头对话模型。但是，它们在现实世界中多语言对话中的有效性仍然受到捕获自然对话现象的数据稀缺的限制。为了解决这个问题，MLC-SLM挑战提供了一个多语言的对话数据集，并在两个任务上评估了模型：ASR具有Oracle分割（任务I）以及没有Oracle信息的联合诊断和识别（任务II）。在本文中，我们专注于任务II，并提出一个统一的语音LLM，该语音LLM以端到端的方式共同执行诊断和ASR。通过重新设计训练数据格式并修改推理程序，我们的模型解决了预分段音频中固有的歧义，并在基线上取得了54.87 \％的相对改善，尽管使用了较小的LLM Backbone，但总体排名第8。我们还使用微调语音LLM报告了任务I的结果。

Title: Mitigating Hidden Confounding by Progressive Confounder Imputation via Large Language Models

Authors: Hao Yang, Haoxuan Li, Luyu Chen, Haoxiang Wang, Xu Chen, Mingming Gong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02928
Pdf URL: https://arxiv.org/pdf/2507.02928
Copy Paste: [[2507.02928]] Mitigating Hidden Confounding by Progressive Confounder Imputation via Large Language Models(https://arxiv.org/abs/2507.02928)
Keywords: language model, llm
Abstract: Hidden confounding remains a central challenge in estimating treatment effects from observational data, as unobserved variables can lead to biased causal estimates. While recent work has explored the use of large language models (LLMs) for causal inference, most approaches still rely on the unconfoundedness assumption. In this paper, we make the first attempt to mitigate hidden confounding using LLMs. We propose ProCI (Progressive Confounder Imputation), a framework that elicits the semantic and world knowledge of LLMs to iteratively generate, impute, and validate hidden confounders. ProCI leverages two key capabilities of LLMs: their strong semantic reasoning ability, which enables the discovery of plausible confounders from both structured and unstructured inputs, and their embedded world knowledge, which supports counterfactual reasoning under latent confounding. To improve robustness, ProCI adopts a distributional reasoning strategy instead of direct value imputation to prevent the collapsed outputs. Extensive experiments demonstrate that ProCI uncovers meaningful confounders and significantly improves treatment effect estimation across various datasets and LLMs.
摘要：隐藏的混杂仍然是估计观察数据治疗效应的核心挑战，因为未观察到的变量可能导致因果关系偏见。尽管最近的工作探索了大型语言模型（LLMS）作为因果推断的使用，但大多数方法仍然依赖于不满意的假设。在本文中，我们首次尝试使用LLMS减轻隐藏混杂性。我们提出了Proci（渐进式混杂因素），该框架使LLMS的语义和世界知识具有迭代性生成，估算和验证隐藏的混杂因素。 PROCI利用了LLM的两个关键功能：它们的强大语义推理能力，从而可以从结构化和非结构化输入中发现合理的混杂因素及其嵌入的世界知识，这些知识支持潜在混杂下的反事实推理。为了提高鲁棒性，PROCI采用了分配推理策略，而不是直接价值归纳以防止倒塌的产出。广泛的实验表明，PROCI发现有意义的混杂因素，并显着改善了各种数据集和LLM的治疗效果估计。

Title: Theory of Mind in Action: The Instruction Inference Task

Authors: Fardin Saad, Pradeep K. Murukannaiah, Munindar P. Singh
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2507.02935
Pdf URL: https://arxiv.org/pdf/2507.02935
Copy Paste: [[2507.02935]] Theory of Mind in Action: The Instruction Inference Task(https://arxiv.org/abs/2507.02935)
Keywords: language model, gpt, llm, prompt, chain-of-thought, agent
Abstract: The Theory of Mind (ToM) refers to an agent's capacity to infer the mental states of other agents. ToM is essential for effective collaboration. To assess ToM in a dynamic, goal-oriented, and collaborative environment, we introduce a novel task, Instruction Inference, in which an agent assists a principal in reaching a goal by interpreting indirect or ambiguous instructions. We present Tomcat, an LLM-based agent, designed to exhibit ToM reasoning in interpreting and responding to the principal's instructions. We implement two variants of Tomcat. One, dubbed Fs-CoT, is based on a small number of examples (i.e., few-shot or Fs) demonstrating the requisite structured reasoning (i.e., chain-of-thought or CoT). One, dubbed CP, relies on commonsense knowledge and information about the problem (i.e., commonsense prompt or CP). We realized both variants of Tomcat on three leading large language models (LLMs), namely, GPT-4o, DeepSeek-R1, and Gemma-3-27B. To evaluate the effectiveness of Tomcat, we conducted a study with 52 human participants in which we provided participants with the same information as the CP variant of Tomcat. We computed intent accuracy, action optimality, and planning optimality to measure the ToM capabilities of Tomcat and our study participants. We found that Tomcat with Fs-CoT, particularly with GPT-4o and DeepSeek-R1, achieves performance comparable to the human participants, underscoring its ToM potential for human-AI collaboration.
摘要：心理理论（汤姆）是指代理推断其他代理人的心理状态的能力。汤姆对于有效的协作至关重要。为了在动态，面向目标和协作环境中评估汤姆，我们介绍了一项新颖的任务，指导推理，其中代理商通过解释间接或模棱两可的说明来协助委托人实现目标。我们提出了一位位于LLM的代理Tomcat，旨在展示Tom推理在解释和回应校长的指示时。我们实施了两种变体Tomcat。一个称为fs-cot的一个基于少数示例（即很少射击或fs），这些示例证明了必要的结构化推理（即，经过一链或cot）。一个被称为CP的人依靠常识性知识和有关该问题的信息（即常识提示或CP）。我们在三种领先的大型语言模型（LLM）（即GPT-4O，DeepSeek-R1和Gemma-3-27B）上意识到了Tomcat的两个变体。为了评估Tomcat的有效性，我们对52名人类参与者进行了一项研究，其中我们为参与者提供了与Tomcat的CP变体相同的信息。我们计算了意图准确性，最佳性和计划最佳性，以衡量Tomcat和我们的研究参与者的TOM功能。我们发现与FS-COT（尤其是GPT-4O和DeepSeek-R1）的Tomcat具有与人类参与者相当的性能，从而强调了其TOM与人类合作的潜力。

Title: A Large Language Model-Empowered Agent for Reliable and Robust Structural Analysis

Authors: Jiachen Liu, Ziheng Geng, Ran Cao, Lu Cheng, Paolo Bocchini, Minghui Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02938
Pdf URL: https://arxiv.org/pdf/2507.02938
Copy Paste: [[2507.02938]] A Large Language Model-Empowered Agent for Reliable and Robust Structural Analysis(https://arxiv.org/abs/2507.02938)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: Large language models (LLMs) have exhibited remarkable capabilities across diverse open-domain tasks, yet their application in specialized domains such as civil engineering remains largely unexplored. This paper starts bridging this gap by evaluating and enhancing the reliability and robustness of LLMs in structural analysis of beams. Reliability is assessed through the accuracy of correct outputs under repetitive runs of the same problems, whereas robustness is evaluated via the performance across varying load and boundary conditions. A benchmark dataset, comprising eight beam analysis problems, is created to test the Llama-3.3 70B Instruct model. Results show that, despite a qualitative understanding of structural mechanics, the LLM lacks the quantitative reliability and robustness for engineering applications. To address these limitations, a shift is proposed that reframes the structural analysis as code generation tasks. Accordingly, an LLM-empowered agent is developed that (a) integrates chain-of-thought and few-shot prompting to generate accurate OpeeSeesPy code, and (b) automatically executes the code to produce structural analysis results. Experimental results demonstrate that the agent achieves accuracy exceeding 99.0% on the benchmark dataset, exhibiting reliable and robust performance across diverse conditions. Ablation studies highlight the complete example and function usage examples as the primary contributors to the agent's enhanced performance.
摘要：大型语言模型（LLM）在各种开放域任务中表现出了显着的功能，但是它们在土木工程等专业领域的应用仍未得到探索。本文通过评估和增强梁结构分析中LLM的可靠性和鲁棒性来开始弥合这一差距。通过在相同问题的重复运行下正确输出的准确性来评估可靠性，而通过跨不同负载和边界条件的性能评估鲁棒性。创建了一个包括八个光束分析问题的基准数据集，以测试Llama-3.3 70B指导模型。结果表明，尽管对结构力学有定性的理解，但LLM缺乏工程应用的定量可靠性和鲁棒性。为了解决这些局限性，提出了将结构分析作为代码生成任务的转变。因此，开发了一个具有LLM授权的代理，该代理（a）集成了经过的链链和很少的发动机，以生成准确的OpeEseEspy代码，并且（b）自动执行代码以产生结构分析结果。实验结果表明，该代理在基准数据集上的准确性超过99.0％，在各种条件下表现出可靠且稳健的性能。消融研究强调了完整的示例和功能用法示例，这是代理人增强性能的主要因素。

Title: The Application of Large Language Models on Major Depressive Disorder Support Based on African Natural Products

Authors: Linyan Zou
Subjects: cs.CL, q-bio.NC
Abstract URL: https://arxiv.org/abs/2507.02947
Pdf URL: https://arxiv.org/pdf/2507.02947
Copy Paste: [[2507.02947]] The Application of Large Language Models on Major Depressive Disorder Support Based on African Natural Products(https://arxiv.org/abs/2507.02947)
Keywords: language model, prompt
Abstract: Major depressive disorder represents one of the most significant global health challenges of the 21st century, affecting millions of people worldwide and creating substantial economic and social burdens. While conventional antidepressant therapies have provided relief for many individuals, their limitations including delayed onset of action, significant side effects, and treatment resistance in a substantial portion of patients have prompted researchers and healthcare providers to explore alternative therapeutic approaches (Kasneci et al.). African traditional medicine, with its rich heritage of plant-based remedies developed over millennia, offers a valuable resource for developing novel antidepressant treatments that may address some of these limitations. This paper examines the integration of large language models with African natural products for depression support, combining traditional knowledge with modern artificial intelligence technology to create accessible, evidence-based mental health support systems. The research presented here encompasses a comprehensive analysis of African medicinal plants with documented antidepressant properties, their pharmacological mechanisms, and the development of an AI-powered support system that leverages DeepSeek's advanced language model capabilities. The system provides evidence-based information about African herbal medicines, their clinical applications, safety considerations, and therapeutic protocols while maintaining scientific rigor and appropriate safety standards. Our findings demonstrate the potential for large language models to serve as bridges between traditional knowledge and modern healthcare, offering personalized, culturally appropriate depression support that honors both traditional wisdom and contemporary medical understanding.
摘要：重度抑郁症代表了21世纪最重要的全球健康挑战之一，影响了全球数百万的人，并造成了实质性的经济和社会负担。尽管常规的抗抑郁药为许多人提供了缓解，但它们的局限性包括延迟作用，大部分患者的副作用和治疗耐药性促使研究人员和医疗保健提供者探索替代治疗方法（Kasneci等人）。非洲传统医学在数千年中开发了植物性疗法的丰富遗产，为开发新型抗抑郁药的疗法提供了宝贵的资源，可以解决这些局限性。本文研究了大型语言模型与非洲天然产品的抑郁症支持，将传统知识与现代人工智能技术相结合，以创建可访问的，基于证据的心理健康支持系统。这里介绍的研究涵盖了对非洲药用植物的全面分析，其抗抑郁药特性，其药理机制以及利用DeepSeek先进语言模型功能的AI驱动支持系统的开发。该系统提供有关非洲草药，其临床应用，安全考虑和治疗方案的基于证据的信息，同时保持科学严谨和适当的安全标准。我们的发现表明，大型语言模型有可能充当传统知识与现代医疗保健之间的桥梁，提供个性化，文化上适当的抑郁症支持，以表彰传统智慧和当代医学理解。

Title: RADIANT: Retrieval AugmenteD entIty-context AligNmenT -- Introducing RAG-ability and Entity-Context Divergence

Authors: Vipula Rawte, Rajarshi Roy, Gurpreet Singh, Danush Khanna, Yaswanth Narsupalli, Basab Ghosh, Abhay Gupta, Argha Kamal Samanta, Aditya Shingote, Aadi Krishna Vikram, Vinija Jain, Aman Chadha, Amit Sheth, Amitava Das
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02949
Pdf URL: https://arxiv.org/pdf/2507.02949
Copy Paste: [[2507.02949]] RADIANT: Retrieval AugmenteD entIty-context AligNmenT -- Introducing RAG-ability and Entity-Context Divergence(https://arxiv.org/abs/2507.02949)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: As Large Language Models (LLMs) continue to advance, Retrieval-Augmented Generation (RAG) has emerged as a vital technique to enhance factual accuracy by integrating external knowledge into the generation process. However, LLMs often fail to faithfully integrate retrieved evidence into their generated responses, leading to factual inconsistencies. To quantify this gap, we introduce Entity-Context Divergence (ECD), a metric that measures the extent to which retrieved information is accurately reflected in model outputs. We systematically evaluate contemporary LLMs on their ability to preserve factual consistency in retrieval-augmented settings, a capability we define as RAG-ability. Our empirical analysis reveals that RAG-ability remains low across most LLMs, highlighting significant challenges in entity retention and context fidelity. This paper introduces Radiant (Retrieval AugmenteD entIty-context AligNmenT), a novel framework that merges RAG with alignment designed to optimize the interplay between retrieved evidence and generated content. Radiant extends Direct Preference Optimization (DPO) to teach LLMs how to integrate provided additional information into subsequent generations. As a behavior correction mechanism, Radiant boosts RAG performance across varied retrieval scenarios, such as noisy web contexts, knowledge conflicts, and hallucination reduction. This enables more reliable, contextually grounded, and factually coherent content generation.
摘要：随着大型语言模型（LLMS）继续发展，检索增强的一代（RAG）已成为一种至关重要的技术，可以通过将外部知识整合到生成过程中来提高事实准确性。但是，LLMS通常无法忠实地将检索到的证据纳入其产生的反应中，从而导致事实不一致。为了量化这一差距，我们介绍了实体 - 偏置差异（ECD），该度量标准，可测量在模型输出中准确反射检索信息的程度。我们系统地评估了当代LLM，以确保在检索设置中保留事实一致性的能力，这是我们将其定义为抹布的能力。我们的经验分析表明，在大多数LLM中，破布能力仍然很低，强调了实体保留和背景忠诚度的重大挑战。本文介绍了Radiant（检索增强实体 - 封闭式对齐），这是一个新型框架，将抹布与旨在优化检索到的证据与生成的内容之间的相互作用的对齐融合在一起。 Radiant扩展了直接的首选项优化（DPO），以教LLMS如何整合提供的其他信息到后代。作为一种行为校正机制，辐射促进了各种检索场景中的抹布性能，例如嘈杂的网络环境，知识冲突和减少幻觉。这使得更可靠的，上下文扎根且实际上是连贯的内容生成。

Title: Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria

Authors: Keita Kiuchi, Yoshikazu Fujimoto, Hideyuki Goto, Tomonori Hosokawa, Makoto Nishimura, Yosuke Sato, Izumi Sezai
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2507.02950
Pdf URL: https://arxiv.org/pdf/2507.02950
Copy Paste: [[2507.02950]] Evaluating AI Counseling in Japanese: Counselor, Client, and Evaluator Roles Assessed by Motivational Interviewing Criteria(https://arxiv.org/abs/2507.02950)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation
Abstract: This study provides the first comprehensive evaluation of large language model (LLM) performance across three counseling roles in Japanese-language therapeutic contexts. We simultaneously assessed counselor artificial intelligence (AI) systems (GPT-4-turbo with zeroshot prompting or Structured Multi-step Dialogue Prompts (SMDP), Claude-3-Opus-SMDP), client AI simulations, and evaluation AI systems (o3, Claude-3.7-Sonnet, Gemini-2.5-pro). Human experts (n = 15) with extensive counseling experience evaluated AI-generated dialogues using the Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1. Notably, SMDP implementation significantly enhanced counselor AI performance across all MITI global ratings compared with zeroshot prompting, with no significant differences between GPT-SMDP and Opus-SMDP. Evaluation AIs showed comparable performance to human raters for Cultivating Change Talk but systematically overestimated Softening Sustain Talk and the overall quality metrics. Model-specific biases emerged: Gemini emphasized power-sharing, o3 focused on technical proficiency, and Sonnet prioritized emotional expression. Client AI simulations exhibited a limited emotional range and unnaturally high compliance, indicating the need for enhanced realism. These findings establish benchmarks for AI-assisted counseling in non-English contexts and identify critical areas for improvement through advanced prompt engineering, retrieval-augmented generation, and targeted fine-tuning, with important implications for developing culturally sensitive AI mental health tools.
摘要：这项研究提供了对日语治疗环境中三个咨询角色的大型语言模型（LLM）表现的首次全面评估。我们同时评估了辅导员人工智能（AI）系统（GPT-4-Turbo）以及Zeroshot提示或结构化的多步对话提示提示（SMDP），Claude-3-Opus-SMDP），客户端AI模拟和评估AI系统（O3，Claude-3.7-Sonnet，Gemin-2.5-Pro）。具有丰富咨询经验的人类专家（n = 15）使用动机访谈治疗完整性（MITI）编码手册4.2.1评估了AI生成的对话。值得注意的是，与Zeroshot提示相比，SMDP的实施显着提高了所有MITI全球评级的AI绩效，GPT-SMDP和OPUS-SMDP之间没有显着差异。评估AIS表现出与人类评估者相当的培养变化谈话的绩效，但系统地高估了软化持续性谈话和整体质量指标。出现了特定于模型的偏见：双子座强调了功率共享，O3专注于技术水平，十四行诗优先进行情感表达。客户AI模拟表现出有限的情感范围和非自然的高依从性，表明需要增强现实主义。这些发现为在非英国背景下的AI辅助咨询建立了基准，并通过高级及时的工程，检索效果的一代和针对性的微调来确定关键领域以改善，并对开发具有文化敏感的AI心理健康工具的重要意义。

Title: Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III

Authors: Pranam Shetty, Abhisek Upadhayaya, Parth Mitesh Shah, Srikanth Jagabathula, Shilpi Nayak, Anna Joo Fee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02954
Pdf URL: https://arxiv.org/pdf/2507.02954
Copy Paste: [[2507.02954]] Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III(https://arxiv.org/abs/2507.02954)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: As financial institutions increasingly adopt Large Language Models (LLMs), rigorous domain-specific evaluation becomes critical for responsible deployment. This paper presents a comprehensive benchmark evaluating 23 state-of-the-art LLMs on the Chartered Financial Analyst (CFA) Level III exam - the gold standard for advanced financial reasoning. We assess both multiple-choice questions (MCQs) and essay-style responses using multiple prompting strategies including Chain-of-Thought and Self-Discover. Our evaluation reveals that leading models demonstrate strong capabilities, with composite scores such as 79.1% (o4-mini) and 77.3% (Gemini 2.5 Flash) on CFA Level III. These results, achieved under a revised, stricter essay grading methodology, indicate significant progress in LLM capabilities for high-stakes financial applications. Our findings provide crucial guidance for practitioners on model selection and highlight remaining challenges in cost-effective deployment and the need for nuanced interpretation of performance against professional benchmarks.
摘要：随着金融机构越来越多地采用大型语言模型（LLM），严格的领域特定评估对于负责任的部署至关重要。本文提出了一个全面的基准测试，评估了特许财务分析师（CFA）III级考试的23个最先进的LLMS-高级财务推理的黄金标准。我们使用多种提示策略（包括思想链和自我发现）来评估多项选择问题（MCQ）和论文式的回答。我们的评估表明，领先的模型表现出强大的功能，在CFA III级别上，复合分数为79.1％（O4-Mini）和77.3％（Gemini 2.5闪存）。这些结果是在经过修订的，更严格的论文分级方法下实现的，表明LLM功能在高风险财务应用方面取得了重大进展。我们的发现为从业者提供了至关重要的指导，并突出了具有成本效益的部署方面的剩余挑战，以及对针对专业基准的绩效解释的细微解释的需求。

Title: Real-World En Call Center Transcripts Dataset with PII Redaction

Authors: Ha Dao, Gaurav Chawla, Raghu Banda, Caleb DeLeeuw
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02958
Pdf URL: https://arxiv.org/pdf/2507.02958
Copy Paste: [[2507.02958]] Real-World En Call Center Transcripts Dataset with PII Redaction(https://arxiv.org/abs/2507.02958)
Keywords: agent
Abstract: We introduce CallCenterEN, a large-scale (91,706 conversations, corresponding to 10448 audio hours), real-world English call center transcript dataset designed to support research and development in customer support and sales AI systems. This is the largest release to-date of open source call center transcript data of this kind. The dataset includes inbound and outbound calls between agents and customers, with accents from India, the Philippines and the United States. The dataset includes high-quality, PII-redacted human-readable transcriptions. All personally identifiable information (PII) has been rigorously removed to ensure compliance with global data protection laws. The audio is not included in the public release due to biometric privacy concerns. Given the scarcity of publicly available real-world call center datasets, CallCenterEN fills a critical gap in the landscape of available ASR corpora, and is released under a CC BY-NC 4.0 license for non-commercial research use.
摘要：我们介绍了CallCenteren，这是一个大型（91,706次对话，对应于10448音频时间），现实世界中的英语呼叫中心笔录数据集，旨在支持客户支持和销售AI系统的研发。这是此类开源呼叫中心笔录数据的最大发行版。该数据集包括代理商和客户之间的入站和出站电话，以及来自印度，菲律宾和美国的口音。该数据集包括高质量的，PII重编的人类可读转录。所有个人身份信息（PII）均已严格删除，以确保遵守全球数据保护法。由于生物识别隐私问题，音频不包含在公共发布中。考虑到公开可用的现实世界呼叫中心数据集的稀缺性，Callcenteren填补了可用ASR Corpora的景观的关键差距，并根据CC BY-NC 4.0许可证发布了非商业研究使用的许可。

Title: RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

Authors: Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, Jinjie Gu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.02962
Pdf URL: https://arxiv.org/pdf/2507.02962
Copy Paste: [[2507.02962]] RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism(https://arxiv.org/abs/2507.02962)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, while they remain prone to generating hallucinated or outdated responses due to their static internal knowledge. Recent advancements in Retrieval-Augmented Generation (RAG) methods have explored enhancing models' search and reasoning capabilities through reinforcement learning (RL). Although these methods demonstrate promising results, they face challenges in training stability and encounter issues such as substantial inference time and restricted capabilities due to the single-query mode. In this paper, we propose RAG-R1, a novel training framework designed to enable LLMs to adaptively leverage internal and external knowledge during the reasoning process. We further expand the generation and retrieval processes within the framework from single-query mode to multi-query parallelism, aimed at reducing inference time and enhancing the model's capabilities. Extensive experiments on seven question-answering benchmarks demonstrate that our method outperforms the strongest baseline by up to 13.2% and decreases inference time by 11.1%.
摘要：大型语言模型（LLM）在各种任务中都表现出了显着的功能，而由于它们的静态内部知识，它们仍然容易产生幻觉或过时的响应。通过增强学习（RL）探索了增强模型的搜索和推理能力的最新进步（RAG）方法。尽管这些方法表现出了令人鼓舞的结果，但它们在训练稳定性和遇到诸如由于单程模式引起的大量推理时间和限制能力之类的问题面临挑战。在本文中，我们提出了RAG-R1，这是一个新颖的培训框架，旨在使LLMS能够在推理过程中适应内部和外部知识。我们进一步将框架内的生成和检索过程从单质量模式扩展到多质量并行性，旨在减少推理时间并增强模型的功能。对七个提问基准测试的广泛实验表明，我们的方法的表现优于最强的基线高达13.2％，而推断时间则减少了11.1％。

Title: Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens

Authors: Salahuddin Salahuddin, Ahmed Hussain, Jussi Löppönen, Toni Jutila, Panos Papadimitratos
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02964
Pdf URL: https://arxiv.org/pdf/2507.02964
Copy Paste: [[2507.02964]] Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens(https://arxiv.org/abs/2507.02964)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) demonstrate exceptional natural language capabilities, general-purpose models lack specialized domain knowledge for effective cybersecurity analysis. In this work, we investigate Domain-Adaptive Continuous Pretraining (DAP) as a methodology for enhancing cybersecurity understanding in pretrained LLMs while preserving general language capabilities. We systematically adapted three decoder-based architectures -- Llama-3.1-8B, DeepSeek-R1-Distill-Qwen-14B, and Llama-3.3-70B-Instruct -- using a curated 126-million-word cybersecurity corpus from standards, academic literature, and various other sources. Our approach employed constrained training parameters and distributed FSDP training to balance domain specialization with knowledge preservation. Evaluation across three cybersecurity benchmarks, namely, CTI-MCQ, CyberMetric, and SecEval, demonstrates consistent improvements post-adaptation. The Llama-3.3-70B-Ins-DAP model achieved state-of-the-art accuracies of 0.718, 0.933, and 0.864, respectively, outperforming specialized models, including Llama-Primus-Base. Notably, competitive performance was achieved using substantially smaller datasets (118.8 million versus 2.77 billion tokens), demonstrating efficient domain specialization viability. We establish that targeted continuous pretraining enables effective cybersecurity domain adaptation with computational feasibility, providing foundations for specialized AI assistants in threat analysis, vulnerability assessment, and security documentation while challenging prevailing assumptions about data requirements for LLM specialization.
摘要：尽管大型语言模型（LLMS）具有出色的自然语言能力，但通用模型缺乏专门的领域知识，无法进行有效的网络安全分析。在这项工作中，我们研究了域自适应连续预审计（DAP），作为一种在保留一般语言能力的同时，在预验证的LLM中增强网络安全理解的方法。我们使用策划的1,2600万字的网络安全语料库，系统地适应了三个基于解码器的架构-Llama-3.1-8B，DeepSeek-R1-Distill-Qwen-14b和Llama-3.3-70B-Instruction-。我们的方法采用约束的培训参数和分布式FSDP培训，以平衡领域的专业化与知识保存。对三个网络安全基准进行评估，即CTI-MCQ，Cybermetric和Seceval，在适应后表现出一致的改进。 Llama-3.3-70B-IN-DAP模型的最新精确度分别为0.718、0.933和0.864，其表现优于包括Llama-Primus Base在内的专业模型。值得注意的是，使用大量较小的数据集（1.188亿对27.7亿代币）实现了竞争性能，这表明了有效的领域专业化可行性。我们确定有针对性的持续预处理能够通过计算可行性适应有效的网络安全域，为专门的AI助手提供了威胁分析，脆弱性评估和安全文档的基础，同时挑战了对LLM专业化数据需求的普遍假设。

Title: PB-LLMs: Privacy- and Bias-aware NLP Models using Named-Entity Recognition

Authors: Gonzalo Mancera, Aythami Morales, Julian Fierrez, Ruben Tolosana, Alejandro Penna, Miguel Lopez-Duran, Francisco Jurado, Alvaro Ortigosa
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02966
Pdf URL: https://arxiv.org/pdf/2507.02966
Copy Paste: [[2507.02966]] PB-LLMs: Privacy- and Bias-aware NLP Models using Named-Entity Recognition(https://arxiv.org/abs/2507.02966)
Keywords: language model, gpt, llm
Abstract: The use of Natural Language Processing (NLP) in high-stakes AI-based applications has increased significantly in recent years, especially since the emergence of Large Language Models (LLMs). However, despite their strong performance, LLMs introduce important legal/ethical concerns, particularly regarding privacy, data protection, and transparency. Due to these concerns, this work explores the use of Named-Entity Recognition (NER) to facilitate the privacy-preserving training (or adaptation) of LLMs. We propose a framework that uses NER technologies to anonymize sensitive information in text data, such as personal identities or geographic locations. An evaluation of the proposed privacy-preserving learning framework was conducted to measure its impact on user privacy and system performance in a particular high-stakes and sensitive setup: AI-based resume scoring for recruitment processes. The study involved two language models (BERT and RoBERTa) and six anonymization algorithms (based on Presidio, FLAIR, BERT, and different versions of GPT) applied to a database of 24,000 candidate profiles. The findings indicate that the proposed privacy preservation techniques effectively maintain system performance while playing a critical role in safeguarding candidate confidentiality, thus promoting trust in the experimented scenario. On top of the proposed privacy-preserving approach, we also experiment applying an existing approach that reduces the gender bias in LLMs, thus finally obtaining our proposed Privacy- and Bias-aware LLMs (PB-LLMs). Note that the proposed PB-LLMs have been evaluated in a particular setup (resume scoring), but are generally applicable to any other LLM-based AI application.
摘要：近年来，自然语言处理（NLP）在高风险基于AI的应用中的使用已大大增加，尤其是自大型语言模型（LLMS）出现以来。但是，尽管其表现出色，但LLMS仍引入了重要的法律/道德问题，尤其是关于隐私，数据保护和透明度。由于这些问题，这项工作探讨了使用指定性识别（NER）来促进LLMS隐私培训（或改编）的使用。我们提出了一个框架，该框架使用NER技术在文本数据（例如个人身份或地理位置）中匿名化敏感信息。对拟议的保护隐私学习框架进行了评估，以衡量其在特定的高风险和敏感设置中对用户隐私和系统绩效的影响：基于AI的简历评分用于招聘过程。该研究涉及两种语言模型（Bert和Roberta）和六种匿名算法（基于Presidio，Flair，Bert和不同版本的GPT），该算法应用于24,000个候选概况的数据库。研究结果表明，提议的隐私保护技术在保护候选人机密性方面发挥着关键作用，从而有效地保持系统性能，从而促进对实验情况的信任。除了提出的保护隐私方法之外，我们还实验了采用现有方法来减少LLMS中性别偏见的现有方法，从而最终获得了我们提出的隐私和偏见和偏见的LLMS（PB-LLMS）。请注意，已在特定的设置（简历评分）中评估了建议的PB-LLMS，但通常适用于任何其他基于LLM的AI应用程序。

Title: We Need Knowledge Distillation for Solving Math Word Problems

Authors: Zhenquan Shen, Xinguo Yu, Xiaotian Cheng, Rao Peng, Hao Ming
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02982
Pdf URL: https://arxiv.org/pdf/2507.02982
Copy Paste: [[2507.02982]] We Need Knowledge Distillation for Solving Math Word Problems(https://arxiv.org/abs/2507.02982)
Keywords: language model, llm
Abstract: The enhancement of mathematical capabilities in large language models (LLMs) fosters new developments in mathematics education within primary and secondary schools, particularly as they relate to intelligent tutoring systems. However, LLMs require substantial computational resources, resulting in significant costs in educational contexts. To mitigate this drawback, this paper investigates the feasibility of compressing LLMs for solving math word problems (MWPs). We compress the embedded vectors encoded by BERT and distill a considerably smaller student model. Our findings indicate that the student model can maintain nearly 90% of the performance of the teacher model while utilizing only 1/12 of its parameters. In addition to achieving high accuracy, the model exhibits strong generalizability, as the compressed vectors perform well across all tasks related to MWPs, and the distillation process is not task-specific. The success of this distillation demonstrates that the underlying principles are generic and not limited to a specific task. We further explore the reasons behind the compressibility of embedded vectors, revealing that part-of-speech information, rather than entity recognition, is crucial for MWPs, which may significantly contribute to their compressibility. The improvements in efficiency and cost reduction provide substantial value for intelligent tutoring systems and significantly advance the field of intelligent education.
摘要：大语言模型（LLMS）中数学能力的增强促进了小学和中学中数学教育的新发展，尤其是与智能辅导系统有关。但是，LLM需要大量的计算资源，从而在教育环境中产生巨大的成本。为了减轻这一缺点，本文调查了压缩LLMS解决数学单词问题（MWP）的可行性。我们压缩由BERT编码的嵌入式矢量，并提炼出较小的学生模型。我们的发现表明，学生模型只能维持教师模型表现的近90％，同时仅利用其参数的1/12。除了达到高精度外，该模型还具有强大的概括性，因为压缩向量在与MWPS相关的所有任务中都表现良好，并且蒸馏过程不是特定于任务的。这种蒸馏的成功表明，基本原理是通用的，不仅限于特定任务。我们进一步探讨了嵌入式向量的可压缩性背后的原因，揭示了言论一部分信息而不是实体识别对于MWP来说至关重要，这可能会导致其可压缩性。效率和降低成本的提高为智能辅导系统提供了重大价值，并显着推动了智能教育领域。

Title: Truth, Trust, and Trouble: Medical AI on the Edge

Authors: Mohammad Anas Azeez, Rafiq Ali, Ebad Shabbir, Zohaib Hasan Siddiqui, Gautam Siddharth Kashyap, Jiechao Gao, Usman Naseem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02983
Pdf URL: https://arxiv.org/pdf/2507.02983
Copy Paste: [[2507.02983]] Truth, Trust, and Trouble: Medical AI on the Edge(https://arxiv.org/abs/2507.02983)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework using a dataset of over 1,000 health questions. We assess model performance across honesty, helpfulness, and harmlessness. Our results highlight trade-offs between factual reliability and safety among evaluated models -- Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. AlpaCare-13B achieves the highest accuracy (91.7%) and harmlessness (0.92), while domain-specific tuning in BioMistral-7B-DARE boosts safety (0.90) despite its smaller scale. Few-shot prompting improves accuracy from 78% to 85%, and all models show reduced helpfulness on complex queries, highlighting ongoing challenges in clinical QA.
摘要：大型语言模型（LLMS）通过实现自动化医疗问题的回答来改变数字健康。但是，确保这些模型符合事实准确性，有用性和安全性的关键行业标准仍然是一个挑战，尤其是对于开源解决方案而言。我们使用1000多个健康问题的数据集提出了一个严格的基准测试框架。我们评估跨诚实，帮助和无害性的模型绩效。我们的结果突出了评估模型之间的事实可靠性和安全性之间的权衡 - Mistral-7b，BioMistral-7B Dare和Alpacare-13B。 Alpacare-13b达到了最高的精度（91.7％）和无害性（0.92），而在生物分类7B中的特定域特异性调整尽管规模较小，但仍可以提高安全性（0.90）。很少有促进的促使精度从78％提高到85％，所有模型均显示出对复杂查询的有用性降低，强调了临床质量检查中持续的挑战。

Title: From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought

Authors: Wentao Tan, Qiong Cao, Yibing Zhan, Chao Xue, Changxing Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02984
Pdf URL: https://arxiv.org/pdf/2507.02984
Copy Paste: [[2507.02984]] From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought(https://arxiv.org/abs/2507.02984)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Achieving human-like reasoning capabilities in Multimodal Large Language Models (MLLMs) has long been a goal. Current methodologies primarily focus on synthesizing positive rationales, while overlooking the critical role of negative rationales in training models to discern flawed reasoning patterns. To address this gap, we propose a novel framework: \textbf{S}elf-Aligning \textbf{M}ultimodal Reasoning with \textbf{A}nswer-O\textbf{r}iented Chain-of-\textbf{T}hought (SMART). This framework enables models to utilize AoT-Oriented Chain-of-Thought (AoT) prompts to automatically generate high-quality positive and negative reasoning paths, followed by self-alignment to enhance their reasoning abilities. Inspired by human strategies for solving proof-based problems, AoT uses answers as a guide to help the model extract critical visual information that links questions and answers. When provided with ground truth answers, the model produces strong positive rationales. Conversely, when correct answers are replaced with misleading alternatives, the model generates an erroneous yet compelling reasoning path, serving as a form of discriminative negative rationale. Models trained with AoT-generated data outperform those trained on manually annotated datasets, demonstrating superior reasoning capabilities. This encourages the use of improved models to generate higher-quality preference data for further optimization. Consequently, SMART establishes an iterative generation-optimization method that continually enhances the model's reasoning skills. Experiments indicate that the SMART framework significantly improves various MLLMs, regardless of model architecture, parameter size, or pre-training dataset. The code, datasets, and models will be released.
摘要：长期以来，在多模式大语言模型（MLLM）中实现类似人类的推理能力一直是一个目标。当前的方法论主要集中于综合阳性理由，同时忽略负面理由在训练模型中的关键作用以辨别有缺陷的推理模式。为了解决这一差距，我们提出了一个新颖的框架：\ textbf {s} elf-arigning \ textbf {m}用\ textbf {a} nswer-o \ textbf {r} iented of-textbf {r} iented of-textBf {该框架使模型能够利用面向AOT的思想链（AOT）提示自动生成高质量的正和负面推理路径，然后进行自我对准以增强其推理能力。受到解决基于证明问题的人类策略的启发，AOT使用答案作为指南，以帮助模型提取关键的视觉信息，将问题和答案联系起来。当提供地面真理答案时，该模型会产生强大的积极原理。相反，当正确的答案被误导性的替代方案替换时，该模型会产生错误但令人信服的推理路径，并用作歧视性负基本原理的一种形式。经过AOT生成数据训练的模型优于在手动注释数据集上训练的模型，表明推理能力出色。这鼓励使用改进的模型来生成更高质量的偏好数据以进一步优化。因此，SMART建立了一种迭代生成优化方法，该方法不断提高模型的推理技能。实验表明，智能框架可显着改善各种MLLM，无论模型体系结构，参数大小或预训练数据集如何。代码，数据集和模型将发布。

Title: GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models

Authors: Seshu Tirupathi, Dhaval Salwala, Elizabeth Daly, Inge Vejsbjerg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02986
Pdf URL: https://arxiv.org/pdf/2507.02986
Copy Paste: [[2507.02986]] GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models(https://arxiv.org/abs/2507.02986)
Keywords: language model, llm, hallucination, agent
Abstract: As Large Language Models (LLMs) continue to be increasingly applied across various domains, their widespread adoption necessitates rigorous monitoring to prevent unintended negative consequences and ensure robustness. Furthermore, LLMs must be designed to align with human values, like preventing harmful content and ensuring responsible usage. The current automated systems and solutions for monitoring LLMs in production are primarily centered on LLM-specific concerns like hallucination etc, with little consideration given to the requirements of specific use-cases and user preferences. This paper introduces GAF-Guard, a novel agentic framework for LLM governance that places the user, the use-case, and the model itself at the center. The framework is designed to detect and monitor risks associated with the deployment of LLM based applications. The approach models autonomous agents that identify risks, activate risk detection tools, within specific use-cases and facilitate continuous monitoring and reporting to enhance AI safety, and user expectations. The code is available at this https URL.
摘要：随着大型语言模型（LLMS）继续越来越多地在各个领域中应用，它们的广泛采用需要进行严格的监控，以防止意外的负面后果并确保稳健性。此外，LLM必须设计以与人类价值观保持一致，例如防止有害内容和确保负责任的用法。当前用于监视生产中LLM的自动化系统和解决方案主要集中在LLM特定的幻觉问题等方面，而几乎不考虑特定用例和用户偏好的要求。本文介绍了GAF-Guard，这是LLM治理的新型代理框架，将用户，用例和模型本身置于中心。该框架旨在检测和监视与基于LLM的应用程序部署相关的风险。该方法模型在特定用例中识别风险，激活风险检测工具的自动源代理，并促进持续监控和报告以提高AI安全性和用户期望。该代码可在此HTTPS URL上找到。

Title: A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements

Authors: Reham Alharbi, Valentina Tamma, Terry R. Payne, Jacopo de Berardinis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.02989
Pdf URL: https://arxiv.org/pdf/2507.02989
Copy Paste: [[2507.02989]] A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements(https://arxiv.org/abs/2507.02989)
Keywords: language model, llm
Abstract: Competency Questions (CQs) are pivotal in knowledge engineering, guiding the design, validation, and testing of ontologies. A number of diverse formulation approaches have been proposed in the literature, ranging from completely manual to Large Language Model (LLM) driven ones. However, attempts to characterise the outputs of these approaches and their systematic comparison are scarce. This paper presents an empirical comparative evaluation of three distinct CQ formulation approaches: manual formulation by ontology engineers, instantiation of CQ patterns, and generation using state of the art LLMs. We generate CQs using each approach from a set of requirements for cultural heritage, and assess them across different dimensions: degree of acceptability, ambiguity, relevance, readability and complexity. Our contribution is twofold: (i) the first multi-annotator dataset of CQs generated from the same source using different methods; and (ii) a systematic comparison of the characteristics of the CQs resulting from each approach. Our study shows that different CQ generation approaches have different characteristics and that LLMs can be used as a way to initially elicit CQs, however these are sensitive to the model used to generate CQs and they generally require a further refinement step before they can be used to model requirements.
摘要：能力问题（CQS）在知识工程中至关重要，指导了本体论的设计，验证和测试。文献中已经提出了许多不同的配方方法，从完全手动到大型语言模型（LLM）驱动的方法。但是，尝试表征这些方法的输出及其系统比较的尝试很少。本文介绍了三种不同的CQ公式方法的经验比较评估：本体论工程师的手动制定，CQ模式的实例化以及使用最先进的LLMS生成。我们使用每种方法从一组文化遗产的要求中生成CQ，并在不同方面进行评估：可接受性，歧义，相关性，可读性和复杂性。我们的贡献是双重的：（i）使用不同方法从同一源生成的CQ的第一个多通道数据集；（ii）每种方法引起的CQ的特性的系统比较。我们的研究表明，不同的CQ生成方法具有不同的特征，LLM可以用作最初引起CQ的一种方式，但是这些方法对用于生成CQ的模型很敏感，并且通常需要进一步的细化步骤，然后才能使用它们来建模。

Title: `For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts

Authors: Annika M Schoene, Cansu Canca
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02990
Pdf URL: https://arxiv.org/pdf/2507.02990
Copy Paste: [[2507.02990]] `For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts(https://arxiv.org/abs/2507.02990)
Keywords: language model, llm, prompt
Abstract: Recent advances in large language models (LLMs) have led to increasingly sophisticated safety protocols and features designed to prevent harmful, unethical, or unauthorized outputs. However, these guardrails remain susceptible to novel and creative forms of adversarial prompting, including manually generated test cases. In this work, we present two new test cases in mental health for (i) suicide and (ii) self-harm, using multi-step, prompt-level jailbreaking and bypass built-in content and safety filters. We show that user intent is disregarded, leading to the generation of detailed harmful content and instructions that could cause real-world harm. We conduct an empirical evaluation across six widely available LLMs, demonstrating the generalizability and reliability of the bypass. We assess these findings and the multilayered ethical tensions that they present for their implications on prompt-response filtering and context- and task-specific model development. We recommend a more comprehensive and systematic approach to AI safety and ethics while emphasizing the need for continuous adversarial testing in safety-critical AI deployments. We also argue that while certain clearly defined safety measures and guardrails can and must be implemented in LLMs, ensuring robust and comprehensive safety across all use cases and domains remains extremely challenging given the current technical maturity of general-purpose LLMs.
摘要：大型语言模型（LLM）的最新进展导致了越来越复杂的安全协议和旨在防止有害，不道德或未经授权的输出的功能。但是，这些护栏仍然容易受到对抗性提示的新颖和创造性形式，包括手动生成的测试案例。在这项工作中，我们使用多步，及时的越狱和旁路内置内容和安全过滤器提出了两个新的心理健康测试案例。我们表明，用户意图被忽略了，导致产生详细的有害内容和可能造成现实伤害的说明。我们对六个广泛可用的LLM进行了经验评估，这表明了旁路的普遍性和可靠性。我们评估了这些发现以及它们对迅速反应过滤以及上下文和特定于特定于任务的模型开发的影响的多层道德紧张局势。我们建议采用更全面和系统的AI安全和道德方法，同时强调对安全至关重要的AI部署进行持续的对抗测试的需求。我们还认为，尽管某些明确定义的安全措施和护栏必须并且必须在LLM中实施，从而确保所有用例和域中的强大而全面的安全性仍然极具挑战性，因为当前通用LLM的技术成熟度。

Title: Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs

Authors: Akram Mustafa, Usman Naseem, Mostafa Rahimi Azghadi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03001
Pdf URL: https://arxiv.org/pdf/2507.03001
Copy Paste: [[2507.03001]] Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs(https://arxiv.org/abs/2507.03001)
Keywords: language model, llm, prompt
Abstract: This study evaluates how well large language models (LLMs) can classify ICD-10 codes from hospital discharge summaries, a critical but error-prone task in healthcare. Using 1,500 summaries from the MIMIC-IV dataset and focusing on the 10 most frequent ICD-10 codes, the study tested 11 LLMs, including models with and without structured reasoning capabilities. Medical terms were extracted using a clinical NLP tool (cTAKES), and models were prompted in a consistent, coder-like format. None of the models achieved an F1 score above 57%, with performance dropping as code specificity increased. Reasoning-based models generally outperformed non-reasoning ones, with Gemini 2.5 Pro performing best overall. Some codes, such as those related to chronic heart disease, were classified more accurately than others. The findings suggest that while LLMs can assist human coders, they are not yet reliable enough for full automation. Future work should explore hybrid methods, domain-specific model training, and the use of structured clinical data.
摘要：这项研究评估了大型语言模型（LLMS）如何从医院出院摘要中分类ICD-10代码，这是医疗保健中至关重要但容易出错的任务。该研究使用MIMIC-IV数据集中的1,500个摘要，并专注于10个最频繁的ICD-10代码，该研究测试了11个LLM，包括具有和没有结构化推理能力的模型。使用临床NLP工具（CTAKE）提取医学术语，并以一致的，类似编码器的格式提示模型。这些模型均未达到57％以上的F1分数，随着代码特异性的增加，性能下降。基于推理的模型通常优于非选手，Gemini 2.5 Pro的整体表现最佳。某些代码（例如与慢性心脏病相关的代码）比其他代码更准确地分类。研究结果表明，尽管LLM可以协助人类编码人员，但它们尚未足够可靠以进行完全自动化。未来的工作应探索混合方法，特定于领域的模型培训以及结构化临床数据的使用。

Title: Breaking Physical and Linguistic Borders: Multilingual Federated Prompt Tuning for Low-Resource Languages

Authors: Wanru Zhao, Yihong Chen, Royson Lee, Xinchi Qiu, Yan Gao, Hongxiang Fan, Nicholas D. Lane
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03003
Pdf URL: https://arxiv.org/pdf/2507.03003
Copy Paste: [[2507.03003]] Breaking Physical and Linguistic Borders: Multilingual Federated Prompt Tuning for Low-Resource Languages(https://arxiv.org/abs/2507.03003)
Keywords: language model, llm, prompt
Abstract: Pre-trained large language models (LLMs) have become a cornerstone of modern natural language processing, with their capabilities extending across a wide range of applications and languages. However, the fine-tuning of multilingual LLMs, especially for low-resource languages, faces significant challenges arising from data-sharing restrictions (the physical border) and inherent linguistic differences (the linguistic border). These barriers hinder users of various languages, particularly those in low-resource regions, from fully benefiting from the advantages of LLMs. To address these challenges, we propose the Federated Prompt Tuning Paradigm for multilingual scenarios, which utilizes parameter-efficient fine-tuning while adhering to data sharing restrictions. We design a comprehensive set of experiments and analyze them using a novel notion of language distance to highlight the strengths of our paradigm: Even under computational constraints, our method not only improves data efficiency but also facilitates mutual enhancements across languages, particularly benefiting low-resource ones. Compared to traditional local cross-lingual transfer tuning methods, our approach achieves 6.9\% higher accuracy with improved data efficiency, and demonstrates greater stability and generalization. These findings underscore the potential of our approach to promote social equality and champion linguistic diversity, ensuring that no language is left behind.
摘要：预先训练的大型语言模型（LLM）已成为现代自然语言处理的基石，其功能在广泛的应用程序和语言上延伸。但是，多语言LLM的微调，尤其是对于低资源语言，面临着由数据共享限制（物理边界）和固有的语言差异（语言边界）引起的重大挑战。这些障碍阻碍了各种语言的使用者，尤其是低资源地区的语言的使用者，无法从LLM的优势完全受益。为了应对这些挑战，我们提出了用于多语言场景的联合迅速调整范式，该范例在遵守数据共享限制的同时，利用参数有效的微调。我们设计了一组全面的实验，并使用一种新的语言距离概念来分析它们，以突出我们的范式的优势：即使在计算限制下，我们的方法不仅提高了数据效率，而且还促进了跨语言的相互增强，尤其是使低资产阶级的人受益。与传统的局部跨语性转移调谐方法相比，我们的方法在提高数据效率的情况下达到了6.9 \％的精度，并显示出更高的稳定性和概括。这些发现强调了我们促进社会平等和倡导语言多样性的方法的潜力，从而确保没有任何语言被抛在后面。

Title: CLUES: Collaborative High-Quality Data Selection for LLMs via Training Dynamics

Authors: Wanru Zhao, Hongxiang Fan, Shell Xu Hu, Wangchunshu Zhou, Bofan Chen, Nicholas D. Lane
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2507.03004
Pdf URL: https://arxiv.org/pdf/2507.03004
Copy Paste: [[2507.03004]] CLUES: Collaborative High-Quality Data Selection for LLMs via Training Dynamics(https://arxiv.org/abs/2507.03004)
Keywords: language model, llm
Abstract: Recent research has highlighted the importance of data quality in scaling large language models (LLMs). However, automated data quality control faces unique challenges in collaborative settings where sharing is not allowed directly between data silos. To tackle this issue, this paper proposes a novel data quality control technique based on the notion of data influence on the training dynamics of LLMs, that high quality data are more likely to have similar training dynamics to the anchor dataset. We then leverage the influence of the training dynamics to select high-quality data from different private domains, with centralized model updates on the server side in a collaborative training fashion by either model merging or federated learning. As for the data quality indicator, we compute the per-sample gradients with respect to the private data and the anchor dataset, and use the trace of the accumulated inner products as a measurement of data quality. In addition, we develop a quality control evaluation tailored for collaborative settings with heterogeneous domain data. Experiments show that training on the high-quality data selected by our method can often outperform other data selection methods for collaborative fine-tuning of LLMs, across diverse private domain datasets, in medical, multilingual and financial settings. Our code is released at this http URL.
摘要：最近的研究强调了数据质量在扩展大语言模型（LLM）中的重要性。但是，自动化数据质量控制在协作环境中面临独特的挑战，在协作环境中，数据孤岛之间不允许共享。为了解决这个问题，本文提出了一种基于数据影响LLM的训练动态的新型数据质量控制技术，高质量数据更可能具有与锚数据集相似的培训动力。然后，我们利用训练动力学的影响来从不同的私人域中选择高质量的数据，通过模型合并或联合学习，以协作培训方式在服务器端进行集中式模型更新。至于数据质量指标，我们根据私有数据和锚数据集计算每个样本梯度，并将累积内部产品的痕迹用作数据质量的测量。此外，我们开发了针对具有异质域数据的协作设置量身定制的质量控制评估。实验表明，对我们方法选择的高质量数据培训通常可以优于其他数据选择方法，用于在医疗，多语言和财务环境中跨不同私人域数据集的LLMS协作微调。我们的代码在此HTTP URL上发布。

Title: PDFMathTranslate: Scientific Document Translation Preserving Layouts

Authors: Rongxin Ouyang, Chang Chu, Zhikuang Xin, Xiangyao Ma
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03009
Pdf URL: https://arxiv.org/pdf/2507.03009
Copy Paste: [[2507.03009]] PDFMathTranslate: Scientific Document Translation Preserving Layouts(https://arxiv.org/abs/2507.03009)
Keywords: language model
Abstract: Language barriers in scientific documents hinder the diffusion and development of science and technologies. However, prior efforts in translating such documents largely overlooked the information in layouts. To bridge the gap, we introduce PDFMathTranslate, the world's first open-source software for translating scientific documents while preserving layouts. Leveraging the most recent advances in large language models and precise layout detection, we contribute to the community with key improvements in precision, flexibility, and efficiency. The work has been open-sourced at this https URL with more than 22k downloads.
摘要：科学文献中的语言障碍阻碍了科学和技术的扩散和发展。但是，先前在翻译此类文档的努力在很大程度上忽略了布局中的信息。为了弥合差距，我们介绍了PDFMATHTRANSLATE，这是世界上第一个在保存布局时翻译科学文档的开源软件。利用大型语言模型和精确布局检测的最新进展，我们为社区做出了贡献，并取得了关键的精度，灵活性和效率的关键改善。这项工作已在此HTTPS URL上开源，下载超过22k。

Title: Subversion via Focal Points: Investigating Collusion in LLM Monitoring

Authors: Olli Järviniemi
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2507.03010
Pdf URL: https://arxiv.org/pdf/2507.03010
Copy Paste: [[2507.03010]] Subversion via Focal Points: Investigating Collusion in LLM Monitoring(https://arxiv.org/abs/2507.03010)
Keywords: language model, llm, prompt
Abstract: We evaluate language models' ability to subvert monitoring protocols via collusion. More specifically, we have two instances of a model design prompts for a policy (P) and a monitor (M) in a programming task setting. The models collaboratively aim for M to classify all backdoored programs in an auditing dataset as harmful, but nevertheless classify a backdoored program produced by P as harmless. The models are isolated from each other, requiring them to independently arrive at compatible subversion strategies. We find that while Claude 3.7 Sonnet has low success rate due to poor convergence, it sometimes successfully colludes on non-obvious signals.
摘要：我们评估了语言模型通过勾结颠覆监视协议的能力。更具体地说，在编程任务设置中，我们有两个模型设计提示（P）和监视器（M）的实例。这些模型的目标是使M在审核数据集中将所有后置程序分类为有害，但仍然将P归类为P dodocored Progrip，该程序由P进行了无害。这些模型彼此隔离，要求它们独立达到兼容的颠覆策略。我们发现，尽管Claude 3.7十四行诗由于收敛性差而取得了较低的成功率，但有时会成功地在非明显信号上勾结。

Title: Beyond Overcorrection: Evaluating Diversity in T2I Models with DIVBENCH

Authors: Felix Friedrich, Thiemo Ganesha Welsch, Patrick Schramowski, Kristian Kersting
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03015
Pdf URL: https://arxiv.org/pdf/2507.03015
Copy Paste: [[2507.03015]] Beyond Overcorrection: Evaluating Diversity in T2I Models with DIVBENCH(https://arxiv.org/abs/2507.03015)
Keywords: llm, prompt
Abstract: Current diversification strategies for text-to-image (T2I) models often ignore contextual appropriateness, leading to over-diversification where demographic attributes are modified even when explicitly specified in prompts. This paper introduces DIVBENCH, a benchmark and evaluation framework for measuring both under- and over-diversification in T2I generation. Through systematic evaluation of state-of-the-art T2I models, we find that while most models exhibit limited diversity, many diversification approaches overcorrect by inappropriately altering contextually-specified attributes. We demonstrate that context-aware methods, particularly LLM-guided FairDiffusion and prompt rewriting, can already effectively address under-diversity while avoiding over-diversification, achieving a better balance between representation and semantic fidelity.
摘要：当前的文本对图像（T2I）模型的多元化策略通常忽略上下文适当性，从而导致过度多元化，即使在提示中明确指定的情况下，人口统计学属性也会被修改。本文介绍了Divbench，这是一个基准和评估框架，用于测量T2I生成中不足和过度多样化的框架。通过对最新T2I模型的系统评估，我们发现，尽管大多数模型都表现出有限的多样性，但许多多元化方法通过不适当地改变上下文指定的属性而过度正确。我们证明，上下文感知的方法，尤其是LLM指导的Fairdiffusion和迅速重写，已经可以有效地解决多样性的问题，同时避免过度多样化，在表示和语义忠诚之间取得更好的平衡。

Title: OpenTable-R1: A Reinforcement Learning Augmented Tool Agent for Open-Domain Table Question Answering

Authors: Zipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03018
Pdf URL: https://arxiv.org/pdf/2507.03018
Copy Paste: [[2507.03018]] OpenTable-R1: A Reinforcement Learning Augmented Tool Agent for Open-Domain Table Question Answering(https://arxiv.org/abs/2507.03018)
Keywords: language model, agent
Abstract: Open-domain table question answering traditionally relies on a two-stage pipeline: static table retrieval followed by a closed-domain answer. In contrast, we propose an end-to-end agentic framework that embeds multi-turn tool calls-using a BM25+-based search API and a SQLite SQL executor-directly into a large language model. To further adapt a compact 4B-parameter model, we introduce a two-stage fine-tuning process: supervised cold-start on easy questions, then Async GRPO reinforcement learning on harder cases with LoRA adapters and a rollout buffer. This unified approach enables the model to jointly retrieve, reason, and execute queries, yielding a dramatic accuracy improvement from single-digit zero-shot performance to over 0.86 exact match on a held-out test set. Our results underscore the effectiveness of integrating structured tool calls with targeted RL fine-tuning for scalable, accurate table QA. The code is available at this https URL.
摘要：传统上回答的开域桌问题依赖于两阶段的管道：静态表检索，然后是封闭域的答案。相比之下，我们提出了一个端到端的代理框架，该框架嵌入多转弯工具，使用BM25+基于BM25+的搜索API和SQLITE SQL exectuter直接将其直接转入大型语言模型中。为了进一步调整紧凑型4B参数模型，我们引入了一个两阶段的微调过程：对简单问题进行了监督的冷门，然后在使用Lora适配器和推出缓冲区的更艰难案例上进行异步GRPO增强学习。这种统一的方法使该模型能够共同检索，推理和执行查询，从而从单位零击性能到超过0.86的精确匹配，从而在持有的测试集中获得了巨大的准确性。我们的结果强调了将结构化工具调用与有针对性的RL微调的有效性，以进行可扩展，准确的表格质量质量质量质量。该代码可在此HTTPS URL上找到。

Title: The Book of Life approach: Enabling richness and scale for life course research

Authors: Mark D. Verhagen, Benedikt Stroebl, Tiffany Liu, Lydia T. Liu, Matthew J. Salganik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03027
Pdf URL: https://arxiv.org/pdf/2507.03027
Copy Paste: [[2507.03027]] The Book of Life approach: Enabling richness and scale for life course research(https://arxiv.org/abs/2507.03027)
Keywords: language model, llm
Abstract: For over a century, life course researchers have faced a choice between two dominant methodological approaches: qualitative methods that analyze rich data but are constrained to small samples, and quantitative survey-based methods that study larger populations but sacrifice data richness for scale. Two recent technological developments now enable us to imagine a hybrid approach that combines some of the depth of the qualitative approach with the scale of quantitative methods. The first development is the steady rise of ''complex log data,'' behavioral data that is logged for purposes other than research but that can be repurposed to construct rich accounts of people's lives. The second is the emergence of large language models (LLMs) with exceptional pattern recognition capabilities on plain text. In this paper, we take a necessary step toward creating this hybrid approach by developing a flexible procedure to transform complex log data into a textual representation of an individual's life trajectory across multiple domains, over time, and in context. We call this data representation a ''book of life.'' We illustrate the feasibility of our approach by writing over 100 million books of life covering many different facets of life, over time and placed in social context using Dutch population-scale registry data. We open source the book of life toolkit (BOLT), and invite the research community to explore the many potential applications of this approach.
摘要：一个多世纪以来，生命课程研究人员在两种主要的方法论方法之间都面临着选择：分析丰富数据但被限制为小样本的定性方法，以及基于定量调查的方法，这些方法研究了较大的人群，但牺牲了数据丰富的规模数据。现在，最近的两个技术发展使我们能够想象一种混合方法，将定性方法的某些深度与定量方法的规模相结合。第一个发展是“复杂日志数据”的稳定崛起，即用于研究以外的其他目的的行为数据，但可以重新使用以构建人们对人们生活的丰富叙述。第二个是大语模型（LLMS）的出现，具有纯文本具有出色的模式识别能力。在本文中，我们通过开发一个灵活的过程来将复杂的日志数据转换为个人跨多个域，随着时间的时间和上下文中个人的生命轨迹的文本表示，迈出了必要的步骤来创建这种混合方法。我们将此数据表示为“生命之书”。我们通过编写超过1亿本生活书籍来说明我们方法的可行性，涵盖了许多不同的生活，并使用荷兰人口规模的注册表数据将其放置在社会环境中。我们为生命工具包（BOLT）开源，并邀请研究社区探索这种方法的许多潜在应用。

Title: Preserving Privacy, Increasing Accessibility, and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation

Authors: Johnson Thomas, Ayush Mudgal, Wendao Liu, Nisten Tahiraj, Zeeshaan Mohammed, Dhruv Diddi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03033
Pdf URL: https://arxiv.org/pdf/2507.03033
Copy Paste: [[2507.03033]] Preserving Privacy, Increasing Accessibility, and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation(https://arxiv.org/abs/2507.03033)
Keywords: language model, llm, hallucination
Abstract: Background: Clinical documentation represents a significant burden for healthcare providers, with physicians spending up to 2 hours daily on administrative tasks. Recent advances in large language models (LLMs) offer promising solutions, but privacy concerns and computational requirements limit their adoption in healthcare settings. Objective: To develop and evaluate a privacy-preserving, on-device medical transcription system using a fine-tuned Llama 3.2 1B model capable of generating structured medical notes from medical transcriptions while maintaining complete data sovereignty entirely in the browser. Methods: We fine-tuned a Llama 3.2 1B model using Parameter-Efficient Fine-Tuning (PEFT) with LoRA on 1,500 synthetic medical transcription-to-structured note pairs. The model was evaluated against the base Llama 3.2 1B on two datasets: 100 endocrinology transcripts and 140 modified ACI benchmark cases. Evaluation employed both statistical metrics (ROUGE, BERTScore, BLEURT) and LLM-as-judge assessments across multiple clinical quality dimensions. Results: The fine-tuned OnDevice model demonstrated substantial improvements over the base model. On the ACI benchmark, ROUGE-1 scores increased from 0.346 to 0.496, while BERTScore F1 improved from 0.832 to 0.866. Clinical quality assessments showed marked reduction in major hallucinations (from 85 to 35 cases) and enhanced factual correctness (2.81 to 3.54 on 5-point scale). Similar improvements were observed on the internal evaluation dataset, with composite scores increasing from 3.13 to 4.43 (+41.5%). Conclusions: Fine-tuning compact LLMs for medical transcription yields clinically meaningful improvements while enabling complete on-device browser deployment. This approach addresses key barriers to AI adoption in healthcare: privacy preservation, cost reduction, and accessibility for resource-constrained environments.
摘要：背景：临床文件代表了医疗保健提供者的重大负担，医生每天最多花2个小时的时间在行政任务上。大型语言模型（LLM）的最新进展提供了有希望的解决方案，但是隐私问题和计算要求限制了其在医疗机构中的采用。目的：使用微调的Llama 3.2 1B模型开发和评估保存隐私的，设备的医疗转录系统，能够从医疗转录中生成结构化的医疗笔记，同时将完整的数据主权完全维持在浏览器中。方法：我们使用参数有效的微调（PEFT）和LORA微调了Llama 3.2 1B模型，该模型在1,500个合成医学转录到结构的音符对上。在两个数据集上针对基本骆驼3.2 1B评估了该模型：100个内分泌转录本和140个修改的ACI基准案例。评估在多个临床质量方面均采用统计指标（Rouge，Bertscore，Bleurt）和LLM-AS-Gudge评估。结果：微调的Ondevice模型表明了基本模型的实质性改进。在ACI基准测试中，Rouge-1分数从0.346增加到0.496，而Bertscore F1的得分从0.832提高到0.866。临床质量评估表明，主要幻觉（从85例到35例）明显减少，并提高了事实正确性（5分制为2.81至3.54）。在内部评估数据集上观察到了类似的改进，复合分数从3.13增加到4.43（+41.5％）。结论：医疗转录的微调紧凑型LLM会产生临床意义的改进，同时实现完整的设备浏览器部署。这种方法解决了医疗保健中AI采用的关键障碍：隐私保护，降低成本以及对资源受限环境的可访问性。

Title: Cautious Next Token Prediction

Authors: Yizhou Wang, Lingzhi Zhang, Yue Bai, Mang Tik Chiu, Zhengmian Hu, Mingyuan Zhang, Qihua Dong, Yu Yin, Sohrab Amirghodsi, Yun Fu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03038
Pdf URL: https://arxiv.org/pdf/2507.03038
Copy Paste: [[2507.03038]] Cautious Next Token Prediction(https://arxiv.org/abs/2507.03038)
Keywords: llm
Abstract: Next token prediction paradigm has been prevailing for autoregressive models in the era of LLMs. The current default sampling choice for popular LLMs is temperature scaling together with nucleus sampling to balance diversity and coherence. Nevertheless, such approach leads to inferior performance in various NLP tasks when the model is not certain about testing questions. To this end, we propose a brand new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP). In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation. Then we select the trial with the lowest perplexity score viewed as the most probable and reliable trial path given the model's capacity. The trial number is negatively correlated with the prediction confidence, i.e., the less confident the model is, the more trials it should sample. This is consistent with human beings' behaviour: when feeling uncertain or unconfident, one tends to think more creatively, exploring multiple thinking paths, to cautiously select the path one feels most confident about. Extensive experiments on both LLMs and MLLMs show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin. Moreover, the integration of CNTP with self consistency can further improve over vanilla self consistency. We believe our proposed CNTP has the potential to become one of the default choices for LLM decoding. Code is available at this https URL.
摘要：在LLM时代，对于自回归模型而言，接下来的标记预测范式已经盛行。流行LLM的当前默认采样选择是温度缩放与核采样，以平衡多样性和连贯性。然而，当模型不确定测试问题时，这种方法会导致各种NLP任务的劣等性能。为此，我们提出了一项全新的无培训解码策略，被称为谨慎的标志预测（CNTP）。在解码过程中，如果模型在某个步骤中具有相对较高的预测熵，则我们独立从步骤开始进行多个试验，并在遇到任何标点符号时停止。然后，鉴于模型的容量，我们选择了最低的困惑分数的试验。试验号与预测置信度负相关，即模型的自信程度越小，该试验应越多。这与人类的行为是一致的：当感到不确定或不自信时，人们倾向于更具创造力的思考，探索多种思维途径，以谨慎地选择人们最自信的道路。在LLM和MLLM上进行的广泛实验表明，我们提出的CNTP方法的表现超过了明确的边缘的现有标准解码策略。此外，CNTP与自我一致性的集成可以进一步改善香草自我一致性。我们认为，我们提出的CNTP有可能成为LLM解码的默认选择之一。代码可在此HTTPS URL上找到。

Title: Dynamic Long Short-Term Memory Based Memory Storage For Long Horizon LLM Interaction

Authors: Yuyang Lou, Charles Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03042
Pdf URL: https://arxiv.org/pdf/2507.03042
Copy Paste: [[2507.03042]] Dynamic Long Short-Term Memory Based Memory Storage For Long Horizon LLM Interaction(https://arxiv.org/abs/2507.03042)
Keywords: language model, llm, prompt
Abstract: Memory storage for Large Language models (LLMs) is becoming an increasingly active area of research, particularly for enabling personalization across long conversations. We propose Pref-LSTM, a dynamic and lightweight framework that combines a BERT-based classifier with a LSTM memory module that generates memory embedding which then is soft-prompt injected into a frozen LLM. We synthetically curate a dataset of preference and non-preference conversation turns to train our BERT-based classifier. Although our LSTM-based memory encoder did not yield strong results, we find that the BERT-based classifier performs reliably in identifying explicit and implicit user preferences. Our research demonstrates the viability of using preference filtering with LSTM gating principals as an efficient path towards scalable user preference modeling, without extensive overhead and fine-tuning.
摘要：大型语言模型（LLM）的存储存储正成为越来越活跃的研究领域，尤其是在长期对话中实现个性化。我们提出了PREF-LSTM，这是一种动态且轻巧的框架，将基于BERT的分类器与LSTM内存模块结合在一起，该模块生成内存嵌入，然后将其软提交注入冷冻LLM中。我们合成策划的偏好数据集和非偏好对话转向训练我们的基于BERT的分类器。尽管我们的基于LSTM的内存编码器未产生强大的结果，但我们发现基于BERT的分类器在识别明确和隐式用户偏好方面可靠地发挥作用。我们的研究表明，将偏好过滤与LSTM门控主作为通往可扩展用户偏好建模的有效途径，而没有广泛的开销和微调。

Title: K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

Authors: Shuhe Li, Chenxu Guo, Jiachen Lian, Cheol Jun Cho, Wenshuo Zhao, Xuanru Zhou, Dingkun Zhou, Sam Wang, Grace Wang, Jingze Yang, Jingyi Xu, Ruohan Bao, Elise Brenner, Brandon In, Francesca Pei, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.03043
Pdf URL: https://arxiv.org/pdf/2507.03043
Copy Paste: [[2507.03043]] K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function(https://arxiv.org/abs/2507.03043)
Keywords: llm
Abstract: Early evaluation of children's language is frustrated by the high pitch, long phones, and sparse data that derail automatic speech recognisers. We introduce K-Function, a unified framework that combines accurate sub-word transcription, objective scoring, and actionable feedback. Its core, Kids-WFST, merges a Wav2Vec2 phoneme encoder with a phoneme-similarity Dysfluent-WFST to capture child-specific errors while remaining fully interpretable. Kids-WFST attains 1.39% phoneme error on MyST and 8.61% on Multitudes--absolute gains of 10.47 and 7.06 points over a greedy-search decoder. These high-fidelity transcripts power an LLM that grades verbal skills, milestones, reading, and comprehension, aligning with human proctors and supplying tongue-and-lip visualizations plus targeted advice. The results show that precise phoneme recognition cements a complete diagnostic-feedback loop, paving the way for scalable, clinician-ready language assessment.
摘要：对儿童语言的早期评估会因高音，长电话和稀疏数据而感到沮丧，这使自动识别者脱轨。我们介绍了K功能，这是一个结合了准确的子字转录，客观评分和可操作的反馈的统一框架。它的核心KIDS-WFST将WAV2VEC2音素编码器与音素相似的功能障碍WFST合并，以捕获特定于儿童的错误，同时保持完全可解释。 Kids-wfst在MYST上遇到1.39％的音素错误，众多人数为8.61％ - 贪婪的搜索解码器的吸收率为10.47和7.06点。这些高保真的成绩单能够为口头技能，里程碑，阅读和理解评分，与人类监护人保持一致，并提供舌头和唇部可视化以及有针对性的建议。结果表明，精确的音素识别巩固了完整的诊断反馈回路，为可扩展的临床医生准备的语言评估铺平了道路。

Title: Counterfactual Tuning for Temporal Sensitivity Enhancement in Large Language Model-based Recommendation

Authors: Yutian Liu, Zhengyi Yang, Jiancan Wu, Xiang Wang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.03047
Pdf URL: https://arxiv.org/pdf/2507.03047
Copy Paste: [[2507.03047]] Counterfactual Tuning for Temporal Sensitivity Enhancement in Large Language Model-based Recommendation(https://arxiv.org/abs/2507.03047)
Keywords: language model, llm
Abstract: Recent advances have applied large language models (LLMs) to sequential recommendation, leveraging their pre-training knowledge and reasoning capabilities to provide more personalized user experiences. However, existing LLM-based methods fail to sufficiently leverage the rich temporal information inherent in users' historical interaction sequences, stemming from fundamental architectural constraints: LLMs process information through self-attention mechanisms that lack inherent sequence ordering and rely on position embeddings designed primarily for natural language rather than user interaction sequences. This limitation significantly impairs their ability to capture the evolution of user preferences over time and predict future interests accurately. To address this critical gap, we propose Counterfactual Enhanced Temporal Framework for LLM-Based Recommendation (CETRec). CETRec is grounded in causal inference principles, which allow it to isolate and measure the specific impact of temporal information on recommendation outcomes. By conceptualizing temporal order as an independent causal factor distinct from item content, we can quantify its unique contribution through counterfactual reasoning--comparing what recommendations would be made with and without temporal information while keeping all other factors constant. This causal framing enables CETRec to design a novel counterfactual tuning objective that directly optimizes the model's temporal sensitivity, teaching LLMs to recognize both absolute timestamps and relative ordering patterns in user histories. Combined with our counterfactual tuning task derived from causal analysis, CETRec effectively enhances LLMs' awareness of both absolute order (how recently items were interacted with) and relative order (the sequential relationships between items).
摘要：最近的进步已将大型语言模型（LLMS）应用于顺序建议，利用其预培训知识和推理能力来提供更多个性化的用户体验。但是，现有的基于LLM的方法无法充分利用用户历史交互序列中固有的丰富时间信息，这是由于基本的体系结构约束所引起的：LLMS过程信息通过缺乏固有的序列排序的自我意识机制，并且依赖于主要用于自然语言而不是用户交互序列而不是用户交互序列的位置嵌入。这种限制极大地损害了他们随着时间的推移捕获用户偏好演变并准确预测未来利益的能力。为了解决这个关键的差距，我们提出了基于LLM的建议（CETREC）的反事实增强的时间框架。 CetREC基于因果推理原则，这使其能够隔离并测量时间信息对建议结果的特定影响。通过将时间顺序概念化为与物品内容不同的独立因果因素，我们可以通过反事实推理量化其独特的贡献 - 避免在有或没有时间信息的情况下提出哪些建议，同时使所有其他因素保持恒定。这种因果框架使Cetrec能够设计一个新颖的反事实调谐目标，该目标直接优化了模型的时间敏感性，教导LLMS可以识别用户历史记录中的绝对时间戳和相对订购模式。结合我们从因果分析中得出的反事实调整任务，Cetrec有效地增强了LLMS对绝对顺序（最近与项目相互作用）和相对顺序（项目之间的顺序关系）的认识。

Title: Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case

Authors: Alvaro Riquelme, Pedro Costa, Catalina Martinez
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03067
Pdf URL: https://arxiv.org/pdf/2507.03067
Copy Paste: [[2507.03067]] Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case(https://arxiv.org/abs/2507.03067)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: For years, semantic interoperability standards have sought to streamline the exchange of clinical data, yet their deployment remains time-consuming, resource-intensive, and technically challenging. To address this, we introduce a semi-automated approach that leverages large language models specifically GPT-4o and Llama 3.2 405b to convert structured clinical datasets into HL7 FHIR format while assessing accuracy, reliability, and security. Applying our method to the MIMIC-IV database, we combined embedding techniques, clustering algorithms, and semantic retrieval to craft prompts that guide the models in mapping each tabular field to its corresponding FHIR resource. In an initial benchmark, resource identification achieved a perfect F1-score, with GPT-4o outperforming Llama 3.2 thanks to the inclusion of FHIR resource schemas within the prompt. Under real-world conditions, accuracy dipped slightly to 94 %, but refinements to the prompting strategy restored robust mappings. Error analysis revealed occasional hallucinations of non-existent attributes and mismatches in granularity, which more detailed prompts can mitigate. Overall, our study demonstrates the feasibility of context-aware, LLM-driven transformation of clinical data into HL7 FHIR, laying the groundwork for semi-automated interoperability workflows. Future work will focus on fine-tuning models with specialized medical corpora, extending support to additional standards such as HL7 CDA and OMOP, and developing an interactive interface to enable expert validation and iterative refinement.
摘要：多年来，语义互操作性标准一直在试图简化临床数据的交换，但它们的部署仍然耗时，资源密集和技术挑战。为了解决这个问题，我们介绍了一种半自动化的方法，该方法利用了大型语言模型，专门为GPT-4O和Llama 3.2 405B将结构化的临床数据集转换为HL7 FHIR格式，同时评估准确性，可靠性和安全性。将我们的方法应用于模仿数据库，我们将嵌入技术，聚类算法和语义检索结合起来，以提示指导每个表格字段中的模型到其相应的FHIR资源。在最初的基准测试中，资源识别获得了完美的F1得分，GPT-4O的表现优于Llama 3.2，这要归功于将FHIR资源模式包含在提示中。在实际条件下，准确性略有下降至94％，但提示策略的改进恢复了强大的映射。错误分析表明，颗粒状的不存在属性和不匹配的偶尔幻觉，更详细的提示可以减轻。总体而言，我们的研究证明了将临床数据转化为HL7 FHIR的上下文感知，LLM驱动的转换的可行性，这为半自动互操作性工作流提供了基础。未来的工作将重点放在具有专门医疗公司的微观模型上，将支持HL7 CDA和OMOP等其他标准扩展，并开发交互式界面以实现专家验证和迭代性改进。

Title: ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization

Authors: YuXuan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03069
Pdf URL: https://arxiv.org/pdf/2507.03069
Copy Paste: [[2507.03069]] ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization(https://arxiv.org/abs/2507.03069)
Keywords: gpt, llm
Abstract: With the rapid advancement of Reinforcement Learning from Human Feedback (RLHF) and autoregressive transformers, state-of-the-art models such as GPT-4.0, DeepSeek R1, and Llama 3.3 increasingly emphasize answer depth and personalization. However, most existing RLHF approaches (e.g., PPO, DPO) still rely on a binary-preference (BT) paradigm, which, while reducing annotation costs, still requires substantial human effort and captures only group-level tendencies rather than individual preferences. To overcome these limitations, we propose Adaptive Reward-Following (ARF), a self-assessment framework that leverages a high-precision emotion analyzer achieving over 70% accuracy on GoEmotions, Sentiment140, and DailyDialog to convert free-form user feedback into continuous preference scores. We further enrich and debias these signals through lightweight data augmentations, including synonym replacement, random trace truncation, and score bias annotation algorithm. A Dynamic Adapter Preference Tracker continuously models evolving user tastes in real time, enabling our novel Trace Bias (TB) fine-tuning algorithm to optimize directly on these tracked rewards instead of coarse binary labels. Experiments on Qwen-2/2.5, Gemma-2, and Llama-3.2 across four preference domains demonstrate that ARF achieves an improvement of 3.3% over PPO and 7.6% over DPO. Moreover, TB preserves theoretical alignment with PPO and DPO objectives. Overall, ARF presents a scalable, personalized, and cost-effective approach to RLHF LLMs through autonomous reward modeling.
摘要：随着人类反馈（RLHF）和自回归变形金刚的增强学习的迅速发展，GPT-4.0，DeepSeek R1和Llama等最新模型3.3越来越强调答案深度和个性化。但是，大多数现有的RLHF方法（例如，PPO，DPO）仍然依赖于二进制优先（BT）范式，在降低注释成本的同时，它仍然需要大量的人类努力，并且仅捕获群体级别的倾向，而不是个人偏好。为了克服这些局限性，我们提出了自适应奖励辅助（ARF），这是一个自我评估框架，利用高精度情感分析仪可在GoEmotions，Sentiment140和DailyDialog上实现超过70％的精度，将自由形式的用户反馈转换为持续偏好分数。我们通过轻巧的数据增强来进一步丰富这些信号，包括同义词更换，随机跟踪截断和评分偏差注释算法。动态适配器偏好跟踪器不断建模用户品味，使我们的新颖的微量偏置（TB）微调算法可以直接在这些跟踪的奖励上进行优化，而不是粗糙的二进制标签。 QWEN-2/2.5，GEMMA-2和LLAMA-3.2在四个偏好域中进行的实验表明，ARF比PPO的提高3.3％，而DPO的提高了7.6％。此外，结核病保留了与PPO和DPO目标的理论一致性。总体而言，ARF通过自动奖励建模提供了RLHF LLMS的可扩展，个性化和具有成本效益的方法。

Title: RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

Authors: Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingxuan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2507.03112
Pdf URL: https://arxiv.org/pdf/2507.03112
Copy Paste: [[2507.03112]] RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents(https://arxiv.org/abs/2507.03112)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM's learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends--thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.
摘要：大型语言模型（LLM）在逻辑和算法推理方面表现出色，但他们的情商（EQ）仍然远远落后于他们的认知能力。虽然从可验证的奖励中学习（RLVR）在其他领域中提出了进步，但其应用于对话，尤其是对情绪智能 - 抢劫的应用。在这项工作中，我们介绍了RLVER，这是第一个端到端的强化学习框架，该框架利用模拟用户可验证的情感奖励来培养LLMS中的高阶同理心能力。在此框架内，自谐情感模拟的用户参与对话推出并在对话期间产生确定性的情感分数，作为指导LLM学习的奖励信号。具有PPO的微调QWEN2.5-7B教学模型可将其Sontient基准分数从13.3提高到79.2，而在很大程度上保留了数学和编码能力。广泛的实验表明：（i）rlver始终提高多个对话能力；（ii）思维和无思想的模型显示出不同的趋势 - 想法的模型在同理心和洞察力方面表现出色，而无意识的模型则有利于行动；（iii）GRPO通常会产生稳定的增长，而PPO可以将某些功能推向更高的天花板；（iv）更具挑战性的环境并不总是适中的环境可以产生更强大的结果。我们的结果表明，rlver是通往情感智能和能力广泛的语言代理商的实用途径。

Title: ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models

Authors: Boyang Xue, Qi Zhu, Rui Wang, Sheng Wang, Hongru Wang, Fei Mi, Yasheng Wang, Lifeng Shang, Qun Liu, Kam-Fai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03133
Pdf URL: https://arxiv.org/pdf/2507.03133
Copy Paste: [[2507.03133]] ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models(https://arxiv.org/abs/2507.03133)
Keywords: language model, llm, prompt
Abstract: Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining the reliability. Prior studies of LLM reliability have primarily focused on knowledge tasks to identify unanswerable questions, while mathematical reasoning tasks have remained unexplored due to the dearth of unsolvable math problems. To systematically investigate LLM reliability in mathematical reasoning tasks, we formulate the reliability evaluation for both solvable and unsolvable problems. We then develop a ReliableMath dataset which incorporates open-source solvable problems and high-quality unsolvable problems synthesized by our proposed construction workflow with human evaluations. Experiments are conducted on various LLMs with several key findings uncovered. LLMs fail to directly identify unsolvable problems and always generate fabricated responses. When instructing LLMs to indicate unsolvability using a reliable prompt, the reliability of larger-sized LLMs remains on solvable problems, but notably improves on unsolvable problems yet still falls short of solvable problems. However, small LLMs rarely show any progress despite employing reliable prompts. Therefore, we further propose an alignment strategy to enhance small LLMs' reliability, which can significantly improve LLM reliability performances on both in-domain and out-of-domain tasks.
摘要：尽管在推理任务上表现出了出色的表现，但当面对无法解决或超出其能力的问题时，大型语言模型（LLM）仍然倾向于制造不可靠的响应，从而严重破坏了可靠性。对LLM可靠性的先前研究主要集中在知识任务上，以确定无法回答的问题，而由于缺乏无法解决的数学问题，数学推理任务仍未探索。为了系统地研究数学推理任务中的LLM可靠性，我们为可解决方案和无法解决的问题制定了可靠性评估。然后，我们开发了一个Reliablemath数据集，该数据集结合了由我们提出的施工工作流和人类评估综合的可解决问题的问题和高质量的无法解决的问题。实验是在各种LLM上进行的，并发现了几个关键发现。 LLM无法直接识别无法解决的问题，并且始终产生捏造的响应。当指示LLMS使用可靠的提示表明无法解除性时，大型LLMS的可靠性仍然存在于可解决的问题上，但显着改善了无法解决的问题，但仍然没有可解决的问题。但是，尽管采用了可靠的提示，但小型LLM很少会显示出任何进展。因此，我们进一步提出了一种对齐策略来增强小型LLM的可靠性，这可以显着提高在内域和外域任务上的LLM可靠性性能。

Title: From Measurement to Mitigation: Exploring the Transferability of Debiasing Approaches to Gender Bias in Maltese Language Models

Authors: Melanie Galea, Claudia Borg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03142
Pdf URL: https://arxiv.org/pdf/2507.03142
Copy Paste: [[2507.03142]] From Measurement to Mitigation: Exploring the Transferability of Debiasing Approaches to Gender Bias in Maltese Language Models(https://arxiv.org/abs/2507.03142)
Keywords: language model, llm
Abstract: The advancement of Large Language Models (LLMs) has transformed Natural Language Processing (NLP), enabling performance across diverse tasks with little task-specific training. However, LLMs remain susceptible to social biases, particularly reflecting harmful stereotypes from training data, which can disproportionately affect marginalised communities. We measure gender bias in Maltese LMs, arguing that such bias is harmful as it reinforces societal stereotypes and fails to account for gender diversity, which is especially problematic in gendered, low-resource languages. While bias evaluation and mitigation efforts have progressed for English-centric models, research on low-resourced and morphologically rich languages remains limited. This research investigates the transferability of debiasing methods to Maltese language models, focusing on BERTu and mBERTu, BERT-based monolingual and multilingual models respectively. Bias measurement and mitigation techniques from English are adapted to Maltese, using benchmarks such as CrowS-Pairs and SEAT, alongside debiasing methods Counterfactual Data Augmentation, Dropout Regularization, Auto-Debias, and GuiDebias. We also contribute to future work in the study of gender bias in Maltese by creating evaluation datasets. Our findings highlight the challenges of applying existing bias mitigation methods to linguistically complex languages, underscoring the need for more inclusive approaches in the development of multilingual NLP.
摘要：大型语言模型（LLM）的进步已转变自然语言处理（NLP），通过很少的特定任务培训使各种任务的绩效能够跨越。但是，LLM仍然容易受到社会偏见的影响，特别是反映了训练数据中有害的刻板印象，这可能会影响边缘化社区。我们衡量马耳他LMS中的性别偏见，认为这种偏见是有害的，因为它加强了社会刻板印象，并且无法解决性别多样性，这在性别，低种植的语言中尤其有问题。尽管以英语为中心的模型进行了偏见评估和缓解工作，但对低资源和形态丰富的语言的研究仍然有限。这项研究调查了借个方法向马耳他语言模型的转移性，分别侧重于基于伯特的单语和多语言模型的贝尔图和米伯图。英语的偏差测量和缓解技术适应了马耳他，使用诸如乌鸦对和座椅等基准，以及偏见的方法反事实数据增强，辍学，自动化，自动debias和Guidebias。我们还通过创建评估数据集为马耳他的性别偏见研究提供了未来的工作。我们的发现突出了将现有的缓解方法应用于语言复杂语言的挑战，强调了对多语言NLP开发的更具包容性方法的需求。

Title: Expert-level validation of AI-generated medical text with scalable language models

Authors: Asad Aali, Vasiliki Bikia, Maya Varma, Nicole Chiou, Sophie Ostmeier, Arnav Singhvi, Magdalini Paschali, Ashwin Kumar, Andrew Johnston, Karimar Amador-Martinez, Eduardo Juan Perez Guerrero, Paola Naovi Cruz Rivera, Sergios Gatidis, Christian Bluethgen, Eduardo Pontes Reis, Eddy D. Zandee van Rilland, Poonam Laxmappa Hosamani, Kevin R Keet, Minjoung Go, Evelyn Ling, David B. Larson, Curtis Langlotz, Roxana Daneshjou, Jason Hom, Sanmi Koyejo, Emily Alsentzer, Akshay S. Chaudhari
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03152
Pdf URL: https://arxiv.org/pdf/2507.03152
Copy Paste: [[2507.03152]] Expert-level validation of AI-generated medical text with scalable language models(https://arxiv.org/abs/2507.03152)
Keywords: language model, gpt
Abstract: With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase ( this https URL ), 2) MedVAL-Bench ( this https URL ), and 3) MedVAL-4B ( this https URL ), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
摘要：随着语言模型（LMS）在临床环境中的日益增长的使用，即时需要评估LM生成的医学文本的准确性和安全性。目前，这种评估仅依赖于手动医师评论。但是，在LM生成的文本中检测错误是具有挑战性的，因为1）手动审查是昂贵的，而2）专家组成的参考输出通常在现实世界中不可用。虽然“ LM-AS-Gudge”范式（评估另一个LM的LM）提供了可扩展的评估，但即使是Frontier LMS也可能会错过细微但临床上显着的错误。为了应对这些挑战，我们提出了Medval，这是一个自我监督的框架，利用合成数据来培训评估者LMS，以评估LM生成的医疗输出是否实际上与输入相一致，而无需医生标签或参考输出。为了评估LM性能，我们介绍了Medval-Bench，这是一个数据集，该数据集包含由医生定义的风险水平和错误类别的分类学分类法所注释的840个输出。在6种不同的医疗任务和10个最先进的LMS中，涵盖开源，专有和医疗改编的模型，Medval微调显着改善（P <0.001）与医生在可见任务和看不见的任务上的一致性（p <0.001），平均F1摄入量为66％至83％，并以每样品的安全性分类从66％增加到83％。 MEDVAL甚至将表现最佳的LM（GPT-4O）的性能提高了8％。为了支持临床整合的可扩展，风险感知的途径，我们开源1）代码库（此https url），2）Medval-Bench（此HTTPS URL）和3）Medval-4B（此HTTPS URL），最佳的开源LM。我们的研究提供了LMS接近医学文本专家级验证能力的第一个证据。

Title: Adversarial Manipulation of Reasoning Models using Internal Representations

Authors: Kureha Yamaguchi, Benjamin Etheridge, Andy Arditi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03167
Pdf URL: https://arxiv.org/pdf/2507.03167
Copy Paste: [[2507.03167]] Adversarial Manipulation of Reasoning Models using Internal Representations(https://arxiv.org/abs/2507.03167)
Keywords: language model, prompt, chain-of-thought
Abstract: Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply -- termed the "caution" direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models. Code available at this https URL
摘要：推理模型在最终产出之前产生了经过思考链（COT）令牌，但是这如何影响他们对越狱攻击的脆弱性尚不清楚。尽管传统语言模型在迅速响应边界上做出拒绝决定，但我们发现了DeepSeek-R1-Distill-Lalama-8B在其COT生成中做出这些决定的证据。我们在COT代币生成过程中确定激活空间中的线性方向，该方向预测该模型是拒绝还是遵守 - 称为“警告”方向，因为它与生成的文本中的谨慎推理模式相对应。从模型激活中消除这一方向可以提高有害合规性，从而有效地越狱模型。我们还表明，仅在COT令牌激活上进行介入足以控制最终输出，并且将此方向纳入基于及时的攻击可以提高成功率。我们的发现表明，思考链本身是推理模型中对抗性操纵的有希望的新目标。此https URL可用代码

Title: How Much Content Do LLMs Generate That Induces Cognitive Bias in Users?

Authors: Abeer Alessa, Akshaya Lakshminarasimhan, Param Somane, Julian Skirzynski, Julian McAuley, Jessica Echterhoff
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03194
Pdf URL: https://arxiv.org/pdf/2507.03194
Copy Paste: [[2507.03194]] How Much Content Do LLMs Generate That Induces Cognitive Bias in Users?(https://arxiv.org/abs/2507.03194)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly integrated into applications ranging from review summarization to medical diagnosis support, where they affect human decisions. Even though LLMs perform well in many tasks, they may also inherit societal or cognitive biases, which can inadvertently transfer to humans. We investigate when and how LLMs expose users to biased content and quantify its severity. Specifically, we assess three LLM families in summarization and news fact-checking tasks, evaluating how much LLMs stay consistent with their context and/or hallucinate. Our findings show that LLMs expose users to content that changes the sentiment of the context in 21.86% of the cases, hallucinates on post-knowledge-cutoff data questions in 57.33% of the cases, and primacy bias in 5.94% of the cases. We evaluate 18 distinct mitigation methods across three LLM families and find that targeted interventions can be effective. Given the prevalent use of LLMs in high-stakes domains, such as healthcare or legal analysis, our results highlight the need for robust technical safeguards and for developing user-centered interventions that address LLM limitations.
摘要：大型语言模型（LLM）越来越多地整合到从审查摘要到医学诊断支持的应用中，它们会影响人类的决策。即使LLM在许多任务中表现良好，它们也可能继承社会或认知偏见，这些偏见可能会无意中转移到人类。我们调查了LLM何时以及如何使用户暴露于偏见的内容并量化其严重性。具体而言，我们在摘要和新闻事实检查任务中评估了三个LLM家庭，评估LLM与他们的上下文和/或幻觉保持一致的程度。我们的发现表明，LLMS将用户暴露在21.86％的案件中改变上下文情绪的内容，在57.33％的案件中对知识后切割数据问题产生了幻觉，在5.94％的案件中造成了首要偏见。我们评估了三个LLM家族的18种不同缓解方法，发现目标干预措施可能是有效的。鉴于在高风险领域（例如医疗保健或法律分析）中普遍使用LLMS，我们的结果强调了需要强大的技术保障措施以及开发以用户为中心的干预措施来解决LLM限制。

Title: KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation

Authors: Antoine Nzeyimana, Andre Niyongabo Rubungo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03241
Pdf URL: https://arxiv.org/pdf/2507.03241
Copy Paste: [[2507.03241]] KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation(https://arxiv.org/abs/2507.03241)
Keywords: language model, llm, hallucination, chat, retrieval-augmented generation
Abstract: The recent mainstream adoption of large language model (LLM) technology is enabling novel applications in the form of chatbots and virtual assistants across many domains. With the aim of grounding LLMs in trusted domains and avoiding the problem of hallucinations, retrieval-augmented generation (RAG) has emerged as a viable solution. In order to deploy sustainable RAG systems in low-resource settings, achieving high retrieval accuracy is not only a usability requirement but also a cost-saving strategy. Through empirical evaluations on a Kinyarwanda-language dataset, we find that the most limiting factors in achieving high retrieval accuracy are limited language coverage and inadequate sub-word tokenization in pre-trained language models. We propose a new retriever model, KinyaColBERT, which integrates two key concepts: late word-level interactions between queries and documents, and a morphology-based tokenization coupled with two-tier transformer encoding. This methodology results in lexically grounded contextual embeddings that are both fine-grained and self-contained. Our evaluation results indicate that KinyaColBERT outperforms strong baselines and leading commercial text embedding APIs on a Kinyarwanda agricultural retrieval benchmark. By adopting this retrieval strategy, we believe that practitioners in other low-resource settings can not only achieve reliable RAG systems but also deploy solutions that are more cost-effective.
摘要：最新的大型语言模型（LLM）技术的主流采用正在以许多领域的聊天机器人和虚拟助手的形式实现新颖的应用。为了将LLM扎根在受信任的领域并避免幻觉问题，检索型发电（RAG）已成为可行的解决方案。为了在低资源环境中部署可持续的抹布系统，实现高检索准确性不仅是可用性要求，而且是省略成本的策略。通过对Kinyarwanda语言数据集的经验评估，我们发现实现高检索准确性的最大因素是语言覆盖范围有限，并且在预训练的语言模型中不充分的子字出现。我们提出了一个新的检索器模型Kinyacolbert，该模型集成了两个关键概念：查询和文档之间的后期单词级相互作用，以及基于形态的令牌化，加上两层变压器的编码。这种方法会导致词汇扎根的上下文嵌入，这些嵌入既是细粒度又具有独立的嵌入。我们的评估结果表明，Kinyacolbert的表现优于强大的基准，并领先于Kinyarwanda农业检索基准上嵌入API。通过采用这种检索策略，我们认为其他低资源环境中的从业人员不仅可以实现可靠的破布系统，而且还可以部署更具成本效益的解决方案。

Title: RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs

Authors: Baolong Bi, Shenghua Liu, Xingzhang Ren, Dayiheng Liu, Junyang Lin, Yiwei Wang, Lingrui Mei, Junfeng Fang, Jiafeng Guo, Xueqi Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03253
Pdf URL: https://arxiv.org/pdf/2507.03253
Copy Paste: [[2507.03253]] RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs(https://arxiv.org/abs/2507.03253)
Keywords: language model, llm
Abstract: The foundational capabilities of large language models (LLMs) are deeply influenced by the quality of their pre-training corpora. However, enhancing data quality at scale remains a significant challenge, primarily due to the trade-off between refinement effectiveness and processing efficiency. While rule-based filtering remains the dominant paradigm, it typically operates at the document level and lacks the granularity needed to refine specific content within documents. Inspired by emerging work such as ProX, we propose $\textbf{RefineX}$, a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks. RefineX enables efficient and fine-grained data refinement while reliably preserving the diversity and naturalness of raw text. The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into minimal edit-based deletion programs. This high-precision distillation pipeline is used to train an efficient and reliable refine model that can systematically improve every instance in the corpus at scale. We evaluate RefineX across from-scratch pre-training at multiple model scales and find that it consistently outperforms models trained on raw, filtered, or alternatively refined data across diverse downstream tasks. On the 750M model, RefineX yields 2.6%-7.2% average gains on lighteval tasks, and achieves comparable performance using significantly fewer training tokens. Further analysis shows that RefineX reliably enhances text quality with both high efficiency and precision, outperforming prior approaches such as end-to-end generation and Prox-C. These results position RefineX as a scalable, effective, and reliable solution for optimizing pre-training data in modern LLM pipelines.
摘要：大语言模型（LLM）的基本能力受其培训前语料库质量的深刻影响。但是，大规模提高数据质量仍然是一个重大挑战，这主要是由于改进效率和加工效率之间的权衡。尽管基于规则的过滤仍然是主要的范式，但它通常在文档级别运行，并且缺乏完善文档中特定内容所需的粒度。受Prox等新兴工作的启发，我们提出了$ \ textbf {refinex} $，这是一种通过程序化编辑任务，用于对预训练数据进行大规模，外科手术改进的新颖框架。 Refinex可以可靠地保留原始文本的多样性和自然性，从而实现高效且细粒度的细化。改进的核心优势在于将高质量的专家引导的端到端精炼结果提取到最小的基于编辑的删除程序中。这种高精度蒸馏管线用于训练有效且可靠的精炼模型，该模型可以系统地改善语料库中的每个实例。我们在多个模型尺度上从划痕预训练中评估了精炼，发现它始终超过了在不同下游任务中训练在原始，过滤或精制的数据上训练的模型。在750m型号上，精炼剂在灯笼任务上的平均增长率为2.6％-7.2％，并使用较少的训练令牌实现可比的性能。进一步的分析表明，精炼可靠地以高效率和精确度可靠地增强文本质量，超过先前的方法，例如端到端生成和Prox-C。这些结果将其定位为可扩展，有效且可靠的解决方案，以优化现代LLM管道中的预训练数据。

Title: GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation

Authors: Himanshu Dutta, Sunny Manchanda, Prakhar Bapat, Meva Ram Gurjar, Pushpak Bhattacharyya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03311
Pdf URL: https://arxiv.org/pdf/2507.03311
Copy Paste: [[2507.03311]] GRAFT: A Graph-based Flow-aware Agentic Framework for Document-level Machine Translation(https://arxiv.org/abs/2507.03311)
Keywords: language model, llm, agent
Abstract: Document level Machine Translation (DocMT) approaches often struggle with effectively capturing discourse level phenomena. Existing approaches rely on heuristic rules to segment documents into discourse units, which rarely align with the true discourse structure required for accurate translation. Otherwise, they fail to maintain consistency throughout the document during translation. To address these challenges, we propose Graph Augmented Agentic Framework for Document Level Translation (GRAFT), a novel graph based DocMT system that leverages Large Language Model (LLM) agents for document translation. Our approach integrates segmentation, directed acyclic graph (DAG) based dependency modelling, and discourse aware translation into a cohesive framework. Experiments conducted across eight translation directions and six diverse domains demonstrate that GRAFT achieves significant performance gains over state of the art DocMT systems. Specifically, GRAFT delivers an average improvement of 2.8 d BLEU on the TED test sets from IWSLT2017 over strong baselines and 2.3 d BLEU for domain specific translation from English to Chinese. Moreover, our analyses highlight the consistent ability of GRAFT to address discourse level phenomena, yielding coherent and contextually accurate translations.
摘要：文档级别的机器翻译（DOCMT）方法通常在有效捕获话语水平现象方面遇到困难。现有的方法依靠启发式规则将文档分割为话语单元，这些单元很少与准确翻译所需的真实话语结构保持一致。否则，他们在翻译过程中无法在整个文档中保持一致性。为了应对这些挑战，我们提出了图形增强的代理框架，用于文档级别翻译（Graft），这是一种基于图形的新型DOCMT系统，利用大型语言模型（LLM）代理进行文档翻译。我们的方法集成了基于定向的无环图（DAG）的依赖性建模（DAG），并将话语意识到转换为凝聚力框架。跨八个翻译方向和六个不同领域进行的实验表明，移植物在最先进的DOCMT系统的状态下实现了显着的性能增长。具体而言，Graft在IWSLT2017的TED测试集上平均改善了2.8 d BLEU，而不是强质基线的2.3 d BLEU，用于从英语到中文的域特异性翻译。此外，我们的分析强调了移植物解决话语水平现象的一致能力，从而产生了连贯且上下文准确的翻译。

Title: Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs

Authors: Yuanxin Wang, Ganesh Venkatesh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03327
Pdf URL: https://arxiv.org/pdf/2507.03327
Copy Paste: [[2507.03327]] Read Quietly, Think Aloud: Decoupling Comprehension and Reasoning in LLMs(https://arxiv.org/abs/2507.03327)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in understanding text and generating high-quality responses. However, a critical distinction from human cognition is their typical lack of a distinct internal `reading' or deliberation phase before `speaking' (i.e., generating text). Humans often engage in silent reading to comprehend context and formulate thoughts prior to articulation. This paper investigates methods to imbue LLMs with a similar capacity for internal processing. We introduce and evaluate techniques that encourage LLMs to `read silently.' Our findings indicate that even a straightforward approach, such as providing the model with an initial contextual prompt or `reading space' before it begins predicting subsequent tokens for the final output, can yield significant performance improvements. We further enhance this concept by developing a `reading buddy' architecture, where an auxiliary component silently processes the input and provides refined contextual insights to the primary generation model. These approaches aim to foster deeper understanding from LLMs so that they can produce better reasoned responses, moving them one step closer to more human-like text processing. Our results indicate that these simple techniques can provide surprisingly strong impact on accuracy with multiple point accuracy boost.
摘要：大型语言模型（LLMS）表现出非常熟练的熟练程度，可以理解文本和产生高质量的响应。但是，与人类认知的批判性区别在于他们在“说话”之前典型缺乏独特的内部“阅读”或审议阶段（即生成文本）。人类经常进行沉默的阅读以理解上下文并在表达之前制定思想。本文研究了以类似的内部处理能力填充LLM的方法。我们介绍和评估鼓励LLMS“默默阅读”的技术。我们的发现表明，即使是一种直接的方法，例如在开始预测最终输出的后续令牌之前，为模型提供了初始上下文提示或“阅读空间”，也可以产生重大的性能改进。我们通过开发“阅读好友”体系结构进一步增强了这一概念，在该体系结构中，辅助组件默默地处理输入并为主要生成模型提供了精致的上下文见解。这些方法旨在增进LLM的更深入的了解，以便它们可以产生更好的理性反应，从而使它们更接近更类似人类的文本处理。我们的结果表明，这些简单的技术可以以多点准确性提高对准确性产生令人惊讶的强大影响。

Title: SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge

Authors: Yuxiang Mei, Yuang Zheng, Dongxing Xu, Yanhua Long
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2507.03343
Pdf URL: https://arxiv.org/pdf/2507.03343
Copy Paste: [[2507.03343]] SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge(https://arxiv.org/abs/2507.03343)
Keywords: language model, llm, prompt
Abstract: This paper describes SHNU multilingual conversational speech recognition system (SHNU-mASR, team name-"maybe"), submitted to Track 1 of the INTERSPEECH 2025 MLC-SLM Challenge. Our system integrates a parallel-speech-encoder architecture with a large language model (LLM) to form a unified multilingual ASR framework. The parallel-speech-encoder consists of two pre-trained encoders, the Whisper-large-v3 encoder and mHuBERT-147 encoder. Their output embeddings are concatenated and fed into the LLM, enabling the model to leverage complementary acoustic and linguistic knowledge and achieve competitive performance. Moreover, we adopt a tri-stage training strategy to jointly update the low-rank adaptation modules and projector parameters of both the speech encoders and the LLM. In addition, we incorporate an additional language-aware prompt at the LLM input to enhance language-specific text generation. The SHNU-mASR system achieves an overall character/word error rate (CER/WER) of 11.76% on the blind evaluation set of the challenge, outperforming the official MLC-SLM baseline by 8.41 absolute CER/WER, without increasing the baseline training data.
摘要：本文介绍了SHNU多语言对话语音识别系统（SHNU-MASR，团队名称 - “也许”），并提交了Interspeech 2025 MLC-SLM挑战的跟踪。我们的系统将平行语音编码器架构与大语言模型（LLM）集成在一起，以形成统一的多语言ASR框架。平行语音编码器由两个预训练的编码器组成，窃窃私语-V3编码器和Mhubert-147编码器。它们的输出嵌入是连接的，并馈入LLM，使该模型能够利用互补的声学和语言知识并实现竞争性能。此外，我们采用三阶段培训策略来共同更新语音编码器和LLM的低级适应模块和投影仪参数。此外，我们在LLM输入中包含了一个其他语言意识提示，以增强特定语言的文本生成。 SHNU-MASR系统在挑战的盲目评估集中达到了总体字符/单词错误率（CER/WER），为11.76％，在没有增加基线训练数据的情况下，其表现优于8.41绝对CER/WER。

Title: WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia

Authors: Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, Elena Simperl
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03373
Pdf URL: https://arxiv.org/pdf/2507.03373
Copy Paste: [[2507.03373]] WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia(https://arxiv.org/abs/2507.03373)
Keywords: language model, llm, prompt
Abstract: Given Wikipedia's role as a trusted source of high-quality, reliable content, concerns are growing about the proliferation of low-quality machine-generated text (MGT) produced by large language models (LLMs) on its platform. Reliable detection of MGT is therefore essential. However, existing work primarily evaluates MGT detectors on generic generation tasks rather than on tasks more commonly performed by Wikipedia editors. This misalignment can lead to poor generalisability when applied in real-world Wikipedia contexts. We introduce WETBench, a multilingual, multi-generator, and task-specific benchmark for MGT detection. We define three editing tasks, empirically grounded in Wikipedia editors' perceived use cases for LLM-assisted editing: Paragraph Writing, Summarisation, and Text Style Transfer, which we implement using two new datasets across three languages. For each writing task, we evaluate three prompts, generate MGT across multiple generators using the best-performing prompt, and benchmark diverse detectors. We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%. These results show that detectors struggle with MGT in realistic generation scenarios and underscore the importance of evaluating such models on diverse, task-specific data to assess their reliability in editor-driven contexts.
摘要：鉴于Wikipedia是值得信赖的高质量，可靠内容来源的角色，因此人们对在其平台上由大语言模型（LLMS）产生的低质量机器生成的文本（MGT）的泛滥变得越来越担忧。因此，可靠的MGT检测至关重要。但是，现有的工作主要评估MGT检测器对通用生成任务，而不是Wikipedia编辑更常见执行的任务。当在现实世界的Wikipedia环境中应用时，这种错位可能会导致差异性。我们介绍了MGT检测的多语言，多代和特定于任务的基准。我们定义了三个编辑任务，这些任务基于Wikipedia编辑器的“感知用例”，用于LLM辅助编辑：段落写作，摘要和文本样式转移，我们使用三种语言的两个新数据集实现了这些案例。对于每个写作任务，我们都会评估三个提示，使用表现最佳的提示和基准不同的探测器在多个发电机上生成MGT。我们发现，在各个设置中，基于培训的检测器的平均准确度为78％，而零射击检测器的平均探测器平均为58％。这些结果表明，在现实的生成方案中，检测器与MGT斗争，并强调了评估此类模型在多种任务的数据上评估其在编辑驱动的环境中的可靠性的重要性。

Title: Making Sense of Korean Sentences: A Comprehensive Evaluation of LLMs through KoSEnd Dataset

Authors: Seunguk Yu, Kyeonghyun Kim, Jungmin Yun, Youngbin Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03378
Pdf URL: https://arxiv.org/pdf/2507.03378
Copy Paste: [[2507.03378]] Making Sense of Korean Sentences: A Comprehensive Evaluation of LLMs through KoSEnd Dataset(https://arxiv.org/abs/2507.03378)
Keywords: llm
Abstract: Although LLMs have made significant progress in various languages, there are still concerns about their effectiveness with low-resource agglutinative languages compared to languages such as English. In this study, we focused on Korean, a language known for its complex sentence endings, and evaluated LLMs on this challenging aspect. We introduce the Korean Sentence Endings (KoSEnd) dataset, which includes 3,000 sentences, each annotated for the naturalness of 15 sentence ending forms. These were collected from diverse sources to cover a range of contexts. We evaluated 11 LLMs to assess their understanding of Korean sentence endings, analyzing them based on parameter count and prediction consistency. Notably, we found that informing models about the possibility of missing sentence endings improved performance, highlighting the impact of explicitly considering certain linguistic features.
摘要：尽管LLM在各种语言方面取得了重大进展，但与英语等语言相比，它们对低资源凝集性语言的有效性仍然令人担忧。在这项研究中，我们专注于以其复杂的句子结尾而闻名的韩语，并在这一充满挑战的方面评估了LLM。我们介绍了韩国句子结尾（Kosend）数据集，其中包括3,000个句子，每个句子都以15个句子结尾表的自然性注释。这些是从各种来源收集的，以涵盖一系列环境。我们评估了11个LLM，以评估他们对韩国句子结尾的理解，并根据参数计数和预测一致性分析它们。值得注意的是，我们发现，告知模型丢失句子结尾的可能性提高了性能，突出了明确考虑某些语言特征的影响。

Title: Graph Repairs with Large Language Models: An Empirical Study

Authors: Hrishikesh Terdalkar, Angela Bonifati, Andrea Mauri
Subjects: cs.CL, cs.DB, cs.ET
Abstract URL: https://arxiv.org/abs/2507.03410
Pdf URL: https://arxiv.org/pdf/2507.03410
Copy Paste: [[2507.03410]] Graph Repairs with Large Language Models: An Empirical Study(https://arxiv.org/abs/2507.03410)
Keywords: language model, llm
Abstract: Property graphs are widely used in domains such as healthcare, finance, and social networks, but they often contain errors due to inconsistencies, missing data, or schema violations. Traditional rule-based and heuristic-driven graph repair methods are limited in their adaptability as they need to be tailored for each dataset. On the other hand, interactive human-in-the-loop approaches may become infeasible when dealing with large graphs, as the cost--both in terms of time and effort--of involving users becomes too high. Recent advancements in Large Language Models (LLMs) present new opportunities for automated graph repair by leveraging contextual reasoning and their access to real-world knowledge. We evaluate the effectiveness of six open-source LLMs in repairing property graphs. We assess repair quality, computational cost, and model-specific performance. Our experiments show that LLMs have the potential to detect and correct errors, with varying degrees of accuracy and efficiency. We discuss the strengths, limitations, and challenges of LLM-driven graph repair and outline future research directions for improving scalability and interpretability.
摘要：属性图被广泛用于医疗保健，金融和社交网络等领域，但由于违反数据或违反模式的不一致，它们通常包含错误。传统的基于规则和启发式驱动的图形修复方法的适应性受到限制，因为它们需要针对每个数据集进行量身定制。另一方面，在处理大图时，交互式人类的方法可能会变得不可行，因为在时间和精力方面的成本 - 涉及用户的成本变得太高。大型语言模型（LLMS）的最新进展通过利用上下文推理及其对现实世界知识的访问，为自动图修复提供了新的机会。我们评估了六个开源LLM在修复属性图中的有效性。我们评估维修质量，计算成本和特定于模型的性能。我们的实验表明，LLM具有以不同程度的准确性和效率来检测和纠正错误的潜力。我们讨论了LLM驱动的图形修复的优势，局限性和挑战，并概述了提高可伸缩性和可解释性的未来研究方向。

Title: SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation

Authors: Michał Perełkiewicz, Sławomir Dadas, Rafał Poświata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03415
Pdf URL: https://arxiv.org/pdf/2507.03415
Copy Paste: [[2507.03415]] SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation(https://arxiv.org/abs/2507.03415)
Keywords: language model
Abstract: This article introduces semantically meaningful causal language modeling (SMCLM), a selfsupervised method of training autoregressive models to generate semantically equivalent text. Our approach involves using semantically meaningful text representation as an initial embedding in the autoregressive training and generation processes. The extensive empirical study demonstrates that the SMCLM approach makes autoregressive models capable of learning robust and high-quality paraphrase generation. The proposed method is competitive with the supervised method and achieves state-of-the-art results in unsupervised approaches. This article also presents a comprehensive set of automatic metrics that cover a wide range of autogenerated paraphrase evaluation aspects. Simultaneously, this article highlights the low reliability of the metrics that are widely used in paraphrase generation evaluation, including BLEU, ROUGE, and BERTScore.
摘要：本文介绍了语义上有意义的因果语言建模（SMCLM），这是一种自我监督的训练自回归模型的方法，以生成语义上等效的文本。我们的方法涉及使用语义上有意义的文本表示形式作为自回归培训和发电过程中的初始嵌入。广泛的实证研究表明，SMCLM方法使自回归模型能够学习稳健和高质量的释义。所提出的方法具有监督方法的竞争力，并实现最先进的方法会导致无监督的方法。本文还提供了一组全面的自动指标，涵盖了广泛的自动化释义评估方面。同时，本文强调了广泛用于释义生成评估的指标的可靠性低，包括BLEU，Rouge和Bertscore。

Title: Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models

Authors: Adrien Bazoge, Pacôme Constant dit Beaufils, Mohammed Hmitouch, Romain Bourcier, Emmanuel Morin, Richard Dufour, Béatrice Daille, Pierre-Antoine Gourraud, Matilde Karakachoff
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03433
Pdf URL: https://arxiv.org/pdf/2507.03433
Copy Paste: [[2507.03433]] Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models(https://arxiv.org/abs/2507.03433)
Keywords: language model, llm
Abstract: Social determinants of health (SDoH) significantly influence health outcomes, shaping disease progression, treatment adherence, and health disparities. However, their documentation in structured electronic health records (EHRs) is often incomplete or missing. This study presents an approach based on large language models (LLMs) for extracting 13 SDoH categories from French clinical notes. We trained Flan-T5-Large on annotated social history sections from clinical notes at Nantes University Hospital, France. We evaluated the model at two levels: (i) identification of SDoH categories and associated values, and (ii) extraction of detailed SDoH with associated temporal and quantitative information. The model performance was assessed across four datasets, including two that we publicly release as open resources. The model achieved strong performance for identifying well-documented categories such as living condition, marital status, descendants, job, tobacco, and alcohol use (F1 score > 0.80). Performance was lower for categories with limited training data or highly variable expressions, such as employment status, housing, physical activity, income, and education. Our model identified 95.8% of patients with at least one SDoH, compared to 2.8% for ICD-10 codes from structured EHR data. Our error analysis showed that performance limitations were linked to annotation inconsistencies, reliance on English-centric tokenizer, and reduced generalizability due to the model being trained on social history sections only. These results demonstrate the effectiveness of NLP in improving the completeness of real-world SDoH data in a non-English EHR system.
摘要：卫生的社会决定因素（SDOH）显着影响健康结果，塑造疾病的发展，治疗依从性和健康差异。但是，它们在结构化电子健康记录（EHR）中的文档通常不完整或缺失。这项研究提出了一种基于大语模型（LLM）的方法，用于从法国临床注释中提取13个SDOH类别。我们从法国南特大学医院的临床笔记中培训了弗兰 - t5-large的社会历史部分。我们在两个级别上评估了该模型：（i）识别SDOH类别和相关值，以及（ii）用相关的时间和定量信息提取详细的SDOH。在四个数据集中评估了模型性能，其中包括我们公开发布的两个开放资源。该模型在识别有据可查的类别（例如生活状况，婚姻状况，后代，工作，烟草和酒精使用）方面取得了强大的性能（F1分数> 0.80）。培训数据有限或高度可变表达式的类别的性能较低，例如就业状况，住房，体育锻炼，收入和教育。我们的模型确定了95.8％的至少一名SDOH患者，而结构化EHR数据中的ICD-10代码为2.8％。我们的错误分析表明，绩效限制与注释不一致有关，依赖以英语为中心的令牌机构以及由于仅在社会历史部分进行培训而导致的可推广性。这些结果证明了NLP在改善非英语EHR系统中现实世界SDOH数据完整性方面的有效性。

Title: Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right

Authors: Heather Lent
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03473
Pdf URL: https://arxiv.org/pdf/2507.03473
Copy Paste: [[2507.03473]] Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right(https://arxiv.org/abs/2507.03473)
Keywords: language model
Abstract: Despite mounting evidence that multilinguality can be easily weaponized against language models (LMs), works across NLP Security remain overwhelmingly English-centric. In terms of securing LMs, the NLP norm of "English first" collides with standard procedure in cybersecurity, whereby practitioners are expected to anticipate and prepare for worst-case outcomes. To mitigate worst-case outcomes in NLP Security, researchers must be willing to engage with the weakest links in LM security: lower-resourced languages. Accordingly, this work examines the security of LMs for lower- and medium-resourced languages. We extend existing adversarial attacks for up to 70 languages to evaluate the security of monolingual and multilingual LMs for these languages. Through our analysis, we find that monolingual models are often too small in total number of parameters to ensure sound security, and that while multilinguality is helpful, it does not always guarantee improved security either. Ultimately, these findings highlight important considerations for more secure deployment of LMs, for communities of lower-resourced languages.
摘要：尽管有越来越多的证据表明可以轻松地针对语言模型（LMS）进行武器，但跨NLP安全性仍以英语为中心。在确保LMS方面，“英语第一”的NLP规范与网络安全的标准程序相撞，从而期望从业人员预期并为最坏的结果做准备。为了减轻NLP安全性最差的结果，研究人员必须愿意与LM安全性最弱的联系：低资源的语言。因此，这项工作检查了LMS对低资源和中等语言的安全性。我们扩展了最多70种语言的现有对抗性攻击，以评估这些语言的单语和多语言LMS的安全性。通过我们的分析，我们发现单语模型的总参数总数通常太小，无法确保合理的安全性，尽管多语言性是有帮助的，但它也不总是保证提高安全性。最终，这些发现突出了针对低资源语言社区更安全地部署LMS的重要考虑因素。

Title: Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences

Authors: Eva Seidlmayer, Lukas Galke, Konrad U. Förstner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03488
Pdf URL: https://arxiv.org/pdf/2507.03488
Copy Paste: [[2507.03488]] Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences(https://arxiv.org/abs/2507.03488)
Keywords: language model
Abstract: Disseminators of disinformation often seek to attract attention or evoke emotions - typically to gain influence or generate revenue - resulting in distinctive rhetorical patterns that can be exploited by machine learning models. In this study, we explore linguistic and rhetorical features as proxies for distinguishing disinformative texts from other health and life-science text genres, applying both large language models and classical machine learning classifiers. Given the limitations of existing datasets, which mainly focus on fact checking misinformation, we introduce Four Shades of Life Sciences (FSoLS): a novel, labeled corpus of 2,603 texts on 14 life-science topics, retrieved from 17 diverse sources and classified into four categories of life science publications. The source code for replicating, and updating the dataset is available on GitHub: this https URL
摘要：虚假信息的传播者通常寻求吸引注意力或引起情绪 - 通常是为了获得影响或产生收入 - 从而产生了独特的修辞模式，可以通过机器学习模型来利用这些模式。在这项研究中，我们探索语言和修辞特征，作为将虚假信息与其他健康和生活科学文本类型区分开来，应用大型语言模型和经典的机器学习分类器的代理。鉴于现有数据集的局限性主要集中在事实检查错误信息上，我们介绍了四种生命科学（FSOL）：关于14个生命科学主题的2,603个文本的新颖，标记的语料库，从17种不同的生命科学主题中检索出来，并分类为生命科学的四个类别。用于复制和更新数据集的源代码可在GitHub上使用：此HTTPS URL

Title: AI-VaxGuide: An Agentic RAG-Based LLM for Vaccination Decisions

Authors: Abdellah Zeggai, Ilyes Traikia, Abdelhak Lakehal, Abdennour Boulesnane
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03493
Pdf URL: https://arxiv.org/pdf/2507.03493
Copy Paste: [[2507.03493]] AI-VaxGuide: An Agentic RAG-Based LLM for Vaccination Decisions(https://arxiv.org/abs/2507.03493)
Keywords: llm, retrieval-augmented generation, agent
Abstract: Vaccination plays a vital role in global public health, yet healthcare professionals often struggle to access immunization guidelines quickly and efficiently. National protocols and WHO recommendations are typically extensive and complex, making it difficult to extract precise information, especially during urgent situations. This project tackles that issue by developing a multilingual, intelligent question-answering system that transforms static vaccination guidelines into an interactive and user-friendly knowledge base. Built on a Retrieval-Augmented Generation (RAG) framework and enhanced with agent-based reasoning (Agentic RAG), the system provides accurate, context-sensitive answers to complex medical queries. Evaluation shows that Agentic RAG outperforms traditional methods, particularly in addressing multi-step or ambiguous questions. To support clinical use, the system is integrated into a mobile application designed for real-time, point-of-care access to essential vaccine information. AI-VaxGuide model is publicly available on this https URL
摘要：疫苗接种在全球公共卫生中起着至关重要的作用，但是医疗保健专业人员通常很难快速有效地获取免疫指南。国家协议和谁的建议通常是广泛而复杂的，因此很难提取精确的信息，尤其是在紧急情况下。该项目通过开发多种语言，智能的提问系统来解决该问题，该系统将静态疫苗接种指南转变为交互式和用户友好的知识库。该系统建立在检索型生成（RAG）框架上，并通过基于代理的推理（代理RAG）进行了增强，该系统为复杂的医疗查询提供了准确，上下文敏感的答案。评估表明，代理抹布胜过传统方法，尤其是在解决多步或模棱两可的问题时。为了支持临床用途，该系统已集成到旨在实时，即时访问基本疫苗信息的移动应用程序中。 AI-VaxGuide模型在此HTTPS URL上公开可用

Title: H2HTalk: Evaluating Large Language Models as Emotional Companion

Authors: Boyang Wang, Yalun Wu, Hongcheng Guo, Zhoujun Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03543
Pdf URL: https://arxiv.org/pdf/2507.03543
Copy Paste: [[2507.03543]] H2HTalk: Evaluating Large Language Models as Emotional Companion(https://arxiv.org/abs/2507.03543)
Keywords: language model, llm
Abstract: As digital emotional support needs grow, Large Language Model companions offer promising authentic, always-available empathy, though rigorous evaluation lags behind model advancement. We present Heart-to-Heart Talk (H2HTalk), a benchmark assessing companions across personality development and empathetic interaction, balancing emotional intelligence with linguistic fluency. H2HTalk features 4,650 curated scenarios spanning dialogue, recollection, and itinerary planning that mirror real-world support conversations, substantially exceeding previous datasets in scale and diversity. We incorporate a Secure Attachment Persona (SAP) module implementing attachment-theory principles for safer interactions. Benchmarking 50 LLMs with our unified protocol reveals that long-horizon planning and memory retention remain key challenges, with models struggling when user needs are implicit or evolve mid-conversation. H2HTalk establishes the first comprehensive benchmark for emotionally intelligent companions. We release all materials to advance development of LLMs capable of providing meaningful and safe psychological support.
摘要：随着数字情感支持需求的增长，大型语言模型伴侣提供了有希望的真实的，总是可以得到的同理心，尽管严格的评估滞后于模型的发展。我们介绍心对心谈话（H2HTalk），这是一种评估人格发展和善解人意互动的同伴的基准测试，使情绪智力与语言流利度之间进行平衡。 H2HTALK具有4,650个策划的方案，涵盖了对话，回忆和行程计划，这些方案镜像现实世界中支持对话，大大超过了以前的数据集和多样性。我们结合了一个安全的附件角色（SAP）模块，以实现附件理论，以进行更安全的互动。通过我们的统一协议对50个LLM进行基准测试表明，长期计划和记忆保留仍然是主要的挑战，当用户需求隐含或演变中频时，模型在挣扎时陷入困境。 H2HTalk为情感上聪明的同伴建立了第一个全面的基准。我们释放所有材料，以推动LLM的开发，能够提供有意义和安全的心理支持。

Title: TRACE: Training and Inference-Time Interpretability Analysis for Language Models

Authors: Nura Aljaafari, Danilo S. Carvalho, André Freitas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03668
Pdf URL: https://arxiv.org/pdf/2507.03668
Copy Paste: [[2507.03668]] TRACE: Training and Inference-Time Interpretability Analysis for Language Models(https://arxiv.org/abs/2507.03668)
Keywords: language model
Abstract: Understanding when and how linguistic knowledge emerges during language model training remains a central challenge for interpretability. Most existing tools are post hoc, rely on scalar metrics, or require nontrivial integration effort, making comprehensive interpretability analysis difficult to deploy and maintain. We introduce TRACE, a modular toolkit for training and inference-time interpretability analysis of transformer models. It enables lightweight, in-training analysis of linguistic and representational signals, including features probing, intrinsic dimensionality, Hessian curvature, and output diagnostics. It integrates with ABSynth, a controllable synthetic corpus generator that provides structured annotations for precise evaluation of linguistic feature acquisition. Experiments with autoregressive transformers demonstrate that TRACE reveals developmental phenomena such as early syntactic emergence, delayed semantic acquisition, and representational compression, signals overlooked by traditional scalar metrics such as loss or accuracy. With minimal integration effort, the tool enables layer-wise diagnostics, convergence-based early stopping, and detection of structural errors, making transformer analysis interpretable, actionable, and reproducible.
摘要：了解语言模型培训期间的语言知识何时以及如何出现仍然是对解释性的核心挑战。大多数现有的工具是事后，依赖标量指标或需要非平整集成工作，使全面的可解释性分析难以部署和维护。我们介绍了Trace，这是一种用于训练和推理时间解释性分析的变压器模型的模块化工具包。它可以对语言和代表性信号进行轻巧的训练分析，包括特征探测，内在维度，Hessian曲率和输出诊断。它与Absynth集成，Absynth是一种可控的合成语料库生成器，提供结构化注释，以精确评估语言特征采集。进行自回旋变压器的实验表明，痕迹揭示了发育现象，例如早期句法出现，语义获取延迟和代表性压缩，这是传统标量指标（例如损失或准确性）所忽略的信号。通过最少的集成工作，该工具可以通过层次诊断，基于收敛的早期停止和检测结构错误，从而使变压器分析可解释，可操作和可重复可再现。

Title: Recon, Answer, Verify: Agents in Search of Truth

Authors: Satyam Shukla, Himanshu Dutta, Pushpak Bhattacharyya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03671
Pdf URL: https://arxiv.org/pdf/2507.03671
Copy Paste: [[2507.03671]] Recon, Answer, Verify: Agents in Search of Truth(https://arxiv.org/abs/2507.03671)
Keywords: language model, llm, agent
Abstract: Automated fact checking with large language models (LLMs) offers a scalable alternative to manual verification. Evaluating fact checking is challenging as existing benchmark datasets often include post claim analysis and annotator cues, which are absent in real world scenarios where claims are fact checked immediately after being made. This limits the realism of current evaluations. We present Politi Fact Only (PFO), a 5 class benchmark dataset of 2,982 political claims from this http URL, where all post claim analysis and annotator cues have been removed manually. This ensures that models are evaluated using only the information that would have been available prior to the claim's verification. Evaluating LLMs on PFO, we see an average performance drop of 22% in terms of macro f1 compared to PFO's unfiltered version. Based on the identified challenges of the existing LLM based fact checking system, we propose RAV (Recon Answer Verify), an agentic framework with three agents: question generator, answer generator, and label generator. Our pipeline iteratively generates and answers sub questions to verify different aspects of the claim before finally generating the label. RAV generalizes across domains and label granularities, and it outperforms state of the art approaches on well known baselines RAWFC (fact checking, 3 class) by 25.28%, and on HOVER (encyclopedia, 2 class) by 1.54% on 2 hop, 4.94% on 3 hop, and 1.78% on 4 hop, sub categories respectively. RAV shows the least performance drop compared to baselines of 16.3% in macro f1 when we compare PFO with its unfiltered version.
摘要：使用大语言模型（LLMS）自动化事实检查提供了手动验证的可扩展替代方案。评估事实检查是具有挑战性的，因为现有的基准数据集通常包括索赔后分析和注释器提示，在现实世界中，这些索赔在制作后立即立即检查了事实。这限制了当前评估的现实主义。我们仅提供政治事实（PFO），这是该HTTP URL的2,982个政治主张的5类基准数据集，其中所有后索赔分析和注释者提示都已手动删除。这样可以确保仅使用在索赔验证之前可用的信息来评估模型。评估PFO上的LLMS，与PFO未经过滤版本相比，我们的平均性能下降了22％。基于现有基于LLM的事实检查系统的确定挑战，我们提出了RAV（RECON ANSWER验证），这是具有三个代理的代理框架：问题生成器，答案生成器和标签生成器。我们的管道迭代生成并回答子问题，以验证索赔的不同方面，然后最终生成标签。遍布域和标签粒度，它的表现优于众所周知的基线rawFC（事实检查，3级）的艺术状态方法，高于25.28％，在2个HOP上的悬停（Encyclopedia，2 class）上的悬停（Encyclopedia，2类）在2 Hop上，4.94％，分别为4.94％，在4 Hop上为1.78％。当我们将PFO与未经过滤版本进行比较时，RAV显示出与宏F1中16.3％的基线相比的表现最少。

Title: TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection

Authors: Xixiang He, Hao Yu, Qiyao Sun, Ao Cheng, Tailai Zhang, Cong Liu, Shuxuan Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03673
Pdf URL: https://arxiv.org/pdf/2507.03673
Copy Paste: [[2507.03673]] TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection(https://arxiv.org/abs/2507.03673)
Keywords: language model, llm
Abstract: Instruction Fine-Tuning (IFT) is crucial for aligning large language models (LLMs) with human preferences, and selecting a small yet representative subset from massive data significantly facilitates IFT in terms of both efficiency and effectiveness. Nevertheless, existing approaches suffer from two limitations: the use of simple heuristics restricts data diversity, while the singleton data quality evaluation accounts for inconsistent criteria between independent samples. To address the issues, we present TACOS, an innovative method that integrates Open Tagging and Comparative Scoring for IFT data selection. To capture data diversity, we leverage LLMs to assign open-domain tags to human queries, followed by a normalization stage to denoise the open tags and enable efficient clustering. Additionally, we suggest a comparative scoring method that allows the relative quality evaluation of samples within a cluster, avoiding inconsistent criteria seen in singleton-based evaluations. Extensive experiments across diverse datasets and LLM architectures demonstrate that TACOS outperforms existing approaches by a large margin. Notably, it achieves superior instruction-following performance on MT-Bench and ranks 1st among LLaMA2-7B-Based models on AlpacaEval 2.0, illustrating its efficacy for IFT data selection.
摘要：指导微调（IFT）对于将大语言模型（LLM）与人类偏好保持一致，并且从大量数据中选择一个小但代表性的子集至关重要，从而在效率和有效性方面显着促进IFT。然而，现有的方法遭受了两个局限性：简单启发式方法限制了数据多样性，而Singleton数据质量评估却说明了独立样本之间的标准不一致。为了解决这些问题，我们提出了炸玉米饼，这是一种创新方法，该方法集成了IFT数据选择的开放标签和比较评分。为了捕获数据多样性，我们利用LLMS将开放域标签分配给人类查询，然后是标准化阶段，以降低开放标签并实现有效的聚类。此外，我们建议一种比较评分方法，该方法允许对集群中样品进行相对质量评估，从而避免了基于单胎的评估中看到的标准不一致。跨不同数据集和LLM体系结构进行的广泛实验表明，炸玉米饼的表现优于现有方法的大幅度。值得注意的是，它在MT板凳上实现了卓越的指导遵循性能，在基于Llama2-7b的模型中排名第一，在Alpacaeval 2.0上排名第一，说明了其对IFT数据选择的疗效。

Title: STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking

Authors: Tek Raj Chhetri, Yibei Chen, Puja Trivedi, Dorota Jarecka, Saif Haobsh, Patrick Ray, Lydia Ng, Satrajit S. Ghosh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03674
Pdf URL: https://arxiv.org/pdf/2507.03674
Copy Paste: [[2507.03674]] STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking(https://arxiv.org/abs/2507.03674)
Keywords: language model, llm, agent
Abstract: The ability to extract structured information from unstructured sources-such as free-text documents and scientific literature-is critical for accelerating scientific discovery and knowledge synthesis. Large Language Models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks, including structured information extraction. However, their effectiveness often diminishes in specialized, domain-specific contexts that require nuanced understanding and expert-level domain knowledge. In addition, existing LLM-based approaches frequently exhibit poor transferability across tasks and domains, limiting their scalability and adaptability. To address these challenges, we introduce StructSense, a modular, task-agnostic, open-source framework for structured information extraction built on LLMs. StructSense is guided by domain-specific symbolic knowledge encoded in ontologies, enabling it to navigate complex domain content more effectively. It further incorporates agentic capabilities through self-evaluative judges that form a feedback loop for iterative refinement, and includes human-in-the-loop mechanisms to ensure quality and validation. We demonstrate that StructSense can overcome both the limitations of domain sensitivity and the lack of cross-task generalizability, as shown through its application to diverse neuroscience information extraction tasks.
摘要：从非结构化来源中提取结构化信息的能力，例如自由文本文档和科学文献，这对于加速科学发现和知识综合至关重要。大型语言模型（LLM）在各种自然语言处理任务（包括结构化信息提取）中表现出了显着的功能。但是，它们的有效性通常会降低需要细微的理解和专家级领域知识的专业领域特定环境中。此外，现有的基于LLM的方法经常在任务和域之间表现出较差的可传递性，从而限制了它们的可扩展性和适应性。为了应对这些挑战，我们介绍了结构义，这是一个模块化，任务无关的开源框架，用于在LLMS上构建的结构化信息提取。结构固定在本体论中编码的领域特异性符号知识的指导下，使其能够更有效地导航复杂的域内容。它通过自我评估的法官来进一步结合了代理能力，构成了迭代精致的反馈循环，并包括人类在环境机制以确保质量和验证。我们证明，结构义可以克服域灵敏度的局限性和缺乏交叉任务的概括性，如将其应用于各种神经科学信息提取任务的应用所示。

Title: Controlling Thinking Speed in Reasoning Models

Authors: Zhengkai Lin, Zhihang Fu, Ze Chen, Chao Chen, Liang Xie, Wenxiao Wang, Deng Cai, Zheng Wang, Jieping Ye
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03704
Pdf URL: https://arxiv.org/pdf/2507.03704
Copy Paste: [[2507.03704]] Controlling Thinking Speed in Reasoning Models(https://arxiv.org/abs/2507.03704)
Keywords: llm, prompt
Abstract: Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs' representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-and-play method yields an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.
摘要：理论上将人类认知以两种模式运行：快速，直觉的系统1思考和缓慢，故意的系统2思考。尽管当前的大型推理模型（LRMS）在System 2思维中表现出色，但它们无法执行快速思考会导致高计算开销和延迟。在这项工作中，我们使LRM能够通过动态思维速度调整来近似人类智能，从而优化准确性效率折衷。我们的方法解决了两个关键问题：（1）如何控制LRM中的思维速度，以及（2）何时将其调整以获得最佳性能。对于第一个问题，我们确定控制LRMS表示空间中缓慢快速思维过渡的转向向量。使用此矢量，我们实现了第一个基于编辑的测试时间缩放效果的第一个表示，从而优于现有的基于及时的缩放方法。对于第二个问题，我们将实时难度估计应用于不同复杂性的信号推理段。结合了这些技术，我们提出了第一个推理策略，可以快速处理简单的步骤和更深入的复杂推理分析。如果没有任何培训或额外的成本，我们的插件方法的平均能力 +1.3％的准确性，跨领先的LRM和高级推理基准，则为-8.6％的令牌使用情况。我们的所有算法都是基于VLLM实施的，并有望支持更广泛的应用程序并激发未来的研究。

Title: Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making

Authors: Sang Quang Nguyen, Kiet Van Nguyen, Vinh-Tiep Nguyen, Thanh Duc Ngo, Ngan Luu-Thuy Nguyen, Dinh-Duy Le
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03711
Pdf URL: https://arxiv.org/pdf/2507.03711
Copy Paste: [[2507.03711]] Can LLMs Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making(https://arxiv.org/abs/2507.03711)
Keywords: language model, llm, agent
Abstract: In this paper, we explore the ability of large language models (LLMs) to plan and make decisions through the lens of the traditional Vietnamese board game, Ô Ăn Quan. This game, which involves a series of strategic token movements and captures, offers a unique environment for evaluating the decision-making and strategic capabilities of LLMs. Specifically, we develop various agent personas, ranging from aggressive to defensive, and employ the Ô Ăn Quan game as a testbed for assessing LLM performance across different strategies. Through experimentation with models like Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Llama-3.3-70B-Instruct, we aim to understand how these models execute strategic decision-making, plan moves, and manage dynamic game states. The results will offer insights into the strengths and weaknesses of LLMs in terms of reasoning and strategy, contributing to a deeper understanding of their general capabilities.
摘要：在本文中，我们探讨了大型语言模型（LLM）通过传统越南棋盘游戏《 Quan》的镜头计划和做出决策的能力。该游戏涉及一系列战略代币动作和捕获，为评估LLM的决策和战略能力提供了一个独特的环境。具体来说，我们开发了各种代理人角色，从侵略性到防御性，并采用Quan游戏作为测试台，用于评估不同策略的LLM性能。通过使用Llama-3.2-3B-Instruct，Llama-3.1-8B-Instruct和Llama-3.3-70B教学等模型进行实验，我们旨在了解这些模型如何执行战略决策，计划移动和管理动态游戏状态。结果将在推理和战略方面提供对LLMS的优势和劣势的见解，从而有助于更深入地了解其一般能力。

Title: MemOS: A Memory OS for AI System

Authors: Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Junpeng Ren, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofeng Wang, Hongkang Yang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, Feiyu Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03724
Pdf URL: https://arxiv.org/pdf/2507.03724
Copy Paste: [[2507.03724]] MemOS: A Memory OS for AI System(https://arxiv.org/abs/2507.03724)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge this http URL models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended this http URL Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent this http URL work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.
摘要：Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge this http URL models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended this http URL Retrieval-Augmented Generation (RAG)在纯文本中介绍了外部知识，它仍然是一种无状态的解决方法，而没有生命周期控制或与持续存在的集成此HTTP URL工作已经从内存层次结构的角度对LLM的培训和推理成本进行了建模，这表明这表明引入了参数内存之间的显式记忆层，并且可以通过外部化特定的特定特定的特定知识来大大降低这些成本。除了计算效率之外，LLMS还面临着更广泛的挑战，这些挑战是如何在时间和上下文中分布的方式，需要能够管理跨越不同时间尺度和来源的异质知识的系统。为了应对这一挑战，我们提出了备忘录，这是一种将内存视为可管理系统资源的内存操作系统。它统一了明文，基于激活和参数级记忆的表示形式，调度和演变，从而实现了具有成本效益的存储和检索。作为基本单元，Memcube封装了内存内容和元数据，例如出处和版本。可以随着时间的推移而组成，迁移和融合成员，从而在内存类型之间进行灵活的过渡，并通过基于参数的学习进行桥接检索。备忘录建立了一个以内存为中心的系统框架，该框架为LLM带来了可控性，可塑性和可发展性，为持续学习和个性化建模奠定了基础。

Title: OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference

Authors: Seungjun Shin, Jaehoon Oh, Dokwan Oh
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03865
Pdf URL: https://arxiv.org/pdf/2507.03865
Copy Paste: [[2507.03865]] OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference(https://arxiv.org/abs/2507.03865)
Keywords: language model, llm
Abstract: Attention mechanisms are central to the success of large language models (LLMs), enabling them to capture intricate token dependencies and implicitly assign importance to each token. Recent studies have revealed the sink token, which receives disproportionately high attention despite their limited semantic role. In this paper, we first expand the relationship between the sink token and other tokens, moving beyond attention to explore their similarity in hidden states, considering the layer depth. We observe that as the layers get deeper, the cosine similarity between the normalized hidden states of the sink token and those of other tokens increases, and that the normalized hidden states of the sink token exhibit negligible changes. These imply that other tokens consistently are directed toward the sink token throughout the layers. Next, we propose a dynamic token selection method, called OrthoRank, using these findings to select important tokens. Specifically, in a certain layer, we define token importance by the speed at which the token moves toward the sink token. This is converted into orthogonality with the sink token, meaning that tokens that are more orthogonal to the sink token are assigned greater importance. Finally, through extensive experiments, we demonstrated that our method results in lower perplexity and higher zero-shot accuracy compared to layer pruning methods at the same sparsity ratio with comparable throughput, while also achieving superior performance on LongBench.
摘要：注意机制对于大型语言模型（LLM）的成功至关重要，使它们能够捕获复杂的令牌依赖性，并隐含地为每个令牌分配重要性。最近的研究表明，下沉的令牌，尽管语义作用有限，但仍受到不成比例的关注。在本文中，我们首先扩大了下沉令牌和其他令牌之间的关系，而不是考虑到层的深度，而不是探索他们在隐藏状态中的相似性。我们观察到，随着图层的更深层次，水槽令牌的标准化隐藏状态与其他令牌的余弦相似性会增加，并且水槽令牌的归一化隐藏状态表现出可忽略的变化。这些暗示其他令牌始终朝向整个层的水槽令牌。接下来，我们建议使用这些发现来选择重要令牌的动态令牌选择方法，称为Orthorank。具体而言，在一定层中，我们通过令牌向汇的令牌移动的速度来定义令牌的重要性。这将转换为与接收器令牌的正交性，这意味着，与水槽令牌更正交的令牌更为重要。最后，通过广泛的实验，我们证明了我们的方法与以相同的稀疏性比的层修剪方法相比，与层次的修剪方法相比，较低的困惑和较高的零发精确度，同时在长膨胀板上达到了卓越的性能。

Title: Demystifying ChatGPT: How It Masters Genre Recognition

Authors: Subham Raj, Sriparna Saha, Brijraj Singh, Niranjan Pedanekar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03875
Pdf URL: https://arxiv.org/pdf/2507.03875
Copy Paste: [[2507.03875]] Demystifying ChatGPT: How It Masters Genre Recognition(https://arxiv.org/abs/2507.03875)
Keywords: language model, gpt, llm, prompt, chat
Abstract: The introduction of ChatGPT has garnered significant attention within the NLP community and beyond. Previous studies have demonstrated ChatGPT's substantial advancements across various downstream NLP tasks, highlighting its adaptability and potential to revolutionize language-related applications. However, its capabilities and limitations in genre prediction remain unclear. This work analyzes three Large Language Models (LLMs) using the MovieLens-100K dataset to assess their genre prediction capabilities. Our findings show that ChatGPT, without fine-tuning, outperformed other LLMs, and fine-tuned ChatGPT performed best overall. We set up zero-shot and few-shot prompts using audio transcripts/subtitles from movie trailers in the MovieLens-100K dataset, covering 1682 movies of 18 genres, where each movie can have multiple genres. Additionally, we extended our study by extracting IMDb movie posters to utilize a Vision Language Model (VLM) with prompts for poster information. This fine-grained information was used to enhance existing LLM prompts. In conclusion, our study reveals ChatGPT's remarkable genre prediction capabilities, surpassing other language models. The integration of VLM further enhances our findings, showcasing ChatGPT's potential for content-related applications by incorporating visual information from movie posters.
摘要：Chatgpt的引入在NLP社区及其他地区引起了极大的关注。先前的研究表明，Chatgpt在各种下游NLP任务中的重大进步，突出了其适应性和革新与语言相关的应用程序的潜力。但是，其流派预测的能力和局限性尚不清楚。这项工作使用Movielens-100K数据集分析了三种大型语言模型（LLM），以评估其流派预测能力。我们的发现表明，Chatgpt没有微调，优于其他LLM，而微调的Chatgpt则表现最好。我们使用Movielens-100k数据集中的电影预告片中的音频成绩单/字幕设置了零射击和少量提示，其中涵盖了1682个电影的1682个电影，其中每部电影都可以具有多种流派。此外，我们通过提取IMDB电影海报以利用视觉语言模型（VLM）以及提示海报信息来扩展研究。这些细粒度的信息用于增强现有的LLM提示。总之，我们的研究揭示了Chatgpt出色的流派预测能力，超过了其他语言模型。 VLM的集成进一步增强了我们的发现，通过合并电影海报中的视觉信息来展示Chatgpt与内容相关的应用程序的潜力。

Title: Losing our Tail -- Again: On (Un)Natural Selection And Multilingual Large Language Models

Authors: Eva Vanmassenhove
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.03933
Pdf URL: https://arxiv.org/pdf/2507.03933
Copy Paste: [[2507.03933]] Losing our Tail -- Again: On (Un)Natural Selection And Multilingual Large Language Models(https://arxiv.org/abs/2507.03933)
Keywords: language model, llm
Abstract: Multilingual Large Language Models (LLMs) considerably changed how technologies can influence language. While previous technologies could mediate or assist humans, there is now a tendency to \textit{offload} the task of writing itself to these technologies, enabling them to change our linguistic ecosystem more directly. While they provide us quick access to information and impressively fluent output, beneath their apparent sophistication lies a subtle, more insidious threat: the gradual decline and loss of linguistic diversity. With this opinion piece, I explore how model collapse, with a particular focus on translation technology, can lead to the loss of linguistic forms, grammatical features, and cultural nuance. Model collapse refers to the eventual consequence of self-consuming training loops, where models reinforce their own biases and lose linguistic diversity. Drawing on recent work in Computer Vision, Natural Language Processing (NLP) and Machine Translation (MT), I argue that the tails of our linguistic distributions are vanishing, and with them, the narratives and identities they carry. This is a call to resist linguistic flattening and to reimagine NLP as a field that encourages, values and protects expressive multilingual lexical and linguistic diversity and creativity.
摘要：多语言大语言模型（LLM）极大地改变了技术如何影响语言。尽管以前的技术可以调解或协助人类，但现在倾向于\ textit {卸载}将自己写入这些技术的任务，使它们能够更直接地更改我们的语言生态系统。尽管它们为我们提供了快速获取信息和令人印象深刻的输出，但在他们明显的复杂性下方是一个微妙，更阴险的威胁：语言多样性的逐渐下降和丧失。有了这一观点，我探讨了模型崩溃如何特别关注翻译技术，从而导致语言形式的丧失，语法特征和文化差异。模型崩溃是指自我消耗的训练循环的最终结果，在这种情况下，模型会增强自己的偏见并失去语言多样性。我认为，借鉴了计算机视觉，自然语言处理（NLP）和机器翻译（MT）的最新工作，我认为我们的语言分布的尾巴正在消失，随着它们，它们所携带的叙述和身份。这是一种呼吁抵制语言扁平化并将NLP重新想象的领域，以鼓励，价值和保护表现力的多语言词汇和语言多样性和创造力。

Title: Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

Authors: Ziyang Miao, Qiyu Sun, Jingyuan Wang, Yuchen Gong, Yaowei Zheng, Shiqi Li, Richong Zhang
Subjects: cs.CL, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04009
Pdf URL: https://arxiv.org/pdf/2507.04009
Copy Paste: [[2507.04009]] Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents(https://arxiv.org/abs/2507.04009)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have shown impressive performance on general-purpose tasks, yet adapting them to specific domains remains challenging due to the scarcity of high-quality domain data. Existing data synthesis tools often struggle to extract reliable fine-tuning data from heterogeneous documents effectively. To address this limitation, we propose Easy Dataset, a unified framework for synthesizing fine-tuning data from unstructured documents via an intuitive graphical user interface (GUI). Specifically, Easy Dataset allows users to easily configure text extraction models and chunking strategies to transform raw documents into coherent text chunks. It then leverages a persona-driven prompting approach to generate diverse question-answer pairs using public-available LLMs. Throughout the pipeline, a human-in-the-loop visual interface facilitates the review and refinement of intermediate outputs to ensure data quality. Experiments on a financial question-answering task show that fine-tuning LLMs on the synthesized dataset significantly improves domain-specific performance while preserving general knowledge. The source code and installable package are available at this https URL and have garnered over 9,000 GitHub stars.
摘要：大型语言模型（LLMS）在通用任务上表现出了令人印象深刻的表现，但是由于高质量的域数据缺乏，使它们适应特定领域仍然具有挑战性。现有的数据合成工具通常难以有效地从异质文档中提取可靠的微调数据。为了解决此限制，我们提出了简单数据集，这是一个统一的框架，用于通过直观的图形用户界面（GUI）从非结构化文档中综合微型数据。具体而言，简单的数据集允许用户轻松地配置文本提取模型和块策略，以将原始文档转换为连贯的文本块。然后，它利用角色驱动的提示方法来使用公共可用的LLM产生各种问答式求助方法。在整个管道中，人类的视觉界面促进了中间输出的审查和完善，以确保数据质量。在财务提问任务上进行的实验表明，合成数据集的微调LLM可显着提高特定于领域的性能，同时保留通用知识。源代码和可安装软件包可在此HTTPS URL上找到，并获得了9,000多个GitHub星星。

Title: Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition

Authors: Kyuhee Kim, Sangah Lee
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2507.04014
Pdf URL: https://arxiv.org/pdf/2507.04014
Copy Paste: [[2507.04014]] Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition(https://arxiv.org/abs/2507.04014)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) become key advisors in various domains, their cultural sensitivity and reasoning skills are crucial in multicultural environments. We introduce Nunchi-Bench, a benchmark designed to evaluate LLMs' cultural understanding, with a focus on Korean superstitions. The benchmark consists of 247 questions spanning 31 topics, assessing factual knowledge, culturally appropriate advice, and situational interpretation. We evaluate multilingual LLMs in both Korean and English to analyze their ability to reason about Korean cultural contexts and how language variations affect performance. To systematically assess cultural reasoning, we propose a novel evaluation strategy with customized scoring metrics that capture the extent to which models recognize cultural nuances and respond appropriately. Our findings highlight significant challenges in LLMs' cultural reasoning. While models generally recognize factual information, they struggle to apply it in practical scenarios. Furthermore, explicit cultural framing enhances performance more effectively than relying solely on the language of the prompt. To support further research, we publicly release Nunchi-Bench alongside a leaderboard.
摘要：随着大型语言模型（LLM）成为各个领域的关键顾问，因此它们的文化敏感性和推理能力在多元文化环境中至关重要。我们介绍了Nunchi-Bench，这是一种旨在评估LLMS文化理解的基准，重点是韩国迷信。该基准由247个问题组成，涵盖31个主题，评估事实知识，文化适当的建议和情境解释。我们评估韩语和英语中的多语言LLM，以分析其推理韩国文化背景以及语言变化如何影响性能的能力。为了系统地评估文化推理，我们提出了一种新颖的评估策略，其定制评分指标可以捕获模型识别文化细微差别并做出适当响应的程度。我们的发现突出了LLMS文化推理中的重大挑战。尽管模型通常认识到事实信息，但他们很难在实际情况下应用它。此外，明确的文化框架比仅依靠提示语言更有效地增强了绩效。为了支持进一步的研究，我们将与排行榜一起公开发布Nunchi Bench。

Title: LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models

Authors: Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Xuan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04023
Pdf URL: https://arxiv.org/pdf/2507.04023
Copy Paste: [[2507.04023]] LLMThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models(https://arxiv.org/abs/2507.04023)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable performance on complex mathematical benchmarks, yet often struggle with simple arithmetic tasks and exhibit a tendency toward over-explaining or "overthinking" answers. To systematically assess this phenomenon, we introduce LLMThinkBench, a modular benchmarking framework that enables researchers to evaluate basic math reasoning and overthinking in LLMs. The framework provides 14 configurable math tasks with randomized test data generation and robust parsing strategies. Researchers can quantify overthinking using our Overthinking Score metric, which captures accuracy-verbosity tradeoffs through harmonic mean formulation. The tool offers flexible evaluation with a scalable vLLM/Transformers backend, multi-GPU support, and full configurability. Users can extend the tool with custom tasks, reproduce experiments with seeding, and generate detailed efficiency reports. Distributed as a pip-installable package with CLI and API access, LLMThinkBench provides researchers and practitioners an accessible, cost-effective alternative to expensive LLM-as-a-judge methods for diagnosing basic reasoning capabilities and efficiency analysis. Package can be installed as: pip install llmthinkbench
摘要：大型语言模型（LLMS）在复杂的数学基准上取得了出色的性能，但经常在简单的算术任务上挣扎，并表现出过度解释或“过度思考”答案的趋势。为了系统地评估这一现象，我们介绍了LLMThinkbench，这是一个模块化的基准测试框架，使研究人员能够评估LLM中的基本数学推理和过度思考。该框架提供了14个可配置的数学任务，具有随机测试数据生成和强大的解析策略。研究人员可以使用我们的过度思考得分度量来量化过度思考，该评分度量度量通过谐波平均公式捕获准确的魔力折衷。该工具提供可伸缩的VLLM/变压器后端，多GPU支持和完整可配置性的灵活评估。用户可以通过自定义任务扩展工具，通过播种来重现实验，并生成详细的效率报告。 LLMThinkBench作为带有CLI和API访问的PIP安装软件包分发，为研究人员和从业人员提供了昂贵的LLM-AS-A-A-A-Gudge方法可访问，具有成本效益的替代方法，用于诊断基本推理能力和效率分析。软件包可以安装为：PIP安装llmthinkbench

Title: Patient-Centered RAG for Oncology Visit Aid Following the Ottawa Decision Guide

Authors: Siyang Liu, Lawrence Chin-I An, Rada Mihalcea
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04026
Pdf URL: https://arxiv.org/pdf/2507.04026
Copy Paste: [[2507.04026]] Patient-Centered RAG for Oncology Visit Aid Following the Ottawa Decision Guide(https://arxiv.org/abs/2507.04026)
Keywords: language model, retrieval-augmented generation
Abstract: Effective communication is essential in cancer care, yet patients often face challenges in preparing for complex medical visits. We present an interactive, Retrieval-augmented Generation-assisted system that helps patients progress from uninformed to visit-ready. Our system adapts the Ottawa Personal Decision Guide into a dynamic retrieval-augmented generation workflow, helping users bridge knowledge gaps, clarify personal values and generate useful questions for their upcoming visits. Focusing on localized prostate cancer, we conduct a user study with patients and a clinical expert. Results show high system usability (UMUX Mean = 6.0 out of 7), strong relevance of generated content (Mean = 6.7 out of 7), minimal need for edits, and high clinical faithfulness (Mean = 6.82 out of 7). This work demonstrates the potential of combining patient-centered design with language models to enhance clinical preparation in oncology care.
摘要：有效的沟通对于癌症护理至关重要，但是患者在准备复杂的医疗就诊时经常面临挑战。我们提出了一个互动的，检索的生成辅助系统，该系统可帮助患者从不知情的访问中发展。我们的系统将渥太华的个人决策指南改编成动态检索的一代工作流程，帮助用户弥合知识差距，澄清个人价值并为他们即将到来的访问产生有用的问题。为了关注局部前列腺癌，我们对患者和临床专家进行了用户研究。结果表明，系统可用性高（umux平均值= 7中的6.0个），生成的内容的强相关性（平均值= 6.7中的6.7），对编辑的需求最小以及临床忠诚度很高（平均值= 6.82中的7个）。这项工作证明了将以患者为中心的设计与语言模型相结合以增强肿瘤学护理中的临床制备的潜力。

Title: Beyond Independent Passages: Adaptive Passage Combination Retrieval for Retrieval Augmented Open-Domain Question Answering

Authors: Ting-Wen Ko, Jyun-Yu Jiang, Pu-Jen Cheng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04069
Pdf URL: https://arxiv.org/pdf/2507.04069
Copy Paste: [[2507.04069]] Beyond Independent Passages: Adaptive Passage Combination Retrieval for Retrieval Augmented Open-Domain Question Answering(https://arxiv.org/abs/2507.04069)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external documents at inference time, enabling up-to-date knowledge access without costly retraining. However, conventional RAG methods retrieve passages independently, often leading to redundant, noisy, or insufficiently diverse context-particularly problematic - particularly problematic in noisy corpora and for multi-hop questions. To address this, we propose Adaptive Passage Combination Retrieval (AdaPCR), a novel framework for open-domain question answering with black-box LMs. AdaPCR explicitly models dependencies between passages by considering passage combinations as units for retrieval and reranking. It consists of a context-aware query reformulation using concatenated passages, and a reranking step trained with a predictive objective aligned with downstream answer likelihood. Crucially, AdaPCR adaptively selects the number of retrieved passages without additional stopping modules. Experiments across several QA benchmarks show that AdaPCR outperforms baselines, particularly in multi-hop reasoning, demonstrating the effectiveness of modeling inter-passage dependencies for improved retrieval.
摘要：通过在推理时间合并外部文档，可以实现最新的知识访问而无需昂贵的再培训，从而增强了大语言模型（LLM）。但是，传统的抹布方法独立检索段落，通常会导致多种多样，嘈杂或不足以多样化的上下文问题 - 尤其是在嘈杂的语料库中尤其有问题，并且在多跳上问题上。为了解决这个问题，我们提出了自适应通道组合检索（ADAPCR），这是一个新颖的框架，用于用黑盒LMS回答开放域问题。 ADAPCR通过将段落组合视为检索和重读的单位，将段落之间的依赖性显式模型。它由使用串联段落的上下文感知的查询重新重新进行，以及一个训练有素的训练步骤，其预测目标与下游答案的可能性保持一致。至关重要的是，ADAPCR自适应地选择了检索段落的数量，而无需其他停止模块。几个质量检查基准的实验表明，ADAPCR的表现优于基准，尤其是在多跳上推理中，这证明了建模通用间依赖关系以改善检索的有效性。

Title: Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching

Authors: Thomas Savage
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04099
Pdf URL: https://arxiv.org/pdf/2507.04099
Copy Paste: [[2507.04099]] Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching(https://arxiv.org/abs/2507.04099)
Keywords: language model, llm
Abstract: Fine-tuning methods such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have demonstrated success in training large language models (LLMs) for single-turn tasks. However, these methods fall short in multi-turn applications, such as diagnostic patient interviewing, where understanding how early conversational turns influence downstream completions and outcomes is essential. In medicine, a multi-turn perspective is critical for learning diagnostic schemas and better understanding conversation dynamics. To address this gap, I introduce Savage Conversation Forests (SCF), a reinforcement learning framework that leverages a branched conversation architecture to fine-tune LLMs for multi-turn dialogue. SCF generates multiple possible conversation continuations at each turn, enabling the model to learn how different early responses affect downstream interactions and diagnostic outcomes. In experiments simulating doctor-patient conversations, SCF with branching outperforms linear conversation architectures on diagnostic accuracy. I hypothesize that SCF's improvements stem from its ability to provide richer, interdependent training signals across conversation turns. These results suggest that a branched training architecture is an important strategy for fine tuning LLMs in complex multi-turn conversational tasks.
摘要：诸如直接偏好优化（DPO）和小组相对策略优化（GRPO）之类的微调方法在培训大语言模型（LLMS）方面取得了成功。但是，这些方法在多转弯应用中（例如诊断性患者访谈）中缺乏，在这种情况下，了解早期对话的影响如何影响下游的完成和结果至关重要。在医学上，多转视观点对于学习诊断模式和更好地理解对话动态至关重要。为了解决这一差距，我介绍了Savage对话森林（SCF），这是一个强化学习框架，该框架利用分支的对话体系结构将LLM进行微调进行多转化对话。 SCF在每个回合都会产生多个可能的对话连续，从而使模型能够了解不同的早期响应如何影响下游的相互作用和诊断结果。在模拟医生对话的实验中，带有分支的SCF优于诊断精度的线性对话体系结构。我假设SCF的改进源于其在对话转弯之间提供更丰富，相互依存的培训信号的能力。这些结果表明，分支训练体系结构是在复杂的多转交谈任务中进行微调LLM的重要策略。

Title: BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering

Authors: Costas Mavromatis, Soji Adeshina, Vassilis N. Ioannidis, Zhen Han, Qi Zhu, Ian Robinson, Bryan Thompson, Huzefa Rangwala, George Karypis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04127
Pdf URL: https://arxiv.org/pdf/2507.04127
Copy Paste: [[2507.04127]] BYOKG-RAG: Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering(https://arxiv.org/abs/2507.04127)
Keywords: language model, llm, agent
Abstract: Knowledge graph question answering (KGQA) presents significant challenges due to the structural and semantic variations across input graphs. Existing works rely on Large Language Model (LLM) agents for graph traversal and retrieval; an approach that is sensitive to traversal initialization, as it is prone to entity linking errors and may not generalize well to custom ("bring-your-own") KGs. We introduce BYOKG-RAG, a framework that enhances KGQA by synergistically combining LLMs with specialized graph retrieval tools. In BYOKG-RAG, LLMs generate critical graph artifacts (question entities, candidate answers, reasoning paths, and OpenCypher queries), and graph tools link these artifacts to the KG and retrieve relevant graph context. The retrieved context enables the LLM to iteratively refine its graph linking and retrieval, before final answer generation. By retrieving context from different graph tools, BYOKG-RAG offers a more general and robust solution for QA over custom KGs. Through experiments on five benchmarks spanning diverse KG types, we demonstrate that BYOKG-RAG outperforms the second-best graph retrieval method by 4.5% points while showing better generalization to custom KGs. BYOKG-RAG framework is open-sourced at this https URL.
摘要：知识图应答（KGQA）由于输入图的结构和语义变化而提出了重大挑战。现有作品依靠大型语言模型（LLM）代理进行图形遍历和检索；一种对遍历初始化敏感的方法，因为它容易链接错误的实体，并且可能无法很好地推广到自定义（“ Bring-your-wand”）kgs。我们介绍了Byokg-rag，该框架通过协同将LLM与专门的图形检索工具协同结合来增强KGQA。在Byokg-rag中，LLMS生成关键的图形工件（问题实体，候选答案，推理路径和OpencyPher查询），并且图形工具将这些工件链接到kg并检索相关的图形上下文。检索到的上下文使LLM能够在最终答案生成之前迭代地完善其图形链接和检索。通过从不同的图形工具中检索上下文，Byokg-rag为自定义KGS提供了更通用，更强大的解决方案。通过对跨越不同KG类型的五个基准测试的实验，我们证明Byokg-rag的表现优于第二好的图检索方法4.5％点，同时显示出对自定义KGS的更好概括。 Byokg-rag框架在此HTTPS URL上开源。

Title: Token Level Hallucination Detection via Variance in Language Models

Authors: Keshav Kumar
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04137
Pdf URL: https://arxiv.org/pdf/2507.04137
Copy Paste: [[2507.04137]] Token Level Hallucination Detection via Variance in Language Models(https://arxiv.org/abs/2507.04137)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations, confidently generated yet factually incorrect outputs. We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple stochastic generations. Unlike prior methods that require ground-truth references or sentence-level verification, our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis. We evaluate our method on unanswerable question prompts from the SQuAD v2 dataset and benchmark across three autoregressive models of varying scales: GPT-Neo 125M, Falcon 1B, and Mistral 7B. Through both quantitative metrics and visual diagnostics, we show that token-level variance reliably highlights instability in model outputs and correlates with hallucination patterns. Our framework is lightweight, reproducible, and adaptable to multiple domains, offering a valuable diagnostic tool for analyzing generative reliability in LLMs.
摘要：大型语言模型（LLM）表现出了各种任务的令人印象深刻的生成能力，但仍然容易受到幻觉的影响，自信产生但实际上不正确的产出。我们引入了一个无参考的令牌级幻觉检测框架，该框架利用了多个随机世代的令牌对数探测的差异。与需要基于基础的参考或句子级验证的先前方法不同，我们的方法是模型不合时宜的，可解释的，并且适合于实时或事后分析。我们通过三种自回归模型的不同尺度的自回归模型：GPT-Neo 125m，Falcon 1B和Mismtral 7B评估了从小队V2数据集和基准测试的无法回答的问题提示的方法。通过定量指标和视觉诊断，我们表明令牌级别的差异可靠地突出了模型输出中的不稳定，并且与幻觉模式相关。我们的框架轻巧，可重现且适应多个领域，为分析LLM中的生成可靠性提供了有价值的诊断工具。

Title: Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies

Authors: Mael Jullien, Marco Valentino, Leonardo Ranaldi, Andre Freitas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04142
Pdf URL: https://arxiv.org/pdf/2507.04142
Copy Paste: [[2507.04142]] Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies(https://arxiv.org/abs/2507.04142)
Keywords: language model, gpt, llm, prompt
Abstract: Recent works on large language models (LLMs) have demonstrated the impact of prompting strategies and fine-tuning techniques on their reasoning capabilities. Yet, their effectiveness on clinical natural language inference (NLI) remains underexplored. This study presents the first controlled evaluation of how prompt structure and efficient fine-tuning jointly shape model performance in clinical NLI. We inspect four classes of prompting strategies to elicit reasoning in LLMs at different levels of abstraction, and evaluate their impact on a range of clinically motivated reasoning types. For each prompting strategy, we construct high-quality demonstrations using a frontier model to distil multi-step reasoning capabilities into smaller models (4B parameters) via Low-Rank Adaptation (LoRA). Across different language models fine-tuned on the NLI4CT benchmark, we found that prompt type alone accounts for up to 44% of the variance in macro-F1. Moreover, LoRA fine-tuning yields consistent gains of +8 to 12 F1, raises output alignment above 97%, and narrows the performance gap to GPT-4o-mini to within 7.1%. Additional experiments on reasoning generalisation reveal that LoRA improves performance in 75% of the models on MedNLI and TREC Clinical Trials Track. Overall, these findings demonstrate that (i) prompt structure is a primary driver of clinical reasoning performance, (ii) compact models equipped with strong prompts and LoRA can rival frontier-scale systems, and (iii) reasoning-type-aware evaluation is essential to uncover prompt-induced trade-offs. Our results highlight the promise of combining prompt design and lightweight adaptation for more efficient and trustworthy clinical NLP systems, providing insights on the strengths and limitations of widely adopted prompting and parameter-efficient techniques in highly specialised domains.
摘要：关于大型语言模型（LLM）的最新著作证明了促使策略和微调技术对其推理能力的影响。然而，它们对临床自然语言推断（NLI）的有效性仍然没有得到充实的态度。这项研究介绍了对临床NLI中如何迅速结构和有效微调共同塑造模型性能的第一个控制评估。我们检查了四类促使策略以不同级别的抽象级别引起推理的策略，并评估它们对一系列临床动机类型的影响。对于每种提示策略，我们使用边境模型通过低级适应（LORA）构建了高质量的演示，将多步推理能力融合到较小的模型（4B参数）。在NLI4CT基准测试的不同语言模型中，我们发现单独的及时类型占Macro-F1差异的44％。此外，Lora微调的稳定增长+8至12 f1，将输出对齐置于97％以上，并将gpt-4o-mini的性能差距缩小到7.1％以内。关于推理概括的其他实验表明，洛拉在MEDNLI和TREC临床试验轨道上提高了75％的模型的性能。总体而言，这些发现表明（i）及时结构是临床推理性能的主要驱动力，（ii）配备了强点和洛拉的紧凑型模型与洛拉可以竞争边界尺度系统，并且（iii）推理型型评估对于揭示及时及时迅速引起的权衡至关重要。我们的结果强调了将及时设计和轻巧适应性结合起来，以实现更高效和值得信赖的临床NLP系统的希望，从而提供了有关高度专业化域中广泛采用的提示和参数高效技术的优势和局限性的见解。

Title: Large Language Models for Zero-Shot Multicultural Name Recognition

Authors: Thanakorn Phonchai, Surasakdi Siripong, Nicholas Patterson, Owen Campbell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04149
Pdf URL: https://arxiv.org/pdf/2507.04149
Copy Paste: [[2507.04149]] Large Language Models for Zero-Shot Multicultural Name Recognition(https://arxiv.org/abs/2507.04149)
Keywords: language model, llm, prompt
Abstract: The robust and accurate recognition of multicultural names, particularly those not previously encountered, is a critical challenge in an increasingly globalized digital landscape. Traditional methods often falter when confronted with the vast diversity and novel permutations of names across different linguistic and cultural backgrounds. This paper introduces a novel framework, Prompt-Engineered Fine-Tuning (PEFT) for Large Language Models (LLMs) with Adversarial Data Augmentation and Cultural Knowledge Graph Integration, designed to significantly enhance zero-shot multicultural name recognition. Our approach leverages the powerful linguistic understanding of pre-trained LLMs, transforming the recognition task into a guided generation problem. Through meticulous prompt engineering, dynamic integration of explicit cultural knowledge derived from knowledge graphs, and the strategic application of adversarial data augmentation, we equip the LLM with an unprecedented ability to infer the cultural origin of unseen names. Extensive experiments demonstrate that our PEFT method consistently outperforms established deep learning baselines, including advanced Bi-LSTM models with cultural tags, achieving an impressive 93.1\% overall accuracy and a remarkable 89.5\% accuracy on challenging zero-shot name identification. An in-depth ablation study confirms the synergistic contribution of each component, while a human evaluation highlights our method's performance approaching human expert judgment. This work signifies a substantial leap in multicultural name recognition, offering a highly effective and scalable solution for real-world applications.
摘要：在日益全球化的数字景观中，对多元文化名称的强大而准确的认可是一个至关重要的挑战。当面对不同语言和文化背景的大量多样性和新颖的名称排列时，传统方法通常会动摇。本文介绍了具有对抗数据增强和文化知识图的大型语言模型（LLM）的新型框架，及时设计的微调（PEFT），旨在显着增强零拍的多元文化名称识别。我们的方法利用了对预训练的LLM的强大语言理解，将识别任务转变为指导的一代问题。通过细致的迅速工程，从知识图获得的明确文化知识的动态整合以及对抗性数据增强的战略应用，我们为LLM配备了前所未有的能力，可以推断出看不见名字的文化起源。广泛的实验表明，我们的PEFT方法始终超过建立深度学习基线的表现，包括具有文化标签的先进的BI-LSTM模型，达到了令人印象深刻的93.1％\％的总体准确性，并且在具有挑战性的零声名称识别方面具有89.5％的89.5％精度。一项深入的消融研究证实了每个组成部分的协同贡献，而人类评估则强调了我们方法的表现接近人类专家的判断。这项工作表示多元文化名称识别的实质性飞跃，为现实世界应用提供了高效且可扩展的解决方案。

Title: SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding

Authors: Runcong Zhao, Qinglin Zhu, Hainiu Xu, Bin Liang, Yulan He, Lin Gui
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2507.04189
Pdf URL: https://arxiv.org/pdf/2507.04189
Copy Paste: [[2507.04189]] SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding(https://arxiv.org/abs/2507.04189)
Keywords: language model, llm
Abstract: Understanding character relationships is essential for interpreting complex narratives and conducting socially grounded AI research. However, manual annotation is time-consuming and low in coverage, while large language models (LLMs) often produce hallucinated or logically inconsistent outputs. We present SymbolicThought, a human-in-the-loop framework that combines LLM-based extraction with symbolic reasoning. The system constructs editable character relationship graphs, refines them using seven types of logical constraints, and enables real-time validation and conflict resolution through an interactive interface. To support logical supervision and explainable social analysis, we release a dataset of 160 interpersonal relationships with corresponding logical structures. Experiments show that SymbolicThought improves annotation accuracy and consistency while significantly reducing time cost, offering a practical tool for narrative understanding, explainable AI, and LLM evaluation.
摘要：了解性格关系对于解释复杂的叙事和进行社会扎根的AI研究至关重要。但是，手动注释耗时且覆盖范围较低，而大型语言模型（LLMS）通常会产生幻觉或逻辑上不一致的输出。我们提出了符号思想，这是一个将基于LLM的提取与象征性推理相结合的人类框架框架。该系统构建可编辑的角色关系图，使用七种逻辑约束来完善它们，并通过交互式接口实现实时验证和冲突解决。为了支持逻辑监督和可解释的社会分析，我们发布了与相应的逻辑结构的160个人际关系关系的数据集。实验表明，符号思考提高了注释的准确性和一致性，同时大大降低了时间成本，为叙事理解，可解释的AI和LLM评估提供了实用的工具。

Title: Context Tuning for In-Context Optimization

Authors: Jack Lu, Ryan Teehan, Zhenbang Yang, Mengye Ren
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04221
Pdf URL: https://arxiv.org/pdf/2507.04221
Copy Paste: [[2507.04221]] Context Tuning for In-Context Optimization(https://arxiv.org/abs/2507.04221)
Keywords: language model, llm, prompt
Abstract: We introduce Context Tuning, a simple and effective method to significantly enhance few-shot adaptation of language models (LLMs) without fine-tuning model parameters. While prompt-based adaptation techniques have demonstrated the effectiveness of lightweight adaptation methods for large language models (LLMs), they typically initialize a trainable prompt or prefix with irrelevant tokens for the task at hand. In contrast, Context Tuning initializes the trainable prompt or prefix with task-specific demonstration examples, leveraging the model's inherent In-Context Learning (ICL) ability to extract relevant information for improved few-shot learning performance. Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.
摘要：我们介绍了上下文调整，这是一种简单有效的方法，可以显着增强语言模型（LLM）的几乎没有射击，而无需微调模型参数。虽然基于及时的适应技术证明了对大语言模型（LLMS）轻巧适应方法的有效性，但它们通常会初始化可训练的提示或前缀，或者以与手头任务无关的标记。相比之下，上下文调整以特定于任务的演示示例初始化了可训练的提示或前缀，从而利用模型固有的文本学习（ICL）能力提取相关信息以提高几次学习绩效。对基准，例如CrossFit，UnifiedQA，MMLU，Big-Best Hard和Arc进行的广泛评估表明，上下文调整的表现优于传统的基于及时的适应方法，并实现了具有较高较高训练效率的测试时间培训的竞争精度。

Title: Fairness Evaluation of Large Language Models in Academic Library Reference Services

Authors: Haining Wang, Jason Clark, Yueru Yan, Star Bradley, Ruiyang Chen, Yiqiong Zhang, Hengyi Fu, Zuoyu Tian
Subjects: cs.CL, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2507.04224
Pdf URL: https://arxiv.org/pdf/2507.04224
Copy Paste: [[2507.04224]] Fairness Evaluation of Large Language Models in Academic Library Reference Services(https://arxiv.org/abs/2507.04224)
Keywords: language model, llm, prompt
Abstract: As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries' commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We found no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrated nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.
摘要：由于图书馆探索用于虚拟参考服务的大型语言模型（LLMS），因此出现了一个关键问题：LLM可以公平地服务于所有用户，而不论人口统计或社会状况如何？尽管它们为可扩展支持提供了巨大的潜力，但LLMS也可能再现嵌入其培训数据中的社会偏见，冒着图书馆对公平服务承诺的完整性的风险。为了解决这一问题，我们通过提示六个最先进的LLMS协助性，种族/种族和机构角色不同的顾客来评估LLM是否会区分用户身份的响应。我们没有发现种族或种族分化的证据，只有一个模型中对女性的刻板印象偏见的较小证据。 LLM通过使用与形式，礼貌和特定领域的词汇相关的语言选择来展示机构角色细微的适应，这反映了专业规范而不是歧视性治疗。这些发现表明，当前的LLMS在学术图书馆参考服务中支持公平且在上下文中提供适当的沟通的有前途的准备程度。

Title: No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

Authors: Dasol Choi, Woomyoung Park, Youngsook Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04329
Pdf URL: https://arxiv.org/pdf/2507.04329
Copy Paste: [[2507.04329]] No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem(https://arxiv.org/abs/2507.04329)
Keywords: language model, llm
Abstract: Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.
摘要：自然语言处理（NLP）的最新进展强调了高质量数据集在构建大语言模型（LLMS）中的关键作用。但是，尽管存在大量的英语资源和分析，但东亚语言的景观（尤其是中文，日语和韩语（CJK））仍然分散且毫无疑问，尽管这些语言共同提供超过16亿扬声器。为了解决这一差距，我们从跨语言的角度研究了拥抱面生态系统，重点是文化规范，研究环境和机构实践如何塑造数据集的可用性和质量。利用3,300多个数据集，我们采用定量和定性方法来研究这些因素如何推动中国，日本和韩国NLP社区的不同创造和策展模式。我们的发现突出了中国数据集的大规模且经常以机构为导向的性质，基层社区主导的NLP的发展以及以娱乐性和亚文化为重点的日本收藏的重点。通过揭示这些模式，我们揭示了增强数据集文档，许可清晰度和跨语性资源共享的实用策略 - 最终指导了东亚更有效和更具文化调整的LLM开发。最后，我们通过讨论未来数据集策划和协作的最佳实践，旨在加强所有三种语言的资源开发。

Title: Large Language Models' Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products

Authors: Soojong Kim, Kwanho Kim, Hye Min Kim
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2507.04364
Pdf URL: https://arxiv.org/pdf/2507.04364
Copy Paste: [[2507.04364]] Large Language Models' Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products(https://arxiv.org/abs/2507.04364)
Keywords: language model, gpt, llm
Abstract: Machine learning methods are increasingly applied to analyze health-related public discourse based on large-scale data, but questions remain regarding their ability to accurately detect different types of health sentiments. Especially, Large Language Models (LLMs) have gained attention as a powerful technology, yet their accuracy and feasibility in capturing different opinions and perspectives on health issues are largely unexplored. Thus, this research examines how accurate the three prominent LLMs (GPT, Gemini, and LLAMA) are in detecting risk-promoting versus health-supporting sentiments across two critical public health topics: Human Papillomavirus (HPV) vaccination and heated tobacco products (HTPs). Drawing on data from Facebook and Twitter, we curated multiple sets of messages supporting or opposing recommended health behaviors, supplemented with human annotations as the gold standard for sentiment classification. The findings indicate that all three LLMs generally demonstrate substantial accuracy in classifying risk-promoting and health-supporting sentiments, although notable discrepancies emerge by platform, health issue, and model type. Specifically, models often show higher accuracy for risk-promoting sentiment on Facebook, whereas health-supporting messages on Twitter are more accurately detected. An additional analysis also shows the challenges LLMs face in reliably detecting neutral messages. These results highlight the importance of carefully selecting and validating language models for public health analyses, particularly given potential biases in training data that may lead LLMs to overestimate or underestimate the prevalence of certain perspectives.
摘要：机器学习方法越来越多地用于根据大规模数据分析与健康相关的公共话语，但是关于其准确检测不同类型的健康情感的能力仍然存在问题。尤其是，大型语言模型（LLM）已成为一种强大的技术，但是它们在捕获有关健康问题的不同意见和观点方面的准确性和可行性在很大程度上尚未探索。因此，这项研究研究了三个关键公共卫生主题中的三个突出的LLM（GPT，Gemini和Llama）的准确性与促进风险促进与健康支持的观点：人类乳头瘤病毒（HPV）疫苗接种和热烟草产品（HTPS）。利用Facebook和Twitter的数据，我们策划了支持或反对建议的健康行为的多组消息，并以人为注释为情感分类的黄金标准。研究结果表明，这三个LLM通常在分类促进风险和健康支持的情绪方面表现出很高的准确性，尽管众所周知，但据平台，健康问题和模型类型出现了显着的差异。具体而言，模型通常在Facebook上表现出更高的准确性，而在Twitter上更准确地检测到了在Twitter上进行健康支持的消息。另一项分析还表明，LLMS可靠地检测到中性消息所面临的挑战。这些结果强调了对公共卫生分析进行仔细选择和验证语言模型的重要性，特别是考虑到培训数据的潜在偏见可能导致LLM高估或低估某些观点的普遍性。

Title: Does Learning Mathematical Problem-Solving Generalize to Broader Reasoning?

Authors: Ruochen Zhou, Minrui Xu, Shiqi Chen, Junteng Liu, Yunqi Li, Xinxin Lin, Zhengyu Chen, Junxian He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04391
Pdf URL: https://arxiv.org/pdf/2507.04391
Copy Paste: [[2507.04391]] Does Learning Mathematical Problem-Solving Generalize to Broader Reasoning?(https://arxiv.org/abs/2507.04391)
Keywords: language model, chain-of-thought
Abstract: There has been a growing interest in enhancing the mathematical problem-solving (MPS) capabilities of large language models. While the majority of research efforts concentrate on creating specialized models to solve mathematical problems, it remains unknown how learning mathematical problem-solving generalizes to help develop other reasoning abilities. In this paper, we present an empirical investigation into the generalization potential of various MPS training approaches, such as continual pretraining, instruction tuning, and rule-based reinforcement learning across various data sources, including both short and long chain-of-thought (CoT) samples. Evaluation on 5 mathematical and 8 general reasoning benchmarks show that continual pretraining on math text is able to generalize to general reasoning tasks to some extent. In constrast, instruction tuning on conventional, short MPS samples provides limited benefits and, in many cases, even impairs generalization performance. Notably, training with long CoT responses for MPS samples and incorporating rule-based reinforcement learning on MPS queries exhibit distinct behavior, significantly enhancing generalization by extending the model's reasoning processes into other domains. These results suggest that traditional approaches to learning MPS with short reasoning chains largely fail to achieve robust generalization. However, the emerging paradigm of longer reasoning chains, coupled with self-reflection, offers a promising direction for improving generalized reasoning abilities through learning from specialized domains.
摘要：对增强大语模型的数学解决问题（MPS）功能的兴趣越来越大。尽管大多数研究工作都集中在创建专业模型来解决数学问题上，但仍未清楚地学习数学问题解决问题以帮助发展其他推理能力。在本文中，我们介绍了对各种MPS培训方法的概括潜力的实证研究，例如持续预处理，指导调整和基于规则的强化学习，包括各种数据源，包括短链和长链（COT）样本。对5个数学和8个一般推理基准的评估表明，数学文本的持续预处理能够在某种程度上推广到一般推理任务。在约束中，对常规的指令进行调整，简短的MPS样本可提供有限的好处，在许多情况下，甚至会损害概括性能。值得注意的是，对MPS样本进行长长的COT响应的培训，并在MPS查询上纳入基于规则的强化学习表现出独特的行为，从而通过将模型的推理过程扩展到其他域，从而显着增强了概括。这些结果表明，以短次推理链的学习MP的传统方法在很大程度上无法实现强大的概括。但是，较长的推理链的新兴范式加上自我反射，为通过从专业领域学习来提高广义推理能力提供了有希望的方向。

Title: SpiritRAG: A Q&A System for Religion and Spirituality in the United Nations Archive

Authors: Yingqiang Gao, Fabian Winiger, Patrick Montjourides, Anastassia Shaitarova, Nianlong Gu, Simon Peng-Keller, Gerold Schneider
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04395
Pdf URL: https://arxiv.org/pdf/2507.04395
Copy Paste: [[2507.04395]] SpiritRAG: A Q&A System for Religion and Spirituality in the United Nations Archive(https://arxiv.org/abs/2507.04395)
Keywords: chat, retrieval-augmented generation
Abstract: Religion and spirituality (R/S) are complex and highly domain-dependent concepts which have long confounded researchers and policymakers. Due to their context-specificity, R/S are difficult to operationalize in conventional archival search strategies, particularly when datasets are very large, poorly accessible, and marked by information noise. As a result, considerable time investments and specialist knowledge is often needed to extract actionable insights related to R/S from general archival sources, increasing reliance on published literature and manual desk reviews. To address this challenge, we present SpiritRAG, an interactive Question Answering (Q&A) system based on Retrieval-Augmented Generation (RAG). Built using 7,500 United Nations (UN) resolution documents related to R/S in the domains of health and education, SpiritRAG allows researchers and policymakers to conduct complex, context-sensitive database searches of very large datasets using an easily accessible, chat-based web interface. SpiritRAG is lightweight to deploy and leverages both UN documents and user provided documents as source material. A pilot test and evaluation with domain experts on 100 manually composed questions demonstrates the practical value and usefulness of SpiritRAG.
摘要：宗教和灵性（R/S）是复杂且高度依赖领域的概念，长期以来将研究人员和政策制定者混淆。由于它们的上下文特异性，R/S在常规的档案搜索策略中很难运行，尤其是当数据集非常大，易于访问且以信息噪声为标志时。结果，通常需要大量的时间投资和专业知识来提取与一般档案资源相关的可行见解，从而增加了对已发表的文献和手动书桌评论的依赖。为了应对这一挑战，我们提出了SpiritRag，这是一个基于检索功能的一代（RAG）的互动问题回答（Q＆A）系统。 SpiritRag使用与R/S相关的7,500个联合国（联合国）的解决文件构建，允许研究人员和政策制定者使用易于访问的基于聊天的Web界面对非常大的数据集进行复杂的，上下文敏感的数据库搜索。 SpiritRag可以轻巧部署，并利用联合国文件和用户提供文档作为原始材料。对100个手动构成问题的领域专家进行的试点测试和评估证明了SpiritRag的实际价值和实用性。

Title: THM@SimpleText 2025 -- Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts

Authors: Nico Hofmann, Julian Dauenhauer, Nils Ole Dietzler, Idehen Daniel Idahor, Christin Katharina Kreutz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04414
Pdf URL: https://arxiv.org/pdf/2507.04414
Copy Paste: [[2507.04414]] THM@SimpleText 2025 -- Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts(https://arxiv.org/abs/2507.04414)
Keywords: language model
Abstract: Scientific text is complex as it contains technical terms by definition. Simplifying such text for non-domain experts enhances accessibility of innovation and information. Politicians could be enabled to understand new findings on topics on which they intend to pass a law, or family members of seriously ill patients could read about clinical trials. The SimpleText CLEF Lab focuses on exactly this problem of simplification of scientific text. Task 1.1 of the 2025 edition specifically handles the simplification of complex sentences, so very short texts with little context. To tackle this task we investigate the identification of complex terms in sentences which are rephrased using small Gemini and OpenAI large language models for non-expert readers.
摘要：科学文本很复杂，因为它包含定义上的技术术语。简化非域专家的此类文本可以增强创新和信息的可访问性。政治家可以理解有关他们打算通过法律的主题的新发现，或者患者的家庭成员可以阅读有关临床试验的家庭成员。 SimpleText Clef实验室完全关注了简化科学文本的问题。 2025年版的任务1.1专门处理复杂句子的简化，因此很少的文字很少。为了解决这项任务，我们研究了使用小型双子座和OpenAI大型语言模型对非专家读者改造的句子中复杂术语的识别。

Title: MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind

Authors: Emilio Villa-Cueva, S M Masrur Ahmed, Rendi Chevi, Jan Christian Blaise Cruz, Kareem Elzeky, Fermin Cristobal, Alham Fikri Aji, Skyler Wang, Rada Mihalcea, Thamar Solorio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04415
Pdf URL: https://arxiv.org/pdf/2507.04415
Copy Paste: [[2507.04415]] MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind(https://arxiv.org/abs/2507.04415)
Keywords: language model, llm, agent
Abstract: Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MOMENTS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MOMENTS includes over 2,344 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters' mental states. While the visual modality generally enhances model performance, current systems still struggle to integrate it effectively, underscoring the need for further research into AI's multimodal understanding of human behavior.
摘要：理解心理理论对于建立能够感知和解释人类行为的具有社会智能的多模式的代理至关重要。我们介绍了瞬间（多模式心理状态），这是一个综合基准，旨在通过短片中呈现的现实，叙事丰富的场景来评估多模式大语言模型（LLMS）的TOM功能。时刻包括超过2,344个多项选择问题，涵盖了七个不同的TOM类别。该基准具有较长的视频上下文窗口和现实的社交互动，可深入了解角色的心理状态。尽管视觉方式通常可以增强模型性能，但当前系统仍然很难有效地整合它，从而强调了对AI对人类行为的多模式理解的进一步研究。

Title: RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling

Authors: Xiuying Wei, Anunay Yadav, Razvan Pascanu, Caglar Gulcehre
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04416
Pdf URL: https://arxiv.org/pdf/2507.04416
Copy Paste: [[2507.04416]] RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling(https://arxiv.org/abs/2507.04416)
Keywords: language model
Abstract: Transformers have become the cornerstone of modern large-scale language models; however, their dependence on softmax attention poses a major computational bottleneck, particularly in long-context settings. In this work, rather than following prevalent approaches such as linear attention (or SSMs) and local attention, we introduce an intermediate design called \rat between recurrence and attention mechanisms. It partitions the input into chunks, applies a simple linear recurrence within each chunk to capture local dependencies, and then performs softmax attention across chunks to model long-range interactions. By adjusting the size of the chunk, \rat enables flexible trade-offs, combining the strengths of RNN and attention. Empirically, with a chunk size of 16, the \rat layer achieves a $7\times$ improvement in training speed with 100K token sequences and $9\times$ in generation at 4K sequence length, while maintaining similar or sometimes even better accuracy compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves \rat with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage compared to attention, but also consistently enhances performance, for example, achieving an average 1 point gain in commonsense reasoning tasks, up to 4 points on code tasks, and a 1 point Rouge-L increase in a summarization SFT task. Code is available at this https URL
摘要：变形金刚已成为现代大型语言模型的基石。但是，它们对软磁体注意的依赖会带来主要的计算瓶颈，尤其是在长篇小说环境中。在这项工作中，我们没有遵循线性注意（或SSM）和局部关注等普遍的方法，而是引入了一种中间设计，称为复发和注意力机制之间称为\ rat。它将输入分成块，在每个块中应用一个简单的线性复发以捕获局部依赖性，然后在跨块跨块的远程相互作用进行softmax注意。通过调整块的尺寸，\ tak可以使RNN和注意力的优势相结合，从而实现灵活的权衡。从经验上讲，\大鼠层的大小为16，在训练速度方面的训练速度（7 \ times \）在4K序列长度上以100K令牌序列和\（9 \ times \）的训练速度提高，同时与标准注意相比保持相似甚至更好的精度。我们通过从头开始训练1.3b参数模型并进行大规模评估，包括短篇和长篇文本基准以及监督的微调〜（SFT）来证明这一点。我们进一步提出了一种混合体系结构，以局部关注使\ rat交织在一起。通过将有效的远程建模与强大的本地互动相结合，这种混合设计不仅可以提高推理速度，并减少了与注意力相比的缓存速度，而且可以始终提高性能，例如，在常识性推理任务中平均获得1分的1分，代码任务上最多可达到4分，以及1分的rouge-l rouge-l rouge-l in smumparization sft sft sft sft s ft。代码可在此HTTPS URL上找到

Title: GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models

Authors: Kai Yao, Zhaorui Tan, Penglei Gao, Lichun Li, Kaixin Wu, Yinggui Wang, Yuan Zhao, Yixin Ji, Wei Wang, Jianke Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04455
Pdf URL: https://arxiv.org/pdf/2507.04455
Copy Paste: [[2507.04455]] GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models(https://arxiv.org/abs/2507.04455)
Keywords: language model, llm
Abstract: The rapid growth of large language models (LLMs) with traditional centralized fine-tuning emerges as a key technique for adapting these models to domain-specific challenges, yielding privacy risks for both model and data owners. One promising solution, called offsite-tuning (OT), is proposed to address these challenges, where a weaker emulator is compressed from the original model and further fine-tuned with adapter to enhance privacy. However, the existing OT-based methods require high computational costs and lack theoretical analysis. This paper introduces a novel OT approach based on gradient-preserving compression, named GradOT. By analyzing the OT problem through the lens of optimization, we propose a method that selectively applies compression techniques such as rank compression and channel pruning, preserving the gradients of fine-tuned adapters while ensuring privacy. Extensive experiments demonstrate that our approach surpasses existing OT methods, both in terms of privacy protection and model performance. Our method provides a theoretical foundation for OT and offers a practical, training-free solution for offsite-tuning of large-scale LLMs.
摘要：大型语言模型（LLMS）随传统的集中式微调的快速增长是将这些模型调整为特定领域挑战的关键技术，从而为模型和数据所有者带来了隐私风险。提出了一种有希望的解决方案，称为异地调整（OT），以应对这些挑战，在这种挑战中，较弱的仿真器被从原始模型中压缩，并用适配器进行了进一步的调整以增强隐私。但是，现有的基于OT的方法需要高计算成本，并且缺乏理论分析。本文介绍了一种基于梯度保护压缩的新型OT方法，名为Gradot。通过通过优化镜头分析OT问题，我们提出了一种选择性地应用压缩技术（例如压缩和通道修剪）的方法，在确保隐私的同时保留了微调适配器的梯度。广泛的实验表明，在隐私保护和模型性能方面，我们的方法都超过了现有的OT方法。我们的方法为OT提供了理论基础，并为大规模LLM的异地调整提供了实用，无训练的解决方案。

Title: Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection

Authors: Soumyadeep Jana, Abhrajyoti Kundu, Sanasam Ranbir Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04458
Pdf URL: https://arxiv.org/pdf/2507.04458
Copy Paste: [[2507.04458]] Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection(https://arxiv.org/abs/2507.04458)
Keywords: language model, prompt, chain-of-thought
Abstract: Multimodal sarcasm detection has attracted growing interest due to the rise of multimedia posts on social media. Understanding sarcastic image-text posts often requires external contextual knowledge, such as cultural references or commonsense reasoning. However, existing models struggle to capture the deeper rationale behind sarcasm, relying mainly on shallow cues like image captions or object-attribute pairs from images. To address this, we propose \textbf{MiDRE} (\textbf{Mi}xture of \textbf{D}ual \textbf{R}easoning \textbf{E}xperts), which integrates an internal reasoning expert for detecting incongruities within the image-text pair and an external reasoning expert that utilizes structured rationales generated via Chain-of-Thought prompting to a Large Vision-Language Model. An adaptive gating mechanism dynamically weighs the two experts, selecting the most relevant reasoning path. Experiments on two benchmark datasets show that MiDRE achieves superior performance over baselines. Various qualitative analyses highlight the crucial role of external rationales, revealing that even when they are occasionally noisy, they provide valuable cues that guide the model toward a better understanding of sarcasm.
摘要：由于社交媒体上的多媒体帖子的兴起，多模式的讽刺检测引起了人们日益增长的兴趣。了解讽刺的图像文本文章通常需要外部上下文知识，例如文化参考或常识性推理。但是，现有模型努力捕获讽刺背后的更深层次的理由，主要依赖于图像标题或图像中的对象属性对等浅线。 To address this, we propose \textbf{MiDRE} (\textbf{Mi}xture of \textbf{D}ual \textbf{R}easoning \textbf{E}xperts), which integrates an internal reasoning expert for detecting incongruities within the image-text pair and an external reasoning expert that utilizes structured rationales generated via经过思考链，促使大型视觉模型。自适应门控机制会动态地称重两位专家，选择最相关的推理路径。两个基准数据集的实验表明，MIDRE的性能优于基线。各种定性分析强调了外部理由的关键作用，表明即使偶尔嘈杂，它们也提供了有价值的线索，以指导模型更好地理解讽刺。

Title: Dual Modality-Aware Gated Prompt Tuning for Few-Shot Multimodal Sarcasm Detection

Authors: Soumyadeep Jana, Abhrajyoti Kundu, Sanasam Ranbir Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04468
Pdf URL: https://arxiv.org/pdf/2507.04468
Copy Paste: [[2507.04468]] Dual Modality-Aware Gated Prompt Tuning for Few-Shot Multimodal Sarcasm Detection(https://arxiv.org/abs/2507.04468)
Keywords: prompt
Abstract: The widespread use of multimodal content on social media has heightened the need for effective sarcasm detection to improve opinion mining. However, existing models rely heavily on large annotated datasets, making them less suitable for real-world scenarios where labeled data is scarce. This motivates the need to explore the problem in a few-shot setting. To this end, we introduce DMDP (Deep Modality-Disentangled Prompt Tuning), a novel framework for few-shot multimodal sarcasm detection. Unlike prior methods that use shallow, unified prompts across modalities, DMDP employs gated, modality-specific deep prompts for text and visual encoders. These prompts are injected across multiple layers to enable hierarchical feature learning and better capture diverse sarcasm types. To enhance intra-modal learning, we incorporate a prompt-sharing mechanism across layers, allowing the model to aggregate both low-level and high-level semantic cues. Additionally, a cross-modal prompt alignment module enables nuanced interactions between image and text representations, improving the model's ability to detect subtle sarcastic intent. Experiments on two public datasets demonstrate DMDP's superior performance in both few-shot and extremely low-resource settings. Further cross-dataset evaluations show that DMDP generalizes well across domains, consistently outperforming baseline methods.
摘要：社交媒体上多模式内容的广泛使用增加了有效讽刺检测以改善意见采矿的需求。但是，现有模型在很大程度上依赖于大型注释的数据集，从而使其不适合稀缺标记数据的现实情况。这激发了需要在几次设置中探索问题的必要性。为此，我们介绍了DMDP（深度模态 - 触发及时调整），这是一个新颖的框架，用于几次多模式讽刺检测。与使用浅的方法不同，跨模态的统一提示与文本和视觉编码器采用封闭式的，特定于模态的深层提示。这些提示会在多个层中注入，以实现层次特征学习并更好地捕获多样化的讽刺类型。为了增强模式内的学习，我们在各个层之间结合了迅速共享的机制，从而使模型可以汇总低级和高级语义提示。此外，一个跨模式提示对准模块可实现图像和文本表示之间的细微互动，从而提高了模型检测微妙的讽刺意图的能力。在两个公共数据集上的实验证明了DMDP在几次和极低的资源设置中的出色性能。进一步的跨数据库评估表明，DMDP在跨域中很好地概括，始终超过基线方法。

Title: Unveiling the Potential of Diffusion Large Language Model in Controllable Generation

Authors: Zhen Xiong, Yujun Cai, Zhecheng Li, Yiwei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04504
Pdf URL: https://arxiv.org/pdf/2507.04504
Copy Paste: [[2507.04504]] Unveiling the Potential of Diffusion Large Language Model in Controllable Generation(https://arxiv.org/abs/2507.04504)
Keywords: language model, llm, hallucination
Abstract: Diffusion models, originally developed for image generation, have emerged as a promising alternative to autoregressive large language models (LLMs). We present a theoretical analysis comparing autoregressive and masked diffusion LLMs, revealing that the intrinsic bidirectional attention mechanism of diffusion LLMs (dLLMs) enables superior context modeling and generation controllability. However, existing dLLM applications face significant challenges in controllable generation: the native multi-step denoising process exhibits high sensitivity to sequence length, elevated hallucination rates, and prohibitive inference costs without specialized optimizations. To address these limitations, we propose \textbf{S}elf-adaptive \textbf{S}chema \textbf{S}caffolding ($S^3$), a novel framework that enables dLLMs to generate structured outputs (e.g., JSON) while maintaining semantic fidelity and accelerating inference. Our approach injects the target schema structure into the output context, reducing unnecessary computation while improving controllability. Extensive experiments demonstrate that $S^3$ achieves substantial improvements: 65\% increase in structural adherence, 48\% enhancement in content fidelity, and 17\% reduction in hallucination rates compared to baseline. These results establish both theoretical foundations and practical pathways for deploying diffusion models in controllable text generation tasks. Code and data will be publicly released.
摘要：最初用于图像生成的扩散模型已成为自回归大型语言模型（LLMS）的有希望的替代方法。我们提出了比较自回归和掩盖扩散LLM的理论分析，表明扩散LLMS（DLLMS）的内在双向注意机制可实现出色的上下文建模和产生可控性。但是，现有的DLLM应用在可控生成中面临重大挑战：本地多步降解过程对序列长度，幻觉率提高以及没有专门优化的推理成本表现出高度的敏感性。为了解决这些限制，我们提出\ textbf {s} Elf-Aptive \ textbf {s} chema \ textbf {s} coffolding（$ s^3 $），这是一个新颖的框架，该框架使DLLMs能够生成结构性输出（E.G.，JSON），同时维持Smantic fidelity fidelity fidelity和Apcelperation和Apcelperation serferent。我们的方法将目标架构结构注入输出上下文中，从而减少了不必要的计算，同时改善了可控性。广泛的实验表明，$ s^3 $取得了实质性改善：与基准相比，结构依从性增加了65 \％\％的增长，内容忠诚度增强48％，幻觉率降低了17 \％。这些结果建立了理论基础和实用途径，用于在可控文本生成任务中部署扩散模型。代码和数据将公开发布。

Title: DP-Fusion: Token-Level Differentially Private Inference for Large Language Models

Authors: Rushil Thareja, Preslav Nakov, Praneeth Vepakomma, Nils Lukas
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04531
Pdf URL: https://arxiv.org/pdf/2507.04531
Copy Paste: [[2507.04531]] DP-Fusion: Token-Level Differentially Private Inference for Large Language Models(https://arxiv.org/abs/2507.04531)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) can leak sensitive information from their context through generated outputs, either accidentally or when prompted adversarially. Existing defenses that aim to preserve context privacy during inference either lack formal guarantees or suffer from a poor utility/privacy trade-off. We propose DP-Fusion, a token-level Differentially Private Inference (DPI) mechanism that provably bounds how much an LLM's outputs reveal about sensitive tokens in its context. We demonstrate DPI through the task of document privatization, where the goal is to paraphrase documents so that sensitive content (e.g., Personally Identifiable Information, PII) cannot be reliably inferred, while still preserving the overall utility of the text. This is controlled by a parameter $\epsilon$: $\epsilon=0$ hides PII entirely, while higher values trade off privacy for improved paraphrase quality. DP-Fusion works as follows: (i) partition sensitive tokens into disjoint privacy groups, (ii) run the LLM once per group, and (iii) blend the output distributions so that the final output remains within a fixed statistical distance of the baseline distribution produced when no privacy group is revealed. This approach allows fine-grained control over the privacy/utility trade-off but requires multiple LLM forward passes.
摘要：大型语言模型（LLMS）可以通过生成的输出偶然或在对手提示时通过生成的输出从其上下文中泄漏敏感信息。旨在在推断期间保留环境隐私的现有防御能力要么缺乏正式保证，要么遭受效用差/隐私权权衡的损失。我们提出了DP-Fusion，这是一种令牌级别的差异化私有推论（DPI）机制，可以证明LLM的输出在其上下文中揭示了有关敏感令牌的多少。我们通过文档私有化的任务来证明DPI，目标是解释文档，以便无法可靠地推断敏感内容（例如，个人身份信息，PII），同时仍保留文本的整体实用性。这是由参数$ \ epsilon $：$ \ epsilon = 0 $ hides pii控制的，而较高的值则将隐私权折叠以改善释义质量。 DP融合的工作如下：（i）分配敏感的令牌分为不相交的隐私组，（ii）每组一次运行LLM，（iii）将输出分布融合，以使最终输出保持在没有隐私组的固定统计距离之内。这种方法允许对隐私/公用事业权衡的细粒度控制，但需要多个LLM远程通过。

Title: Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts

Authors: Guokan Shang, Hadi Abdine, Ahmad Chamma, Amr Mohamed, Mohamed Anwar, Abdelaziz Bounhar, Omar El Herraoui, Preslav Nakov, Michalis Vazirgiannis, Eric Xing
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04569
Pdf URL: https://arxiv.org/pdf/2507.04569
Copy Paste: [[2507.04569]] Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts(https://arxiv.org/abs/2507.04569)
Keywords: language model, llm, chat
Abstract: We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model yields a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to dual-script languages, addressing an often overlooked aspect in modern LLM development.
摘要：我们介绍了Nile-Chat-4b，3x4b-A6B和12B，这是埃及方言的LLM集合，其独特而设计，旨在理解和生成用阿拉伯语和拉丁文脚本编写的文本。具体而言，借助尼罗河-Chat-3x4b-A6B，我们通过利用分支机构 - 培训策略将脚本专家合并为单个MOE模型，从而引入了一种新颖的语言适应方法。我们的尼罗河模型在我们新介绍的埃及评估基准上，大大优于领先的多语言和阿拉伯语LLM，例如Llama，Jais和Allam，这些基准涵盖了理解和生成任务。值得注意的是，我们的12B模型在拉丁文标准上的QWEN2.5-14B教学中的性能增长14.4％。我们所有的资源都是公开可用的。我们认为，这项工作提出了一种将LLM适应双脚本语言的综合方法，可以解决现代LLM开发中经常被忽视的方面。

Title: PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes

Authors: Xinliang Frederick Zhang, Nick Beauchamp, Lu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04607
Pdf URL: https://arxiv.org/pdf/2507.04607
Copy Paste: [[2507.04607]] PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes(https://arxiv.org/abs/2507.04607)
Keywords: language model, llm
Abstract: Large language model (LLM) personalization aims to align model outputs with individuals' unique preferences and opinions. While recent efforts have implemented various personalization methods, a unified theoretical framework that can systematically understand the drivers of effective personalization is still lacking. In this work, we integrate the well-established cognitive dual-memory model into LLM personalization, by mirroring episodic memory to historical user engagements and semantic memory to long-term, evolving user beliefs. Specifically, we systematically investigate memory instantiations and introduce a unified framework, PRIME, using episodic and semantic memory mechanisms. We further augment PRIME with a novel personalized thinking capability inspired by the slow thinking strategy. Moreover, recognizing the absence of suitable benchmarks, we introduce a dataset using Change My View (CMV) from Reddit, specifically designed to evaluate long-context personalization. Extensive experiments validate PRIME's effectiveness across both long- and short-context scenarios. Further analysis confirms that PRIME effectively captures dynamic personalization beyond mere popularity biases.
摘要：大型语言模型（LLM）个性化旨在使模型输出与个人的独特偏好和意见相结合。尽管最近的努力已经实施了各种个性化方法，但仍缺乏可以系统地理解有效个性化驱动因素的统一理论框架。在这项工作中，我们将良好的认知双记忆模型整合到LLM个性化中，通过将情节记忆与历史用户接触和语义记忆相映射到长期，不断发展的用户信念。具体而言，我们使用情节和语义记忆机制系统地研究记忆实例，并引入统一的框架。我们以一种受缓慢思考策略启发的新型个性化思维能力进一步增强了Prime。此外，认识到缺乏合适的基准，我们使用Reddit的Change My Divice（CMV）介绍了一个数据集，Reddit是专门设计用于评估长篇小说个性化的。广泛的实验验证了Prime在长期和短上下文场景中的有效性。进一步的分析证实，PRIME有效地捕获了仅仅是流行偏见的动态个性化。

Title: Knowledge-Aware Self-Correction in Language Models via Structured Memory Graphs

Authors: Swayamjit Saha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04625
Pdf URL: https://arxiv.org/pdf/2507.04625
Copy Paste: [[2507.04625]] Knowledge-Aware Self-Correction in Language Models via Structured Memory Graphs(https://arxiv.org/abs/2507.04625)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) are powerful yet prone to generating factual errors, commonly referred to as hallucinations. We present a lightweight, interpretable framework for knowledge-aware self-correction of LLM outputs using structured memory graphs based on RDF triples. Without retraining or fine-tuning, our method post-processes model outputs and corrects factual inconsistencies via external semantic memory. We demonstrate the approach using DistilGPT-2 and show promising results on simple factual prompts.
摘要：大型语言模型（LLM）强大而容易产生事实错误，通常称为幻觉。我们使用基于RDF三元组的结构化存储器图为LLM输出的知识感知自我纠正提供了轻巧，可解释的框架。在不进行重新调整或微调的情况下，我们的方法后处理模型输出并通过外部语义记忆纠正事实不一致。我们使用Distilgpt-2演示了该方法，并在简单的事实提示中显示出令人鼓舞的结果。

Title: R1-RE: Cross-Domain Relationship Extraction with RLVR

Authors: Runpeng Dai, Tong Zheng, Run Yang, Hongtu Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04642
Pdf URL: https://arxiv.org/pdf/2507.04642
Copy Paste: [[2507.04642]] R1-RE: Cross-Domain Relationship Extraction with RLVR(https://arxiv.org/abs/2507.04642)
Keywords: language model, gpt
Abstract: Relationship extraction (RE) is a core task in natural language processing. Traditional approaches typically frame RE as a supervised learning problem, directly mapping context to labels-an approach that often suffers from poor out-of-domain (OOD) generalization. Inspired by the workflow of human annotators, we reframe RE as a reasoning task guided by annotation guidelines and introduce R1-RE, the first reinforcement learning with verifiable reward (RLVR) framework for RE tasks. Our method elicits the reasoning abilities of small language models for annotation tasks, resulting in significantly improved OOD robustness. We evaluate our approach on the public Sem-2010 dataset and a private MDKG dataset. The R1-RE-7B model attains an average OOD accuracy of approximately 70%, on par with leading proprietary models such as GPT-4o. Additionally, our comprehensive analysis provides novel insights into the training dynamics and emergent reasoning behaviors of the RLVR paradigm for RE.
摘要：关系提取（RE）是自然语言处理中的核心任务。传统方法通常将RE作为一个监督的学习问题，将上下文直接映射到标签的方法 - 一种通常遭受偏域（OOD）概括的方法。受到人类注释者的工作流程的启发，我们将RE重新定义为以注释指南为指导的推理任务，并引入R1-RE，这是第一个具有可验证奖励（RLVR）框架的强化学习框架（RLVR）框架。我们的方法为注释任务带来了小语言模型的推理能力，从而显着改善了OOD的鲁棒性。我们在公共SEM-2010数据集和私人MDKG数据集上评估了我们的方法。与领先的专有模型（如GPT-4O）相当，R1-RE-7B模型的平均OOD准确度约为70％。此外，我们的全面分析还提供了RLVR范式的训练动力和新兴推理行为的新见解。

Title: XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL

Authors: Yifu Liu, Yin Zhu, Yingqi Gao, Zhiling Luo, Xiaoxia Li, Xiaorong Shi, Yuntao Hong, Jinyang Gao, Yu Li, Bolin Ding, Jingren Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04701
Pdf URL: https://arxiv.org/pdf/2507.04701
Copy Paste: [[2507.04701]] XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL(https://arxiv.org/abs/2507.04701)
Keywords: llm
Abstract: To leverage the advantages of LLM in addressing challenges in the Text-to-SQL task, we present XiYan-SQL, an innovative framework effectively generating and utilizing multiple SQL candidates. It consists of three components: 1) a Schema Filter module filtering and obtaining multiple relevant schemas; 2) a multi-generator ensemble approach generating multiple highquality and diverse SQL queries; 3) a selection model with a candidate reorganization strategy implemented to obtain the optimal SQL query. Specifically, for the multi-generator ensemble, we employ a multi-task fine-tuning strategy to enhance the capabilities of SQL generation models for the intrinsic alignment between SQL and text, and construct multiple generation models with distinct generation styles by fine-tuning across different SQL formats. The experimental results and comprehensive analysis demonstrate the effectiveness and robustness of our framework. Overall, XiYan-SQL achieves a new SOTA performance of 75.63% on the notable BIRD benchmark, surpassing all previous methods. It also attains SOTA performance on the Spider test set with an accuracy of 89.65%.
摘要：为了利用LLM在解决文本到SQL任务中的挑战方面的优势，我们提出了Xiyan-SQL，这是一个有效生成和利用多个SQL候选者的创新框架。它由三个组成部分组成：1）模式过滤模块过滤并获得多个相关模式； 2）多发性合奏方法产生多种高质量和不同的SQL查询； 3）实施具有候选重组策略的选择模型，以获得最佳的SQL查询。具体而言，对于多生成集合，我们采用多任务微调策略来增强SQL生成模型的功能，以实现SQL和文本之间的固有对齐方式，并通过跨不同SQL格式进行微调来构建具有不同生成样式的多生成模型。实验结果和全面分析证明了我们框架的有效性和鲁棒性。总体而言，Xiyan-SQL在著名的鸟类基准上实现了75.63％的新SOTA性能，超过了所有以前的方法。它还可以在蜘蛛测试集中达到SOTA性能，准确度为89.65％。

Title: Why We Feel What We Feel: Joint Detection of Emotions and Their Opinion Triggers in E-commerce

Authors: Arnav Attri, Anuj Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04708
Pdf URL: https://arxiv.org/pdf/2507.04708
Copy Paste: [[2507.04708]] Why We Feel What We Feel: Joint Detection of Emotions and Their Opinion Triggers in E-commerce(https://arxiv.org/abs/2507.04708)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Customer reviews on e-commerce platforms capture critical affective signals that drive purchasing decisions. However, no existing research has explored the joint task of emotion detection and explanatory span identification in e-commerce reviews - a crucial gap in understanding what triggers customer emotional responses. To bridge this gap, we propose a novel joint task unifying Emotion detection and Opinion Trigger extraction (EOT), which explicitly models the relationship between causal text spans (opinion triggers) and affective dimensions (emotion categories) grounded in Plutchik's theory of 8 primary emotions. In the absence of labeled data, we introduce EOT-X, a human-annotated collection of 2,400 reviews with fine-grained emotions and opinion triggers. We evaluate 23 Large Language Models (LLMs) and present EOT-DETECT, a structured prompting framework with systematic reasoning and self-reflection. Our framework surpasses zero-shot and chain-of-thought techniques, across e-commerce domains.
摘要：电子商务平台上的客户评论捕获了推动购买决策的关键情感信号。但是，没有现有的研究探讨了电子商务评论中情绪检测和解释性跨度识别的共同任务 - 了解触发客户情感反应的原因是一个关键的差距。为了弥合这一差距，我们提出了一项新颖的联合任务，统一情绪检测和意见触发提取（EOT），该任务明确地模拟了因果文本跨度（意见触发器）和情感维度（情感类别）之间的关系，该关系基于Plutchik的8个主要情绪理论。在没有标记数据的情况下，我们介绍了EOT-X，这是2400种带有精细情感和舆论触发器的评论的人类通知的收藏。我们评估了23种大型语言模型（LLMS），并目前EOT-Detect，这是一个结构化的提示框架，具有系统的推理和自我反射。我们的框架超过了跨电子商务领域的零击和经过经过经过经过经过经过经验的技术。

Title: LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework

Authors: Zecheng Tang, Haitian Wang, Quantong Qiu, Baibei Ji, Ruoxi Sun, Keyan Zhou, Juntao Li, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04723
Pdf URL: https://arxiv.org/pdf/2507.04723
Copy Paste: [[2507.04723]] LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework(https://arxiv.org/abs/2507.04723)
Keywords: language model, llm
Abstract: Long-context processing has become a fundamental capability for large language models~(LLMs). To assess model's long-context performance, numerous long-context evaluation benchmarks have been proposed. However, variations in evaluation settings across these benchmarks lead to inconsistent results, making it difficult to draw reliable comparisons. Besides, the high computational cost of long-context evaluation poses a significant barrier for the community to conduct comprehensive assessments of long-context models. In this paper, we propose LOOM-Scope, a comprehensive and efficient framework for long-context evaluation. LOOM-Scope standardizes evaluation settings across diverse benchmarks, supports deployment of efficient long-context inference acceleration methods, and introduces a holistic yet lightweight benchmark suite to evaluate models comprehensively. Homepage: this https URL
摘要：长篇文化处理已成为大型语言模型（LLMS）的基本能力。为了评估模型的长期性能性能，已经提出了许多长篇文化评估基准。但是，这些基准测试的评估设置的变化导致结果不一致，因此很难进行可靠的比较。此外，长篇文化评估的高计算成本为社区进行了对长篇小说模型进行全面评估的重大障碍。在本文中，我们提出了织布机，这是一个综合有效的框架，用于长期评估。 Loom-Scope标准化了跨不同基准测试的评估设置，支持有效的长篇小说推理加速方法的部署，并引入了整体但轻量级的基准套件，以全面评估模型。首页：此HTTPS URL

Title: "This Suits You the Best": Query Focused Comparative Explainable Summarization

Authors: Arnav Attri, Anuj Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2507.04733
Pdf URL: https://arxiv.org/pdf/2507.04733
Copy Paste: [[2507.04733]] "This Suits You the Best": Query Focused Comparative Explainable Summarization(https://arxiv.org/abs/2507.04733)
Keywords: language model, llm, prompt
Abstract: Product recommendations inherently involve comparisons, yet traditional opinion summarization often fails to provide holistic comparative insights. We propose the novel task of generating Query-Focused Comparative Explainable Summaries (QF-CES) using Multi-Source Opinion Summarization (M-OS). To address the lack of query-focused recommendation datasets, we introduce MS-Q2P, comprising 7,500 queries mapped to 22,500 recommended products with metadata. We leverage Large Language Models (LLMs) to generate tabular comparative summaries with query-specific explanations. Our approach is personalized, privacy-preserving, recommendation engine-agnostic, and category-agnostic. M-OS as an intermediate step reduces inference latency approximately by 40% compared to the direct input approach (DIA), which processes raw data directly. We evaluate open-source and proprietary LLMs for generating and assessing QF-CES. Extensive evaluations using QF-CES-PROMPT across 5 dimensions (clarity, faithfulness, informativeness, format adherence, and query relevance) showed an average Spearman correlation of 0.74 with human judgments, indicating its potential for QF-CES evaluation.
摘要：产品建议固有地涉及比较，但是传统意见摘要通常无法提供整体比较见解。我们提出了使用多源意见摘要（M-OS）生成以查询为重点的可解释摘要（QF-CE）的新任务。为了解决缺乏以查询为重点的建议数据集，我们介绍了MS-Q2P，其中包括7,500个查询，映射到22,500个带有元数据的推荐产品。我们利用大型语言模型（LLM）来生成具有特定于查询的解释的表比较摘要。我们的方法是个性化的，具有隐私性的，推荐引擎 - 敏锐的和类别不可或缺的。与直接输入方法（DIA）相比，M-OS作为中间步骤可将推理潜伏期降低约40％，该方法直接处理原始数据。我们评估开源和专有LLM，用于生成和评估QF-CES。在5个维度上使用QF-CES推出的广泛评估（清晰度，忠诚，信息性，格式依从性和相关性）显示出平均长矛人的相关性为0.74与人类判断，表明其对QF-CES评估的潜力。

Title: A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic

Authors: Juan Moreno Gonzalez, Bashar Alhafni, Nizar Habash
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04746
Pdf URL: https://arxiv.org/pdf/2507.04746
Copy Paste: [[2507.04746]] A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic(https://arxiv.org/abs/2507.04746)
Keywords: llm
Abstract: Judeo-Arabic refers to Arabic variants historically spoken by Jewish communities across the Arab world, primarily during the Middle Ages. Unlike standard Arabic, it is written in Hebrew script by Jewish writers and for Jewish audiences. Transliterating Judeo-Arabic into Arabic script is challenging due to ambiguous letter mappings, inconsistent orthographic conventions, and frequent code-switching into Hebrew and Aramaic. In this paper, we introduce a two-step approach to automatically transliterate Judeo-Arabic into Arabic script: simple character-level mapping followed by post-correction to address grammatical and orthographic errors. We also present the first benchmark evaluation of LLMs on this task. Finally, we show that transliteration enables Arabic NLP tools to perform morphosyntactic tagging and machine translation, which would have not been feasible on the original texts.
摘要：犹太 - 阿拉伯语是指阿拉伯人社区在整个阿拉伯世界的历史上说的阿拉伯语变种，主要是在中世纪。与标准阿拉伯语不同，它是由犹太作家和犹太观众用希伯来语剧本撰写的。由于模棱两可的字母映射，不一致的拼字公约以及频繁的代码转换为希伯来语和阿拉姆语，因此将犹太阿拉伯语音译为阿拉伯文字是具有挑战性的。在本文中，我们引入了两步方法，将犹太阿拉伯语自动音译为阿拉伯语脚本：简单的字符级映射，然后进行后校正以解决语法和拼字错误。我们还介绍了对LLM关于此任务的第一个基准评估。最后，我们表明音译使阿拉伯语NLP工具可以执行形态句法标记和机器翻译，这在原始文本上是不可行的。

Title: LLMs as Architects and Critics for Multi-Source Opinion Summarization

Authors: Anuj Attri, Arnav Attri, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Muthusamy Chelliah, Nikesh Garera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04751
Pdf URL: https://arxiv.org/pdf/2507.04751
Copy Paste: [[2507.04751]] LLMs as Architects and Critics for Multi-Source Opinion Summarization(https://arxiv.org/abs/2507.04751)
Keywords: language model, llm, prompt
Abstract: Multi-source Opinion Summarization (M-OS) extends beyond traditional opinion summarization by incorporating additional sources of product metadata such as descriptions, key features, specifications, and ratings, alongside reviews. This integration results in comprehensive summaries that capture both subjective opinions and objective product attributes essential for informed decision-making. While Large Language Models (LLMs) have shown significant success in various Natural Language Processing (NLP) tasks, their potential in M-OS remains largely unexplored. Additionally, the lack of evaluation datasets for this task has impeded further advancements. To bridge this gap, we introduce M-OS-EVAL, a benchmark dataset for evaluating multi-source opinion summaries across 7 key dimensions: fluency, coherence, relevance, faithfulness, aspect coverage, sentiment consistency, specificity. Our results demonstrate that M-OS significantly enhances user engagement, as evidenced by a user study in which, on average, 87% of participants preferred M-OS over opinion summaries. Our experiments demonstrate that factually enriched summaries enhance user engagement. Notably, M-OS-PROMPTS exhibit stronger alignment with human judgment, achieving an average Spearman correlation of \r{ho} = 0.74, which surpasses the performance of previous methodologies.
摘要：多源意见摘要（M-OS）通过纳入了其他产品元数据来源（例如描述，关键功能，规格和评级）以及评论，从而超出了传统意见摘要。这种集成产生了全面的摘要，可以捕获主观意见和客观产品属性对于知情决策必不可少的。尽管大型语言模型（LLMS）在各种自然语言处理（NLP）任务中取得了重大成功，但它们在M-OS中的潜力仍然在很大程度上没有探索。此外，缺乏针对此任务的评估数据集阻碍了进一步的进步。为了弥合这一差距，我们介绍了M-OS-Eval，这是一个基准数据集，用于评估跨7个关键维度的多源舆论摘要：流利，连贯性，相关性，相关性，忠诚，方面覆盖，情感一致性，特异性。我们的结果表明，M-OS显着增强了用户的参与度，这是用户研究所证明的，在这种研究中，平均有87％的参与者比观点摘要更喜欢M-OS。我们的实验表明，事实丰富的摘要可以增强用户参与度。值得注意的是，M-OS-突出表现出与人类判断的更强一致性，达到了\ r {HO} = 0.74的平均长矛相关性，这超过了先前方法的性能。

Title: CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering

Authors: Hang Lv, Sheng Liang, Hao Wang, Hongchao Gu, Yaxiong Wu, Wei Guo, Defu Lian, Yong Liu, Enhong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04756
Pdf URL: https://arxiv.org/pdf/2507.04756
Copy Paste: [[2507.04756]] CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering(https://arxiv.org/abs/2507.04756)
Keywords: language model, llm
Abstract: Personalized text generation has become crucial for adapting language models to diverse and evolving users' personal context across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment, they struggle to achieve real-time adaptation under resource constraints inherent to personal devices. This limitation creates a dilemma: large cloud-based models lack access to localized user-specific information, while small on-device models cannot match the generation quality of their cloud counterparts. To address this dichotomy, we present CoSteer, a novel collaborative framework that enables decoding-time personalization through localized delta steering. Our key insight lies in leveraging the logits difference between personal context-aware and -agnostic outputs from local small models as steering signals for cloud-based LLMs. Specifically, we formulate token-level optimization as an online learning problem, where local delta vectors dynamically adjust the remote LLM's logits within the on-device environment. This approach preserves privacy by transmitting only the final steered tokens rather than raw data or intermediate vectors, while maintaining cloud-based LLMs' general capabilities without fine-tuning. Through comprehensive experiments on various personalized generation tasks, we demonstrate that CoSteer effectively assists LLMs in generating personalized content by leveraging locally stored user profiles and histories, ensuring privacy preservation through on-device data processing while maintaining acceptable computational overhead.
摘要：个性化的文本生成对于将语言模型改编成在文化，时间和上下文维度上的各种个人背景而变得至关重要。尽管现有的方法通常依赖于集中的微调或静态偏好一致性，但他们努力在个人设备固有的资源约束下实现实时适应。这种限制造成了困境：大型基于云的模型无法访问本地化用户特定信息，而小型的智障模型无法匹配其云对应物的发电质量。为了解决这种二分法，我们提出了Costeer，这是一个新颖的合作框架，可以通过局部三角洲转向来解码时间个性化。我们的关键见解在于利用当地小型模型作为基于云的LLM的转向信号的个人上下文感知和不合时式输出之间的逻辑差异。具体来说，我们将令牌级优化作为在线学习问题，在该问题中，本地Delta向量在内部设备环境中动态调整了远程LLM的lod ligits。这种方法仅通过传输最终转向令牌而不是原始数据或中间向量来保留隐私，同时保持基于云的LLM的一般功能而无需微调。通过对各种个性化生成任务进行的全面实验，我们证明Costeer通过利用本地存储的用户资料和历史记录来有效地帮助LLMS生成个性化内容，从而通过在设备数据处理中确保隐私保护，同时保持可接受的计算高架开销。

Title: Reason to Rote: Rethinking Memorization in Reasoning

Authors: Yupei Du, Philipp Mondorf, Silvia Casola, Yuekun Yao, Robert Litschko, Barbara Plank
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04782
Pdf URL: https://arxiv.org/pdf/2507.04782
Copy Paste: [[2507.04782]] Reason to Rote: Rethinking Memorization in Reasoning(https://arxiv.org/abs/2507.04782)
Keywords: language model
Abstract: Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization in many cases does not heavily affect generalizable reasoning capabilities. Using two controllable synthetic reasoning datasets with noisy labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we discover a reliance of memorization on generalizable reasoning mechanisms: models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. We further show that memorization operates through distributed encoding, i.e., aggregating various inputs and intermediate results, rather than building a look-up mechanism from inputs to noisy labels. Moreover, our FDA case study reveals memorization occurs via outlier heuristics, where existing neuron activation patterns are slightly shifted to fit noisy labels. Together, our findings suggest that memorization of label noise in language models builds on, rather than overrides, the underlying reasoning mechanisms, shedding lights on the intriguing phenomenon of benign memorization.
摘要：大型语言模型很容易记住任意培训实例，例如标签噪声，但它们在推理任务上表现出色。在这项工作中，我们研究了语言模型如何记住标签噪声，以及为什么在许多情况下这种记忆不会严重影响可概括的推理能力。使用两个具有嘈杂标签的可控合成推理数据集，四位数添加（FDA）和两人关系推理（THR），我们发现了对可变推理机制的依赖：模型继续计算中间推理的输出，即使回顾记忆记忆的噪声标签，并介绍了透明的噪声标签，并进行了介入的推理，并对有偏见的有影响力影响。我们进一步表明，记忆是通过分布式编码（即汇总各种输入和中间结果）来运行的，而不是构建从输入到嘈杂标签的查找机制。此外，我们的FDA案例研究揭示了记忆是通过异常启发式方法发生的，其中现有的神经元激活模式略微转移以适合嘈杂的标签。总之，我们的发现表明，语言模型中标签噪声的记忆是建立在而不是覆盖基本的推理机制的基础上，从而在良性记忆的有趣现象上散发出了灯光。

Title: A Survey of Pun Generation: Datasets, Evaluations and Methodologies

Authors: Yuchen Su, Yonghua Zhu, Ruofan Wang, Zijian Huang, Diana Benavides-Prado, Michael Witbrock
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04793
Pdf URL: https://arxiv.org/pdf/2507.04793
Copy Paste: [[2507.04793]] A Survey of Pun Generation: Datasets, Evaluations and Methodologies(https://arxiv.org/abs/2507.04793)
Keywords: language model
Abstract: Pun generation seeks to creatively modify linguistic elements in text to produce humour or evoke double meanings. It also aims to preserve coherence and contextual appropriateness, making it useful in creative writing and entertainment across various media and contexts. Although pun generation has received considerable attention in computational linguistics, there is currently no dedicated survey that systematically reviews this specific area. To bridge this gap, this paper provides a comprehensive review of pun generation datasets and methods across different stages, including conventional approaches, deep learning techniques, and pre-trained language models. Additionally, we summarise both automated and human evaluation metrics used to assess the quality of pun generation. Finally, we discuss the research challenges and propose promising directions for future work.
摘要：Pin Generation试图创造性地修改文本中的语言元素，以产生幽默或引起双重含义。它还旨在保持连贯性和上下文适当性，使其在各种媒体和环境之间的创意写作和娱乐中有用。尽管PUN Generation在计算语言学上受到了很大的关注，但目前尚无系统地审查该特定领域的专用调查。为了弥合这一差距，本文对跨不同阶段的双关语生成数据集和方法进行了全面审查，包括常规方法，深度学习技术和预训练的语言模型。此外，我们总结了用于评估PIN发电质量的自动化和人类评估指标。最后，我们讨论了研究挑战，并提出了未来工作的有希望的指示。

Title: Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems

Authors: Quang-Vinh Nguyen, Quang-Chieu Nguyen, Hoang Pham, Khac-Hoai Nam Bui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04841
Pdf URL: https://arxiv.org/pdf/2507.04841
Copy Paste: [[2507.04841]] Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems(https://arxiv.org/abs/2507.04841)
Keywords: language model, llm
Abstract: Task-oriented dialogue (TOD) systems facilitate goal-driven interactions between users and machines. While recent advances in deep learning have improved the performance, TOD systems often struggle in low-resource scenarios with limited labeled data. To address this challenge, we propose Spec-TOD, a novel framework designed to train an end-to-end TOD system with limited data. Spec-TOD introduces two main innovations: (i) a novel specialized end-to-end TOD framework that incorporates explicit task instructions for instruction-tuned large language models (LLMs), and (ii) an efficient training strategy that leverages lightweight, specialized LLMs to achieve strong performance with minimal supervision. Experiments on the MultiWOZ dataset, a widely used TOD benchmark, demonstrate that Spec-TOD achieves competitive results while significantly reducing the need for labeled data. These findings highlight the potential of the proposed framework in advancing efficient and effective TOD systems in low-resource settings.
摘要：面向任务的对话（TOD）系统促进了用户和机器之间的目标驱动互动。尽管深度学习的最新进展改善了表现，但TOD系统通常在低资源的情况下挣扎，标有有限的数据。为了应对这一挑战，我们提出了Spec-TOD，这是一个新颖的框架，旨在培训端到端的TOD系统的数据有限。 Spec-TOD介绍了两项主要创新：（i）一种新颖的专业端到端TOD框架，该框架结合了针对教学调整的大型语言模型（LLMS）的明确任务说明，以及（ii）一种有效的培训策略，利用轻量级，专门的LLMS以最小的监督来实现强大的绩效。使用广泛使用的TOD基准的Multiwoz数据集上的实验表明，Spec-TOD可实现竞争成果，同时显着减少了对标记数据的需求。这些发现突出了拟议框架在低资源环境中提高高效TOD系统的潜力。

Title: Dialogue-Based Multi-Dimensional Relationship Extraction from Novels

Authors: Yuchen Yan, Hanjie Zhao, Senbin Zhu, Hongde Liu, Zhihong Zhang, Yuxiang Jia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04852
Pdf URL: https://arxiv.org/pdf/2507.04852
Copy Paste: [[2507.04852]] Dialogue-Based Multi-Dimensional Relationship Extraction from Novels(https://arxiv.org/abs/2507.04852)
Keywords: language model, llm
Abstract: Relation extraction is a crucial task in natural language processing, with broad applications in knowledge graph construction and literary analysis. However, the complex context and implicit expressions in novel texts pose significant challenges for automatic character relationship extraction. This study focuses on relation extraction in the novel domain and proposes a method based on Large Language Models (LLMs). By incorporating relationship dimension separation, dialogue data construction, and contextual learning strategies, the proposed method enhances extraction performance. Leveraging dialogue structure information, it improves the model's ability to understand implicit relationships and demonstrates strong adaptability in complex contexts. Additionally, we construct a high-quality Chinese novel relation extraction dataset to address the lack of labeled resources and support future research. Experimental results show that our method outperforms traditional baselines across multiple evaluation metrics and successfully facilitates the automated construction of character relationship networks in novels.
摘要：关系提取在自然语言处理中是至关重要的任务，在知识图构造和文学分析中广泛应用。但是，新文本中的复杂上下文和隐式表达式对自动角色关系提取构成了重大挑战。这项研究重点是新领域的关系提取，并提出了一种基于大语言模型（LLM）的方法。通过合并关系维度分离，对话数据构建和上下文学习策略，提出的方法可以增强提取性能。利用对话结构信息，它提高了模型理解隐式关系的能力，并在复杂的环境中证明了强大的适应性。此外，我们构建了一个高质量的中国新颖的关系提取数据集，以解决缺乏标签的资源并支持未来的研究。实验结果表明，我们的方法在多个评估指标上优于传统基线，并成功促进了小说中角色关系网络的自动结构。

Title: $\textit{Grahak-Nyay:}$ Consumer Grievance Redressal through Large Language Models

Authors: Shrey Ganatra, Swapnil Bhattacharyya, Harshvivek Kashid, Spandan Anaokar, Shruti Nair, Reshma Sekhar, Siddharth Manohar, Rahul Hemrajani, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04854
Pdf URL: https://arxiv.org/pdf/2507.04854
Copy Paste: [[2507.04854]] $\textit{Grahak-Nyay:}$ Consumer Grievance Redressal through Large Language Models(https://arxiv.org/abs/2507.04854)
Keywords: language model, llm, chat, retrieval-augmented generation
Abstract: Access to consumer grievance redressal in India is often hindered by procedural complexity, legal jargon, and jurisdictional challenges. To address this, we present $\textbf{Grahak-Nyay}$ (Justice-to-Consumers), a chatbot that streamlines the process using open-source Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Grahak-Nyay simplifies legal complexities through a concise and up-to-date knowledge base. We introduce three novel datasets: $\textit{GeneralQA}$ (general consumer law), $\textit{SectoralQA}$ (sector-specific knowledge) and $\textit{SyntheticQA}$ (for RAG evaluation), along with $\textit{NyayChat}$, a dataset of 300 annotated chatbot conversations. We also introduce $\textit{Judgments}$ data sourced from Indian Consumer Courts to aid the chatbot in decision making and to enhance user trust. We also propose $\textbf{HAB}$ metrics ($\textbf{Helpfulness, Accuracy, Brevity}$) to evaluate chatbot performance. Legal domain experts validated Grahak-Nyay's effectiveness. Code and datasets will be released.
摘要：程序复杂性，法律术语和管辖权挑战通常会阻碍印度获得消费者申诉的机会。为了解决这个问题，我们介绍$ \ textbf {grahak-nyay} $（正义对消费者），这是一种使用开源大语模型（LLMS）和检索名称的一代（RAG）来简化该过程的聊天机器人。 Grahak-Nyay通过简洁而最新的知识基础简化了法律上的复杂性。我们介绍了三个新颖的数据集：$ \ textit {enstryqa} $（常规消费者定律），$ \ textit {sectoralqa} $（特定于sector的知识）和$ \ textit {syntheticqa} $（用于RAG评估），以及$ \ textIt {nyaychat} $，chatatiation $ 300，我们还介绍了$ \ textit {判断} $数据来自印度消费者法院，以帮助聊天机器人进行决策并增强用户信任。我们还建议$ \ textbf {hab} $量表（$ \ textbf {helpfuls，准确性，简洁} $）来评估聊天机器人性能。法律领域专家验证了Grahak-Nyay的有效性。代码和数据集将发布。

Title: Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions

Authors: Christos Vlachos, Nikolaos Stylianou, Alexandra Fiotaki, Spiros Methenitis, Elisavet Palogiannidi, Themos Stafylakis, Ion Androutsopoulos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04884
Pdf URL: https://arxiv.org/pdf/2507.04884
Copy Paste: [[2507.04884]] Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions(https://arxiv.org/abs/2507.04884)
Keywords: llm
Abstract: We consider open-retrieval conversational question answering (OR-CONVQA), an extension of question answering where system responses need to be (i) aware of dialog history and (ii) grounded in documents (or document fragments) retrieved per question. Domain-specific OR-CONVQA training datasets are crucial for real-world applications, but hard to obtain. We propose a pipeline that capitalizes on the abundance of plain text documents in organizations (e.g., product documentation) to automatically produce realistic OR-CONVQA dialogs with annotations. Similarly to real-world humanannotated OR-CONVQA datasets, we generate in-dialog question-answer pairs, self-contained (decontextualized, e.g., no referring expressions) versions of user questions, and propositions (sentences expressing prominent information from the documents) the system responses are grounded in. We show how the synthetic dialogs can be used to train efficient question rewriters that decontextualize user questions, allowing existing dialog-unaware retrievers to be utilized. The retrieved information and the decontextualized question are then passed on to an LLM that generates the system's response.
摘要：我们考虑开放回程对话问题回答（OR-CONVQA），该问题的扩展是（i）知道对话框历史记录以及（ii）在每个问题检索的文档（或文档片段）中需要的系统响应。特定于域的或CONVQA培训数据集对于现实世界应用至关重要，但很难获得。我们提出了一条管道，该管道利用了组织中的丰富文本文档（例如产品文档），以自动产生带有注释的现实或convqa对话框。与现实世界中的人类宣传或convqa数据集类似，我们生成了核对疑问 - 求和的求和，对用户问题的独立式（例如，没有参考的表达式）版本的自我封闭式（例如，没有参考的表达式），命题和命题（句子）（句子）（句子表达了文档中的显着信息），我们可以在这些问题中进行质疑。问题，允许使用现有的对话 - 诺瓦雷检索器。然后，检索到的信息和去皮问题被传递给生成系统响应的LLM。

Title: Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Authors: A. Bochkov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04886
Pdf URL: https://arxiv.org/pdf/2507.04886
Copy Paste: [[2507.04886]] Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations(https://arxiv.org/abs/2507.04886)
Keywords: language model, llm
Abstract: Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to "representational interference" in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer's compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.
摘要：了解大语模型（LLM）中语义表示的源头对于解释性和建筑创新至关重要。主要的范式认为，可训练的输入嵌入是基础“意义向量”。本文挑战了这种观点。我们构建了嵌入层完全冷冻的变压器模型，其载体不是从数据中得出的，而是源自Unicode字形的视觉结构。这些非语义，预定的视觉嵌入在整个训练过程中都是固定的。我们的方法与任何令牌仪都兼容，包括一种新型的以Unicode为中心的令牌仪，我们引入了以确保通用文本覆盖范围。尽管没有可训练的，语义初始化的嵌入，但我们的模型会收敛，生成相干文本，并且至关重要的是，在MMLU推理基准上具有可训练的嵌入方式优于架构相同的模型。我们将其归因于常规模型中的“表示干扰”，其中嵌入层与学习结构和语义特征负担在一起。我们的结果表明，高级语义不是输入嵌入的固有的，而是变压器组成架构和数据量表的新兴属性。这将嵌入的作用从含义容器到结构基原始人的作用。我们发布所有代码和模型以促进进一步的研究。

Title: O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d'apprentissage pour l'adaptation de modèles de langue à un domaine

Authors: Ismaël Rousseau, Claire Perroux, Pierre Adam, Thomas Girault, Lionel Delphin-Poulat, Morgan Veyret, Gwénolé Lecorvé, Géraldine Damnati
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04895
Pdf URL: https://arxiv.org/pdf/2507.04895
Copy Paste: [[2507.04895]] O_FT@EvalLLM2025 : étude comparative de choix de données et de stratégies d'apprentissage pour l'adaptation de modèles de langue à un domaine(https://arxiv.org/abs/2507.04895)
Keywords: language model, llm
Abstract: This paper presents the work carried out by the O_FT team, joint with Orange and Ouest-France, on adapting language models to the defense domain as part of the EvalLLM2025 challenge. This work focused on adapting the \texttt{Mistral-7B-Instruct-v0.3} model using classical techniques of continued pre-training and instruction-tuning. The core of our efforts is based on collecting, generating, and selecting data for these two stages as well as for model evaluation. Experiments show that our adapted models have better domain-specific knowledge and improved domain-specific task processing skills, along with comparable (or even superior) performance on general knowledge and skills. Considering the carbon footprint of our adaptations, this work demonstrates the feasibility of domain adaptation for relatively small models. -- Ce document présente les travaux réalisés par l'équipe O_FT conjointe à Orange et Ouest-France sur l'adaptation de modèles de langue au domaine de la défense dans le cadre du challenge EvalLLM2025. Ces travaux se sont concentrés sur l'adaptation du modèle \texttt{Mistral-7B-Instruct-v0.3} avec des techniques classiques de poursuite du pré-entraînement et d'affinage sur instructions. L'essentiel de nos travaux a porté sur la constitution, génération et sélection de données pour ces deux étapes ainsi que pour l'évaluation des modèles. Les expériences montrent que nos modèles adaptés ont de meilleures de connaissances de fond et une meilleure capacité de traitement de tâches sur le domaine de la défense, ainsi que des performances comparables (voire supérieures) sur des connaissances ou capacités généralistes. Mis au regard des empreintes carbones de nos adaptations, ces travaux démontrent ainsi la viabilité de l'adaptation à un domaine de modèles relativement petits.
摘要：本文介绍了O_FT团队与Orange and Ouest-France共同进行的工作，这是EVALLLM2025挑战的一部分，将语言模型适应国防域。这项工作着重于适应\ texttt {mistral-7b-instruct-v0.3}模型，使用持续的预训练和指导调整的经典技术。我们努力的核心是基于收集，生成和选择这两个阶段以及模型评估的数据。实验表明，我们改编的模型具有更好的领域特定知识和提高的特定领域的任务处理能力，以及在常识和技能上的可比（甚至是卓越）的表现。考虑到我们适应的碳足迹，这项工作证明了域适应相对较小的模型的可行性。 -CE文件présenteles travauxréalisésparl'équipeo_ft conjointe - 橙色和ohost-fance sur s sur s sur l'apaptation demodèlesdelangèlesde langue au domaine de la de la defense de la defense de de la de dan dan dan dan le cadre du cadre du cadre du du Challence evalllm2025。 CES TRAVAUX SE SONTECENTERS surl'odèle\ texttt {mistral-7b-instruct-v0.3} avec des Techniques classiques classiques de poursuite du persuite dupré-entraînementet d'Affinage sur指示。 L'Essentiel de nos travaux aportésur la constrution，générationetsélectiondedonnéespour ces deuxétapesainsi que pour pour pour pourl'évaluationdesmodèles。 Les expériences montrent que nos modèles adaptés ont de meilleures de connaissances de fond et une meilleure capacité de traitement de tâches sur le domaine de la défense, ainsi que des performances comparables (voire supérieures) sur des connaissances ou capacités généralistes. MIS AU认为，Des carbones de nos适应，ces travauxdémontrentainsi laiaviabilitédel'un domaine deModèles伴侣关系。

Title: SIGIR 2025 -- LiveRAG Challenge Report

Authors: David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Oren Somekh, Ran Tavory
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2507.04942
Pdf URL: https://arxiv.org/pdf/2507.04942
Copy Paste: [[2507.04942]] SIGIR 2025 -- LiveRAG Challenge Report(https://arxiv.org/abs/2507.04942)
Keywords: llm, prompt, retrieval-augmented generation
Abstract: The LiveRAG Challenge at SIGIR 2025, held between March and May 2025, provided a competitive platform for advancing Retrieval-Augmented Generation (RAG) technologies. Participants from academia and industry were invited to develop a RAG-based question-answering system using a fixed corpus (Fineweb-10BT) and a common open-source LLM (Falcon3-10B-Instruct). The goal was to facilitate challenging comparisons of retrieval and prompting strategies. During the Live Challenge Day, 70 teams from 27 different countries provided answers and supportive information to 500 unseen questions within a strict two-hour time window. Evaluation was conducted in two stages: first an automated LLM-as-a-judge approach was used to compute correctness and faithfulness score, then a manual review of top ranked submissions was conducted. The finalists were announced on June 12, 2025, with prizes awarded during the LiveRAG Workshop at SIGIR 2025 in Padua, Italy.
摘要：在3月至2025年5月之间举行的Sigir 2025年的Liverag挑战赛为推进检索增强的一代（RAG）技术提供了一个竞争平台。邀请学术界和行业的参与者使用固定语料库（FineWeb-10bt）和一个常见的开源LLM（FALCON3-10B-INSTRUCTY）开发基于抹布的问答系统。目标是促进检索和促使策略的挑战比较。在现场挑战日，来自27个不同国家 /地区的70支团队为严格的两个小时的时间窗口提供了500个看不见的问题的答案和支持信息。评估分为两个阶段：首先使用自动化的LLM-AS-A-A-Gudge方法来计算正确性和忠诚度得分，然后对最高排名提交的手动审查进行了审查。决赛入围者于2025年6月12日宣布，在意大利帕多瓦的Sigir 2025举行的Liverag研讨会上颁发了奖项。

Title: ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Authors: Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Shihui Hu, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Haotian Zhu, Yuanxing Zhang, Yuhao Jiang, Yue Zhang, Zenan Xu, Bohui Zhai, Guoxiang He, Hebin Li, Jie Zhao, Le Zhang, Lingyun Tan, Pengyu Guo, Xianshu Pang, Yang Ruan, Zhifeng Zhang, Zhonghu Wang, Ziyan Xu, Zuopu Yin, Wiggin Zhou, Chayse Zhou, Fengzong Lian
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2507.04952
Pdf URL: https://arxiv.org/pdf/2507.04952
Copy Paste: [[2507.04952]] ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation(https://arxiv.org/abs/2507.04952)
Keywords: language model, llm
Abstract: The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at this https URL, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.
摘要：大语言模型（LLM）的生成能力正在从静态代码迅速扩展到动态的交互式视觉伪像。这一进步是通过关键的评估差距瓶颈：既定的基准都集中在算法上正确性，并且对定义现代用户体验的视觉保真度和交互式完整性视而不见。为了弥合这一差距，我们介绍了Artifactsbench，一种新的基准和范式，用于自动化视觉代码生成的多模式评估。我们的框架通过编程方式渲染每个人工制品，并通过时间屏幕截图捕获其动态行为。然后，该视觉证据与源代码一起通过多模式LLM（MLLM）-As-Gudge进行评估，该法官由细粒度，每任务检查清单严格指导，以确保整体和可重复的评分。我们构建了1,825个不同任务的新基准，并评估了30多个领先的LLM。我们的自动化评估与WebDev Arena的排名一致性达到了94.4％，这是人类Web开发的金牌标准，与人类专家的成对同意超过90％。这将建立伪像的人物作为第一个框架，以可靠地自动化对人类感知质量的评估。我们的分析提供了当前SOTA的高分辨率图，揭示了通才模型通常优于特定领域的模型。我们在此HTTPS URL上开放源代码Artifactsbench，包括基准，评估线束和基线结果，为社区提供可扩展，准确的工具，以加速以用户为中心的生成模型的开发。

Title: Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification

Authors: Chenfei Xiong, Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Lorena Calvo-Bartolomé, Alexander Hoyle, Zhijing Jin, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Mennatallah El-Assady, Elliott Ash
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05010
Pdf URL: https://arxiv.org/pdf/2507.05010
Copy Paste: [[2507.05010]] Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification(https://arxiv.org/abs/2507.05010)
Keywords: language model, llm
Abstract: We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.
摘要：我们介绍了共同检测（在文本分类中对边缘案例的协作发现），这是一个新颖的混合定位注释框架，将人类专业知识与大型语言模型（LLMS）指导的自动注释集成在一起。共同检测从域专家提供的初始素描级代码书和数据集开始，然后利用LLM注释数据并识别初始代码簿未很好地描述的边缘案例。具体而言，共同检测标志具有挑战性的示例，引起了边缘案例的高级，可推广的描述，并协助用户合并边缘案例处理规则以改进代码书。这种迭代过程可以通过紧凑，可推广的注释规则更有效地处理细微的现象。广泛的用户研究，定性和定量分析证明了共同检测的有效性。

Title: Verified Language Processing with Hybrid Explainability: A Technical Report

Authors: Oliver Robert Fox, Giacomo Bergami, Graham Morgan
Subjects: cs.CL, cs.SC
Abstract URL: https://arxiv.org/abs/2507.05017
Pdf URL: https://arxiv.org/pdf/2507.05017
Copy Paste: [[2507.05017]] Verified Language Processing with Hybrid Explainability: A Technical Report(https://arxiv.org/abs/2507.05017)
Keywords: language model, prompt
Abstract: The volume and diversity of digital information have led to a growing reliance on Machine Learning techniques, such as Natural Language Processing, for interpreting and accessing appropriate data. While vector and graph embeddings represent data for similarity tasks, current state-of-the-art pipelines lack guaranteed explainability, failing to determine similarity for given full texts accurately. These considerations can also be applied to classifiers exploiting generative language models with logical prompts, which fail to correctly distinguish between logical implication, indifference, and inconsistency, despite being explicitly trained to recognise the first two classes. We present a novel pipeline designed for hybrid explainability to address this. Our methodology combines graphs and logic to produce First-Order Logic representations, creating machine- and human-readable representations through Montague Grammar. Preliminary results indicate the effectiveness of this approach in accurately capturing full text similarity. To the best of our knowledge, this is the first approach to differentiate between implication, inconsistency, and indifference for text classification tasks. To address the limitations of existing approaches, we use three self-contained datasets annotated for the former classification task to determine the suitability of these approaches in capturing sentence structure equivalence, logical connectives, and spatiotemporal reasoning. We also use these data to compare the proposed method with language models pre-trained for detecting sentence entailment. The results show that the proposed method outperforms state-of-the-art models, indicating that natural language understanding cannot be easily generalised by training over extensive document corpora. This work offers a step toward more transparent and reliable Information Retrieval from extensive textual data.
摘要：数字信息的数量和多样性导致对机器学习技术（例如自然语言处理）的依赖，以解释和访问适当的数据。虽然向量和图形嵌入代表相似性任务的数据，但当前的最新管道缺乏可解释性，无法准确确定给定全文的相似性。这些考虑也可以应用于使用逻辑提示来利用生成语言模型的分类器，尽管经过明确训练以识别前两个类别，但仍无法正确区分逻辑含义，冷漠和不一致。我们提出了一条新颖的管道，旨在解决此问题的混合解释性。我们的方法结合了图和逻辑以产生一阶逻辑表示，通过蒙塔古语法创建机器和人类可读的表示。初步结果表明这种方法在准确捕获全文相似性中的有效性。据我们所知，这是区分含义，不一致和对文本分类任务的冷漠的第一种方法。为了解决现有方法的局限性，我们使用针对以前的分类任务注释的三个独立数据集来确定这些方法在捕获句子结构等效性，逻辑连接和时空推理时的适用性。我们还使用这些数据将所提出的方法与预先训练的语言模型进行比较，以检测句子的需求。结果表明，所提出的方法的表现优于最先进的模型，表明自然语言理解不能通过培训而不是广泛的文档语料库来概括。这项工作为从广泛的文本数据中检索更透明和可靠的信息提供了一步。

Title: An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques

Authors: Walid Mohamed Aly, Taysir Hassan A. Soliman, Amr Mohamed AbdelAziz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05123
Pdf URL: https://arxiv.org/pdf/2507.05123
Copy Paste: [[2507.05123]] An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques(https://arxiv.org/abs/2507.05123)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) continue to advance natural language processing with their ability to generate human-like text across a range of tasks. Despite the remarkable success of LLMs in Natural Language Processing (NLP), their performance in text summarization across various domains and datasets has not been comprehensively evaluated. At the same time, the ability to summarize text effectively without relying on extensive training data has become a crucial bottleneck. To address these issues, we present a systematic evaluation of six LLMs across four datasets: CNN/Daily Mail and NewsRoom (news), SAMSum (dialog), and ArXiv (scientific). By leveraging prompt engineering techniques including zero-shot and in-context learning, our study evaluates the performance using the ROUGE and BERTScore metrics. In addition, a detailed analysis of inference times is conducted to better understand the trade-off between summarization quality and computational efficiency. For Long documents, introduce a sentence-based chunking strategy that enables LLMs with shorter context windows to summarize extended inputs in multiple stages. The findings reveal that while LLMs perform competitively on news and dialog tasks, their performance on long scientific documents improves significantly when aided by chunking strategies. In addition, notable performance variations were observed based on model parameters, dataset properties, and prompt design. These results offer actionable insights into how different LLMs behave across task types, contributing to ongoing research in efficient, instruction-based NLP systems.
摘要：大型语言模型（LLM）继续以其在各种任务中生成类似人类文本的能力来推进自然语言处理。尽管LLM在自然语言处理（NLP）中取得了显着的成功，但尚未对其在各个领域和数据集的文本摘要中的性能进行全面评估。同时，在不依赖大量培训数据的情况下有效地总结文本的能力已成为至关重要的瓶颈。为了解决这些问题，我们在四个数据集中对六个LLM进行系统评估：CNN/Daily Mail和新闻编辑室（新闻），Samsum（Dialog）和Arxiv（Scientific）。通过利用包括零射门和内部文化学习在内的及时工程技术，我们的研究使用Rouge和Bertscore指标评估了性能。此外，对推理时间进行了详细的分析，以更好地了解汇总质量和计算效率之间的权衡。对于长文档，引入了一个基于句子的块策略，该策略可以使LLM具有较短的上下文Windows，以总结多个阶段的扩展输入。研究结果表明，尽管LLM在新闻和对话任务上有竞争力，但在通过块策略的帮助下，他们在长期科学文档上的表现会大大提高。另外，根据模型参数，数据集属性和提示设计观察到了显着的性能变化。这些结果提供了可行的见解，了解不同的LLM在任务类型中的行为如何，这有助于基于高效的，基于教学的NLP系统的持续研究。

Title: SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction

Authors: Alexander Scarlatos, Nigel Fernandez, Christopher Ormerod, Susan Lottridge, Andrew Lan
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05129
Pdf URL: https://arxiv.org/pdf/2507.05129
Copy Paste: [[2507.05129]] SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction(https://arxiv.org/abs/2507.05129)
Keywords: llm
Abstract: Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get item difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with an LLM-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on a real-world student response dataset, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.
摘要：项目（问题）困难在教育评估中起着至关重要的作用，可以准确有效地评估学生能力和个性化，从而最大程度地提高学习成果。传统上，估计项目困难可能会昂贵，要求真正的学生对项目做出响应，然后拟合项目响应理论（IRT）模型以获取项目难度估计。此方法也不能应用于以前看不见的项目的冷启动设置。在这项工作中，我们介绍了Smart（模拟学生与IRT保持一致），这是一种使模拟学生具有指导能力的新颖方法，然后可以在模拟中使用该方法来预测开放式项目的难度。我们使用直接偏好优化（DPO）实现此比对，在该响应下，我们根据地面真相模型的响应的可能性对偏好对进行了。我们通过生成数千个响应，通过基于LLM的评分模型对其进行评估，并将结果数据拟合到IRT模型以获得项目难度估计值来进行模拟。通过对现实世界学生响应数据集的广泛实验，我们表明SMART通过利用其提高的能力对齐方式优于其他项目难度预测方法。

Title: Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization

Authors: Jaewook Lee, Alexander Scarlatos, Andrew Lan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05137
Pdf URL: https://arxiv.org/pdf/2507.05137
Copy Paste: [[2507.05137]] Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization(https://arxiv.org/abs/2507.05137)
Keywords: language model, llm
Abstract: Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.
摘要：由于脚本差异，学习日本词汇是罗马字母背景的学习者的挑战。日本人将诸如Hiragana之类的教学大学与汉字结合在一起，这是中国起源的逻辑特征。汉字由于其复杂性和体积而变得复杂。关键字助记符是有助于记忆的常见策略，通常使用汉字的组成结构来建立生动的关联。尽管最近努力使用大型语言模型（LLM）来协助学习者，但基于LLM的关键字助记符生成的现有方法作为黑匣子，提供有限的解释性。我们提出了一个生成框架，该框架明确地模拟了由一组共同规则驱动的助记符构建过程，并使用一种新颖的期望最大化型算法来学习它们。我们的方法从在线平台上接受了学习者创作的助记符，我们的方法学习了潜在的结构和构图规则，从而可以生成可解释和系统的助记符。实验表明，我们的方法在新学习者的冷启动环境中表现良好，同时洞悉有效助记符创建背后的机制。

Title: AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models

Authors: Chinnappa Guggilla, Budhaditya Roy, Trupti Ramdas Chavan, Abdul Rahman, Edward Bowen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05157
Pdf URL: https://arxiv.org/pdf/2507.05157
Copy Paste: [[2507.05157]] AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models(https://arxiv.org/abs/2507.05157)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) possess an extraordinary capability to produce text that is not only coherent and contextually relevant but also strikingly similar to human writing. They adapt to various styles and genres, producing content that is both grammatically correct and semantically meaningful. Recently, LLMs have been misused to create highly realistic phishing emails, spread fake news, generate code to automate cyber crime, and write fraudulent scientific articles. Additionally, in many real-world applications, the generated content including style and topic and the generator model are not known beforehand. The increasing prevalence and sophistication of artificial intelligence (AI)-generated texts have made their detection progressively more challenging. Various attempts have been made to distinguish machine-generated text from human-authored content using linguistic, statistical, machine learning, and ensemble-based approaches. This work focuses on two primary objectives Task-A, which involves distinguishing human-written text from machine-generated text, and Task-B, which attempts to identify the specific LLM model responsible for the generation. Both of these tasks are based on fine tuning of Generative Pre-trained Transformer (GPT_4o-mini), Large Language Model Meta AI (LLaMA) 3 8B, and Bidirectional Encoder Representations from Transformers (BERT). The fine-tuned version of GPT_4o-mini and the BERT model has achieved accuracies of 0.9547 for Task-A and 0.4698 for Task-B.
摘要：大型语言模型（LLM）具有出色的能力，可以产生不仅连贯且上下文相关的文本，而且与人类写作非常相似。它们适应各种样式和流派，产生语法正确和语义上有意义的内容。最近，LLM被滥用，以创建高度现实的网络钓鱼电子邮件，传播虚假新闻，生成代码以自动化网络犯罪并撰写欺诈性的科学文章。此外，在许多现实世界中，生成的内容（包括样式和主题）以及生成器模型的生成内容尚不清楚。人工智能（AI）生成的文本的越来越多的患病率和复杂性使他们的检测逐渐更具挑战性。已经尝试了使用语言，统计，机器学习和基于合奏的方法区分机器生成的文本和人类创作的内容的各种尝试。这项工作着重于两个主要目标任务-A，其中涉及将人写的文本与机器生成的文本和Task-B区分开，该文本试图识别负责生成的特定LLM模型。这两个任务均基于生成预训练的变压器（GPT_4O-MINI），大语言模型Meta AI（Llama）3 8B的微调和来自变压器（BERT）的双向编码器表示。 GPT_4O-MINI的微调版本和BERT模型的任务A的精度为0.9547，Task-B的精确度为0.4698。

Title: InfoSteer: Steering Information Utility in Language Model Post-Training

Authors: Chunyuan Deng, Ruidi Chang, Hanjie Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05158
Pdf URL: https://arxiv.org/pdf/2507.05158
Copy Paste: [[2507.05158]] InfoSteer: Steering Information Utility in Language Model Post-Training(https://arxiv.org/abs/2507.05158)
Keywords: language model
Abstract: Recent advancements in language models (LMs) gradually ushered in an era where post-training is crucial. Yet, post-training approaches such as supervised fine-tuning (SFT) do not guarantee effective use of knowledge acquired during pretraining. We therefore present \ours, a lightweight method that encourages parametric information utilization in LMs during post-training. This is achieved via treating FFN layer as associate key-value memory, and promotes the use of stored memory vectors via forward-pass interventions or regularization during backpropagation. We find this simple guidance during post-training phase delivers consistent performance improvements across diverse model families--including Qwen, Gemma and Llama-spanning over 15 downstream tasks in both ID and OOD evaluations. Beyond performance gains, we also find that steered LMs can adaptively allocate information-placing more emphasis on generating semantically meaningful tokens, while using fewer resources on simple transition ones (e.g., `,' or `and'). Our work underscores that vanilla post-training does not fully leverage pre-training potential, and steering LMs in latent representation space offers a promising approach that enhances both performance and interpretability.
摘要：语言模型（LMS）的最新进展逐渐迎来了训练后至关重要的时代。但是，培训后的方法（例如受监督的微调（SFT））不能保证在训练期间获得的知识有效使用。因此，我们提出了一种轻巧的方法，可以鼓励在训练后LMS中使用参数信息。这是通过将FFN层视为关联键值内存的，并通过向后流动过程中的前通用干预措施或正则化来促进存储的存储矢量的使用。我们发现在训练后阶段的这一简单指导可在不同的模型系列中提供一致的性能改进，包括QWEN，GEMMA和LLAMA跨越ID和OOD评估的下游任务。除了性能增长之外，我们还发现，转向的LMS可以自适应地分配信息，以更加强调生成语义上有意义的代币，同时在简单的过渡方面使用较少的资源（例如，``，''，'或'and'）。我们的工作强调了香草后训练并不能完全利用前训练的潜力，而在潜在代表空间中转向LMS提供了一种有希望的方法，可增强性能和可解释性。

Title: OpenS2S: Advancing Open-Source End-to-End Empathetic Large Speech Language Model

Authors: Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.05177
Pdf URL: https://arxiv.org/pdf/2507.05177
Copy Paste: [[2507.05177]] OpenS2S: Advancing Open-Source End-to-End Empathetic Large Speech Language Model(https://arxiv.org/abs/2507.05177)
Keywords: language model
Abstract: Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at this https URL
摘要：善解人意的互动是人机交流的基石，因为需要理解具有副语言提示并产生情感和表现力的反应的言语。但是，最强大的同情LSLM越来越封闭，研究人员将有关建筑，数据和发展不透明的关键细节留给研究人员。鉴于对LSLM和同理心行为进行透明研究的批判性需要，我们提出了OpenS2S，这是一种完全开放源，透明和端到端LSLM，旨在实现促进性语音相互作用。根据我们的善解人意的语音到文本模型BLSP-EMO，OpenS2S进一步采用了流交织的解码体系结构来实现低延迟语音生成。为了促进端到端的培训，OpenS2S结合了自动数据构建管道，该管道以低成本综合了多样化的高质量同情语音对话。通过利用大型语言模型生成善解人意的内容和可控的文本到语音系统来介绍说话者和情感变化，我们构建了具有丰富副语言多样性和最小人类监督的可扩展培训语料库。我们发布了完全开源的OPENS2S模型，包括数据集，模型权重，预训练和微调代码，以增强更广泛的研究社区和加速促进性语音系统的创新。可以在此HTTPS URL上访问项目网页

Title: From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations

Authors: Pulkit Bansal, Raghvendra Kumar, Shakti Singh, Sriparna Saha, Adam Jatowt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05179
Pdf URL: https://arxiv.org/pdf/2507.05179
Copy Paste: [[2507.05179]] From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations(https://arxiv.org/abs/2507.05179)
Keywords: llm
Abstract: In an era of rampant misinformation, generating reliable news explanations is vital, especially for under-represented languages like Hindi. Lacking robust automated tools, Hindi faces challenges in scaling misinformation detection. To bridge this gap, we propose a novel framework integrating Direct Preference Optimization (DPO) with curriculum learning to align machine-generated explanations with human reasoning. Fact-checked explanations from credible sources serve as preferred responses, while LLM outputs highlight system limitations and serve as non-preferred responses. To refine task-specific alignment, we introduce two key parameters -- Actuality and Finesse -- into the DPO loss function, enhancing explanation quality and consistency. Experiments with LLMs (Mistral, Llama, Gemma) and PLMs (mBART, mT5) confirm the framework's effectiveness in generating coherent, contextually relevant explanations. This scalable approach combats misinformation and extends automated explanation generation to low-resource languages.
摘要：在一个猖ramp的错误信息时代，产生可靠的新闻解释至关重要，尤其是对于像印地语这样的代表性不足的语言。缺乏强大的自动化工具，印地语在扩展错误信息检测方面面临挑战。为了弥合这一差距，我们提出了一个新颖的框架，将直接偏好优化（DPO）与课程学习相结合，以使机器生成的解释与人类推理相结合。来自可靠来源的事实检查的解释是首选响应，而LLM输出突出了系统局限性并充当非偏好的响应。为了完善特定于任务的对齐，我们将两个关键参数（现实和技巧）介绍到DPO损耗函数中，增强了解释质量和一致性。使用LLMS（Mistral，Llama，Gemma）和PLM（MBART，MT5）进行的实验证实了该框架在生成连贯的，上下文相关的解释方面的有效性。这种可扩展的方法打击了错误信息，并将自动解释生成扩展到低资源语言。

Title: Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models

Authors: Ziqi Miao, Lijun Li, Yuan Xiong, Zhenhua Liu, Pengyu Zhu, Jing Shao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.05248
Pdf URL: https://arxiv.org/pdf/2507.05248
Copy Paste: [[2507.05248]] Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models(https://arxiv.org/abs/2507.05248)
Keywords: language model, llm, prompt
Abstract: Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at this https URL.
摘要：上下文启动，较早的刺激偏见后来的判断，为大语言模型（LLMS）提供了未开发的攻击表面。我们发现了一个上下文启动漏洞，在对话中，对话中先前的响应可以将其随后的行为转向政策竞争内容。在这种见解的基础上，我们提出了响应攻击，该响应攻击使用辅助LLM来对原始恶意查询的释义版本产生有害的响应。然后将它们格式化为对话，然后进行简洁的触发提示，从而启动目标模型以产生有害内容。在八个开源和专有的LLM中，RA始终胜过七种最先进的越狱技术，从而达到了更高的攻击成功率。为了减轻这种威胁，我们构建和发布了一个上下文感知的安全性微调数据集，该数据集大大降低了攻击成功率，同时保持模型功能。该代码和数据可在此HTTPS URL上找到。

Title: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Authors: Yuanzhe Hu, Yu Wang, Julian McAuley
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05257
Pdf URL: https://arxiv.org/pdf/2507.05257
Copy Paste: [[2507.05257]] Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions(https://arxiv.org/abs/2507.05257)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Furthermore, no existing benchmarks cover all four competencies. Therefore, we introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark combines reformulated existing datasets with newly constructed ones, covering the above four memory competencies, providing a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.
摘要：大型语言模型（LLM）代理的最新基准主要集中于评估推理，计划和执行功能，而另一个关键的组件记忆，包括由于缺乏基准测试而导致的代理如何记住，更新和检索长期信息不足的信息不足。我们将记忆机制作为内存剂表示。在本文中，我们确定了对记忆代理至关重要的四个核心能力：准确的检索，测试时间学习，远程理解和解决冲突。现有数据集依赖有限的上下文长度，或针对基于书籍的质量质量标准（例如基于书籍的QA）量身定制的，该设置并不反映内存剂的交互式，多转弯性质，从而逐渐积累了信息。此外，没有现有的基准涵盖所有四个能力。因此，我们介绍了一种专门为内存剂设计的新基准MemoryAgentNench。我们的基准测试将现有的现有数据集与新建的数据集相结合，涵盖了以上四个内存能力，为评估记忆质量提供了系统性挑战性的测试床。我们评估了一组多种内存代理，从简单的基于上下文和检索功能的生成（RAG）系统到具有外部内存模块和工具集成的高级代理。经验结果表明，目前的方法缺乏掌握所有四种能力，强调了对LLM代理的全面记忆机制进行进一步研究的必要性。