2025-07-24

Title: A Unifying Scheme for Extractive Content Selection Tasks

Authors: Shmuel Amar, Ori Shapira, Aviv Slobodkin, Ido Dagan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16922
Pdf URL: https://arxiv.org/pdf/2507.16922
Copy Paste: [[2507.16922]] A Unifying Scheme for Extractive Content Selection Tasks(https://arxiv.org/abs/2507.16922)
Keywords: language model, llm
Abstract: A broad range of NLP tasks involve selecting relevant text spans from given source texts. Despite this shared objective, such \textit{content selection} tasks have traditionally been studied in isolation, each with its own modeling approaches, datasets, and evaluation metrics. In this work, we propose \textit{instruction-guided content selection (IGCS)} as a beneficial unified framework for such settings, where the task definition and any instance-specific request are encapsulated as instructions to a language model. To promote this framework, we introduce \igcsbench{}, the first unified benchmark covering diverse content selection tasks. Further, we create a large generic synthetic dataset that can be leveraged for diverse content selection tasks, and show that transfer learning with these datasets often boosts performance, whether dedicated training for the targeted task is available or not. Finally, we address generic inference time issues that arise in LLM-based modeling of content selection, assess a generic evaluation metric, and overall propose the utility of our resources and methods for future content selection models. Models and datasets available at this https URL.
摘要：广泛的NLP任务涉及从给定的源文本中选择相关文本跨度。尽管有这个共同的目标，但传统上已经孤立地研究了这种\ textit {Content Selection}任务，每个任务都有自己的建模方法，数据集和评估指标。在这项工作中，我们建议\ textit {指令指导的内容选择（IGCS）}作为此类设置的有益统一框架，其中任务定义和任何特定实例的请求都被封装为语言模型的指令。为了促进此框架，我们介绍了\ igcsbench {}，这是涵盖各种内容选择任务的第一个统一基准。此外，我们创建了一个可以利用各种内容选择任务的大型通用合成数据集，并证明使用这些数据集的转移学习通常会提高性能，无论是否可用针对目标任务的专用培训。最后，我们解决了基于LLM的内容选择建模，评估通用评估指标的通用推理时间问题，并总体提出了我们资源和方法的实用性，以实现未来内容选择模型。此HTTPS URL上可用的型号和数据集。

Title: AI-based Clinical Decision Support for Primary Care: A Real-World Study

Authors: Robert Korom, Sarah Kiptinness, Najib Adan, Kassim Said, Catherine Ithuli, Oliver Rotich, Boniface Kimani, Irene King'ori, Stellah Kamau, Elizabeth Atemba, Muna Aden, Preston Bowman, Michael Sharman, Rebecca Soskin Hicks, Rebecca Distler, Johannes Heidecke, Rahul K. Arora, Karan Singhal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16947
Pdf URL: https://arxiv.org/pdf/2507.16947
Copy Paste: [[2507.16947]] AI-based Clinical Decision Support for Primary Care: A Real-World Study(https://arxiv.org/abs/2507.16947)
Keywords: language model, llm
Abstract: We evaluate the impact of large language model-based clinical decision support in live care. In partnership with Penda Health, a network of primary care clinics in Nairobi, Kenya, we studied AI Consult, a tool that serves as a safety net for clinicians by identifying potential documentation and clinical decision-making errors. AI Consult integrates into clinician workflows, activating only when needed and preserving clinician autonomy. We conducted a quality improvement study, comparing outcomes for 39,849 patient visits performed by clinicians with or without access to AI Consult across 15 clinics. Visits were rated by independent physicians to identify clinical errors. Clinicians with access to AI Consult made relatively fewer errors: 16% fewer diagnostic errors and 13% fewer treatment errors. In absolute terms, the introduction of AI Consult would avert diagnostic errors in 22,000 visits and treatment errors in 29,000 visits annually at Penda alone. In a survey of clinicians with AI Consult, all clinicians said that AI Consult improved the quality of care they delivered, with 75% saying the effect was "substantial". These results required a clinical workflow-aligned AI Consult implementation and active deployment to encourage clinician uptake. We hope this study demonstrates the potential for LLM-based clinical decision support tools to reduce errors in real-world settings and provides a practical framework for advancing responsible adoption.
摘要：我们评估了基于语言模型的临床决策支持对实时护理的影响。与肯尼亚内罗毕的初级保健诊所网络Penda Health合作，我们研究了AI Consult，该工具是通过确定潜在的文档和临床决策错误，可以作为临床医生的安全网。人工智能咨询集成到临床医生工作流程中，仅在需要时激活并保留临床医生的自主权。我们进行了一项质量改进研究，比较了临床医生在15个诊所中访问AI咨询的39,849例患者就诊的结果。独立医生对访问进行了评估以识别临床错误。获得AI咨询的临床医生造成的错误相对较少：诊断错误减少了16％，治疗错误少了13％。用绝对的话来说，AI咨询的引入将避免每年在Penda进行29,000次访问中的22,000次访问和治疗错误。所有临床医生在对临床医生对临床医生的调查中说，AI咨询提高了他们提供的护理质量，有75％的人表示这种影响是“实质性的”。这些结果需要临床工作流与AI咨询实施和积极部署，以鼓励临床医生的吸收。我们希望这项研究证明了基于LLM的临床决策支持工具的潜力，以减少现实世界中的错误，并为推进负责任的采用提供了实用的框架。

Title: Harnessing RLHF for Robust Unanswerability Recognition and Trustworthy Response Generation in LLMs

Authors: Shuyuan Lin, Lei Duan, Philip Hughes, Yuxuan Sheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16951
Pdf URL: https://arxiv.org/pdf/2507.16951
Copy Paste: [[2507.16951]] Harnessing RLHF for Robust Unanswerability Recognition and Trustworthy Response Generation in LLMs(https://arxiv.org/abs/2507.16951)
Keywords: language model, llm, hallucination
Abstract: Conversational Information Retrieval (CIR) systems, while offering intuitive access to information, face a significant challenge: reliably handling unanswerable questions to prevent the generation of misleading or hallucinated content. Traditional approaches often rely on external classifiers, which can introduce inconsistencies with the core generative Large Language Models (LLMs). This paper introduces Self-Aware LLM for Unanswerability (SALU), a novel approach that deeply integrates unanswerability detection directly within the LLM's generative process. SALU is trained using a multi-task learning framework for both standard Question Answering (QA) and explicit abstention generation for unanswerable queries. Crucially, it incorporates a confidence-score-guided reinforcement learning with human feedback (RLHF) phase, which explicitly penalizes hallucinated responses and rewards appropriate abstentions, fostering intrinsic self-awareness of knowledge boundaries. Through extensive experiments on our custom-built C-IR_Answerability dataset, SALU consistently outperforms strong baselines, including hybrid LLM-classifier systems, in overall accuracy for correctly answering or abstaining from questions. Human evaluation further confirms SALU's superior reliability, achieving high scores in factuality, appropriate abstention, and, most importantly, a dramatic reduction in hallucination, demonstrating its ability to robustly "know when to say 'I don't know'."
摘要：会话信息检索（CIR）系统在提供信息的直观访问时，面临着一个重大挑战：可靠地处理无法回答的问题，以防止产生误导性或幻觉内容。传统方法通常依赖于外部分类器，这可能会引入与核心生成大语言模型（LLM）的不一致之处。本文介绍了自我意识的LLM，以实现无法选择性（SALU），这是一种新颖的方法，它将直接在LLM的生成过程中直接整合了不可介绍性检测。 Salu使用多任务学习框架进行了培训，以用于标准问答（QA）和明确的弃权生成无法回答的查询。至关重要的是，它结合了人类反馈（RLHF）阶段的置信度得分引导的增强学习，从而明确惩罚了幻觉的反应并奖励适当的弃权，从而促进了知识边界的内在自我意识。通过对我们定制的C-IR_Answerability数据集进行广泛的实验，Salu始终优于包括混合LLM分类器系统在内的强大基准，以正确地回答或避免问题的总体准确性。人类评估进一步证实了萨鲁的出色可靠性，在事实上取得了很高的分数，适当的弃权，最重要的是，幻觉的降低急剧降低，表明了其坚固地“知道何时说'我不知道'”的能力。

Title: Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning

Authors: Aleksandr Perevalov, Andreas Both
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.16971
Pdf URL: https://arxiv.org/pdf/2507.16971
Copy Paste: [[2507.16971]] Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning(https://arxiv.org/abs/2507.16971)
Keywords: llm, agent
Abstract: Accessing knowledge via multilingual natural-language interfaces is one of the emerging challenges in the field of information retrieval and related ones. Structured knowledge stored in knowledge graphs can be queried via a specific query language (e.g., SPARQL). Therefore, one needs to transform natural-language input into a query to fulfill an information need. Prior approaches mostly focused on combining components (e.g., rule-based or neural-based) that solve downstream tasks and come up with an answer at the end. We introduce mKGQAgent, a human-inspired framework that breaks down the task of converting natural language questions into SPARQL queries into modular, interpretable subtasks. By leveraging a coordinated LLM agent workflow for planning, entity linking, and query refinement - guided by an experience pool for in-context learning - mKGQAgent efficiently handles multilingual KGQA. Evaluated on the DBpedia- and Corporate-based KGQA benchmarks within the Text2SPARQL challenge 2025, our approach took first place among the other participants. This work opens new avenues for developing human-like reasoning systems in multilingual semantic parsing.
摘要：通过多语言自然语言接口获取知识是信息检索和相关知识领域的新挑战之一。可以通过特定的查询语言（例如SPARQL）查询存储在知识图中的结构化知识。因此，需要将自然语言输入转换为查询以满足信息需求。先前的方法主要集中于结合解决下游任务并在最后提出答案的组件（例如，基于规则或基于神经的基于规则或基于神经）。我们介绍了MKGQAGENT，这是一个受人启发的框架，它破坏了将自然语言问题转换为SPARQL查询转换为模块化，可解释的子任务的任务。通过利用协调的LLM代理工作流程进行计划，实体链接和查询细化 - 在经验池中进行了内在学习学习的指导 - Mkgqagent有效地处理多语言KGQA。在Text2SPARQL Challenge 2025中，对DBPEDIA和公司基于企业的KGQA基准进行了评估，我们的方法在其他参与者中排名第一。这项工作为在多语性语义解析中开发类似人类的推理系统开辟了新的途径。

Title: Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain

Authors: Rishemjit Kaur, Arshdeep Singh Bhankhar, Surangika Ranathunga, Jashanpreet Singh Salh, Sudhir Rajput, Vidhi, Kashish Mahendra, Bhavika Berwal, Ritesh Kumar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.16974
Pdf URL: https://arxiv.org/pdf/2507.16974
Copy Paste: [[2507.16974]] Leveraging Synthetic Data for Question Answering with Multilingual LLMs in the Agricultural Domain(https://arxiv.org/abs/2507.16974)
Keywords: language model, llm
Abstract: Enabling farmers to access accurate agriculture-related information in their native languages in a timely manner is crucial for the success of the agriculture field. Although large language models (LLMs) can be used to implement Question Answering (QA) systems, simply using publicly available general-purpose LLMs in agriculture typically offer generic advisories, lacking precision in local and multilingual contexts due to insufficient domain-specific training and scarcity of high-quality, region-specific datasets. Our study addresses these limitations by generating multilingual synthetic agricultural datasets (English, Hindi, Punjabi) from agriculture-specific documents and fine-tuning language-specific LLMs. Our evaluation on curated multilingual datasets demonstrates significant improvements in factual accuracy, relevance, and agricultural consensus for the fine-tuned models compared to their baseline counterparts. These results highlight the efficacy of synthetic data-driven, language-specific fine-tuning as an effective strategy to improve the performance of LLMs in agriculture, especially in multilingual and low-resource settings. By enabling more accurate and localized agricultural advisory services, this study provides a meaningful step toward bridging the knowledge gap in AI-driven agricultural solutions for diverse linguistic communities.
摘要：使农民能够及时以其母语访问与农业相关的准确信息，这对于农业领域的成功至关重要。尽管可以使用大型语言模型（LLM）来实施问题答案（QA）系统，但只需在农业中使用公共可用的通用性LLMS，通常会提供通用的建议，在本地和多语言环境中由于不足的域特异性培训和高质量的，区域特定数据集而缺乏精确性。我们的研究通过从特定于农业的文档和特定于语言的LLM中生成多语言合成农业数据集（英语，印地语，旁遮普语）来解决这些局限性。我们对策划的多语言数据集的评估表明，与基线对应物相比，对微调模型的事实准确性，相关性和农业共识有了显着改善。这些结果突出了合成数据驱动的，特定于语言的微调作为提高农业中LLM绩效的有效策略的功效，尤其是在多语言和低资产阶级环境中。通过实现更准确和局部的农业咨询服务，本研究为弥合AI驱动的农业解决方案的知识差距提供了有意义的一步，以弥补各种语言社区。

Title: Obscured but Not Erased: Evaluating Nationality Bias in LLMs via Name-Based Bias Benchmarks

Authors: Giulio Pelosio, Devesh Batra, Noémie Bovey, Robert Hankache, Cristovao Iglesias, Greig Cowan, Raad Khraishi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16989
Pdf URL: https://arxiv.org/pdf/2507.16989
Copy Paste: [[2507.16989]] Obscured but Not Erased: Evaluating Nationality Bias in LLMs via Name-Based Bias Benchmarks(https://arxiv.org/abs/2507.16989)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) can exhibit latent biases towards specific nationalities even when explicit demographic markers are not present. In this work, we introduce a novel name-based benchmarking approach derived from the Bias Benchmark for QA (BBQ) dataset to investigate the impact of substituting explicit nationality labels with culturally indicative names, a scenario more reflective of real-world LLM applications. Our novel approach examines how this substitution affects both bias magnitude and accuracy across a spectrum of LLMs from industry leaders such as OpenAI, Google, and Anthropic. Our experiments show that small models are less accurate and exhibit more bias compared to their larger counterparts. For instance, on our name-based dataset and in the ambiguous context (where the correct choice is not revealed), Claude Haiku exhibited the worst stereotypical bias scores of 9%, compared to only 3.5% for its larger counterpart, Claude Sonnet, where the latter also outperformed it by 117.7% in accuracy. Additionally, we find that small models retain a larger portion of existing errors in these ambiguous contexts. For example, after substituting names for explicit nationality references, GPT-4o retains 68% of the error rate versus 76% for GPT-4o-mini, with similar findings for other model providers, in the ambiguous context. Our research highlights the stubborn resilience of biases in LLMs, underscoring their profound implications for the development and deployment of AI systems in diverse, global contexts.
摘要：大型语言模型（LLM）也可以表现出对特定国籍的潜在偏见，即使不存在明确的人口标记。在这项工作中，我们介绍了一种基于名称的基于新颖的基准方法，该方法从QA（BBQ）数据集的偏置基准中得出，以调查用具有文化指示名称的替代明确的国籍标签的影响，这是对现实世界中LLM应用程序的反映方案。我们的新颖方法研究了这种替代如何影响来自OpenAI，Google和Anthropic等行业领导者的LLM范围内的偏见幅度和准确性。我们的实验表明，与较大的对应物相比，小型模型的准确性较低，并且表现出更多的偏见。例如，在我们基于姓名的数据集和模棱两可的环境中（未透露正确的选择），克劳德·豪库（Claude Haiku）的刻板印象偏差得分为9％，而较大的同行Claude Sonnet只有3.5％，而后者的表现也使它的表现精确地超过了117.7％。此外，我们发现在这些模棱两可的环境中，小型模型保留了较大的现有错误。例如，在将名称替换为显式国籍参考后，GPT-4O保留了错误率的68％，而GPT-4O-Mini则保留了76％，在模棱两可的情况下，其他模型提供者的发现相似。我们的研究强调了LLM中偏见的固执弹性，强调了它们对在各种全球环境中AI系统开发和部署的深刻含义。

Title: Multi-Label Classification with Generative AI Models in Healthcare: A Case Study of Suicidality and Risk Factors

Authors: Ming Huang, Zehan Li, Yan Hu, Wanjing Wang, Andrew Wen, Scott Lane, Salih Selek, Lokesh Shahani, Rodrigo Machado-Vieira, Jair Soares, Hua Xu, Hongfang Liu
Subjects: cs.CL, cs.IR, q-bio.QM
Abstract URL: https://arxiv.org/abs/2507.17009
Pdf URL: https://arxiv.org/pdf/2507.17009
Copy Paste: [[2507.17009]] Multi-Label Classification with Generative AI Models in Healthcare: A Case Study of Suicidality and Risk Factors(https://arxiv.org/abs/2507.17009)
Keywords: language model, gpt, llm, prompt
Abstract: Suicide remains a pressing global health crisis, with over 720,000 deaths annually and millions more affected by suicide ideation (SI) and suicide attempts (SA). Early identification of suicidality-related factors (SrFs), including SI, SA, exposure to suicide (ES), and non-suicidal self-injury (NSSI), is critical for timely intervention. While prior studies have applied AI to detect SrFs in clinical notes, most treat suicidality as a binary classification task, overlooking the complexity of cooccurring risk factors. This study explores the use of generative large language models (LLMs), specifically GPT-3.5 and GPT-4.5, for multi-label classification (MLC) of SrFs from psychiatric electronic health records (EHRs). We present a novel end to end generative MLC pipeline and introduce advanced evaluation methods, including label set level metrics and a multilabel confusion matrix for error analysis. Finetuned GPT-3.5 achieved top performance with 0.94 partial match accuracy and 0.91 F1 score, while GPT-4.5 with guided prompting showed superior performance across label sets, including rare or minority label sets, indicating a more balanced and robust performance. Our findings reveal systematic error patterns, such as the conflation of SI and SA, and highlight the models tendency toward cautious over labeling. This work not only demonstrates the feasibility of using generative AI for complex clinical classification tasks but also provides a blueprint for structuring unstructured EHR data to support large scale clinical research and evidence based medicine.
摘要：自杀仍然是一场紧迫的全球健康危机，每年有72万多人死亡，数百万人受自杀念头（SI）和自杀企图（SA）的影响。早期鉴定自杀性相关因素（SRF），包括SI，SA，暴露于自杀（ES）和非自杀的自我伤害（NSSI），对于及时干预至关重要。尽管先前的研究已应用AI来检测临床注释中的SRF，但大多数研究将自杀视为二元分类任务，忽视了同时发生的风险因素的复杂性。这项研究探讨了生成性大语言模型（LLM），特别是GPT-3.5和GPT-4.5的使用，用于从精神电子健康记录（EHRS）的SRF的多标签分类（MLC）。我们提出了一种新颖的端到端生成MLC管道，并引入了高级评估方法，包括标签集水平指标和多标签混淆矩阵进行错误分析。 Fineted GPT-3.5以0.94的部分匹配准确性和0.91 F1得分获得了最高表现，而GPT-4.5则带有引导提示的GPT-4.5在标签组中显示出卓越的性能，包括稀有或少数族裔标签集，表明表现更加平衡和强大的性能。我们的发现揭示了系统的误差模式，例如SI和SA的汇合，并突出了模型在标签上谨慎的趋势。这项工作不仅证明了将生成AI用于复杂的临床分类任务的可行性，而且还提供了用于构建非结构化EHR数据以支持大规模临床研究和基于证据的医学的蓝图。

Title: Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?

Authors: Arduin Findeis, Floris Weers, Guoli Yin, Ke Ye, Ruoming Pang, Tom Gunter
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.17015
Pdf URL: https://arxiv.org/pdf/2507.17015
Copy Paste: [[2507.17015]] Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?(https://arxiv.org/abs/2507.17015)
Keywords: language model, llm, prompt, chat, agent
Abstract: Pairwise preferences over model responses are widely collected to evaluate and provide feedback to large language models (LLMs). Given two alternative model responses to the same input, a human or AI annotator selects the "better" response. This approach can provide feedback for domains where other hard-coded metrics are difficult to obtain (e.g., chat response quality), thereby helping model evaluation or training. However, for some domains high-quality pairwise comparisons can be tricky to obtain - from AI and humans. For example, for responses with many factual statements, annotators may disproportionately weigh writing quality rather than underlying facts. In this work, we explore augmenting standard AI annotator systems with additional tools to improve performance on three challenging response domains: long-form factual, math and code tasks. We propose a tool-using agentic system to provide higher quality feedback on these domains. Our system uses web-search and code execution to ground itself based on external validation, independent of the LLM's internal knowledge and biases. We provide extensive experimental results evaluating our method across the three targeted response domains as well as general annotation tasks, using RewardBench (incl. AlpacaEval and LLMBar), RewardMath, as well as three new datasets for domains with saturated pre-existing datasets. Our results indicate that external tools can indeed improve performance in many, but not all, cases. More generally, our experiments highlight the sensitivity of performance to simple parameters (e.g., prompt) and the need for improved (non-saturated) annotator benchmarks. We share our code at this https URL.
摘要：对模型响应的成对偏好被广泛收集，以评估和提供大型语言模型（LLMS）的反馈。给定两个对同一输入的替代模型响应，人类或AI注释器选择“更好”响应。这种方法可以为难以获得其他硬编码指标（例如聊天响应质量）的域提供反馈，从而帮助模型评估或培训。但是，对于某些领域而言，从AI和人类获得高质量的成对比较可能很棘手。例如，对于带有许多事实陈述的回应，注释者可能会不成比例地权衡写作质量而不是基本事实。在这项工作中，我们使用其他工具来探索增强标准AI注释系统，以提高三个具有挑战性的响应域的性能：长形式的事实，数学和代码任务。我们提出了一个使用工具的代理系统，以对这些域提供更高质量的反馈。我们的系统使用网络搜索和代码执行来基于外部验证，而与LLM的内部知识和偏见无关。我们提供了广泛的实验结果，以使用RewardBench（包括Alpacaeval和LLMBAR），RewardMATH以及三个具有饱和预先存在数据集的域的新数据集，评估了三个有针对性的响应域以及一般注释任务的方法以及一般注释任务。我们的结果表明，外部工具确实可以改善许多（但不是全部）情况的性能。更一般而言，我们的实验突出了性能对简单参数的敏感性（例如，提示），以及需要改进（非饱和）注释者基准测试。我们在此HTTPS URL上共享代码。

Title: CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards

Authors: Cheng Liu, Yifei Lu, Fanghua Ye, Jian Li, Xingyu Chen, Feiliang Ren, Zhaopeng Tu, Xiaolong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.17147
Pdf URL: https://arxiv.org/pdf/2507.17147
Copy Paste: [[2507.17147]] CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards(https://arxiv.org/abs/2507.17147)
Keywords: language model, llm, prompt, agent
Abstract: Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs). Existing approaches typically rely on prompt engineering or supervised fine-tuning to enable models to imitate character behaviors in specific scenarios, but often neglect the underlying \emph{cognitive} mechanisms driving these behaviors. Inspired by cognitive psychology, we introduce \textbf{CogDual}, a novel RPLA adopting a \textit{cognize-then-respond } reasoning paradigm. By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment. To further optimize the performance, we employ reinforcement learning with two general-purpose reward schemes designed for open-domain text generation. Extensive experiments on the CoSER benchmark, as well as Cross-MR and LifeChoice, demonstrate that CogDual consistently outperforms existing baselines and generalizes effectively across diverse role-playing tasks.
摘要：角色扮演语言代理（RPLAS）已成为大型语言模型（LLM）的重要应用方向。现有方法通常依靠及时的工程或监督的微调来使模型在特定方案中模仿角色行为，但经常忽略了驱动这些行为的基本\ emph {认知}机制。受认知心理学的启发，我们引入了\ textbf {cogdual}，这是一种新颖的rpla，采用了\ textit {cognize-then-respond}推理范式。通过共同建模外部情境意识和内部自我意识，CogDual会产生响应，具有提高的性格一致性和上下文对齐。为了进一步优化性能，我们采用了为开放域文本生成设计的两个通用奖励方案，采用强化学习。在COSER基准测试以及交叉MR和LIFECHOICE上进行的广泛实验表明，Cogdual始终超过现有基准，并在各种角色扮演任务中有效地概括。

Title: SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs

Authors: Zhiqiang Liu, Enpei Niu, Yin Hua, Mengshu Sun, Lei Liang, Huajun Chen, Wen Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.17178
Pdf URL: https://arxiv.org/pdf/2507.17178
Copy Paste: [[2507.17178]] SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs(https://arxiv.org/abs/2507.17178)
Keywords: language model, llm, hallucination
Abstract: Although large language models (LLMs) have made significant progress in understanding Structured Knowledge (SK) like KG and Table, existing evaluations for SK understanding are non-rigorous (i.e., lacking evaluations of specific capabilities) and focus on a single type of SK. Therefore, we aim to propose a more comprehensive and rigorous structured knowledge understanding benchmark to diagnose the shortcomings of LLMs. In this paper, we introduce SKA-Bench, a Structured Knowledge Augmented QA Benchmark that encompasses four widely used structured knowledge forms: KG, Table, KG+Text, and Table+Text. We utilize a three-stage pipeline to construct SKA-Bench instances, which includes a question, an answer, positive knowledge units, and noisy knowledge units. To evaluate the SK understanding capabilities of LLMs in a fine-grained manner, we expand the instances into four fundamental ability testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection. Empirical evaluations on 8 representative LLMs, including the advanced DeepSeek-R1, indicate that existing LLMs still face significant challenges in understanding structured knowledge, and their performance is influenced by factors such as the amount of noise, the order of knowledge units, and hallucination phenomenon. Our dataset and code are available at this https URL.
摘要：尽管大型语言模型（LLM）在理解结构化知识（SK）等大型语言模型（例如kg和table）方面取得了重大进展，但现有的SK理解评估是非符合性的（即缺乏对特定功能的评估），并专注于单一类型的SK。因此，我们旨在提出一个更全面，更严格的结构化知识理解基准，以诊断LLM的缺点。在本文中，我们介绍了SKA Bench，这是一种结构化知识增强的质量质量标准，该基准包括四种广泛使用的结构化知识形式：kg，table，kg+文本和表+文本。我们利用三阶段的管道来构建SKA基础实例，其中包括一个问题，答案，积极的知识单位和嘈杂的知识单位。为了以细粒度的方式评估LLM的SK了解LLM的能力，我们将实例扩展到四个基本能力测试中：噪声稳健性，秩序不敏感性，信息整合和负拒绝。对包括Advanced DeepSeek-R1在内的8个代表性LLM的经验评估表明，现有的LLM在理解结构化知识方面仍然面临重大挑战，并且它们的绩效受到诸如噪声量，知识单位和幻觉现象之类的因素的影响。我们的数据集和代码可在此HTTPS URL上找到。

Title: FinGAIA: An End-to-End Benchmark for Evaluating AI Agents in Finance

Authors: Lingfeng Zeng, Fangqi Lou, Zixuan Wang, Jiajie Xu, Jinyi Niu, Mengping Li, Yifan Dong, Qi Qi, Wei Zhang, Ziwei Yang, Jun Han, Ruilun Feng, Ruiqi Hu, Lejie Zhang, Zhengbo Feng, Yicheng Ren, Xin Guo, Zhaowei Liu, Dongpo Cheng, Weige Cai, Liwen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.17186
Pdf URL: https://arxiv.org/pdf/2507.17186
Copy Paste: [[2507.17186]] FinGAIA: An End-to-End Benchmark for Evaluating AI Agents in Finance(https://arxiv.org/abs/2507.17186)
Keywords: gpt, chat, agent
Abstract: The booming development of AI agents presents unprecedented opportunities for automating complex tasks across various domains. However, their multi-step, multi-tool collaboration capabilities in the financial sector remain underexplored. This paper introduces FinGAIA, an end-to-end benchmark designed to evaluate the practical abilities of AI agents in the financial domain. FinGAIA comprises 407 meticulously crafted tasks, spanning seven major financial sub-domains: securities, funds, banking, insurance, futures, trusts, and asset management. These tasks are organized into three hierarchical levels of scenario depth: basic business analysis, asset decision support, and strategic risk management. We evaluated 10 mainstream AI agents in a zero-shot setting. The best-performing agent, ChatGPT, achieved an overall accuracy of 48.9\%, which, while superior to non-professionals, still lags financial experts by over 35 percentage points. Error analysis has revealed five recurring failure patterns: Cross-modal Alignment Deficiency, Financial Terminological Bias, Operational Process Awareness Barrier, among others. These patterns point to crucial directions for future research. Our work provides the first agent benchmark closely related to the financial domain, aiming to objectively assess and promote the development of agents in this crucial field. Partial data is available at this https URL.
摘要：AI代理的蓬勃发展为自动化各个领域的复杂任务的自动化提供了前所未有的机会。但是，他们在金融领域的多步，多工具的协作能力仍未得到充实。本文介绍了Fingaia，这是一种端到端基准测试，旨在评估金融领域中AI代理的实际能力。 Fingaia包括407个精心制作的任务，涵盖了七个主要的财务子订单：证券，资金，银行，保险，期货，信托和资产管理。这些任务分为三个层次级别的方案深度：基本业务分析，资产决策支持和战略风险管理。我们在零拍设置中评估了10个主流AI代理。表现最佳的代理商Chatgpt的总体准确性为48.9％，尽管它优于非专业人士，但仍然落后于35个百分点以上的财务专家。错误分析揭示了五种反复出现的故障模式：跨模式对齐不足，财务术语偏见，操作过程意识障碍等。这些模式指向未来研究的关键方向。我们的工作提供了与金融领域密切相关的第一个代理基准，旨在客观地评估和促进该关键领域的代理商的发展。部分数据可在此HTTPS URL上找到。

Title: The Pluralistic Moral Gap: Understanding Judgment and Value Differences between Humans and Large Language Models

Authors: Giuseppe Russo, Debora Nozza, Paul Röttger, Dirk Hovy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.17216
Pdf URL: https://arxiv.org/pdf/2507.17216
Copy Paste: [[2507.17216]] The Pluralistic Moral Gap: Understanding Judgment and Value Differences between Humans and Large Language Models(https://arxiv.org/abs/2507.17216)
Keywords: language model, llm
Abstract: People increasingly rely on Large Language Models (LLMs) for moral advice, which may influence humans' decisions. Yet, little is known about how closely LLMs align with human moral judgments. To address this, we introduce the Moral Dilemma Dataset, a benchmark of 1,618 real-world moral dilemmas paired with a distribution of human moral judgments consisting of a binary evaluation and a free-text rationale. We treat this problem as a pluralistic distributional alignment task, comparing the distributions of LLM and human judgments across dilemmas. We find that models reproduce human judgments only under high consensus; alignment deteriorates sharply when human disagreement increases. In parallel, using a 60-value taxonomy built from 3,783 value expressions extracted from rationales, we show that LLMs rely on a narrower set of moral values than humans. These findings reveal a pluralistic moral gap: a mismatch in both the distribution and diversity of values expressed. To close this gap, we introduce Dynamic Moral Profiling (DMP), a Dirichlet-based sampling method that conditions model outputs on human-derived value profiles. DMP improves alignment by 64.3% and enhances value diversity, offering a step toward more pluralistic and human-aligned moral guidance from LLMs.
摘要：人们越来越多地依靠大型语言模型（LLM）来进行道德建议，这可能会影响人类的决定。然而，关于LLM与人类道德判断的一致程度知之甚少。为了解决这个问题，我们介绍了道德困境数据集，这是1,618个现实世界中的道德困境的基准，并与人类道德判断的分布相结合，包括二进制评估和自由文本的基本原理。我们将这个问题视为一项多元化分布对准任务，比较了LLM和人类判断在困境中的分布。我们发现模型仅在高共识下重现人类判断。当人类分歧增加时，对齐会急剧恶化。同时，使用从理由中提取的3,783个价值表达式构建的60值分类法，我们表明LLMS比人类依赖一组狭窄的道德价值。这些发现揭示了多元化的道德差距：表达的值的分布和多样性的不匹配。为了缩小这一差距，我们引入了动态道德分析（DMP），这是一种基于Dirichlet的采样方法，该方法将模型输出在人类衍生的价值曲线上。 DMP将一致性提高了64.3％，并提高了价值多样性，从而迈向了LLMS更加多元化和人类一致的道德指导的一步。

Title: Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge

Authors: Miaomiao Gao, Xiaoxiao Xiang, Yiwen Guo
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2507.17288
Pdf URL: https://arxiv.org/pdf/2507.17288
Copy Paste: [[2507.17288]] Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge(https://arxiv.org/abs/2507.17288)
Keywords: language model, llm
Abstract: This paper describes our Triple X speech recognition system submitted to Task 1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM) Challenge. Our work focuses on optimizing speech recognition accuracy in multilingual conversational scenarios through an innovative encoder-adapter-LLM architecture. This framework harnesses the powerful reasoning capabilities of text-based large language models while incorporating domain-specific adaptations. To further enhance multilingual recognition performance, we adopted a meticulously designed multi-stage training strategy leveraging extensive multilingual audio datasets. Experimental results demonstrate that our approach achieves competitive Word Error Rate (WER) performance on both dev and test sets, obtaining second place in the challenge ranking.
摘要：本文描述了我们提交给多语言对话语音语言建模（MLC-SLM）挑战任务1的三重X语音识别系统。我们的工作着重于通过创新的编码器 - 适配器-LLM体系结构在多语言对话场景中优化语音识别精度。该框架可以利用基于文本的大语言模型的强大推理能力，同时结合了特定领域的适应。为了进一步提高多语言识别性能，我们采用了一种精心设计的多阶段训练策略，利用了广泛的多语言音频数据集。实验结果表明，我们的方法在DEV和测试集上达到竞争性单词错误率（WER）性能，在挑战排名中获得了第二名。

Title: Millions of $\text{GeAR}$-s: Extending GraphRAG to Millions of Documents

Authors: Zhili Shen, Chenxin Diao, Pascual Merita, Pavlos Vougiouklis, Jeff Z. Pan
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.17399
Pdf URL: https://arxiv.org/pdf/2507.17399
Copy Paste: [[2507.17399]] Millions of $\text{GeAR}$-s: Extending GraphRAG to Millions of Documents(https://arxiv.org/abs/2507.17399)
Keywords: retrieval-augmented generation
Abstract: Recent studies have explored graph-based approaches to retrieval-augmented generation, leveraging structured or semi-structured information -- such as entities and their relations extracted from documents -- to enhance retrieval. However, these methods are typically designed to address specific tasks, such as multi-hop question answering and query-focused summarisation, and therefore, there is limited evidence of their general applicability across broader datasets. In this paper, we aim to adapt a state-of-the-art graph-based RAG solution: $\text{GeAR}$ and explore its performance and limitations on the SIGIR 2025 LiveRAG Challenge.
摘要：最近的研究探讨了基于图的检索效果生成，利用结构化或半结构化信息（例如从文档中提取的实体及其关系）来增强检索的方法。但是，这些方法通常旨在解决特定任务，例如多跳问题答案和以查询为中心的摘要，因此，在更广泛的数据集中，其一般适用性的证据有限。在本文中，我们旨在调整最先进的基于图形的抹布解决方案：$ \ text {Gear} $，并探索其对Sigir 2025 Liverag挑战的性能和局限性。

Title: Each to Their Own: Exploring the Optimal Embedding in RAG

Authors: Shiting Chen, Zijian Zhao, Jinsong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.17442
Pdf URL: https://arxiv.org/pdf/2507.17442
Copy Paste: [[2507.17442]] Each to Their Own: Exploring the Optimal Embedding in RAG(https://arxiv.org/abs/2507.17442)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Recently, as Large Language Models (LLMs) have fundamentally impacted various fields, the methods for incorporating up-to-date information into LLMs or adding external knowledge to construct domain-specific models have garnered wide attention. Retrieval-Augmented Generation (RAG), serving as an inference-time scaling method, is notable for its low cost and minimal effort for parameter tuning. However, due to heterogeneous training data and model architecture, the variant embedding models used in RAG exhibit different benefits across various areas, often leading to different similarity calculation results and, consequently, varying response quality from LLMs. To address this problem, we propose and examine two approaches to enhance RAG by combining the benefits of multiple embedding models, named Mixture-Embedding RAG and Confident RAG. Mixture-Embedding RAG simply sorts and selects retrievals from multiple embedding models based on standardized similarity; however, it does not outperform vanilla RAG. In contrast, Confident RAG generates responses multiple times using different embedding models and then selects the responses with the highest confidence level, demonstrating average improvements of approximately 10% and 5% over vanilla LLMs and RAG, respectively. The consistent results across different LLMs and embedding models indicate that Confident RAG is an efficient plug-and-play approach for various domains. We will release our code upon publication.
摘要：最近，由于大型语言模型（LLMS）从根本上影响了各个领域，因此将最新信息纳入LLM的方法或将外部知识添加到构造域特异性模型中引起了广泛的关注。作为推理时间缩放方法的检索增强生成（RAG）以其低成本和对参数调整的最小努力而着称。但是，由于异质训练数据和模型架构，在各个区域中使用的变体嵌入模型表现出不同的好处，通常会导致不同的相似性计算结果，因此，LLMS的响应质量有所不同。为了解决这个问题，我们提出并检查了两种方法，通过结合多个嵌入模型的好处，即混合抹布和自信的抹布。混合物将抹布简单地对基于标准化相似性的多个嵌入模型进行选择；但是，它的表现并不优于香草抹布。相比之下，自信的抹布使用不同的嵌入模型多次产生响应，然后选择具有最高置信度的响应，表明平均改善分别比香草LLM和抹布的平均改善约为10％和5％。在不同的LLM和嵌入模型上的一致结果表明，自信的抹布是各种域的有效插入方法。我们将在出版时发布代码。

Title: MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs

Authors: Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.17476
Pdf URL: https://arxiv.org/pdf/2507.17476
Copy Paste: [[2507.17476]] MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs(https://arxiv.org/abs/2507.17476)
Keywords: language model, llm
Abstract: Although recent Large Language Models (LLMs) have shown rapid improvement on reasoning benchmarks in English, the evaluation of such LLMs' multilingual reasoning capability across diverse languages and cultural contexts remains limited. Existing multilingual reasoning benchmarks are typically constructed by translating existing English reasoning benchmarks, biasing these benchmarks towards reasoning problems with context in English language/cultures. In this work, we introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark designed to assess LLMs on more than 1,000 native, linguistic and culturally grounded reasoning questions written by native speakers in French, Spanish, and Chinese. MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance. For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English. This set of English equivalents can provide a direct comparison of LLM reasoning capacity in other languages vs. English on the same reasoning questions. We systematically evaluate current 14 leading LLMs covering most LLM families on MultiNRC and its English equivalent set. The results show that (1) current LLMs are still not good at native multilingual reasoning, with none scoring above 50% on MultiNRC; (2) LLMs exhibit distinct strengths and weaknesses in handling linguistic, cultural, and logical reasoning tasks; (3) Most models perform substantially better in math reasoning in English compared to in original languages (+10%), indicating persistent challenges with culturally grounded knowledge.
摘要：尽管最近的大型语言模型（LLMS）在英语中的推理基准方面已经快速改善，但对这种LLMS的多语言和文化背景的多语言推理能力的评估仍然有限。现有的多语言推理基准通常是通过翻译现有的英语推理基准来构建的，这些基准偏向于英语/文化中背景的推理问题。在这项工作中，我们介绍了多种语言的本地推理挑战（MultinRC），该基准旨在评估1,000多个本地人，语言和文化扎根的推理问题的LLM，由母语人士用法语，西班牙语和中文编写。 MultinRC涵盖了四个核心推理类别：语言特定的语言推理，文字游戏和谜语，文化/传统推理以及具有文化相关性的数学推理。对于具有文化相关性的文化/传统推理和数学推理，我们还通过英语流利的英语用手动翻译提供了同等的多语言问题翻译。这套英语等效物可以直接比较其他语言中的LLM推理能力与同一推理问题的英语。我们系统地评估当前的14个领先的LLM，涵盖MultinRC及其英语等效套件上的大多数LLM系列。结果表明，（1）当前的LLM在本机多语言推理方面仍然不擅长，而在MultinRC上没有任何评分超过50％；（2）LLM在处理语言，文化和逻辑推理任务时表现出不同的优势和缺点；（3）与原始语言相比，大多数模型在英语的数学推理方面的表现要好得多（+10％），这表明具有文化基础知识的持续挑战。

Title: Synthetic Voice Data for Automatic Speech Recognition in African Languages

Authors: Brian DeRenzi, Anna Dixon, Mohamed Aymane Farhi, Christian Resch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.17578
Pdf URL: https://arxiv.org/pdf/2507.17578
Copy Paste: [[2507.17578]] Synthetic Voice Data for Automatic Speech Recognition in African Languages(https://arxiv.org/abs/2507.17578)
Keywords: llm
Abstract: Speech technology remains out of reach for most of the over 2300 languages in Africa. We present the first systematic assessment of large-scale synthetic voice corpora for African ASR. We apply a three-step process: LLM-driven text creation, TTS voice synthesis, and ASR fine-tuning. Eight out of ten languages for which we create synthetic text achieved readability scores above 5 out of 7. We evaluated ASR improvement for three (Hausa, Dholuo, Chichewa) and created more than 2,500 hours of synthetic voice data at below 1% of the cost of real data. Fine-tuned Wav2Vec-BERT-2.0 models trained on 250h real and 250h synthetic Hausa matched a 500h real-data-only baseline, while 579h real and 450h to 993h synthetic data created the best performance. We also present gender-disaggregated ASR performance evaluation. For very low-resource languages, gains varied: Chichewa WER improved about 6.5% relative with a 1:2 real-to-synthetic ratio; a 1:1 ratio for Dholuo showed similar improvements on some evaluation data, but not on others. Investigating intercoder reliability, ASR errors and evaluation datasets revealed the need for more robust reviewer protocols and more accurate evaluation data. All data and models are publicly released to invite further work to improve synthetic data for African languages.
摘要：在非洲，大多数2300多种语言中的大多数语音技术仍然遥不可及。我们介绍了非洲ASR的大规模合成语音语料库的首次系统评估。我们应用三步过程：LLM驱动的文本创建，TTS语音合成和ASR微调。我们创建的十种语言中有八种在7中获得了5分之5的可读性得分。我们评估了三种（Hausa，dholuo，Chichewa）的ASR改进，并在低于1％的真实数据成本的1％以下创建了超过2500个小时的合成语音数据。经过微调的WAV2VEC-BERT-2.0型号在250H实时训练，合成HAUSA匹配500H的仅基线基线，而579h REAL和450H至993H合成数据创造了最佳性能。我们还提出了性别 - 分散的ASR绩效评估。对于非常低的资源语言，收益有所不同：Chichewa的相对相对约6.5％，实际合成比率为1：2； Dholuo的1：1比率在某些评估数据上显示出类似的改进，但在其他评估数据上没有显示出类似的改善。研究间编码器的可靠性，ASR错误和评估数据集表明需要更强大的审阅者协议和更准确的评估数据。公开发布所有数据和模型，以邀请进一步的工作以改善非洲语言的合成数据。

Title: A Hybrid Early-Exit Algorithm for Large Language Models Based on Space Alignment Decoding (SPADE)

Authors: Bowen Zheng, Ming Ma, Zhongqiao Lin, Tianming Yang
Subjects: cs.CL, cs.PF
Abstract URL: https://arxiv.org/abs/2507.17618
Pdf URL: https://arxiv.org/pdf/2507.17618
Copy Paste: [[2507.17618]] A Hybrid Early-Exit Algorithm for Large Language Models Based on Space Alignment Decoding (SPADE)(https://arxiv.org/abs/2507.17618)
Keywords: language model
Abstract: Large language models are computationally expensive due to their deep structures. Prior research has shown that intermediate layers contain sufficient information to generate accurate answers, leading to the development of early-exit algorithms that reduce inference costs by terminating computation at earlier layers. However, these methods often suffer from poor performance due to misalignment between intermediate and output layer representations that lead to decoding inaccuracy. To address these challenges, we propose SPADE (SPace Alignment DEcoding), a novel decoding method that aligns intermediate layer representations with the output layer by propagating a minimally reduced sequence consisting of only the start token and the answer token. We further optimize the early-exit decision-making process by training a linear approximation of SPADE that computes entropy-based confidence metrics. Putting them together, we create a hybrid early-exit algorithm that monitors confidence levels and stops inference at intermediate layers while using SPADE to generate high-quality outputs. This approach significantly reduces inference costs without compromising accuracy, offering a scalable and efficient solution for deploying large language models in real-world applications.
摘要：大型语言模型由于其深层结构而在计算上昂贵。先前的研究表明，中间层包含足够的信息来产生准确的答案，从而开发出早期外观算法，从而通过终止早期层的计算来降低推理成本。但是，这些方法通常由于中间和输出层表示不对准而导致解码不准确的情况的性能差。为了应对这些挑战，我们提出了Spade（空间对齐解码），这是一种新型的解码方法，通过传播仅由启动令牌和答案令牌组成的最小降低序列，将中间层表示与输出层保持一致。我们通过训练基于熵的置信度指标的黑桃的线性近似来进一步优化早期的决策过程。将它们汇总在一起，我们创建了一种混合早期外观算法，该算法可以监视置信度水平并停止在中间层的推理，同时使用Spade产生高质量的输出。这种方法大大降低了推理成本而不会损害准确性，为在现实世界应用中部署大型语言模型提供了可扩展和高效的解决方案。

Title: WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

Authors: Changxin Tian, Jiapeng Wang, Qian Zhao, Kunlong Chen, Jia Liu, Ziqi Liu, Jiaxin Mao, Wayne Xin Zhao, Zhiqiang Zhang, Jun Zhou
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.17634
Pdf URL: https://arxiv.org/pdf/2507.17634
Copy Paste: [[2507.17634]] WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training(https://arxiv.org/abs/2507.17634)
Keywords: llm
Abstract: Recent advances in learning rate (LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies-including cosine decay, linear decay and inverse square root decay-as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration-the training window for checkpoint aggregation-as the most critical factor influencing model performance, surpassing the importance of both checkpoint interval and merge quantity. Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving significant improvements of +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro. The performance advantages extend to supervised fine-tuning scenarios, highlighting WSM's potential for long-term model refinement.
摘要：最近的学习率（LR）调度的最新进展证明了无衰减方法的有效性，这些方法消除了传统的衰减阶段，同时保持竞争性能。模型合并技术已成为该领域中特别有希望的解决方案。我们提出热身稳定和合并（WSM），这是一个通用框架，在学习率衰减和模型合并之间建立了正式的联系。 WSM为模拟各种衰减策略提供了一个统一的理论基础，包括余弦衰减，线性衰减和平方根衰减，作为原则平均方案，同时与各种优化方法保持完全兼容。通过广泛的实验，我们确定了合并持续时间 - 检查点聚合的训练窗口 - 作为影响模型性能的最关键因素，超过了检查点间隔和合并数量的重要性。我们的框架始终优于多个基准测试的广泛预装的热身稳定 - 纪念日（WSD）方法，在数学方面的显着提高 +3.5％，人道主义者的 +2.9％，MMLU-PRO +5.5％。性能优势扩展到受监督的微调场景，突出了WSM的长期模型改进的潜力。

Title: Who Attacks, and Why? Using LLMs to Identify Negative Campaigning in 18M Tweets across 19 Countries

Authors: Victor Hartman, Petter Törnberg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.17636
Pdf URL: https://arxiv.org/pdf/2507.17636
Copy Paste: [[2507.17636]] Who Attacks, and Why? Using LLMs to Identify Negative Campaigning in 18M Tweets across 19 Countries(https://arxiv.org/abs/2507.17636)
Keywords: language model, llm
Abstract: Negative campaigning is a central feature of political competition, yet empirical research has been limited by the high cost and limited scalability of existing classification methods. This study makes two key contributions. First, it introduces zero-shot Large Language Models (LLMs) as a novel approach for cross-lingual classification of negative campaigning. Using benchmark datasets in ten languages, we demonstrate that LLMs achieve performance on par with native-speaking human coders and outperform conventional supervised machine learning approaches. Second, we leverage this novel method to conduct the largest cross-national study of negative campaigning to date, analyzing 18 million tweets posted by parliamentarians in 19 European countries between 2017 and 2022. The results reveal consistent cross-national patterns: governing parties are less likely to use negative messaging, while ideologically extreme and populist parties -- particularly those on the radical right -- engage in significantly higher levels of negativity. These findings advance our understanding of how party-level characteristics shape strategic communication in multiparty systems. More broadly, the study demonstrates the potential of LLMs to enable scalable, transparent, and replicable research in political communication across linguistic and cultural contexts.
摘要：负面竞选是政治竞争的主要特征，但是经验研究受到现有分类方法的高成本和有限的可扩展性的限制。这项研究做出了两个关键的贡献。首先，它引入了零声学模型（LLM），作为负面竞选跨语性分类的一种新方法。使用十种语言的基准数据集，我们证明了LLM与讲者说的人类编码器的表现，并表现优于常规监督的机器学习方法。其次，我们利用这种新颖的方法来进行迄今为止的负面运动的最大跨国研究，分析了2017年至2022年之间的19个欧洲国家发表的1800万条推文。结果揭示了一致的跨国模式：管理政党的可能性较小，而在意识形态上是极端和民众的范围，尤其是在辐射方面的较高范围，尤其是在射电范围上 - 涉及到范围很大的范围，而又有很大的水平。这些发现提高了我们对政党级特征如何塑造多方系统中战略交流的理解。从更广泛的角度来看，该研究证明了LLM在跨语言和文化背景的政治交流中实现可扩展，透明和可复制研究的潜力。

Title: Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

Authors: Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.17702
Pdf URL: https://arxiv.org/pdf/2507.17702
Copy Paste: [[2507.17702]] Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models(https://arxiv.org/abs/2507.17702)
Keywords: language model, llm
Abstract: Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration. To validate our derived scaling laws, we designed and trained Ling-mini-beta, a pilot model for Ling-2.0 series with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, Ling-mini-beta matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws. This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.
摘要：专家的混合物（MOE）已成为通过将总参数与计算成本解耦，以有效地扩展大型语言模型（LLM）的主导体系结构。但是，这种脱钩会带来一个关键的挑战：预测给定的MOE配置的模型容量（例如，专家激活率和粒度）仍然是一个尚未解决的问题。为了解决这一差距，我们引入了效率杠杆（EL），这是一种量化MOE模型在密集等效词的计算优势的度量。我们进行了一项大规模的实证研究，培训了300多个最多28B参数的模型，以系统地研究MOE建筑构型与EL之间的关系。我们的发现表明，EL主要由专家激活率和总计算预算驱动，均遵循可预测的功率定律，而专家粒度则充当具有明确最佳范围的非线性调制器。我们将这些发现集成到统一的缩放定律中，该定律定律准确地基于其配置来预测MOE架构的EL。为了验证我们的派生缩放定律，我们设计和训练了Ling-Mini-Beta，这是一种仅使用0.85B活性参数的Ling-2.0系列的试验模型，以及一个6.1B密度的模型，用于比较。 Ling-Mini-Beta在相同的1T高质量令牌数据集中接受培训时，与6.1b密度模型的性能相匹配，同时消耗了7倍以上的计算资源，从而确认了我们扩展定律的准确性。这项工作为有效的MOE模型的扩展提供了原则上且经验的基础。

Title: From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

Authors: Karen Zhou, John Giorgi, Pranav Mani, Peng Xu, Davis Liang, Chenhao Tan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.17717
Pdf URL: https://arxiv.org/pdf/2507.17717
Copy Paste: [[2507.17717]] From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes(https://arxiv.org/abs/2507.17717)
Keywords: llm
Abstract: AI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters, prepared in accordance with the HIPAA safe harbor standard, from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms baseline approaches in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist's robustness to quality-degrading perturbations, significant alignment with clinician preferences, and practical value as an evaluation methodology. In offline research settings, the checklist can help identify notes likely to fall below our chosen quality thresholds.
摘要：AI生成的临床笔记越来越多地用于医疗保健中，但是由于主观性高和专家审查的可扩展性有限，评估其质量仍然是一个挑战。现有的自动指标通常无法与现实世界的医师偏好保持一致。为了解决这个问题，我们提出了一条系统，该管道系统地将真实的用户反馈提炼成结构化清单以进行注释。这些清单旨在解释，以人为反馈为基础，并由基于LLM的评估者强制执行。我们使用由HIPAA安全港标准制备的21,000多个临床遭遇的数据，从部署的AI医学抄写员系统中制备，我们表明我们的反馈衍生的清单优于我们在覆盖，多样性和预测性人类评级的范围评估中的基线方法。广泛的实验证实了清单对质量降低扰动的鲁棒性，与临床医生偏好的显着对齐以及作为评估方法的实用价值。在离线研究环境中，清单可以帮助确定可能降至我们选择的质量阈值以下的注释。

Title: AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer

Authors: Danny D. Leybzon, Shreyas Tirumala, Nishant Jain, Summer Gillen, Michael Jackson, Cameron McPhee, Jennifer Schmidt
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2507.17718
Pdf URL: https://arxiv.org/pdf/2507.17718
Copy Paste: [[2507.17718]] AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer(https://arxiv.org/abs/2507.17718)
Keywords: language model, llm
Abstract: With the rise of voice-enabled artificial intelligence (AI) systems, quantitative survey researchers have access to a new data-collection mode: AI telephone surveying. By using AI to conduct phone interviews, researchers can scale quantitative studies while balancing the dual goals of human-like interactivity and methodological rigor. Unlike earlier efforts that used interactive voice response (IVR) technology to automate these surveys, voice AI enables a more natural and adaptive respondent experience as it is more robust to interruptions, corrections, and other idiosyncrasies of human speech. We built and tested an AI system to conduct quantitative surveys based on large language models (LLM), automatic speech recognition (ASR), and speech synthesis technologies. The system was specifically designed for quantitative research, and strictly adhered to research best practices like question order randomization, answer order randomization, and exact wording. To validate the system's effectiveness, we deployed it to conduct two pilot surveys with the SSRS Opinion Panel and followed-up with a separate human-administered survey to assess respondent experiences. We measured three key metrics: the survey completion rates, break-off rates, and respondent satisfaction scores. Our results suggest that shorter instruments and more responsive AI interviewers may contribute to improvements across all three metrics studied.
摘要：随着语音支持的人工智能（AI）系统的兴起，定量调查研究人员可以使用新的数据收集模式：AI电话测量。通过使用AI进行电话访谈，研究人员可以在平衡类似人类的互动性和方法论严格的双重目标的同时扩展定量研究。与使用交互式语音响应（IVR）技术自动化这些调查的早期努力不同，语音AI可以具有更自然和适应性的受访者经验，因为它对人类言语的中断，更正和其他特质更为强大。我们构建并测试了一个AI系统，以基于大语言模型（LLM），自动语音识别（ASR）和语音合成技术进行定量调查。该系统是专门为定量研究而设计的，并严格遵守研究最佳实践，例如问题顺序随机化，答案顺序随机化和确切的措辞。为了验证该系统的有效性，我们将其部署为通过SSRS意见小组进行了两次试点调查，并进行了一项单独的人类管理调查，以评估受访者的经验。我们测量了三个关键指标：调查完成率，中断率和受访者满意度得分。我们的结果表明，较短的工具和更敏感的AI访调员可能会促进所研究的所有三个指标的改进。

Title: Megrez2 Technical Report

Authors: Boxun Li, Yadong Li, Zhiyuan Li, Congyi Liu, Weilin Liu, Guowei Niu, Zheyue Tan, Haiyang Xu, Zhuyu Yao, Tao Yuan, Dong Zhou, Yueqing Zhuang, Bo Zhao, Guohao Dai, Yu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.17728
Pdf URL: https://arxiv.org/pdf/2507.17728
Copy Paste: [[2507.17728]] Megrez2 Technical Report(https://arxiv.org/abs/2507.17728)
Keywords: language model
Abstract: We present Megrez2, a novel lightweight and high-performance language model architecture optimized for device native deployment. Megrez2 introduces a novel cross-layer expert sharing mechanism, which significantly reduces total parameter count by reusing expert modules across adjacent transformer layers while maintaining most of the model's capacity. It also incorporates pre-gated routing, enabling memory-efficient expert loading and faster inference. As the first instantiation of the Megrez2 architecture, we introduce the Megrez2-Preview model, which is pre-trained on a 5-trillion-token corpus and further enhanced through supervised fine-tuning and reinforcement learning with verifiable rewards. With only 3B activated and 7.5B stored parameters, Megrez2-Preview demonstrates competitive or superior performance compared to larger models on a wide range of tasks, including language understanding, instruction following, mathematical reasoning, and code generation. These results highlight the effectiveness of the Megrez2 architecture to achieve a balance between accuracy, efficiency, and deployability, making it a strong candidate for real-world, resource-constrained applications.
摘要：我们介绍Megrez2，这是一种针对设备本机部署优化的新颖的轻巧和高性能的语言模型体系结构。 Megrez2引入了一种新型的跨层专家共享机制，该机制通过重复相邻变压器层的专家模块，同时保持大部分模型的能力，从而大大降低了总参数计数。它还结合了预门票的路由，可以实现内存有效的专家加载和更快的推断。作为Megrez2体系结构的首次实例化，我们介绍了Megrez2-Preiview模型，该模型已在5亿英里的语料库中进行了预培训，并通过可验证的奖励通过监督的微调和增强学习来进一步增强。与在广泛的任务上的较大模型相比，Megrez2-Preivew仅激活了3B和7.5B存储的参数，包括语言理解，教学跟随，数学推理和代码生成，表现出竞争性或优越的性能。这些结果突出了Megrez2体系结构在准确性，效率和可部署性之间达到平衡的有效性，使其成为现实世界中，资源约束应用程序的有力候选人。

Title: Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

Authors: Linbo Cao, Jinman Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.17747
Pdf URL: https://arxiv.org/pdf/2507.17747
Copy Paste: [[2507.17747]] Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks(https://arxiv.org/abs/2507.17747)
Keywords: language model
Abstract: As frontier language models increasingly saturate standard QA benchmarks, concerns about data contamination, memorization, and escalating dataset creation costs persist. We propose a debate-driven evaluation paradigm that transforms any existing QA dataset into structured adversarial debates--where one model is given the official answer to defend, and another constructs and defends an alternative answer--adjudicated by a judge model blind to the correct solution. By forcing multi-round argumentation, this approach substantially increases difficulty while penalizing shallow memorization, yet reuses QA items to reduce curation overhead. We make two main contributions: (1) an evaluation pipeline to systematically convert QA tasks into debate-based assessments, and (2) a public benchmark that demonstrates our paradigm's effectiveness on a subset of MMLU-Pro questions, complete with standardized protocols and reference models. Empirical results validate the robustness of the method and its effectiveness against data contamination--a Llama 3.1 model fine-tuned on test questions showed dramatic accuracy improvements (50% -> 82%) but performed worse in debates. Results also show that even weaker judges can reliably differentiate stronger debaters, highlighting how debate-based evaluation can scale to future, more capable systems while maintaining a fraction of the cost of creating new benchmarks. Overall, our framework underscores that "pretraining on the test set is no longer all you need," offering a sustainable path for measuring the genuine reasoning ability of advanced language models.
摘要：随着边境语言模型越来越饱和标准的质量检查基准，对数据污染，记忆和升级数据集创建成本的担忧持续存在。我们提出了一个以辩论为导向的评估范式，该范式将任何现有的质量检查数据集转化为结构化的对抗性辩论 - 在这种情况下，给出了一个模型的官方答案，而另一种模型则由法官模型对正确的解决方案进行了构建和辩护。通过强迫多轮辩论，这种方法大大增加了难度，同时惩罚浅背记忆，但重复质量检查项目以减少策划开销。我们做出了两个主要贡献：（1）将质量保证任务转换为基于辩论的评估的评估管道，以及（2）公共基准测试，该公共基准证明了我们范式在MMLU-Pro问题子集中的有效性，并包含标准化协议和参考模型。经验结果验证了该方法的鲁棒性及其对数据污染的有效性 - 在测试问题上进行了微调的Llama 3.1模型显示出巨大的准确性提高（50％ - > 82％），但在辩论中的表现较差。结果还表明，即使较弱的法官也可以可靠地区分更强大的辩论者，并强调基于辩论的评估如何扩展到未来，更有能力的系统，同时保持创建新基准的成本的一小部分。总体而言，我们的框架强调了“在测试集上进行预处理不再是您需要的一切”，这为衡量高级语言模型的真正推理能力提供了可持续的途径。