2025-07-29

Title: Setting The Table with Intent: Intent-aware Schema Generation and Editing for Literature Review Tables

Authors: Vishakh Padmakumar, Joseph Chee Chang, Kyle Lo, Doug Downey, Aakanksha Naik
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.19521
Pdf URL: https://arxiv.org/pdf/2507.19521
Copy Paste: [[2507.19521]] Setting The Table with Intent: Intent-aware Schema Generation and Editing for Literature Review Tables(https://arxiv.org/abs/2507.19521)
Keywords: language model, llm, prompt
Abstract: The increasing volume of academic literature makes it essential for researchers to organize, compare, and contrast collections of documents. Large language models (LLMs) can support this process by generating schemas defining shared aspects along which to compare papers. However, progress on schema generation has been slow due to: (i) ambiguity in reference-based evaluations, and (ii) lack of editing/refinement methods. Our work is the first to address both issues. First, we present an approach for augmenting unannotated table corpora with synthesized intents and apply it to create a dataset for studying schema generation conditioned on a given information need, thus reducing ambiguity. With this dataset, we show how incorporating table intents significantly improves baseline performance in reconstructing reference schemas. Next, we propose several LLM-based schema editing techniques. We start by comprehensively benchmarking several single-shot schema generation methods, including prompted LLM workflows and fine-tuned models, showing that smaller, open-weight models can be fine-tuned to be competitive with state-of-the-art prompted LLMs. Then we demonstrate that our editing techniques can further improve schemas generated by these methods.
摘要：越来越多的学术文献使研究人员必须组织，比较和对比的文档集合至关重要。大型语言模型（LLMS）可以通过生成定义共享方面进行比较论文的方面来支持此过程。但是，由于：（i）基于参考的评估中的歧义以及（ii）缺乏编辑/改进方法的歧义。我们的工作是第一个解决这两个问题的工作。首先，我们提出了一种通过合成意图来增强未经注释的表Corpora的方法，并将其应用于创建一个数据集，用于研究以给定信息需求为条件的模式生成，从而减少了歧义。使用此数据集，我们展示了合并表的意图如何显着改善基准性能在重建参考模式中。接下来，我们提出了几种基于LLM的架构编辑技术。我们首先要全面对几种单次架构生成方法进行基准测试，包括提示的LLM工作流和微调模型，表明可以对较小的，开放的型号进行微调，以与最先进的提示LLMS具有竞争力。然后，我们证明我们的编辑技术可以进一步改善这些方法生成的模式。

Title: Mind the Language Gap in Digital Humanities: LLM-Aided Translation of SKOS Thesauri

Authors: Felix Kraus, Nicolas Blumenröhr, Danah Tonne, Achim Streit
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19537
Pdf URL: https://arxiv.org/pdf/2507.19537
Copy Paste: [[2507.19537]] Mind the Language Gap in Digital Humanities: LLM-Aided Translation of SKOS Thesauri(https://arxiv.org/abs/2507.19537)
Keywords: language model, llm
Abstract: We introduce WOKIE, an open-source, modular, and ready-to-use pipeline for the automated translation of SKOS thesauri. This work addresses a critical need in the Digital Humanities (DH), where language diversity can limit access, reuse, and semantic interoperability of knowledge resources. WOKIE combines external translation services with targeted refinement using Large Language Models (LLMs), balancing translation quality, scalability, and cost. Designed to run on everyday hardware and be easily extended, the application requires no prior expertise in machine translation or LLMs. We evaluate WOKIE across several DH thesauri in 15 languages with different parameters, translation services and LLMs, systematically analysing translation quality, performance, and ontology matching improvements. Our results show that WOKIE is suitable to enhance the accessibility, reuse, and cross-lingual interoperability of thesauri by hurdle-free automated translation and improved ontology matching performance, supporting more inclusive and multilingual research infrastructures.
摘要：我们介绍了Wokie，这是一个开源，模块化和现成的管道，用于Skos thesauri的自动翻译。这项工作解决了数字人文科学（DH）的关键需求，在该需求中，语言多样性可以限制知识资源的访问，重复使用和语义互操作性。 Wokie使用大语言模型（LLM），平衡翻译质量，可扩展性和成本结合了外部翻译服务与有针对性的改进。该应用程序旨在在日常硬件上运行并容易扩展，不需要以前的机器翻译或LLMS专业知识。我们用15种语言评估了Wokie，其中包括不同的参数，翻译服务和LLM，系统地分析了翻译质量，性能和本体论匹配的改进。我们的结果表明，Wokie适合通过无障碍的自动翻译来增强词库的可访问性，重复使用和跨语性互操作性，并改善了本体论匹配性能，支持更多包容性和多语言的研究基础架构。

Title: Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning

Authors: Shengyuan Wang, Jie Feng, Tianhui Liu, Dan Pei, Yong Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.19586
Pdf URL: https://arxiv.org/pdf/2507.19586
Copy Paste: [[2507.19586]] Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning(https://arxiv.org/abs/2507.19586)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) possess extensive world knowledge, including geospatial knowledge, which has been successfully applied to various geospatial tasks such as mobility prediction and social indicator prediction. However, LLMs often generate inaccurate geospatial knowledge, leading to geospatial hallucinations (incorrect or inconsistent representations of geospatial information) that compromise their reliability. While the phenomenon of general knowledge hallucination in LLMs has been widely studied, the systematic evaluation and mitigation of geospatial hallucinations remain largely unexplored. To address this gap, we propose a comprehensive evaluation framework for geospatial hallucinations, leveraging structured geospatial knowledge graphs for controlled assessment. Through extensive evaluation across 20 advanced LLMs, we uncover the hallucinations in their geospatial knowledge. Building on these insights, we introduce a dynamic factuality aligning method based on Kahneman-Tversky Optimization (KTO) to mitigate geospatial hallucinations in LLMs, leading to a performance improvement of over 29.6% on the proposed benchmark. Extensive experimental results demonstrate the effectiveness of our benchmark and learning algorithm in enhancing the trustworthiness of LLMs in geospatial knowledge and reasoning tasks.
摘要：大型语言模型（LLM）具有广泛的世界知识，包括地理空间知识，该知识已成功应用于各种地理空间任务，例如流动性预测和社会指标预测。但是，LLM经常产生不准确的地理空间知识，从而导致地理空间幻觉（不正确或不一致的地理空间信息表示），从而损害了其可靠性。尽管已广泛研究了LLMS中常识幻觉的现象，但对地理空间幻觉的系统评估和缓解仍未得到探索。为了解决这一差距，我们为地理空间幻觉提出了一个全面的评估框架，利用结构化的地理空间知识图进行控制评估。通过对20个高级LLM的广泛评估，我们在其地理空间知识中揭示了幻觉。在这些见解的基础上，我们引入了基于Kahneman-Tversky优化（KTO）的动态事实对准方法，以减轻LLMS的地理空间幻觉，从而在拟议的基准中提高了超过29.6％的绩效。广泛的实验结果证明了我们的基准和学习算法在增强LLM在地理空间知识和推理任务中的可信度方面的有效性。

Title: Efficient Attention Mechanisms for Large Language Models: A Survey

Authors: Yutao Sun, Zhenyu Li, Yike Zhang, Tengyu Pan, Bowen Dong, Yuyi Guo, Jianyong Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.19595
Pdf URL: https://arxiv.org/pdf/2507.19595
Copy Paste: [[2507.19595]] Efficient Attention Mechanisms for Large Language Models: A Survey(https://arxiv.org/abs/2507.19595)
Keywords: language model
Abstract: Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address this limitation, recent research has introduced two principal categories of efficient attention mechanisms. Linear attention methods achieve linear complexity through kernel approximations, recurrent formulations, or fastweight dynamics, thereby enabling scalable inference with reduced computational overhead. Sparse attention techniques, in contrast, limit attention computation to selected subsets of tokens based on fixed patterns, block-wise routing, or clustering strategies, enhancing efficiency while preserving contextual coverage. This survey provides a systematic and comprehensive overview of these developments, integrating both algorithmic innovations and hardware-level considerations. In addition, we analyze the incorporation of efficient attention into largescale pre-trained language models, including both architectures built entirely on efficient attention and hybrid designs that combine local and global components. By aligning theoretical foundations with practical deployment strategies, this work aims to serve as a foundational reference for advancing the design of scalable and efficient language models.
摘要：基于变压器的体系结构已成为大型语言模型的主要骨干。但是，自我注意力的二次时间和记忆复杂性仍然是有效的长篇文化建模的根本障碍。为了解决这一局限性，最近的研究引入了有效的注意机制的两个主要类别。线性注意力方法通过内核近似，复发公式或快速的动力学实现线性复杂性，从而通过减少的计算开销来启用可扩展的推断。相比之下，稀疏注意技术将注意力计算限制为基于固定模式，块路由或聚类策略的选定子集，从而提高效率，同时保留上下文覆盖范围。这项调查提供了对这些发展的系统和全面概述，从而整合了算法创新和硬件级别的注意事项。此外，我们分析了将有效的注意力融合到LargesCale预训练的语言模型中，包括这两种架构完全基于有效的关注和结合了本地和全球组件的混合设计。通过将理论基础与实际部署策略保持一致，这项工作旨在作为推进可扩展有效语言模型设计的基础参考。

Title: MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?

Authors: Muntasir Wahed, Xiaona Zhou, Kiet A. Nguyen, Tianjiao Yu, Nirav Diwan, Gang Wang, Dilek Hakkani-Tür, Ismini Lourentzou
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2507.19598
Pdf URL: https://arxiv.org/pdf/2507.19598
Copy Paste: [[2507.19598]] MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?(https://arxiv.org/abs/2507.19598)
Keywords: language model, llm, prompt
Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced their code generation capabilities. However, their robustness against adversarial misuse, particularly through multi-turn malicious coding prompts, remains underexplored. In this work, we introduce code decomposition attacks, where a malicious coding task is broken down into a series of seemingly benign subtasks across multiple conversational turns to evade safety filters. To facilitate systematic evaluation, we introduce \benchmarkname{}, a large-scale benchmark designed to evaluate the robustness of code LLMs against both single-turn and multi-turn malicious prompts. Empirical results across open- and closed-source models reveal persistent vulnerabilities, especially under multi-turn scenarios. Fine-tuning on MOCHA improves rejection rates while preserving coding ability, and importantly, enhances robustness on external adversarial datasets with up to 32.4% increase in rejection rates without any additional supervision.
摘要：大型语言模型（LLM）的最新进展显着增强了其代码生成功能。但是，他们对对抗性滥用的鲁棒性，尤其是通过多转弯的恶意编码提示，仍然没有得到充实的态度。在这项工作中，我们引入了代码分解攻击，其中恶意编码任务分为多个对话旋转的一系列看似良性的子任务，以逃避安全过滤器。为了促进系统评估，我们介绍了\ BenchmarkName {}，这是一个大规模的基准测试，旨在评估代码LLMS对单转弯和多转弯恶意提示的鲁棒性。开放式和封闭源模型之间的经验结果揭示了持续存在的漏洞，尤其是在多转变的情况下。对摩卡咖啡进行微调提高了排斥率，同时保持编码能力，并且重要的是，在没有任何其他监督的情况下，拒绝率提高了外部对抗数据集的鲁棒性，而排斥率最高32.4％。

Title: HITSZ's End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track

Authors: Xuchen Wei, Yangxin Wu, Yaoyin Zhang, Henglyu Liu, Kehai Chen, Xuefeng Bai, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19616
Pdf URL: https://arxiv.org/pdf/2507.19616
Copy Paste: [[2507.19616]] HITSZ's End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track(https://arxiv.org/abs/2507.19616)
Keywords: language model, llm, chain-of-thought
Abstract: This paper presents HITSZ's submission for the IWSLT 2025 Indic track, focusing on speech-to-text translation (ST) for English-to-Indic and Indic-to-English language pairs. To enhance translation quality in this low-resource scenario, we propose an end-to-end system integrating the pre-trained Whisper automated speech recognition (ASR) model with Krutrim, an Indic-specialized large language model (LLM). Experimental results demonstrate that our end-to-end system achieved average BLEU scores of $28.88$ for English-to-Indic directions and $27.86$ for Indic-to-English directions. Furthermore, we investigated the Chain-of-Thought (CoT) method. While this method showed potential for significant translation quality improvements on successfully parsed outputs (e.g. a $13.84$ BLEU increase for Tamil-to-English), we observed challenges in ensuring the model consistently adheres to the required CoT output format.
摘要：本文介绍了HITSZ对IWSLT 2025 INDA的提交曲目的提交，重点介绍了语音到文本翻译（ST），用于英语到印度语言和英语对话。为了在这种低资源场景中提高翻译质量，我们提出了一个端到端系统，将预先训练的耳语自动语音识别（ASR）模型与krutrim（krutrim）整合在一起，克鲁特里姆（Krutrim）是一种指示性的大型语言模型（LLM）。实验结果表明，我们的端到端系统的平均BLEU分数为28.88美元，英语到印度的方向的平均分数为27.86美元，指示到英语方向。此外，我们研究了经营链（COT）方法。尽管该方法显示了成功解析产出的重大翻译质量改进的潜力（例如，泰米尔语到英语的增加$ 13.84 $ bleu），但我们观察到确保模型始终如一地遵守所需的COT输出格式的挑战。

Title: MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Authors: Sara Papi, Maike Züfle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues
Subjects: cs.CL, cs.AI, cs.CV, cs.SD
Abstract URL: https://arxiv.org/abs/2507.19634
Pdf URL: https://arxiv.org/pdf/2507.19634
Copy Paste: [[2507.19634]] MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks(https://arxiv.org/abs/2507.19634)
Keywords: language model, llm
Abstract: Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations--hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities--speech, vision, and text--and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.
摘要：大语言模型的最新进展促进了多模式LLM（MLLM）的开发，这些模式LLM（MLLM）将文本，语音和视觉整合到统一框架内。随着MLLM从狭窄，单语，特定于任务的系统发展为通用指令跟随模型，关键的边界在于评估其在长期和短上下文中的多语言和多模式功能。但是，现有的基准在共同评估这些维度方面缺乏：它们通常仅限于英语，主要是一次侧重于一种单一模式，依靠短形式的环境或缺乏人类注释 - 妨碍对语言，方式和任务复杂性的模型表现的全面评估。为了解决这些差距，我们介绍了MCIF（以下是多模式的跨语言教学），这是基于科学谈判的第一个多语言人类宣传的基准，该基准旨在评估短和长期输入的跨语言，多模态设置中的教学范围。 MCIF跨越了三种核心方式 - 语音，愿景和文本 - 以及四种不同的语言（英语，德语，意大利语和中文），从而可以全面评估MLLM的能力来解释跨语言的说明并将其与多模式上下文信息相结合。 MCIF遵循CC-BY 4.0许可，以鼓励MLLMS开发的开放研究和进展。

Title: RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams

Authors: Andrei Vlad Man, Răzvan-Alexandru Smădu, Cristian-George Craciun, Dumitru-Clementin Cercel, Florin Pop, Mihaela-Claudia Cercel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19666
Pdf URL: https://arxiv.org/pdf/2507.19666
Copy Paste: [[2507.19666]] RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams(https://arxiv.org/abs/2507.19666)
Keywords: language model, llm, prompt, retrieval-augmented generation, chain-of-thought
Abstract: The intersection of AI and legal systems presents a growing need for tools that support legal education, particularly in under-resourced languages such as Romanian. In this work, we aim to evaluate the capabilities of Large Language Models (LLMs) and Vision-Language Models (VLMs) in understanding and reasoning about Romanian driving law through textual and visual question-answering tasks. To facilitate this, we introduce RoD-TAL, a novel multimodal dataset comprising Romanian driving test questions, text-based and image-based, alongside annotated legal references and human explanations. We implement and assess retrieval-augmented generation (RAG) pipelines, dense retrievers, and reasoning-optimized models across tasks including Information Retrieval (IR), Question Answering (QA), Visual IR, and Visual QA. Our experiments demonstrate that domain-specific fine-tuning significantly enhances retrieval performance. At the same time, chain-of-thought prompting and specialized reasoning models improve QA accuracy, surpassing the minimum grades required to pass driving exams. However, visual reasoning remains challenging, highlighting the potential and the limitations of applying LLMs and VLMs to legal education.
摘要：人工智能和法律制度的交集表明，对支持法律教育的工具的需求日益增长，尤其是在诸如罗马尼亚语之类的资源不足的语言中。在这项工作中，我们旨在评估大语模型（LLM）和视觉模型（VLM）的能力，以通过文本和视觉提问任务来理解和推理有关罗马尼亚驾驶法律的理解和推理。为了促进这一点，我们介绍了Rod-Tal，这是一个新型的多模式数据集，其中包括罗马尼亚驾驶测试问题，基于文本和基于图像的驾驶问题，以及带注释的法律参考和人类解释。我们在包括信息检索（IR），问题答录（QA），Visual IR和Visual QA（视觉IR和Visual QA）之间实施和评估了检索结果（RAG）管道，密集的检索器以及推理优化的模型。我们的实验表明，域特异性微调显着提高了检索性能。同时，经过深思熟虑的促进链和专业推理模型提高了质量检查的准确性，超过了通过驾驶考试所需的最低成绩。但是，视觉推理仍然具有挑战性，强调了将LLM和VLMS应用于法律教育的潜力和局限性。

Title: Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks

Authors: Maitha Alshehhi, Ahmed Sharshar, Mohsen Guizani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19699
Pdf URL: https://arxiv.org/pdf/2507.19699
Copy Paste: [[2507.19699]] Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks(https://arxiv.org/abs/2507.19699)
Keywords: language model, gpt, llm, hallucination
Abstract: Although LLMs have attained significant success in high-resource languages, their capacity in low-resource linguistic environments like Kannada and Arabic is not yet fully understood. This work benchmarking the performance of multilingual and monolingual Large Language Models (LLMs) across Arabic, English, and Indic languages, with particular emphasis on the effects of model compression strategies such as pruning and quantization. Findings shows significant performance differences driven by linguistic diversity and resource availability on SOTA LLMS as BLOOMZ, AceGPT, Jais, LLaMA-2, XGLM, and AraGPT2. We find that multilingual versions of the model outperform their language-specific counterparts across the board, indicating substantial cross-lingual transfer benefits. Quantization (4-bit and 8-bit) is effective in maintaining model accuracy while promoting efficiency, but aggressive pruning significantly compromises performance, especially in bigger models. Our findings pinpoint key strategies to construct scalable and fair multilingual NLP solutions and underscore the need for interventions to address hallucination and generalization errors in the low-resource setting.
摘要：尽管LLM在高回购语言中取得了巨大的成功，但它们在诸如卡纳达语和阿拉伯语等低资源语言环境中的能力尚未完全理解。这项工作基于阿拉伯语，英语和指示语言的多语言和单语言大语模型（LLM）的性能，特别强调了模型压缩策略（例如修剪和量化）的影响。调查结果表明，SOTA LLMS的语言多样性和资源可用性所驱动的显着差异，例如Bloomz，Acegpt，Jais，Jais，Llame-2，XGLM和Aragpt2。我们发现该模型的多语言版本均优于其语言特定的对应物，这表明跨语性转移益处很大。量化（4位和8位）在促进效率的同时保持模型的准确性有效，但积极的修剪会显着损害性能，尤其是在更大的模型中。我们的发现指出了关键策略，以构建可扩展和公平的多语言NLP解决方案，并强调需要在低资源设置中解决幻觉和概括错误的干预措施。

Title: Ta-G-T: Subjectivity Capture in Table to Text Generation via RDF Graphs

Authors: Ronak Upasham, Tathagata Dey, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19710
Pdf URL: https://arxiv.org/pdf/2507.19710
Copy Paste: [[2507.19710]] Ta-G-T: Subjectivity Capture in Table to Text Generation via RDF Graphs(https://arxiv.org/abs/2507.19710)
Keywords: language model, gpt, llm
Abstract: In Table-to-Text (T2T) generation, existing approaches predominantly focus on providing objective descriptions of tabular data. However, generating text that incorporates subjectivity, where subjectivity refers to interpretations beyond raw numerical data, remains underexplored. To address this, we introduce a novel pipeline that leverages intermediate representations to generate both objective and subjective text from tables. Our three-stage pipeline consists of: 1) extraction of Resource Description Framework (RDF) triples, 2) aggregation of text into coherent narratives, and 3) infusion of subjectivity to enrich the generated text. By incorporating RDFs, our approach enhances factual accuracy while maintaining interpretability. Unlike large language models (LLMs) such as GPT-3.5, Mistral-7B, and Llama-2, our pipeline employs smaller, fine-tuned T5 models while achieving comparable performance to GPT-3.5 and outperforming Mistral-7B and Llama-2 in several metrics. We evaluate our approach through quantitative and qualitative analyses, demonstrating its effectiveness in balancing factual accuracy with subjective interpretation. To the best of our knowledge, this is the first work to propose a structured pipeline for T2T generation that integrates intermediate representations to enhance both factual correctness and subjectivity.
摘要：在表格到文本（T2T）生成中，现有方法主要集中于提供对表格数据的客观描述。但是，生成结合主观性的文本，即主观性是指除原始数据以外的解释，但仍未得到充满反感。为了解决这个问题，我们介绍了一条新型的管道，该管道利用中间表示从表中产生客观和主观文本。我们的三阶段管道包括：1）资源描述框架（RDF）三元组的提取，2）将文本汇总到连贯的叙述中，以及3）注入主观性以丰富生成的文本。通过合并RDF，我们的方法在保持可解释性的同时提高了事实准确性。与大型语言模型（LLM）不同，例如GPT-3.5，Mistral-7b和Llama-2，我们的管道采用了较小的，微调的T5模型，同时在几种衡量标准中实现了与GPT-3.5的可比性能，并胜过Mistral-7b和Llama-2。我们通过定量和定性分析来评估我们的方法，证明了其在平衡事实准确性与主观解释方面的有效性。据我们所知，这是为T2T生成提出结构化管道的第一项工作，该管道整合了中间表示以增强事实正确性和主观性。

Title: Basic Reading Distillation

Authors: Zhi Zhou, Sirui Miao, Xiangyu Duan, Hao Yang, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19741
Pdf URL: https://arxiv.org/pdf/2507.19741
Copy Paste: [[2507.19741]] Basic Reading Distillation(https://arxiv.org/abs/2507.19741)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable abilities in various natural language processing areas, but they demand high computation resources which limits their deployment in real-world. Distillation is one technique to solve this problem through either knowledge distillation or task distillation. Both distillation approaches train small models to imitate specific features of LLMs, but they all neglect basic reading education for small models on generic texts that are \emph{unrelated} to downstream tasks. In this paper, we propose basic reading distillation (BRD) which educates a small model to imitate LLMs basic reading behaviors, such as named entity recognition, question raising and answering, on each sentence. After such basic education, we apply the small model on various tasks including language inference benchmarks and BIG-bench tasks. It shows that the small model can outperform or perform comparable to over 20x bigger LLMs. Analysis reveals that BRD effectively influences the probability distribution of the small model, and has orthogonality to either knowledge distillation or task distillation.
摘要：大型语言模型（LLM）在各种自然语言处理领域表现出了非凡的能力，但是它们要求限制其在现实世界中的部署的高计算资源。蒸馏是一种通过知识蒸馏或任务蒸馏解决此问题的技术。两种蒸馏方法都训练小型模型以模仿LLM的特定功能，但它们都忽略了针对\ emph {无关}的通用文本的小型模型的基本阅读教育来下游任务。在本文中，我们提出了基本的阅读蒸馏（BRD），该蒸馏（BRD）教育一个小型模型，以模仿LLMS的基本阅读行为，例如在每个句子上命名实体识别，提出问题和回答。经过这样的基础教育，我们将小型模型应用于各种任务，包括语言推理基准和大基础任务。它表明，小型模型可以胜过超过20倍的LLMS。分析表明，BRD有效地影响了小型模型的概率分布，并且对知识蒸馏或任务蒸馏具有正交性。

Title: JT-Math: A Multi-Stage Framework for Advanced Mathematical Reasoning in Large Language Models

Authors: Yifan Hao, Fangning Chao, Yaqian Hao, Zhaojun Cui, Huan Bai, Haiyu Zhang, Yankai Liu, Chao Deng, Junlan Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19748
Pdf URL: https://arxiv.org/pdf/2507.19748
Copy Paste: [[2507.19748]] JT-Math: A Multi-Stage Framework for Advanced Mathematical Reasoning in Large Language Models(https://arxiv.org/abs/2507.19748)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Mathematical reasoning is a cornerstone of artificial general intelligence and a primary benchmark for evaluating the capabilities of Large Language Models (LLMs). While state-of-the-art models show promise, they often falter when faced with complex problems that demand deep conceptual understanding and intricate, multi-step deliberation. To address this challenge, we introduce JT-Math-8B, a series of open-source models comprising base, instruct, and thinking versions, built upon a systematic, multi-stage optimization framework. Our pre-training corpus is a high-quality, 210B-token dataset curated through a dedicated data pipeline that uses model-based validation to ensure quality and diversity. The Instruct Model is optimized for direct, concise answers through Supervised Fine-Tuning (SFT) and a GRPO-based reinforcement learning (RL) method. The Thinking Model is trained for complex problem-solving using a Long Chain-of-Thought (Long CoT) approach, combining SFT with a novel, multi-stage RL curriculum that progressively increases task difficulty and context length up to 32K tokens. JT-Math-8B achieves state-of-the-art results among open-source models of similar size, surpassing prominent models like OpenAI's O1-mini and GPT-4o , and demonstrating superior performance on competition-level mathematics.
摘要：数学推理是人工通用智能的基石，也是评估大语言模型（LLMS）功能的主要基准。尽管最先进的模型表现出希望，但面对需要深刻理解和复杂的多步审议的复杂问题时，它们常常会动摇。为了应对这一挑战，我们介绍了JT-MATH-8B，这是一系列基于系统的，多阶段优化框架的开源模型，其中包括基础，指导和思维版本。我们的培训前语料库是通过专用数据管道策划的高质量，210B token的数据集，该数据管线使用基于模型的验证来确保质量和多样性。通过有监督的微调（SFT）和基于GRPO的强化学习（RL）方法，为直接，简洁的答案进行了优化的指示模型。该思维模型经过了使用长长的经过思考（长COT）方法的复杂解决问题的训练，将SFT与新颖的多阶段RL课程相结合，该课程逐渐增加了任务难度和上下文长度，并将上下文长度提高到32K代币。 JT-MATH-8B在相似大小的开源模型中实现了最新的结果，超过了OpenAI的O1-Mini和GPT-4O等知名模型，并在竞争级数学上表现出了卓越的性能。

Title: UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models' Reasoning Abilities

Authors: Dong Du, Shulin Liu, Tao Yang, Shaohua Chen, Yang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.19766
Pdf URL: https://arxiv.org/pdf/2507.19766
Copy Paste: [[2507.19766]] UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models' Reasoning Abilities(https://arxiv.org/abs/2507.19766)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have highlighted the potential of reinforcement learning with verifiable rewards (RLVR) to enhance reasoning capabilities through extended output sequences. However, traditional RL frameworks face inefficiencies when handling ultra-long outputs due to long-tail sequence distributions and entropy collapse during training. To address these challenges, we propose an Ultra-Long Output Reinforcement Learning (UloRL) approach for advancing large language models' reasoning abilities. Specifically, we divide ultra long output decoding into short segments, enabling efficient training by mitigating delays caused by long-tail samples. Additionally, we introduce dynamic masking of well-Mastered Positive Tokens (MPTs) to prevent entropy collapse. Experimental results demonstrate the effectiveness of our approach. On the Qwen3-30B-A3B model, RL with segment rollout achieved 2.06x increase in training speed, while RL training with 128k-token outputs improves the model's performance on AIME2025 from 70.9\% to 85.1\% and on BeyondAIME from 50.7\% to 61.9\%, even surpassing Qwen3-235B-A22B with remarkable gains. These findings underscore the potential of our methods to advance the reasoning capabilities of LLMs with ultra-long sequence generation. We will release our code and model for further use by the community.
摘要：大型语言模型（LLM）的最新进展突出了通过可验证的奖励（RLVR）增强学习的潜力，以通过扩展的输出序列增强推理能力。但是，由于长尾序列分布和训练期间的熵崩溃，传统的RL框架处理超长输出时会面临效率低下的效率。为了应对这些挑战，我们提出了一种超长的输出增强学习（ULORL）方法，以提高大型语言模型的推理能力。具体而言，我们将超长输出解码分为短段，从而通过减轻长尾样品引起的延迟来有效训练。此外，我们引入了良好的阳性代币（MPT）的动态掩蔽，以防止熵塌陷。实验结果证明了我们方法的有效性。 On the Qwen3-30B-A3B model, RL with segment rollout achieved 2.06x increase in training speed, while RL training with 128k-token outputs improves the model's performance on AIME2025 from 70.9\% to 85.1\% and on BeyondAIME from 50.7\% to 61.9\%, even surpassing Qwen3-235B-A22B with remarkable gains.这些发现强调了我们方法具有超长序列产生LLM的推理能力的潜力。我们将发布我们的代码和模型，以供社区进一步使用。

Title: Flora: Effortless Context Construction to Arbitrary Length and Scale

Authors: Tianxiang Chen, Zhentao Tan, Xiaofan Bo, Yue Wu, Tao Gong, Qi Chu, Jieping Ye, Nenghai Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19786
Pdf URL: https://arxiv.org/pdf/2507.19786
Copy Paste: [[2507.19786]] Flora: Effortless Context Construction to Arbitrary Length and Scale(https://arxiv.org/abs/2507.19786)
Keywords: language model, llm, long context
Abstract: Effectively handling long contexts is challenging for Large Language Models (LLMs) due to the rarity of long texts, high computational demands, and substantial forgetting of short-context abilities. Recent approaches have attempted to construct long contexts for instruction tuning, but these methods often require LLMs or human interventions, which are both costly and limited in length and diversity. Also, the drop in short-context performances of present long-context LLMs remains significant. In this paper, we introduce Flora, an effortless (human/LLM-free) long-context construction strategy. Flora can markedly enhance the long-context performance of LLMs by arbitrarily assembling short instructions based on categories and instructing LLMs to generate responses based on long-context meta-instructions. This enables Flora to produce contexts of arbitrary length and scale with rich diversity, while only slightly compromising short-context performance. Experiments on Llama3-8B-Instruct and QwQ-32B show that LLMs enhanced by Flora excel in three long-context benchmarks while maintaining strong performances in short-context tasks. Our data-construction code is available at \href{this https URL}{this https URL}.
摘要：有效地处理长篇小说（LLMS）的长篇小说是挑战性的，这是由于长文，高度计算需求以及对短篇小说能力的实质性忘记。最近的方法试图构建长篇小说以进行教学调整，但是这些方法通常需要LLMS或人类干预措施，这些措施既昂贵又有限。同样，当前长篇小说LLM的短篇文本性能下降仍然很大。在本文中，我们介绍了一种轻松的（无人LLM/LLM）的长篇小说构建策略。 Flora可以通过根据类别任意组装简短的说明并指示LLMS基于长篇文本的元结构来生成响应，从而显着提高LLM的长期性能。这使得植物群能够产生任意长度和规模的上下文，并具有丰富的多样性，而仅损害了短篇小说的性能。 Llama3-8b-Instruct和QWQ-32B的实验表明，LLMS在三个长篇文本基准测试中通过Flora Excel增强了LLM，同时在短篇小说任务中保持了强大的性能。我们的数据构建代码可在\ href {this HTTPS url} {此https url}上获得。

Title: HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs

Authors: Dongquan Yang, Yifan Yang, Xiaotian Yu, Xianbiao Qi, Rong Xiao
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.19823
Pdf URL: https://arxiv.org/pdf/2507.19823
Copy Paste: [[2507.19823]] HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs(https://arxiv.org/abs/2507.19823)
Keywords: language model, llm
Abstract: Processing long-context inputs with large language models presents a significant challenge due to the enormous memory requirements of the Key-Value (KV) cache during inference. Existing KV cache compression methods exhibit noticeable performance degradation when memory is reduced by more than 85%. Additionally, strategies that leverage GPU-CPU collaboration for approximate attention remain underexplored in this setting. We propose HCAttention, a heterogeneous attention computation framework that integrates key quantization, value offloading, and dynamic KV eviction to enable efficient inference under extreme memory constraints. The method is compatible with existing transformer architectures and does not require model fine-tuning. Experimental results on the LongBench benchmark demonstrate that our approach preserves the accuracy of full-attention model while shrinking the KV cache memory footprint to 25% of its original size. Remarkably, it stays competitive with only 12.5% of the cache, setting a new state-of-the-art in LLM KV cache compression. To the best of our knowledge, HCAttention is the first to extend the Llama-3-8B model to process 4 million tokens on a single A100 GPU with 80GB memory.
摘要：通过大语言模型处理长篇文章输入，由于推理过程中密钥值（KV）缓存的巨大内存要求，带来了重大挑战。当记忆降低超过85％时，现有的KV缓存压缩方法表现出明显的性能降解。此外，在这种情况下，利用GPU-CPU协作以近似关注的策略仍未得到充实。我们提出了HCATTENTION，这是一种集成关键量化，值卸载和动态KV驱逐的异质注意计算框架，以在极端内存约束下实现有效的推断。该方法与现有的变压器体系结构兼容，不需要模型进行微调。 Longbench基准测试的实验结果表明，我们的方法保留了全注意模型的准确性，同时将KV高速缓存记忆足迹缩小到其原始大小的25％。值得注意的是，它保持竞争力，只有12.5％的高速缓存，在LLM KV高速缓存压缩中设置了新的最先进。据我们所知，HCATENTION是第一个将Llama-3-8b模型扩展到具有80GB内存的单个A100 GPU上的400万个令牌的llama-3-8b模型。

Title: DRIVE: Disfluency-Rich Synthetic Dialog Data Generation Framework for Intelligent Vehicle Environments

Authors: Anshul Chavda, M Jagadeesh, Chintalapalli Raja Kullayappa, B Jayaprakash, Medchalimi Sruthi, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19867
Pdf URL: https://arxiv.org/pdf/2507.19867
Copy Paste: [[2507.19867]] DRIVE: Disfluency-Rich Synthetic Dialog Data Generation Framework for Intelligent Vehicle Environments(https://arxiv.org/abs/2507.19867)
Keywords: gpt, prompt
Abstract: In-car conversational AI is becoming increasingly critical as autonomous vehicles and smart assistants gain widespread adoption. Yet, existing datasets fail to capture the spontaneous disfluencies such as hesitations, false starts, repetitions, and self-corrections that characterize real driver-AI dialogs. To address this, we introduce DiscoDrive, a synthetic corpus of 3500 multi-turn dialogs across seven automotive domains, generated using a two-stage, prompt-driven pipeline that dynamically integrates disfluencies during synthesis. We show that DiscoDrive is effective both as a training resource, enabling DialoGPT-Medium and T5-Base to match or exceed KVRET-trained models on the MultiWOZ 2.2 and Schema-Guided Dialogue (SGD) relevant test sets (BLEU-4 improvements of 0.26 to 0.61; METEOR +2.10; ROUGE-L +3.48; BERTScore F1 improvements of 1.35 to 3.48), and as a data augmentation resource in low-resource scenarios, delivering additional gains of up to BLEU-4 +0.38, METEOR +1.95, ROUGE-L +2.87, and BERTScore F1 +4.00 when combined with 10 percent of KVRET. Human evaluations further confirm that dialogs sampled from DiscoDrive are rated higher than KVRET's human-collected dialogs in naturalness (3.8 vs 3.6) and coherence (4.1 vs 4.0), and are perceived as more context-appropriate than leading post-hoc methods (such as LARD), without compromising clarity. DiscoDrive fills a critical gap in existing resources and serves as a versatile corpus for both training and augmenting conversational AI, enabling robust handling of real-world, disfluent in-car interactions.
摘要：随着自动驾驶汽车和智能助手的广泛采用，车内对话式AI变得越来越关键。然而，现有数据集未能捕获诸如犹豫，错误的开始，重复和自我校正之类的自发性疏忽，这些自发性表征了真实的驱动程序 - ai对话框。为了解决这个问题，我们介绍了doundrive，这是一个跨七个汽车域的3500多转向对话框的合成语料库，使用两阶段的及时驱动管道生成，该管道在合成过程中动态整合了散射。我们表明，doundrive既可以作为训练资源有效，既可以使对话中的中等和T5基本匹配或超过kvret训练的模型在多维2.2上，并且架构指导的对话（SGD）相关测试集（BLEU-4的BLEU-4改进0.26至0.61至0.61; Meteor +2.10; Meteor +2.10; Rouge-louge-louge-louge-louge-louge-louge-louge +3.48; bertsscore for f 1。作为低资源场景中的数据增强资源，当与10％的Kvret合并时，可提供多达BLEU-4 +0.38，流星+1.95，Rouge-L +2.87和Bertscore F1 +4.00的额外收益。人类评估进一步证实，从doundrive提出的对话的额定值高于Kvret的自然性（3.8 vs 3.6）和连贯性（4.1 vs 4.0）的对话框，并且被视为上下文的适合性，并且比领先的事后方法更适合（例如猪油），而无需抗衡。 Dosodrive填补了现有资源的关键差距，并充当了培训和增强对话性AI的多功能语料库，从而实现了对现实世界中的，不自我的车载交互的强大处理。

Title: Zero-shot Performance of Generative AI in Brazilian Portuguese Medical Exam

Authors: Cesar Augusto Madid Truyts, Amanda Gomes Rabelo, Gabriel Mesquita de Souza, Daniel Scaldaferri Lages, Adriano Jose Pereira, Uri Adrian Prync Flato, Eduardo Pontes dos Reis, Joaquim Edson Vieira, Paulo Sergio Panse Silveira, Edson Amaro Junior
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19885
Pdf URL: https://arxiv.org/pdf/2507.19885
Copy Paste: [[2507.19885]] Zero-shot Performance of Generative AI in Brazilian Portuguese Medical Exam(https://arxiv.org/abs/2507.19885)
Keywords: language model, gpt, llm
Abstract: Artificial intelligence (AI) has shown the potential to revolutionize healthcare by improving diagnostic accuracy, optimizing workflows, and personalizing treatment plans. Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have achieved notable advancements in natural language processing and medical applications. However, the evaluation of these models has focused predominantly on the English language, leading to potential biases in their performance across different languages. This study investigates the capability of six LLMs (GPT-4.0 Turbo, LLaMA-3-8B, LLaMA-3-70B, Mixtral 8x7B Instruct, Titan Text G1-Express, and Command R+) and four MLLMs (Claude-3.5-Sonnet, Claude-3-Opus, Claude-3-Sonnet, and Claude-3-Haiku) to answer questions written in Brazilian spoken portuguese from the medical residency entrance exam of the Hospital das Clínicas da Faculdade de Medicina da Universidade de São Paulo (HCFMUSP) - the largest health complex in South America. The performance of the models was benchmarked against human candidates, analyzing accuracy, processing time, and coherence of the generated explanations. The results show that while some models, particularly Claude-3.5-Sonnet and Claude-3-Opus, achieved accuracy levels comparable to human candidates, performance gaps persist, particularly in multimodal questions requiring image interpretation. Furthermore, the study highlights language disparities, emphasizing the need for further fine-tuning and data set augmentation for non-English medical AI applications. Our findings reinforce the importance of evaluating generative AI in various linguistic and clinical settings to ensure a fair and reliable deployment in healthcare. Future research should explore improved training methodologies, improved multimodal reasoning, and real-world clinical integration of AI-driven medical assistance.
摘要：人工智能（AI）表明，通过提高诊断准确性，优化工作流和个性化治疗计划来彻底改变医疗保健。大型语言模型（LLM）和多模式大语模型（MLLM）在自然语言处理和医学应用方面取得了显着进步。但是，对这些模型的评估主要集中在英语上，从而导致其跨不同语言的表现偏见。 This study investigates the capability of six LLMs (GPT-4.0 Turbo, LLaMA-3-8B, LLaMA-3-70B, Mixtral 8x7B Instruct, Titan Text G1-Express, and Command R+) and four MLLMs (Claude-3.5-Sonnet, Claude-3-Opus, Claude-3-Sonnet, and Claude-3-Haiku) to answer questions written in南美最大的健康综合体（HCFMUSP）的医院医院医学居住入学考试（HCFMUSP）的医院医院入学考试（HCFMUSP）是南美最大的健康综合体。模型的性能是针对人类候选者的基准测试，分析了生成的解释的准确性，处理时间和连贯性。结果表明，尽管某些模型，尤其是Claude-3.5-Sonnet和Claude-3-Opus，其准确度水平与人类候选人相当，但性能差距持续存在，尤其是在需要图像解释的多模式问题中。此外，该研究强调了语言差异，强调了对非英语医学AI应用程序进行进一步的微调和数据集的需求。我们的发现增强了评估各种语言和临床环境中生成AI的重要性，以确保在医疗保健领域进行公平可靠的部署。未来的研究应探讨改进的培训方法，改进的多模式推理以及AI驱动的医疗援助的现实临床整合。

Title: A Gold Standard Dataset and Evaluation Framework for Depression Detection and Explanation in Social Media using LLMs

Authors: Prajval Bolegave, Pushpak Bhattacharya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19899
Pdf URL: https://arxiv.org/pdf/2507.19899
Copy Paste: [[2507.19899]] A Gold Standard Dataset and Evaluation Framework for Depression Detection and Explanation in Social Media using LLMs(https://arxiv.org/abs/2507.19899)
Keywords: language model, gpt, llm, prompt
Abstract: Early detection of depression from online social media posts holds promise for providing timely mental health interventions. In this work, we present a high-quality, expert-annotated dataset of 1,017 social media posts labeled with depressive spans and mapped to 12 depression symptom categories. Unlike prior datasets that primarily offer coarse post-level labels \cite{cohan-etal-2018-smhd}, our dataset enables fine-grained evaluation of both model predictions and generated explanations. We develop an evaluation framework that leverages this clinically grounded dataset to assess the faithfulness and quality of natural language explanations generated by large language models (LLMs). Through carefully designed prompting strategies, including zero-shot and few-shot approaches with domain-adapted examples, we evaluate state-of-the-art proprietary LLMs including GPT-4.1, Gemini 2.5 Pro, and Claude 3.7 Sonnet. Our comprehensive empirical analysis reveals significant differences in how these models perform on clinical explanation tasks, with zero-shot and few-shot prompting. Our findings underscore the value of human expertise in guiding LLM behavior and offer a step toward safer, more transparent AI systems for psychological well-being.
摘要：在线社交媒体帖子中对抑郁症的早期发现有望提供及时的心理健康干预措施。在这项工作中，我们提出了一个高质量的，专家注册的数据集，该数据集由1,017个社交媒体帖子标记为抑郁症，并将其映射为12个抑郁症状类别。与先前提供主要提供粗略后标签的数据集不同，我们的数据集可以对两个模型预测和生成的解释进行细粒度评估。我们开发了一个评估框架，该框架利用该临床基础数据集评估大语模型（LLMS）产生的自然语言解释的忠诚和质量。通过精心设计的提示策略，包括零射门和与域适应的示例的少量方法，我们评估了最先进的专有LLM，包括GPT-4.1，Gemini 2.5 Pro和Claude 3.7十四行诗。我们的全面经验分析揭示了这些模型在临床解释任务上的执行方式有显着差异，零射击和很少的发动机提示。我们的发现强调了人类专业知识在指导LLM行为方面的价值，并为心理健康提供了更安全，更透明的AI系统。

Title: CaliDrop: KV Cache Compression with Calibration

Authors: Yi Su, Quantong Qiu, Yuechi Zhou, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19906
Pdf URL: https://arxiv.org/pdf/2507.19906
Copy Paste: [[2507.19906]] CaliDrop: KV Cache Compression with Calibration(https://arxiv.org/abs/2507.19906)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) require substantial computational resources during generation. While the Key-Value (KV) cache significantly accelerates this process by storing attention intermediates, its memory footprint grows linearly with sequence length, batch size, and model size, creating a bottleneck in long-context scenarios. Various KV cache compression techniques, including token eviction, quantization, and low-rank projection, have been proposed to mitigate this bottleneck, often complementing each other. This paper focuses on enhancing token eviction strategies. Token eviction leverages the observation that the attention patterns are often sparse, allowing for the removal of less critical KV entries to save memory. However, this reduction usually comes at the cost of notable accuracy degradation, particularly under high compression ratios. To address this issue, we propose \textbf{CaliDrop}, a novel strategy that enhances token eviction through calibration. Our preliminary experiments show that queries at nearby positions exhibit high similarity. Building on this observation, CaliDrop performs speculative calibration on the discarded tokens to mitigate the accuracy loss caused by token eviction. Extensive experiments demonstrate that CaliDrop significantly improves the accuracy of existing token eviction methods.
摘要：大型语言模型（LLMS）在生成过程中需要大量的计算资源。尽管键值（KV）缓存通过存储注意中间体可以显着加速此过程，但其内存足迹随序列长度，批处理大小和型号大小而线性增长，在长篇文化场景中创建瓶颈。已经提出了各种KV缓存压缩技术，包括令牌驱逐，量化和低级别投影，以减轻这种瓶颈，通常相互补充。本文着重于增强令牌驱逐策略。令牌驱逐的观察结果是，注意力模式通常很少，从而可以去除不太关键的KV条目以节省记忆。但是，这种减少通常以明显的准确性降解为代价，尤其是在高压比下。为了解决这个问题，我们提出了\ textbf {calidrop}，这是一种新颖的策略，可以通过校准增强令牌驱逐。我们的初步实验表明，附近位置的查询表现出很高的相似性。在这一观察结果的基础上，Calidrop在废弃的令牌上进行了投机校准，以减轻由代币驱逐引起的准确性损失。广泛的实验表明，Calidrop显着提高了现有的令牌驱逐方法的准确性。

Title: KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models

Authors: Seorin Kim, Dongyoung Lee, Jaejin Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19962
Pdf URL: https://arxiv.org/pdf/2507.19962
Copy Paste: [[2507.19962]] KLAAD: Refining Attention Mechanisms to Reduce Societal Bias in Generative Language Models(https://arxiv.org/abs/2507.19962)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) often exhibit societal biases in their outputs, prompting ethical concerns regarding fairness and harm. In this work, we propose KLAAD (KL-Attention Alignment Debiasing), an attention-based debiasing framework that implicitly aligns attention distributions between stereotypical and anti-stereotypical sentence pairs without directly modifying model weights. KLAAD introduces a composite training objective combining Cross-Entropy, KL divergence, and Triplet losses, guiding the model to consistently attend across biased and unbiased contexts while preserving fluency and coherence. Experimental evaluation of KLAAD demonstrates improved bias mitigation on both the BBQ and BOLD benchmarks, with minimal impact on language modeling quality. The results indicate that attention-level alignment offers a principled solution for mitigating bias in generative language models.
摘要：大型语言模型（LLM）经常在其产出中表现出社会偏见，引发了关于公平和伤害的道德问题。在这项工作中，我们提出了KLAAD（KL意见对齐偏见），这是一个基于注意力的偏见框架，它隐含地将刻板印象和反式型句子对之间的注意力分布对齐，而无需直接修改模型权重。克拉德（Klaad）引入了一个综合训练目标，结合了跨凝结，KL差异和三胞胎损失，指导该模型始终如一地跨越有偏见和无偏见的环境，同时保持流利性和连贯性。 KLAAD的实验评估表明，对烧烤和大胆的基准测试的偏置缓解有所改善，对语言建模质量的影响最小。结果表明，注意力级的对齐为减轻生成语言模型的偏见提供了原则的解决方案。

Title: Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text

Authors: Mizanur Rahman, Md Tahmid Rahman Laskar, Shafiq Joty, Enamul Hoque
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2507.19969
Pdf URL: https://arxiv.org/pdf/2507.19969
Copy Paste: [[2507.19969]] Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text(https://arxiv.org/abs/2507.19969)
Keywords: language model, gpt, llm, agent
Abstract: Automated data visualization plays a crucial role in simplifying data interpretation, enhancing decision-making, and improving efficiency. While large language models (LLMs) have shown promise in generating visualizations from natural language, the absence of comprehensive benchmarks limits the rigorous evaluation of their capabilities. We introduce Text2Vis, a benchmark designed to assess text-to-visualization models, covering 20+ chart types and diverse data science queries, including trend analysis, correlation, outlier detection, and predictive analytics. It comprises 1,985 samples, each with a data table, natural language query, short answer, visualization code, and annotated charts. The queries involve complex reasoning, conversational turns, and dynamic data retrieval. We benchmark 11 open-source and closed-source models, revealing significant performance gaps, highlighting key challenges, and offering insights for future advancements. To close this gap, we propose the first cross-modal actor-critic agentic framework that jointly refines the textual answer and visualization code, increasing GPT-4o`s pass rate from 26% to 42% over the direct approach and improving chart quality. We also introduce an automated LLM-based evaluation framework that enables scalable assessment across thousands of samples without human annotation, measuring answer correctness, code execution success, visualization readability, and chart accuracy. We release Text2Vis at this https URL.
摘要：自动数据可视化在简化数据解释，增强决策和提高效率方面起着至关重要的作用。尽管大型语言模型（LLMS）在产生自然语言的可视化方面表现出了希望，但缺乏全面的基准限制了对其能力的严格评估。我们介绍了Text2Vis，这是一种旨在评估文本到视觉化模型的基准测试，涵盖了20多种图表类型和各种数据科学查询，包括趋势分析，相关性，离群检测和预测分析。它包括1,985个样本，每个样本都有数据表，自然语言查询，简短答案，可视化代码和注释图表。这些查询涉及复杂的推理，对话转弯和动态数据检索。我们基准了11个开源和封闭源模型，揭示了巨大的性能差距，突出了关键挑战，并为未来的进步提供了见解。为了缩小这一差距，我们提出了第一个跨模式参与者 - 批判性代理框架，该框架共同完善了文本答案和可视化代码，将GPT-4O的通过率从直接方法和提高图表质量提高的GPT-4O的通过率从26％提高到42％。我们还引入了一个基于LLM的自动化评估框架，该框架可以在没有人类注释的情况下进行数千个样本的可扩展评估，测量答案正确性，代码执行成功，可视化可读性和图表准确性。我们在此HTTPS URL上释放Text2Vis。

Title: Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory

Authors: Dan Song, Won-Chan Lee, Hong Jiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19980
Pdf URL: https://arxiv.org/pdf/2507.19980
Copy Paste: [[2507.19980]] Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory(https://arxiv.org/abs/2507.19980)
Keywords: language model, llm
Abstract: This study investigates the estimation of reliability for large language models (LLMs) in scoring writing tasks from the AP Chinese Language and Culture Exam. Using generalizability theory, the research evaluates and compares score consistency between human and AI raters across two types of AP Chinese free-response writing tasks: story narration and email response. These essays were independently scored by two trained human raters and seven AI raters. Each essay received four scores: one holistic score and three analytic scores corresponding to the domains of task completion, delivery, and language use. Results indicate that although human raters produced more reliable scores overall, LLMs demonstrated reasonable consistency under certain conditions, particularly for story narration tasks. Composite scoring that incorporates both human and AI raters improved reliability, which supports that hybrid scoring models may offer benefits for large-scale writing assessments.
摘要：这项研究调查了大语言模型（LLM）的可靠性估计，以评分AP中文和文化考试的写作任务。使用概括性理论，研究评估并比较了两种类型的AP中文自由响应写作任务的人与AI评估者之间的得分一致性：故事叙事和电子邮件响应。这些论文是由两个训练有素的人类评估者和七个AI评估者独立得分的。每篇文章都获得了四个分数：一个整体得分和三个分析得分，与任务完成，交付和语言使用的域相对应。结果表明，尽管人类评估者总体上产生了更可靠的分数，但LLM在某些条件下表现出合理的一致性，尤其是对于故事叙事任务。综合人类和人工智能评估者提高了可靠性的综合评分，这支持混合评分模型可能为大规模写作评估提供好处。

Title: VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering

Authors: Tan-Minh Nguyen, Hoang-Trung Nguyen, Trong-Khoi Dao, Xuan-Hieu Phan, Ha-Thanh Nguyen, Thi-Hai-Yen Vuong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.19995
Pdf URL: https://arxiv.org/pdf/2507.19995
Copy Paste: [[2507.19995]] VLQA: The First Comprehensive, Large, and High-Quality Vietnamese Dataset for Legal Question Answering(https://arxiv.org/abs/2507.19995)
Keywords: language model, llm
Abstract: The advent of large language models (LLMs) has led to significant achievements in various domains, including legal text processing. Leveraging LLMs for legal tasks is a natural evolution and an increasingly compelling choice. However, their capabilities are often portrayed as greater than they truly are. Despite the progress, we are still far from the ultimate goal of fully automating legal tasks using artificial intelligence (AI) and natural language processing (NLP). Moreover, legal systems are deeply domain-specific and exhibit substantial variation across different countries and languages. The need for building legal text processing applications for different natural languages is, therefore, large and urgent. However, there is a big challenge for legal NLP in low-resource languages such as Vietnamese due to the scarcity of resources and annotated data. The need for labeled legal corpora for supervised training, validation, and supervised fine-tuning is critical. In this paper, we introduce the VLQA dataset, a comprehensive and high-quality resource tailored for the Vietnamese legal domain. We also conduct a comprehensive statistical analysis of the dataset and evaluate its effectiveness through experiments with state-of-the-art models on legal information retrieval and question-answering tasks.
摘要：大型语言模型（LLM）的出现导致了各个领域的重大成就，包括法律文本处理。利用LLM进行法律任务是一种自然的进化，也是越来越令人信服的选择。但是，它们的能力通常比其真正的能力更大。尽管取得了进展，但我们仍然远离使用人工智能（AI）和自然语言处理（NLP）完全自动化法律任务的最终目标。此外，法律制度非常特定于领域，并且在不同的国家和语言之间表现出很大的差异。因此，需要为不同自然语言构建法律文本处理应用程序的需求是庞大而紧迫的。但是，由于资源和注释数据的稀缺性，对于越南语等低资源语言，法律NLP面临着巨大的挑战。对监督培训，验证和监督微调的标记法律语料库的需求至关重要。在本文中，我们介绍了VLQA数据集，VLQA数据集是为越南法律领域量身定制的全面且高质量的资源。我们还对数据集进行了全面的统计分析，并通过对法律信息检索和提问任务的最先进模型进行实验来评估其有效性。

Title: FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression

Authors: Runchao Li, Yao Fu, Mu Sheng, Xianxuan Long, Haotian Yu, Pan Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20030
Pdf URL: https://arxiv.org/pdf/2507.20030
Copy Paste: [[2507.20030]] FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression(https://arxiv.org/abs/2507.20030)
Keywords: language model, llm
Abstract: The efficacy of Large Language Models (LLMs) in long-context tasks is often hampered by the substantial memory footprint and computational demands of the Key-Value (KV) cache. Current compression strategies, including token eviction and learned projections, frequently lead to biased representations -- either by overemphasizing recent/high-attention tokens or by repeatedly degrading information from earlier context -- and may require costly model retraining. We present FAEDKV (Frequency-Adaptive Infinite-Window for KV cache), a novel, training-free KV cache compression framework that ensures unbiased information retention. FAEDKV operates by transforming the KV cache into the frequency domain using a proposed Infinite-Window Fourier Transform (IWDFT). This approach allows for the equalized contribution of all tokens to the compressed representation, effectively preserving both early and recent contextual information. A preliminary frequency ablation study identifies critical spectral components for layer-wise, targeted compression. Experiments on LongBench benchmark demonstrate FAEDKV's superiority over existing methods by up to 22\%. In addition, our method shows superior, position-agnostic retrieval accuracy on the Needle-In-A-Haystack task compared to compression based approaches.
摘要：大语言模型（LLM）在长篇小说任务中的功效通常受到密钥值（KV）缓存的大量内存足迹和计算需求的阻碍。当前的压缩策略，包括令牌驱逐和学习的预测，经常导致有偏见的表示形式 - 过度强调了最近/高注意力的代币，或通过反复从早期上下文中降级信息 - 可能需要昂贵的模型重新训练。我们提出了FAEDKV（KV Cache的频率自适应无限窗口），这是一种新型的，无训练的KV缓存压缩框架，可确保保留无偏见的信息。 FAEDKV通过使用拟议的无限窗口傅立叶变换（IWDFT）将KV缓存转换为频域来运行。这种方法允许所有令牌对压缩表示形式的均等贡献，从而有效地保留了早期和最近的上下文信息。初步的频率消融研究确定了针对层，有针对性压缩的临界光谱成分。 Longbench基准测试的实验表明，FAEDKV比现有方法的优越性高达22 \％。此外，与基于压缩的方法相比，我们的方法在针中的针刺任务上显示出优异的位置反应检索精度。

Title: Infogen: Generating Complex Statistical Infographics from Documents

Authors: Akash Ghosh, Aparna Garimella, Pritika Ramu, Sambaran Bandyopadhyay, Sriparna Saha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20046
Pdf URL: https://arxiv.org/pdf/2507.20046
Copy Paste: [[2507.20046]] Infogen: Generating Complex Statistical Infographics from Documents(https://arxiv.org/abs/2507.20046)
Keywords: llm
Abstract: Statistical infographics are powerful tools that simplify complex data into visually engaging and easy-to-understand formats. Despite advancements in AI, particularly with LLMs, existing efforts have been limited to generating simple charts, with no prior work addressing the creation of complex infographics from text-heavy documents that demand a deep understanding of the content. We address this gap by introducing the task of generating statistical infographics composed of multiple sub-charts (e.g., line, bar, pie) that are contextually accurate, insightful, and visually aligned. To achieve this, we define infographic metadata that includes its title and textual insights, along with sub-chart-specific details such as their corresponding data and alignment. We also present Infodat, the first benchmark dataset for text-to-infographic metadata generation, where each sample links a document to its metadata. We propose Infogen, a two-stage framework where fine-tuned LLMs first generate metadata, which is then converted into infographic code. Extensive evaluations on Infodat demonstrate that Infogen achieves state-of-the-art performance, outperforming both closed and open-source LLMs in text-to-statistical infographic generation.
摘要：统计信息图是功能强大的工具，可将复杂的数据简化为视觉引人入胜且易于理解的格式。尽管AI的进步，尤其是在LLMS中的进步，但现有的努力仅限于生成简单的图表，没有事先的工作解决了从文本繁重的文档中创建复杂信息图表，这些文本文档需要深入了解内容。我们通过介绍由多个子图（例如，线条，bar，pie）组成的统计图表的任务来解决这一差距，这些图表在上下文上是准确，有见地的，并且在视觉上对齐。为了实现这一目标，我们定义了包含标题和文本见解的信息图元数据，以及诸如相应的数据和对齐方式之类的子曲系特定细节。我们还提出了Infodat，这是第一个用于文本到视图元数据生成的基准数据集，每个样本将文档链接到其元数据。我们提出了Infogen，这是一个两阶段的框架，在该框架中进行了微调的LLMS首次生成元数据，然后将其转换为信息图代码。对InfoDAT的广泛评估表明，Infogen实现了最先进的性能，在文本到统计的信息图生成中表现优于封闭和开源LLM。

Title: RAG in the Wild: On the (In)effectiveness of LLMs with Mixture-of-Knowledge Retrieval Augmentation

Authors: Ran Xu, Yuchen Zhuang, Yue Yu, Haoyu Wang, Wenqi Shi, Carl Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.20059
Pdf URL: https://arxiv.org/pdf/2507.20059
Copy Paste: [[2507.20059]] RAG in the Wild: On the (In)effectiveness of LLMs with Mixture-of-Knowledge Retrieval Augmentation(https://arxiv.org/abs/2507.20059)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieved at inference time. While RAG demonstrates strong performance on benchmarks largely derived from general-domain corpora like Wikipedia, its effectiveness under realistic, diverse retrieval scenarios remains underexplored. We evaluated RAG systems using MassiveDS, a large-scale datastore with mixture of knowledge, and identified critical limitations: retrieval mainly benefits smaller models, rerankers add minimal value, and no single retrieval source consistently excels. Moreover, current LLMs struggle to route queries across heterogeneous knowledge sources. These findings highlight the need for adaptive retrieval strategies before deploying RAG in real-world settings. Our code and data can be found at this https URL.
摘要：检索增强的生成（RAG）通过整合推理时检索的外部知识来增强大语模型（LLM）。尽管RAG在基准上表现出强烈的性能，主要来自Wikipedia，但其在现实，多样化的检索场景下的有效性仍然没有被忽视。我们使用大规模的数据存储进行了大规模数据存储，并确定了临界局限性：检索主要有益于较小的模型，Rerankers增加了最小的价值，并且没有单个检索源始终如一地表现出色。此外，当前的LLM难以将查询跨越异质知识来源进行路由。这些发现突出了在现实世界中部署抹布之前需要自适应检索策略的必要性。我们的代码和数据可以在此HTTPS URL上找到。

Title: ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

Authors: Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, Yang Zhang
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2507.20091
Pdf URL: https://arxiv.org/pdf/2507.20091
Copy Paste: [[2507.20091]] ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models(https://arxiv.org/abs/2507.20091)
Keywords: language model, llm, long context
Abstract: Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optimal in learning prosody information -- we find that the resulting LLMs do not exhibit obvious emerging prosody processing capabilities via pre-training alone. To overcome this, we propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody. Each speech utterance is first transcribed into text, followed by a sequence of word-level prosody tokens. Compared with conventional speech tokenization schemes, the proposed tokenization scheme retains more complete prosody information, and is more understandable to text-based LLMs. We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone, ranging from harnessing the prosody nuances in generated speech, such as contrastive focus, understanding emotion and stress in an utterance, to maintaining prosody consistency in long contexts.
摘要：语音语言模型是指具有语音处理和理解功能的语言模型。语音语言模型的一个关键理想能力是能够捕获内容和韵律之间复杂的相互依赖性。现有的培训语言模型的主流范式在将语音转换为离散代币之前将其转换为LLMS，在学习韵律信息方面是最佳的 - 我们发现，由此产生的LLM并未通过单独进行预先进行预先培训表现出明显的出现的疾病处理能力。为了克服这一点，我们提出了prosodylm，它引入了一种简单的令牌化计划，适合学习韵律。每种语音说法首先被转录为文本，然后是一系列单词级别的韵律令牌。与常规的语音象征化方案相比，提出的令牌化方案保留了更完整的韵律信息，对于基于文本的LLM来说更容易理解。我们发现，ProsodyLM可以通过预训练来学习令人惊讶的多样化的韵律处理能力，从利用发言的韵律细微差别，例如对比度的焦点，理解情感和压力在话语中，到在长篇小说中保持韵律一致性。

Title: AI-Driven Generation of Old English: A Framework for Low-Resource Languages

Authors: Rodrigo Gabriel Salazar Alva, Matías Nuñez, Cristian López, Javier Martín Arista
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20111
Pdf URL: https://arxiv.org/pdf/2507.20111
Copy Paste: [[2507.20111]] AI-Driven Generation of Old English: A Framework for Low-Resource Languages(https://arxiv.org/abs/2507.20111)
Keywords: language model, llm, agent
Abstract: Preserving ancient languages is essential for understanding humanity's cultural and linguistic heritage, yet Old English remains critically under-resourced, limiting its accessibility to modern natural language processing (NLP) techniques. We present a scalable framework that uses advanced large language models (LLMs) to generate high-quality Old English texts, addressing this gap. Our approach combines parameter-efficient fine-tuning (Low-Rank Adaptation, LoRA), data augmentation via backtranslation, and a dual-agent pipeline that separates the tasks of content generation (in English) and translation (into Old English). Evaluation with automated metrics (BLEU, METEOR, and CHRF) shows significant improvements over baseline models, with BLEU scores increasing from 26 to over 65 for English-to-Old English translation. Expert human assessment also confirms high grammatical accuracy and stylistic fidelity in the generated texts. Beyond expanding the Old English corpus, our method offers a practical blueprint for revitalizing other endangered languages, effectively uniting AI innovation with the goals of cultural preservation.
摘要：保存古代语言对于理解人类的文化和语言遗产至关重要，但是古老的英语仍然严重资源不足，从而限制了其对现代自然语言处理（NLP）技术的可及性。我们提出了一个可扩展的框架，该框架使用高级大语言模型（LLM）生成高质量的旧英语文本，以解决此差距。我们的方法结合了参数有效的微调（低级适应，洛拉），通过倒退的数据增强以及双重代理管道，该管道将内容生成（英语）和翻译（以旧英语为单位）分开。自动指标（BLEU，流星和CHRF）的评估比基线模型显示出显着改善，对于英语英语翻译，BLEU分数从26个增长到65多个。专家人类评估还证实了生成的文本中的高语法准确性和风格上的忠诚度。除了扩大古老的英语语料库外，我们的方法还提供了一种实用的蓝图，用于振兴其他濒临灭绝的语言，从而有效地将AI创新与文化保护的目标结合在一起。

Title: Sem-DPO: Mitigating Semantic Inconsistency in Preference Optimization for Prompt Engineering

Authors: Anas Mohamed, Azal Ahmad Khan, Xinran Wang, Ahmad Faraz Khan, Shuwen Ge, Saman Bahzad Khan, Ayaan Ahmad, Ali Anwar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.20133
Pdf URL: https://arxiv.org/pdf/2507.20133
Copy Paste: [[2507.20133]] Sem-DPO: Mitigating Semantic Inconsistency in Preference Optimization for Prompt Engineering(https://arxiv.org/abs/2507.20133)
Keywords: language model, prompt
Abstract: Generative AI can now synthesize strikingly realistic images from text, yet output quality remains highly sensitive to how prompts are phrased. Direct Preference Optimization (DPO) offers a lightweight, off-policy alternative to RL for automatic prompt engineering, but its token-level regularization leaves semantic inconsistency unchecked as prompts that win higher preference scores can still drift away from the user's intended meaning. We introduce Sem-DPO, a variant of DPO that preserves semantic consistency yet retains its simplicity and efficiency. Sem-DPO scales the DPO loss by an exponential weight proportional to the cosine distance between the original prompt and winning candidate in embedding space, softly down-weighting training signals that would otherwise reward semantically mismatched prompts. We provide the first analytical bound on semantic drift for preference-tuned prompt generators, showing that Sem-DPO keeps learned prompts within a provably bounded neighborhood of the original text. On three standard text-to-image prompt-optimization benchmarks and two language models, Sem-DPO achieves 8-12% higher CLIP similarity and 5-9% higher human-preference scores (HPSv2.1, PickScore) than DPO, while also outperforming state-of-the-art baselines. These findings suggest that strong flat baselines augmented with semantic weighting should become the new standard for prompt-optimization studies and lay the groundwork for broader, semantics-aware preference optimization in language models.
摘要：生成的AI现在可以从文本中综合出惊人的逼真图像，但输出质量仍然对提示的措辞高度敏感。直接偏好优化（DPO）为自动及时工程提供了轻巧的，非货币替代方案，但是它的令牌级正则化使语义不一致不符合，因为提示赢得更高的首选项得分仍然可以从用户的预期含义中脱颖而出。我们介绍了SEM-DPO，SEM-DPO是DPO的一种变体，可保留语义一致性，但保留了其简单性和效率。 SEM-DPO通过与原始提示和获胜的候选人之间的余弦距离成正比的指数重量缩放DPO的损失，在嵌入空间中，柔和的下降训练信号将奖励否则会奖励语义上不匹配的提示。我们为偏好调节的提示发电机提供了关于语义漂移的第一个分析限制，这表明SEM-DPO将学习的提示保持在原始文本的界限范围内。在三个标准的文本到图像及时优化的基准和两种语言模型上，SEM-DPO的剪辑相似性高8-12％，人类偏好分数（HPSV2.1，PickScore）高出5-9％，而DPO的表现也优于先进的基本线。这些发现表明，通过语义加权增强的强大平坦基线应成为迅速进行优化研究的新标准，并为语言模型中更广泛的语义吸引人的偏好优化奠定了基础。

Title: Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG

Authors: Baiyu Chen, Wilson Wongso, Xiaoqian Hu, Yue Tan, Flora Salim
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.20136
Pdf URL: https://arxiv.org/pdf/2507.20136
Copy Paste: [[2507.20136]] Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG(https://arxiv.org/abs/2507.20136)
Keywords: language model, hallucination
Abstract: This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM) challenge. The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate, especially when faced with egocentric imagery, long-tail entities, and complex, multi-hop questions. This issue is particularly problematic in real-world applications where users pose fact-seeking queries that demand high factual accuracy across diverse modalities. To tackle this, we propose a robust, multi-stage framework that prioritizes factual accuracy and truthfulness over completeness. Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification. This conservative strategy is designed to minimize hallucinations, which incur a severe penalty in the competition's scoring metric. Our approach achieved 3rd place in Task 1, demonstrating the effectiveness of prioritizing answer reliability in complex multi-modal RAG systems. Our implementation is available at this https URL .
摘要：本文介绍了Team Cruise为KDD Cup 2025 Meta综合抹布基准制定的技术解决方案，用于多模式，多弯曲（CRAG-MM）挑战。挑战旨在解决现代视觉语言模型（VLM）的关键局限性：它们幻觉的倾向，尤其是在面对以自我为中心的图像，长尾实体以及复杂的多跳问题时。在现实世界应用程序中，此问题尤其有问题，在现实世界中，用户提出了要求跨不同方式的事实准确性的寻求事实的查询。为了解决这个问题，我们提出了一个强大的多阶段框架，该框架优先考虑事实准确性和真实性而不是完整性。我们的解决方案集成了一个轻巧的查询路由器，以提高效率，查询感知的检索和摘要管道，双轨道生成以及事后验证。这种保守的策略旨在最大程度地减少幻觉，这在比赛的得分指标中受到了严重的惩罚。我们的方法在任务1中获得了第三名，证明了在复杂的多模式抹布系统中优先考虑答案可靠性的有效性。我们的实现可在此HTTPS URL上获得。

Title: Multi-Agent Interactive Question Generation Framework for Long Document Understanding

Authors: Kesen Wang, Daulet Toibazar, Abdulrahman Alfulayt, Abdulaziz S. Albadawi, Ranya A. Alkahtani, Asma A. Ibrahim, Haneen A. Alhomoud, Sherif Mohamed, Pedro J. Moreno
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20145
Pdf URL: https://arxiv.org/pdf/2507.20145
Copy Paste: [[2507.20145]] Multi-Agent Interactive Question Generation Framework for Long Document Understanding(https://arxiv.org/abs/2507.20145)
Keywords: language model, prompt, agent
Abstract: Document Understanding (DU) in long-contextual scenarios with complex layouts remains a significant challenge in vision-language research. Although Large Vision-Language Models (LVLMs) excel at short-context DU tasks, their performance declines in long-context settings. A key limitation is the scarcity of fine-grained training data, particularly for low-resource languages such as Arabic. Existing state-of-the-art techniques rely heavily on human annotation, which is costly and inefficient. We propose a fully automated, multi-agent interactive framework to generate long-context questions efficiently. Our approach efficiently generates high-quality single- and multi-page questions for extensive English and Arabic documents, covering hundreds of pages across diverse domains. This facilitates the development of LVLMs with enhanced long-context understanding ability. Experimental results in this work have shown that our generated English and Arabic questions (\textbf{AraEngLongBench}) are quite challenging to major open- and close-source LVLMs. The code and data proposed in this work can be found in this https URL. Sample Question and Answer (QA) pairs and structured system prompts can be found in the Appendix.
摘要：在具有复杂布局的长篇小说方案中，文档理解（DU）在视觉研究中仍然是一个重大挑战。尽管大型视觉模型（LVLMS）在短上下文DU任务上表现出色，但它们的性能在长篇小说设置中下降。一个关键的限制是缺乏细粒度的培训数据，尤其是对于阿拉伯语等低资源语言。现有的最新技术在很大程度上取决于人类注释，这是昂贵且效率低下的人类注释。我们提出了一个完全自动化的多代理交互式框架，以有效地生成长篇小说问题。我们的方法有效地为广泛的英语和阿拉伯文档生成了高质量的单页和多页问题，涵盖了各种领域的数百页。这促进了LVLM的发展，并增强了长篇文化的理解能力。这项工作的实验结果表明，我们产生的英语和阿拉伯问题（\ textbf {araenglongbench}）对于主要的开放式和封闭源LVLMS非常具有挑战性。这项工作中提出的代码和数据可以在此HTTPS URL中找到。示例问答（QA）对和结构化系统提示可以在附录中找到。

Title: Goal Alignment in LLM-Based User Simulators for Conversational AI

Authors: Shuhaib Mehri, Xiaocheng Yang, Takyoung Kim, Gokhan Tur, Shikib Mehri, Dilek Hakkani-Tür
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20152
Pdf URL: https://arxiv.org/pdf/2507.20152
Copy Paste: [[2507.20152]] Goal Alignment in LLM-Based User Simulators for Conversational AI(https://arxiv.org/abs/2507.20152)
Keywords: language model, llm, agent
Abstract: User simulators are essential to conversational AI, enabling scalable agent development and evaluation through simulated interactions. While current Large Language Models (LLMs) have advanced user simulation capabilities, we reveal that they struggle to consistently demonstrate goal-oriented behavior across multi-turn conversations--a critical limitation that compromises their reliability in downstream applications. We introduce User Goal State Tracking (UGST), a novel framework that tracks user goal progression throughout conversations. Leveraging UGST, we present a three-stage methodology for developing user simulators that can autonomously track goal progression and reason to generate goal-aligned responses. Moreover, we establish comprehensive evaluation metrics for measuring goal alignment in user simulators, and demonstrate that our approach yields substantial improvements across two benchmarks (MultiWOZ 2.4 and {\tau}-Bench). Our contributions address a critical gap in conversational AI and establish UGST as an essential framework for developing goal-aligned user simulators.
摘要：用户模拟器对于对话AI至关重要，可以通过模拟交互实现可扩展的代理开发和评估。尽管当前的大型语言模型（LLMS）具有高级用户模拟功能，但我们透露，他们努力在多转交谈中始终如一地展示面向目标的行为 - 这是损害其在下游应用程序中的可靠性的临界限制。我们介绍了用户目标状态跟踪（UGST），这是一个新颖的框架，可以跟踪整个对话中用户目标的进步。利用UGST，我们提出了一种三阶段的方法，用于开发用户模拟器，该方法可以自主跟踪目标进度和产生目标对准响应的理由。此外，我们建立了全面的评估指标来衡量用户模拟器中的目标对齐，并证明我们的方法可以在两个基准（Multiwoz 2.4和{\ tau} -bench）之间进行实质性改进。我们的贡献解决了对话式AI的关键差距，并建立了UGST，作为开发与目标一致的用户模拟器的重要框架。

Title: SGPO: Self-Generated Preference Optimization based on Self-Improver

Authors: Hyeonji Lee, Daejin Jo, Seohwan Yun, Sungwoong Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20181
Pdf URL: https://arxiv.org/pdf/2507.20181
Copy Paste: [[2507.20181]] SGPO: Self-Generated Preference Optimization based on Self-Improver(https://arxiv.org/abs/2507.20181)
Keywords: language model, llm
Abstract: Large language models (LLMs), despite their extensive pretraining on diverse datasets, require effective alignment to human preferences for practical and reliable deployment. Conventional alignment methods typically employ off-policy learning and depend on human-annotated datasets, which limits their broad applicability and introduces distribution shift issues during training. To address these challenges, we propose Self-Generated Preference Optimization based on Self-Improver (SGPO), an innovative alignment framework that leverages an on-policy self-improving mechanism. Specifically, the improver refines responses from a policy model to self-generate preference data for direct preference optimization (DPO) of the policy model. Here, the improver and policy are unified into a single model, and in order to generate higher-quality preference data, this self-improver learns to make incremental yet discernible improvements to the current responses by referencing supervised fine-tuning outputs. Experimental results on AlpacaEval 2.0 and Arena-Hard show that the proposed SGPO significantly improves performance over DPO and baseline self-improving methods without using external preference data.
摘要：大型语言模型（LLMS）尽管在各种数据集上进行了大量预处理，但仍需要与人类偏好进行实践和可靠部署的有效保持一致。常规的一致性方法通常采用范围的学习范围，并依赖于人类通知的数据集，这限制了其广泛的适用性并在培训期间引入了分配变化问题。为了应对这些挑战，我们提出了基于自我冲击（SGPO）的自我生成的偏好优化，这是一种创新的一致性框架，利用了实用的自我改善机制。具体而言，改良剂从策略模型到自生的首选项数据进行了完善的响应，以进行策略模型的直接偏好优化（DPO）。在这里，改进者和策略统一为单个模型，为了生成更高质量的偏好数据，这种自我障碍学会通过引用监督的细微调整输出来对当前响应进行增量但可分辨的改进。 Alpacaeval 2.0和Arena-Hard的实验结果表明，所提出的SGPO显着提高了DPO和基线自我改善方法的性能，而无需使用外部偏好数据。

Title: SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding

Authors: Yuqi Yang, Weiqi Wang, Baixuan Xu, Wei Fan, Qing Zong, Chunkit Chan, Zheye Deng, Xin Liu, Yifan Gao, Changlong Yu, Chen Luo, Yang Li, Zheng Li, Qingyu Yin, Bing Yin, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20185
Pdf URL: https://arxiv.org/pdf/2507.20185
Copy Paste: [[2507.20185]] SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding(https://arxiv.org/abs/2507.20185)
Keywords: llm
Abstract: Session history is a common way of recording user interacting behaviors throughout a browsing activity with multiple products. For example, if an user clicks a product webpage and then leaves, it might because there are certain features that don't satisfy the user, which serve as an important indicator of on-the-spot user preferences. However, all prior works fail to capture and model customer intention effectively because insufficient information exploitation and only apparent information like descriptions and titles are used. There is also a lack of data and corresponding benchmark for explicitly modeling intention in E-commerce product purchase sessions. To address these issues, we introduce the concept of an intention tree and propose a dataset curation pipeline. Together, we construct a sibling multimodal benchmark, SessionIntentBench, that evaluates L(V)LMs' capability on understanding inter-session intention shift with four subtasks. With 1,952,177 intention entries, 1,132,145 session intention trajectories, and 13,003,664 available tasks mined using 10,905 sessions, we provide a scalable way to exploit the existing session data for customer intention understanding. We conduct human annotations to collect ground-truth label for a subset of collected data to form an evaluation gold set. Extensive experiments on the annotated data further confirm that current L(V)LMs fail to capture and utilize the intention across the complex session setting. Further analysis show injecting intention enhances LLMs' performances.
摘要：会话历史是在整个浏览活动中使用多个产品记录用户交互行为的常见方法。例如，如果用户单击产品网页然后离开，则可能是因为某些功能不满足用户，这是现场用户首选项的重要指标。但是，所有先前的工作都无法有效捕获和建模客户意图，因为信息开发不足，并且仅使用了描述和标题（例如描述和标题）。在电子商务产品购买会议中，缺乏数据和相应的基准，用于明确建模意图。为了解决这些问题，我们介绍了意图树的概念，并提出了数据集策划管道。我们共同构建了一个同级多模式基准SessionIntentBench，该基准评估了L（V）LMS在理解与四个子任务中的会议间意图转移方面的能力。有1,952,177个意图条目，1,132,145个会话意图轨迹以及13,003,664个使用10,905个会话开采的可用任务，我们提供了一种可扩展的方式来利用现有的会话数据以了解客户意图的理解。我们进行人类注释，以收集一部分数据子集的地面真相标签，以形成评估金集。对注释数据的广泛实验进一步证实，当前的L（V）LMS无法捕获和利用复杂的会话设置的意图。进一步的分析表明，注射意图增强了LLMS的性能。

Title: Diversity-Enhanced Reasoning for Subjective Questions

Authors: Yumeng Wang, Zhiyuan Fan, Jiayu Liu, Yi R. Fung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20187
Pdf URL: https://arxiv.org/pdf/2507.20187
Copy Paste: [[2507.20187]] Diversity-Enhanced Reasoning for Subjective Questions(https://arxiv.org/abs/2507.20187)
Keywords: chain-of-thought
Abstract: Large reasoning models (LRM) with long chain-of-thought (CoT) capabilities have shown strong performance on objective tasks, such as math reasoning and coding. However, their effectiveness on subjective questions that may have different responses from different perspectives is still limited by a tendency towards homogeneous reasoning, introduced by the reliance on a single ground truth in supervised fine-tuning and verifiable reward in reinforcement learning. Motivated by the finding that increasing role perspectives consistently improves performance, we propose MultiRole-R1, a diversity-enhanced framework with multiple role perspectives, to improve the accuracy and diversity in subjective reasoning tasks. MultiRole-R1 features an unsupervised data construction pipeline that generates reasoning chains that incorporate diverse role perspectives. We further employ reinforcement learning via Group Relative Policy Optimization (GRPO) with reward shaping, by taking diversity as a reward signal in addition to the verifiable reward. With specially designed reward functions, we successfully promote perspective diversity and lexical diversity, uncovering a positive relation between reasoning diversity and accuracy. Our experiment on six benchmarks demonstrates MultiRole-R1's effectiveness and generalizability in enhancing both subjective and objective reasoning, showcasing the potential of diversity-enhanced training in LRMs.
摘要：具有长链（COT）功能的大型推理模型（LRM）在客观任务（例如数学推理和编码）上表现出强大的性能。但是，它们对可能从不同角度产生不同回答的主观问题的有效性仍然受到统一推理的趋势的限制，这是由于对监督的微调和可验证的奖励在强化学习中的依赖而引入的。通过发现，提高角色观点始终如一地提高性能的激励，我们提出了MultiRole-R1，这是具有多种角色观点的多样性增强框架，以提高主观推理任务的准确性和多样性。 MultiRole-R1具有无监督的数据构建管道，生成了结合各种角色观点的推理链。除了可验证的奖励之外，我们还通过将多样性作为奖励信号，通过小组相对政策优化（GRPO）进一步采用强化学习。通过专门设计的奖励功能，我们成功地促进了观点多样性和词汇多样性，从而发现了推理多样性和准确性之间的积极关系。我们对六个基准测试的实验表明，多路罗尔-R1在增强主观和客观推理方面的有效性和普遍性，展示了LRMS中多样性增强训练的潜力。

Title: IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs

Authors: Aviya Maimon, Amir DN Cohen, Gal Vishne, Shauli Ravfogel, Reut Tsarfaty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20208
Pdf URL: https://arxiv.org/pdf/2507.20208
Copy Paste: [[2507.20208]] IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs(https://arxiv.org/abs/2507.20208)
Keywords: language model, llm
Abstract: Current evaluations of large language models (LLMs) rely on benchmark scores, but it is difficult to interpret what these individual scores reveal about a model's overall skills. Specifically, as a community we lack understanding of how tasks relate to one another, what they measure in common, how they differ, or which ones are redundant. As a result, models are often assessed via a single score averaged across benchmarks, an approach that fails to capture the models' wholistic strengths and limitations. Here, we propose a new evaluation paradigm that uses factor analysis to identify latent skills driving performance across benchmarks. We apply this method to a comprehensive new leaderboard showcasing the performance of 60 LLMs on 44 tasks, and identify a small set of latent skills that largely explain performance. Finally, we turn these insights into practical tools that identify redundant tasks, aid in model selection, and profile models along each latent skill.
摘要：当前对大语言模型（LLM）的评估依赖于基准分数，但是很难解释这些个人分数对模型的整体技能的揭示。具体而言，作为一个社区，我们对任务之间的关系缺乏了解，他们在共同的量度，它们的区别或哪些是多余的。结果，通常通过基准的单个分数来评估模型，这种方法无法捕获模型的全面优势和局限性。在这里，我们提出了一种新的评估范式，该范式使用因子分析来识别跨基准驾驶性能的潜在技能。我们将此方法应用于全面的新排行榜，展示了60个LLM在44个任务上的性能，并确定一系列的潜在技能，可以在很大程度上解释性能。最后，我们将这些见解转变为实用的工具，这些工具可以识别冗余任务，有助于模型选择以及沿每个潜在技能的个人资料模型。

Title: Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models

Authors: Yi Feng, Jiaqi Wang, Wenxuan Zhang, Zhuang Chen, Yutong Shen, Xiyao Xiao, Minlie Huang, Liping Jing, Jian Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20241
Pdf URL: https://arxiv.org/pdf/2507.20241
Copy Paste: [[2507.20241]] Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models(https://arxiv.org/abs/2507.20241)
Keywords: language model, llm
Abstract: Recent progress in large language models (LLMs) has opened new possibilities for mental health support, yet current approaches lack realism in simulating specialized psychotherapy and fail to capture therapeutic progression over time. Narrative therapy, which helps individuals transform problematic life stories into empowering alternatives, remains underutilized due to limited access and social stigma. We address these limitations through a comprehensive framework with two core components. First, INT (Interactive Narrative Therapist) simulates expert narrative therapists by planning therapeutic stages, guiding reflection levels, and generating contextually appropriate expert-like responses. Second, IMA (Innovative Moment Assessment) provides a therapy-centric evaluation method that quantifies effectiveness by tracking "Innovative Moments" (IMs), critical narrative shifts in client speech signaling therapy progress. Experimental results on 260 simulated clients and 230 human participants reveal that INT consistently outperforms standard LLMs in therapeutic quality and depth. We further demonstrate the effectiveness of INT in synthesizing high-quality support conversations to facilitate social applications.
摘要：大型语言模型（LLMS）的最新进展为心理健康支持开辟了新的可能性，但是当前的方法在模拟专门的心理治疗方面缺乏现实主义，并且随着时间的推移未能捕捉治疗进展。叙事疗法可以帮助个人将有问题的生活故事转变为赋予替代方案的能力，但由于获取和社会污名的有限而被遗忘了。我们通过具有两个核心组件的综合框架来解决这些限制。首先，INT（交互式叙事治疗师）通过计划治疗阶段，指导反思水平并产生适合上下文的专家般的反应来模拟专家叙事治疗师。其次，IMA（创新的力矩评估）提供了一种以治疗为中心的评估方法，该方法通过跟踪“创新时刻”（IMS）（IMS），对客户语音信号治疗进度的批判性叙事转移来量化有效性。对260个模拟客户和230名人参与者的实验结果表明，INT在治疗质量和深度方面始终超过标准LLM。我们进一步证明了INT在综合高质量支持对话中的有效性以促进社交应用。

Title: Modeling Professionalism in Expert Questioning through Linguistic Differentiation

Authors: Giulia D'Agostino, Chung-Chi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20249
Pdf URL: https://arxiv.org/pdf/2507.20249
Copy Paste: [[2507.20249]] Modeling Professionalism in Expert Questioning through Linguistic Differentiation(https://arxiv.org/abs/2507.20249)
Keywords: language model, llm
Abstract: Professionalism is a crucial yet underexplored dimension of expert communication, particularly in high-stakes domains like finance. This paper investigates how linguistic features can be leveraged to model and evaluate professionalism in expert questioning. We introduce a novel annotation framework to quantify structural and pragmatic elements in financial analyst questions, such as discourse regulators, prefaces, and request types. Using both human-authored and large language model (LLM)-generated questions, we construct two datasets: one annotated for perceived professionalism and one labeled by question origin. We show that the same linguistic features correlate strongly with both human judgments and authorship origin, suggesting a shared stylistic foundation. Furthermore, a classifier trained solely on these interpretable features outperforms gemini-2.0 and SVM baselines in distinguishing expert-authored questions. Our findings demonstrate that professionalism is a learnable, domain-general construct that can be captured through linguistically grounded modeling.
摘要：专业精神是专家沟通的至关重要但毫无疑问的维度，尤其是在金融等高风险领域。本文研究了如何利用语言特征来建模和评估专家质疑的专业素养。我们介绍了一个新颖的注释框架，以量化财务分析师问题中的结构和务实元素，例如话语调节器，序言和请求类型。使用人类撰写和大型语言模型（LLM）生成的问题，我们构建了两个数据集：一个用于感知专业精神的注释，一个用问题来源标记。我们表明，相同的语言特征与人类判断和作者身份的起源密切相关，这表明了共同的风格基础。此外，在区分专家作者的问题方面，仅对这些可解释功能进行培训的分类器优于Gemini-2.0和SVM基准。我们的发现表明，专业精神是一种可学习的领域，可以通过语言扎根的建模来捕获。

Title: Post-Completion Learning for Language Models

Authors: Xiang Fei, Siqi Wang, Shu Wei, Yuxiang Nie, Wei Shi, Hao Feng, Can Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20252
Pdf URL: https://arxiv.org/pdf/2507.20252
Copy Paste: [[2507.20252]] Post-Completion Learning for Language Models(https://arxiv.org/abs/2507.20252)
Keywords: language model
Abstract: Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (}) token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point. To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization. Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.
摘要：当前的语言模型训练范例通常会在达到序列末端（}）令牌后终止学习，从而忽略了完成后空间中潜在的学习机会。我们提出了完成后学习（PCL），这是一个新型的培训框架，该培训框架在模型完成后系统地利用了序列空间，以增强推理和自我评估能力。 PCL使模型能够继续产生自我评估并在培训期间进行奖励预测，同时通过在完成点停止来保持有效的推断。为了充分利用此完成后空间，我们设计了一种白色框加固学习方法：让模型根据奖励规则评估输出内容，然后计算并将分数与奖励功能保持一致。我们实施双轨SFT以优化推理和评估功能，并将其与RL培训混合以实现多目标混合动力优化。不同数据集和模型的实验结果表明，对传统SFT和RL方法的改进一致。我们的方法为语言模型培训提供了新的技术途径，该方法可以提高输出质量，同时保持部署效率。

Title: EMBRACE: Shaping Inclusive Opinion Representation by Aligning Implicit Conversations with Social Norms

Authors: Abeer Aldayel, Areej Alokaili
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20264
Pdf URL: https://arxiv.org/pdf/2507.20264
Copy Paste: [[2507.20264]] EMBRACE: Shaping Inclusive Opinion Representation by Aligning Implicit Conversations with Social Norms(https://arxiv.org/abs/2507.20264)
Keywords: language model
Abstract: Shaping inclusive representations that embrace diversity and ensure fair participation and reflections of values is at the core of many conversation-based models. However, many existing methods rely on surface inclusion using mention of user demographics or behavioral attributes of social groups. Such methods overlook the nuanced, implicit expression of opinion embedded in conversations. Furthermore, the over-reliance on overt cues can exacerbate misalignment and reinforce harmful or stereotypical representations in model outputs. Thus, we took a step back and recognized that equitable inclusion needs to account for the implicit expression of opinion and use the stance of responses to validate the normative alignment. This study aims to evaluate how opinions are represented in NLP or computational models by introducing an alignment evaluation framework that foregrounds implicit, often overlooked conversations and evaluates the normative social views and discourse. Our approach models the stance of responses as a proxy for the underlying opinion, enabling a considerate and reflective representation of diverse social viewpoints. We evaluate the framework using both (i) positive-unlabeled (PU) online learning with base classifiers, and (ii) instruction-tuned language models to assess post-training alignment. Through this, we provide a lens on how implicit opinions are (mis)represented and offer a pathway toward more inclusive model behavior.
摘要：塑造包含多样性并确保价值观的公平参与和反思的包容性表示是许多基于对话的模型的核心。但是，许多现有方法依赖于表面包含，并提及社会群体的用户人口统计或行为属性。这样的方法忽略了对话中嵌入的意见的细微，隐含的表达。此外，对明显提示的过度依赖会加剧模型输出中的有害或刻板印象的不对准。因此，我们退后一步，认识到公平的包容性需要说明意见的隐含表达，并以响应的态度来验证规范对齐。这项研究旨在通过引入一个对准评估框架来评估NLP或计算模型中的观点的代表，该框架将隐式，常常被忽视的对话预示并评估规范性的社会观点和话语。我们的方法将回应的立场塑造为基础观点的代理，从而实现了各种社会观点的体贴和反思性的代表。我们使用（i）使用基本分类器的（i）积极的（PU）在线学习以及（ii）指导调节语言模型来评估培训后训练时的框架。通过此，我们提供了有关隐性意见的（MIS）表示的镜头，并为更具包容性模型行为提供了途径。

Title: MoL-RL: Distilling Multi-Step Environmental Feedback into LLMs for Feedback-Independent Reasoning

Authors: Kang Yang, Jingxue Chen, Qingkun Tang, Tianxiang Zhang, Qianchun Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20278
Pdf URL: https://arxiv.org/pdf/2507.20278
Copy Paste: [[2507.20278]] MoL-RL: Distilling Multi-Step Environmental Feedback into LLMs for Feedback-Independent Reasoning(https://arxiv.org/abs/2507.20278)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Large language models (LLMs) face significant challenges in effectively leveraging sequential environmental feedback (EF) signals, such as natural language evaluations, for feedback-independent chain-of-thought (CoT) reasoning. Existing approaches either convert EF into scalar rewards, losing rich contextual information, or employ refinement datasets, failing to exploit the multi-step and discrete nature of EF interactions. To address these limitations, we propose MoL-RL, a novel training paradigm that integrates multi-step EF signals into LLMs through a dual-objective optimization framework. Our method combines MoL (Mixture-of-Losses) continual training, which decouples domain-specific EF signals (optimized via cross-entropy loss) and general language capabilities (preserved via Kullback-Leibler divergence), with GRPO-based post-training to distill sequential EF interactions into single-step inferences. This synergy enables robust feedback-independent reasoning without relying on external feedback loops. Experimental results on mathematical reasoning (MATH-500, AIME24/AIME25) and code generation (CodeAgent-Test) benchmarks demonstrate that MoL-RL achieves state-of-the-art performance with the Qwen3-8B model, while maintaining strong generalization across model scales (Qwen3-4B). This work provides a promising approach for leveraging multi-step textual feedback to enhance LLMs' reasoning capabilities in diverse domains.
摘要：大型语言模型（LLMS）在有效利用独立于反馈的思想链（COT）推理的有效利用顺序环境反馈（EF）信号（例如自然语言评估）方面面临重大挑战。现有方法要么将EF转换为标量奖励，丢失丰富的上下文信息或采用改进数据集，因此无法利用EF交互的多步骤和离散性质。为了解决这些局限性，我们提出了Mol-RL，这是一种新型的训练范式，通过双目标优化框架将多个步骤EF信号整合到LLM中。我们的方法结合了摩尔（损坏混合物）持续训练，该训练将域特异性的EF信号（通过跨透镜损失进行了优化）和一般语言能力（通过Kullback-Leibler Divergence保留），并与基于GRPO的基于GRPO的后训练以蒸馏式培训以蒸发顺序EF顺序相互作用。这种协同作用可实现强大的反馈独立推理，而无需依赖外部反馈循环。数学推理（Math-500，AIME24/AIME25）和代码生成（代码测试）基准的实验结果表明，Mol-RL通过QWEN3-8B模型实现最先进的性能，同时在模型尺度（QWEN3-4B）中保持强大的概括（QWEN3-4B）。这项工作为利用多步文本反馈以增强LLMS在不同域中的推理能力提供了一种有希望的方法。

Title: What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations

Authors: Katharina Trinley, Toshiki Nakai, Tatiana Anikina, Tanja Baeumel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20279
Pdf URL: https://arxiv.org/pdf/2507.20279
Copy Paste: [[2507.20279]] What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations(https://arxiv.org/abs/2507.20279)
Keywords: language model, llm
Abstract: Large language models (LLMs) excel at multilingual tasks, yet their internal language processing remains poorly understood. We analyze how Aya-23-8B, a decoder-only LLM trained on balanced multilingual data, handles code-mixed, cloze, and translation tasks compared to predominantly monolingual models like Llama 3 and Chinese-LLaMA-2. Using logit lens and neuron specialization analyses, we find: (1) Aya-23 activates typologically related language representations during translation, unlike English-centric models that rely on a single pivot language; (2) code-mixed neuron activation patterns vary with mixing rates and are shaped more by the base language than the mixed-in one; and (3) Aya-23's languagespecific neurons for code-mixed inputs concentrate in final layers, diverging from prior findings on decoder-only models. Neuron overlap analysis further shows that script similarity and typological relations impact processing across model types. These findings reveal how multilingual training shapes LLM internals and inform future cross-lingual transfer research.
摘要：大型语言模型（LLMS）在多语言任务上表现出色，但他们的内部语言处理仍然很少理解。我们分析了AYA-23-8B（仅解码器的LLM训练了平衡的多语言数据，处理代码混合，披肩和翻译任务）与主要是单语模型（如Llama 3和Chinese-lllama-2）相比。使用Logit镜头和神经元专业化分析，我们发现：（1）AYA-23在翻译过程中激活类型相关的语言表示，与依赖单个枢轴语言的英语模型不同；（2）混合的神经元激活模式随混合速率而变化，并且由基本语言比混合语言更大；（3）AYA-23的语言特定于代码混合输入的神经元集中在最终层中，与仅解码器模型的先前发现不同。神经元重叠分析进一步表明，脚本相似性和类型学关系影响跨模型类型的处理。这些发现揭示了多语言培训如何塑造LLM内部质量并为未来的跨语性转移研究提供信息。

Title: Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation

Authors: Abdullah Alabdullah, Lifeng Han, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20301
Pdf URL: https://arxiv.org/pdf/2507.20301
Copy Paste: [[2507.20301]] Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation(https://arxiv.org/abs/2507.20301)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Dialectal Arabic (DA) poses a persistent challenge for natural language processing (NLP), as most everyday communication in the Arab world occurs in dialects that diverge significantly from Modern Standard Arabic (MSA). This linguistic divide limits access to digital services and educational resources and impedes progress in Arabic machine translation. This paper presents two core contributions to advancing DA-MSA translation for the Levantine, Egyptian, and Gulf dialects, particularly in low-resource and computationally constrained settings: a comprehensive evaluation of training-free prompting techniques, and the development of a resource-efficient fine-tuning pipeline. Our evaluation of prompting strategies across six large language models (LLMs) found that few-shot prompting consistently outperformed zero-shot, chain-of-thought, and our proposed Ara-TEaR method. GPT-4o achieved the highest performance across all prompting settings. For fine-tuning, a quantized Gemma2-9B model achieved a CHrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58). Joint multi-dialect trained models outperformed single-dialect counterparts by over 10% CHrF++, and 4-bit quantization reduced memory usage by 60% with less than 1% performance loss. The results and insights of our experiments offer a practical blueprint for improving dialectal inclusion in Arabic NLP, showing that high-quality DA-MSA machine translation is achievable even with limited resources and paving the way for more inclusive language technologies.
摘要：方言阿拉伯语（DA）对自然语言处理（NLP）构成了持续的挑战，因为阿拉伯世界的大多数日常交流都发生在与现代标准阿拉伯语（MSA）显着不同的方言中。这种语言划分限制了对数字服务和教育资源的访问，并阻碍了阿拉伯机器翻译的进展。本文为推进黎凡特，埃及和海湾方言的DA-MSA翻译提供了两个核心贡献，尤其是在低资源和计算限制的设置中：对无培训提示技术的全面评估，以及资源高效的微型调节管道的发展。我们对六个大语言模型（LLM）提示策略的评估发现，很少有弹性促使始终超过零击，经过思考链和我们提出的ARA-TEAR方法。 GPT-4O在所有提示设置中取得了最高的性能。对于微调，量化的GEMMA2-9B模型的CHRF ++得分为49.88，表现优于零摄影GPT-4O（44.58）。联合多核训练训练的模型的表现超过10％CHRF ++，而4位量化的内存使用量则降低了60％，而性能损失少于1％。我们实验的结果和见解为改善阿拉伯语NLP的方言包容性提供了实用的蓝图，表明即使资源有限，也可以实现高质量的DA-MSA机器翻译，并为更具包容性语言技术铺平了道路。

Title: RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing

Authors: Hao Xiang, Tianyi Tang, Yang Su, Bowen Yu, An Yang, Fei Huang, Yichang Zhang, Yaojie Lu, Hongyu Lin, Xianpei Han, Jingren Zhou, Junyang Lin, Le Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20352
Pdf URL: https://arxiv.org/pdf/2507.20352
Copy Paste: [[2507.20352]] RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing(https://arxiv.org/abs/2507.20352)
Keywords: language model, llm
Abstract: Recent advancements in Large Language Models (LLMs) have shown outstanding potential for role-playing applications. Evaluating these capabilities is becoming crucial yet remains challenging. Existing benchmarks mostly adopt a \textbf{character-centric} approach, simplify user-character interactions to isolated Q&A tasks, and fail to reflect real-world applications. To address this limitation, we introduce RMTBench, a comprehensive \textbf{user-centric} bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds. RMTBench includes custom characters with detailed backgrounds and abstract characters defined by simple traits, enabling evaluation across various user scenarios. Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications. Furthermore, we construct an authentic multi-turn dialogue simulation mechanism. With carefully selected evaluation dimensions and LLM-based scoring, this mechanism captures the complex intention of conversations between the user and the character. By shifting focus from character background to user intention fulfillment, RMTBench bridges the gap between academic evaluation and practical deployment requirements, offering a more effective framework for assessing role-playing capabilities in LLMs. All code and datasets will be released soon.
摘要：大型语言模型（LLM）的最新进展显示出了角色扮演应用程序的杰出潜力。评估这些功能变得至关重要，但仍然具有挑战性。现有的基准主要采用\ textbf {carame-contarric}方法，将用户 - 字符的交互简化为隔离的Q＆A任务，并且无法反映现实世界中的应用程序。为了解决此限制，我们介绍了RMTBENCH，这是一个全面的\ textbf {以用户为中心}双语角色扮演基准，具有80种不同的角色和8,000多个对话回合。 rmtbench包括具有详细背景的自定义字符和由简单特征定义的抽象字符，从而在各种用户场景中进行评估。我们的基准测试基于明确的用户动机而不是角色描述构建对话，从而确保与实用的用户应用程序保持一致。此外，我们构建了一个真实的多转向对话模拟机制。通过精心选择的评估维度和基于LLM的评分，该机制捕获了用户与角色之间对话的复杂意图。通过将重点从角色背景转移到用户意图实现，RMTBench弥合了学术评估和实际部署要求之间的差距，为评估LLMS中的角色扮演能力提供了更有效的框架。所有代码和数据集将很快发布。

Title: Length Representations in Large Language Models

Authors: Sangjun Moon, Dasom Choi, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20398
Pdf URL: https://arxiv.org/pdf/2507.20398
Copy Paste: [[2507.20398]] Length Representations in Large Language Models(https://arxiv.org/abs/2507.20398)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have shown remarkable capabilities across various tasks, that are learned from massive amounts of text-based data. Although LLMs can control output sequence length, particularly in instruction-based settings, the internal mechanisms behind this control have been unexplored yet. In this study, we provide empirical evidence on how output sequence length information is encoded within the internal representations in LLMs. In particular, our findings show that multi-head attention mechanisms are critical in determining output sequence length, which can be adjusted in a disentangled manner. By scaling specific hidden units within the model, we can control the output sequence length without losing the informativeness of the generated text, thereby indicating that length information is partially disentangled from semantic information. Moreover, some hidden units become increasingly active as prompts become more length-specific, thus reflecting the model's internal awareness of this attribute. Our findings suggest that LLMs have learned robust and adaptable internal mechanisms for controlling output length without any external control.
摘要：大型语言模型（LLMS）在各种任务中显示出了出色的功能，这些功能是从大量基于文本的数据中学到的。尽管LLM可以控制输出序列长度，尤其是在基于指令的设置中，但该控件背后的内部机制尚未探索。在这项研究中，我们提供了有关如何在LLMS中的内部表示中编码输出序列长度信息的经验证据。特别是，我们的发现表明，多头注意机制对于确定输出序列的长度至关重要，可以以分离的方式进行调整。通过在模型中缩放特定的隐藏单元，我们可以控制输出序列长度而不会失去生成的文本的信息，从而表明长度信息部分与语义信息相关。此外，随着提示变得更加特定，一些隐藏单元变得越来越活跃，从而反映了该模型对此属性的内部意识。我们的发现表明，LLMS已经学习了可靠的内部机制，用于控制输出长度，而无需任何外部控制。

Title: Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations

Authors: Eunkyu Park, Wesley Hanwen Deng, Gunhee Kim, Motahhare Eslami, Maarten Sap
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2507.20409
Pdf URL: https://arxiv.org/pdf/2507.20409
Copy Paste: [[2507.20409]] Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations(https://arxiv.org/abs/2507.20409)
Keywords: prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting helps models think step by step. But what happens when they must see, understand, and judge-all at once? In visual tasks grounded in social context, where bridging perception with norm-grounded judgments is essential, flat CoT often breaks down. We introduce Cognitive Chain-of-Thought (CoCoT), a prompting strategy that scaffolds VLM reasoning through three cognitively inspired stages: perception, situation, and norm. Our experiments show that, across multiple multimodal benchmarks (including intent disambiguation, commonsense reasoning, and safety), CoCoT consistently outperforms CoT and direct prompting (+8\% on average). Our findings demonstrate that cognitively grounded reasoning stages enhance interpretability and social awareness in VLMs, paving the way for safer and more reliable multimodal systems.
摘要：经过深思熟虑（COT）提示有助于模型逐步思考。但是，当他们必须立即看到，理解和判断所有人时会发生什么？在基于社会背景下的视觉任务中，与规范判断的桥接感知是必不可少的，扁平的婴儿床通常会崩溃。我们介绍了认知链（COCOT），这是一种促使策略，通过三个认知启发的阶段进行VLM推理：感知，情况和规范。我们的实验表明，在多个多模式基准（包括意图歧义，常识性推理和安全性）中，Cocot始终胜过COT和直接提示（平均+8 \％）。我们的发现表明，认知上的推理阶段可以增强VLM中的解释性和社会意识，为更安全，更可靠的多模式系统铺平了道路。

Title: CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning

Authors: George Ibrahim, Rita Ramos, Yova Kementchedjhieva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20411
Pdf URL: https://arxiv.org/pdf/2507.20411
Copy Paste: [[2507.20411]] CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning(https://arxiv.org/abs/2507.20411)
Keywords: language model, retrieval-augmented generation
Abstract: Multilingual vision-language models have made significant strides in image captioning, yet they still lag behind their English counterparts due to limited multilingual training data and costly large-scale model parameterization. Retrieval-augmented generation (RAG) offers a promising alternative by conditioning caption generation on retrieved examples in the target language, reducing the need for extensive multilingual training. However, multilingual RAG captioning models often depend on retrieved captions translated from English, which can introduce mismatches and linguistic biases relative to the source language. We introduce CONCAP, a multilingual image captioning model that integrates retrieved captions with image-specific concepts, enhancing the contextualization of the input image and grounding the captioning process across different languages. Experiments on the XM3600 dataset indicate that CONCAP enables strong performance on low- and mid-resource languages, with highly reduced data requirements. Our findings highlight the effectiveness of concept-aware retrieval augmentation in bridging multilingual performance gaps.
摘要：多语言视觉语言模型在图像字幕方面取得了显着步骤，但由于有限的多语言培训数据和昂贵的大规模模型参数化，它们仍然落后于英语对应。检索增强的一代（RAG）通过根据目标语言检索示例来调节标题生成，从而提供了有希望的替代方案，从而减少了对广泛的多语言培训的需求。但是，多语言的抹布字幕模型通常取决于从英语中翻译的字幕，这可能引入不匹配和语言偏见相对于源语言。我们介绍了concap，这是一种多语言图像字幕模型，将检索到的字幕与特定于图像的概念集成在一起，增强输入图像的上下文化并跨越不同语言的字幕过程。 XM3600数据集的实验表明，concap可以在低资源和中源语言上具有强大的性能，并具有高度降低的数据要求。我们的发现突出了概念感知的检索增强在弥合多语言性能差距中的有效性。

Title: CodeNER: Code Prompting for Named Entity Recognition

Authors: Sungwoo Han, Hyeyeon Kim, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20423
Pdf URL: https://arxiv.org/pdf/2507.20423
Copy Paste: [[2507.20423]] CodeNER: Code Prompting for Named Entity Recognition(https://arxiv.org/abs/2507.20423)
Keywords: language model, gpt, llm, prompt, chat, chain-of-thought
Abstract: Recent studies have explored various approaches for treating candidate named entity spans as both source and target sequences in named entity recognition (NER) by leveraging large language models (LLMs). Although previous approaches have successfully generated candidate named entity spans with suitable labels, they rely solely on input context information when using LLMs, particularly, ChatGPT. However, NER inherently requires capturing detailed labeling requirements with input context information. To address this issue, we propose a novel method that leverages code-based prompting to improve the capabilities of LLMs in understanding and performing NER. By embedding code within prompts, we provide detailed BIO schema instructions for labeling, thereby exploiting the ability of LLMs to comprehend long-range scopes in programming languages. Experimental results demonstrate that the proposed code-based prompting method outperforms conventional text-based prompting on ten benchmarks across English, Arabic, Finnish, Danish, and German datasets, indicating the effectiveness of explicitly structuring NER instructions. We also verify that combining the proposed code-based prompting method with the chain-of-thought prompting further improves performance.
摘要：最近的研究通过利用大型语言模型（LLMS）来探讨了将候选者跨越命名实体识别（NER）中的候选者跨越源和目标序列的各种方法。尽管以前的方法已成功生成了名为“实体”跨越适当标签的候选者，但它们在使用LLMS时仅依赖于输入上下文信息，尤其是ChatGpt。但是，NER固有地需要使用输入上下文信息捕获详细的标签要求。为了解决这个问题，我们提出了一种新颖的方法，该方法利用基于代码的提示来提高LLMS在理解和执行NER方面的功能。通过将代码嵌入提示中，我们提供了标签的详细的生物模式说明，从而利用了LLMS理解编程语言中远程范围的能力。实验结果表明，提出的基于代码的提示方法的表现优于基于文本的常规提示，在英语，阿拉伯语，芬兰语，丹麦和德国数据集上的十个基准测试中，表明明确构建NER说明的有效性。我们还验证了将提出的基于代码的提示方法与促使进一步提高性能相结合的。

Title: Speaking in Words, Thinking in Logic: A Dual-Process Framework in QA Systems

Authors: Tuan Bui, Trong Le, Phat Thai, Sang Nguyen, Minh Hua, Ngan Pham, Thang Bui, Tho Quan
Subjects: cs.CL, cs.AI, cs.SC
Abstract URL: https://arxiv.org/abs/2507.20491
Pdf URL: https://arxiv.org/pdf/2507.20491
Copy Paste: [[2507.20491]] Speaking in Words, Thinking in Logic: A Dual-Process Framework in QA Systems(https://arxiv.org/abs/2507.20491)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have significantly enhanced question-answering (QA) capabilities, particularly in open-domain contexts. However, in closed-domain scenarios such as education, healthcare, and law, users demand not only accurate answers but also transparent reasoning and explainable decision-making processes. While neural-symbolic (NeSy) frameworks have emerged as a promising solution, leveraging LLMs for natural language understanding and symbolic systems for formal reasoning, existing approaches often rely on large-scale models and exhibit inefficiencies in translating natural language into formal logic representations. To address these limitations, we introduce Text-JEPA (Text-based Joint-Embedding Predictive Architecture), a lightweight yet effective framework for converting natural language into first-order logic (NL2FOL). Drawing inspiration from dual-system cognitive theory, Text-JEPA emulates System 1 by efficiently generating logic representations, while the Z3 solver operates as System 2, enabling robust logical inference. To rigorously evaluate the NL2FOL-to-reasoning pipeline, we propose a comprehensive evaluation framework comprising three custom metrics: conversion score, reasoning score, and Spearman rho score, which collectively capture the quality of logical translation and its downstream impact on reasoning accuracy. Empirical results on domain-specific datasets demonstrate that Text-JEPA achieves competitive performance with significantly lower computational overhead compared to larger LLM-based systems. Our findings highlight the potential of structured, interpretable reasoning frameworks for building efficient and explainable QA systems in specialized domains.
摘要：大型语言模型（LLMS）的最新进展已显着增强了问题的解决（QA）功能，尤其是在开放域上下文中。但是，在诸如教育，医疗保健和法律之类的封闭域情景中，用户不仅需要准确的答案，还需要透明的推理和可解释的决策过程。尽管神经符号（NESY）框架已经成为一种有前途的解决方案，但利用LLM来用于自然语言理解和象征性系统进行正式推理，但现有方法通常依靠大型模型并表现出效率低下来将自然语言转化为正式逻辑表示。为了解决这些局限性，我们介绍了文本jepa（基于文本的联合提示预测架构），这是将自然语言转换为一阶逻辑（NL2FOL）的轻巧但有效的框架。从双重系统认知理论中汲取灵感，Text-JEPA通过有效生成逻辑表示来模仿系统1，而Z3求解器则以系统2的形式运行，从而实现了可靠的逻辑推断。为了严格评估NL2FOL到期的管道，我们提出了一个全面的评估框架，其中包括三个自定义指标：转换得分，推理得分和Spearman Rho得分，该框架集体捕获了逻辑翻译的质量及其对推理准确性的下游影响。与较大的基于LLM的系统相比，针对领域的数据集的经验结果表明，文本JEPA具有较低的计算间接费用。我们的发现突出了结构化的，可解释的推理框架的潜力，用于在专用域中构建有效且可解释的质量检查系统。

Title: AQUA: A Large Language Model for Aquaculture & Fisheries

Authors: Praneeth Narisetty, Uday Kumar Reddy Kattamanchi, Lohit Akshant Nimma, Sri Ram Kaushik Karnati, Shiva Nagendra Babu Kore, Mounika Golamari, Tejashree Nageshreddy
Subjects: cs.CL, cs.AI, cs.CE, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2507.20520
Pdf URL: https://arxiv.org/pdf/2507.20520
Copy Paste: [[2507.20520]] AQUA: A Large Language Model for Aquaculture & Fisheries(https://arxiv.org/abs/2507.20520)
Keywords: language model, llm, agent
Abstract: Aquaculture plays a vital role in global food security and coastal economies by providing sustainable protein sources. As the industry expands to meet rising demand, it faces growing challenges such as disease outbreaks, inefficient feeding practices, rising labor costs, logistical inefficiencies, and critical hatchery issues, including high mortality rates and poor water quality control. Although artificial intelligence has made significant progress, existing machine learning methods fall short of addressing the domain-specific complexities of aquaculture. To bridge this gap, we introduce AQUA, the first large language model (LLM) tailored for aquaculture, designed to support farmers, researchers, and industry practitioners. Central to this effort is AQUADAPT (Data Acquisition, Processing and Tuning), an Agentic Framework for generating and refining high-quality synthetic data using a combination of expert knowledge, largescale language models, and automated evaluation techniques. Our work lays the foundation for LLM-driven innovations in aquaculture research, advisory systems, and decision-making tools.
摘要：水产养殖通过提供可持续的蛋白质来源在全球粮食安全和沿海经济体中起着至关重要的作用。随着行业扩大以满足需求的不断增长，它面临着越来越多的挑战，例如疾病暴发，效率低下的喂养实践，劳动力成本上升，后勤效率低下以及关键的孵化场问题，包括高死亡率率和水质控制不佳。尽管人工智能取得了重大进展，但现有的机器学习方法无法解决水产养殖的特定领域复杂性。为了弥合这一差距，我们介绍了Aqua，这是针对水产养殖量身定制的第一个大型语言模型（LLM），旨在支持农民，研究人员和行业从业人员。这项工作的核心是Aquadapt（数据采集，处理和调整），这是一种使用专家知识，LargesCale语言模型和自动化评估技术的组合来生成和完善高质量合成数据的代理框架。我们的工作为LLM驱动的水产养殖研究，咨询系统和决策工具奠定了基础。

Title: SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers

Authors: Chaitanya Manem, Pratik Prabhanjan Brahma, Prakamya Mishra, Zicheng Liu, Emad Barsoum
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20527
Pdf URL: https://arxiv.org/pdf/2507.20527
Copy Paste: [[2507.20527]] SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers(https://arxiv.org/abs/2507.20527)
Keywords: language model, llm
Abstract: The demand for Large Language Models (LLMs) capable of sophisticated mathematical reasoning is growing across industries. However, the development of performant mathematical LLMs is critically bottlenecked by the scarcity of difficult, novel training data. We introduce \textbf{SAND-Math} (Synthetic Augmented Novel and Difficult Mathematics problems and solutions), a pipeline that addresses this by first generating high-quality problems from scratch and then systematically elevating their complexity via a new \textbf{Difficulty Hiking} step. We demonstrate the effectiveness of our approach through two key findings. First, augmenting a strong baseline with SAND-Math data significantly boosts performance, outperforming the next-best synthetic dataset by \textbf{$\uparrow$ 17.85 absolute points} on the AIME25 benchmark. Second, in a dedicated ablation study, we show our Difficulty Hiking process is highly effective: by increasing average problem difficulty from 5.02 to 5.98, this step lifts AIME25 performance from 46.38\% to 49.23\%. The full generation pipeline, final dataset, and a fine-tuned model form a practical and scalable toolkit for building more capable and efficient mathematical reasoning LLMs. SAND-Math dataset is released here: \href{this https URL}{this https URL}
摘要：对能够复杂数学推理的大型语言模型（LLM）的需求正在跨行业增长。但是，由于困难，新颖的训练数据的稀缺性，表现数学LLM的发展被严重瓶颈。我们介绍了\ textbf {sand-math}（合成增强的新颖和困难的数学问题和解决方案），该管道通过首先从头开始产生高质量问题，然后通过新的\ textbf {困难远足}步骤来解决此问题，然后系统地提高其复杂性。我们通过两个关键发现证明了方法的有效性。首先，使用砂型数据增强强大的基线可以显着提高性能，超过\ textbf {$ \ uparrow $ 17.85绝对点}的次数合成数据集。其次，在一项专门的消融研究中，我们表明远足过程非常有效：通过将平均问题难度从5.02增加到5.98，此步骤将AIME25的绩效从46.38 \％\％\％提高到49.23 \％。完整一代的管道，最终数据集和微调模型形成了一种实用且可扩展的工具包，用于构建更有效，有效的数学推理LLM。 Sand-Math数据集在此处发布：\ href {此https url} {this https url}

Title: Enhancing Hallucination Detection via Future Context

Authors: Joosung Lee, Cheonbok Park, Hwiyeol Jo, Jeonghoon Kim, Joonsuk Park, Kang Min Yoo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20546
Pdf URL: https://arxiv.org/pdf/2507.20546
Copy Paste: [[2507.20546]] Enhancing Hallucination Detection via Future Context(https://arxiv.org/abs/2507.20546)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) are widely used to generate plausible text on online platforms, without revealing the generation process. As users increasingly encounter such black-box outputs, detecting hallucinations has become a critical challenge. To address this challenge, we focus on developing a hallucination detection framework for black-box generators. Motivated by the observation that hallucinations, once introduced, tend to persist, we sample future contexts. The sampled future contexts provide valuable clues for hallucination detection and can be effectively integrated with various sampling-based methods. We extensively demonstrate performance improvements across multiple methods using our proposed sampling approach.
摘要：大型语言模型（LLM）被广泛用于在线平台上生成合理的文本，而无需揭示生成过程。随着用户越来越多地遇到这样的黑框输出，检测幻觉已成为一个关键挑战。为了应对这一挑战，我们专注于为黑盒发电机开发幻觉检测框架。出于观察到幻觉曾经引入的观察，我们倾向于持续存在，我们采样了未来的环境。采样的未来上下文为幻觉检测提供了宝贵的线索，并且可以与各种基于抽样的方法有效整合。我们使用我们提出的抽样方法广泛地展示了多种方法的性能改进。

Title: ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning

Authors: Duc-Tai Dinh, Duc Anh Khoa Dinh
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2507.20564
Pdf URL: https://arxiv.org/pdf/2507.20564
Copy Paste: [[2507.20564]] ZSE-Cap: A Zero-Shot Ensemble for Image Retrieval and Prompt-Guided Captioning(https://arxiv.org/abs/2507.20564)
Keywords: prompt
Abstract: We present ZSE-Cap (Zero-Shot Ensemble for Captioning), our 4th place system in Event-Enriched Image Analysis (EVENTA) shared task on article-grounded image retrieval and captioning. Our zero-shot approach requires no finetuning on the competition's data. For retrieval, we ensemble similarity scores from CLIP, SigLIP, and DINOv2. For captioning, we leverage a carefully engineered prompt to guide the Gemma 3 model, enabling it to link high-level events from the article to the visual content in the image. Our system achieved a final score of 0.42002, securing a top-4 position on the private test set, demonstrating the effectiveness of combining foundation models through ensembling and prompting. Our code is available at this https URL.
摘要：我们提出ZSE-CAP（零射击集合用于字幕），这是我们在富含事件的图像分析中的第四名系统（Eventa）在文章接地图像检索和字幕上共享任务。我们的零击方法不需要对竞争对手的数据进行填充。为了取回，我们从夹子，siglip和dinov2中合奏相似性得分。为了进行字幕，我们利用精心设计的提示来指导Gemma 3模型，使其能够将文章的高级事件与图像中的视觉内容联系起来。我们的系统达到了0.42002的最终成绩，在私人测试集中获得了前4个位置，这证明了通过结合和提示结合基础模型的有效性。我们的代码可在此HTTPS URL上找到。

Title: Before the Outrage: Challenges and Advances in Predicting Online Antisocial Behavior

Authors: Anaïs Ollagnier (CRISAM,CNRS,MARIANNE)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20614
Pdf URL: https://arxiv.org/pdf/2507.20614
Copy Paste: [[2507.20614]] Before the Outrage: Challenges and Advances in Predicting Online Antisocial Behavior(https://arxiv.org/abs/2507.20614)
Keywords: language model
Abstract: Antisocial behavior (ASB) on social media-including hate speech, harassment, and trolling-poses growing challenges for platform safety and societal wellbeing. While prior work has primarily focused on detecting harmful content after it appears, predictive approaches aim to forecast future harmful behaviors-such as hate speech propagation, conversation derailment, or user recidivism-before they fully unfold. Despite increasing interest, the field remains fragmented, lacking a unified taxonomy or clear synthesis of existing methods. This paper presents a systematic review of over 49 studies on ASB prediction, offering a structured taxonomy of five core task types: early harm detection, harm emergence prediction, harm propagation prediction, behavioral risk prediction, and proactive moderation support. We analyze how these tasks differ by temporal framing, prediction granularity, and operational goals. In addition, we examine trends in modeling techniques-from classical machine learning to pre-trained language models-and assess the influence of dataset characteristics on task feasibility and generalization. Our review highlights methodological challenges, such as dataset scarcity, temporal drift, and limited benchmarks, while outlining emerging research directions including multilingual modeling, cross-platform generalization, and human-in-the-loop systems. By organizing the field around a coherent framework, this survey aims to guide future work toward more robust and socially responsible ASB prediction.
摘要：社交媒体上的反社会行为（ASB），包括仇恨言论，骚扰和拖钓对平台安全和社会福祉的挑战日益严重。虽然先前的工作主要集中在出现后检测有害内容，但预测方法旨在预测未来的有害行为，例如仇恨言论传播，对话出轨或用户累犯 - 在它们完全展开之前。尽管兴趣增加，但该领域仍然分散，缺乏统一的分类法或明确的现有方法综合。本文对49多项ASB预测研究进行了系统评价，提供了五种核心任务类型的结构性分类学：早期伤害检测，伤害出现预测，伤害传播预测，行为风险预测和主动的训练支持。我们分析了这些任务如何通过时间框架，预测粒度和操作目标来差异。此外，我们研究了从经典机器学习建模技术的趋势，以预先训练的语言模型，并评估数据集特征对任务可行性和概括的影响。我们的评论强调了方法论挑战，例如数据集稀缺，时间漂移和有限的基准，同时概述了新兴的研究方向，包括多语言建模，跨平台概括和人类在环境系统中。通过组织一个连贯的框架的领域，这项调查旨在指导未来的工作，以更加健壮和对社会负责的ASB预测。

Title: Ontology-Enhanced Knowledge Graph Completion using Large Language Models

Authors: Wenbin Guo, Xin Wang, Jiaoyan Chen, Zhao Li, Zirui Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20643
Pdf URL: https://arxiv.org/pdf/2507.20643
Copy Paste: [[2507.20643]] Ontology-Enhanced Knowledge Graph Completion using Large Language Models(https://arxiv.org/abs/2507.20643)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have been extensively adopted in Knowledge Graph Completion (KGC), showcasing significant research advancements. However, as black-box models driven by deep neural architectures, current LLM-based KGC methods rely on implicit knowledge representation with parallel propagation of erroneous knowledge, thereby hindering their ability to produce conclusive and decisive reasoning outcomes. We aim to integrate neural-perceptual structural information with ontological knowledge, leveraging the powerful capabilities of LLMs to achieve a deeper understanding of the intrinsic logic of the knowledge. We propose an ontology enhanced KGC method using LLMs -- OL-KGC. It first leverages neural perceptual mechanisms to effectively embed structural information into the textual space, and then uses an automated extraction algorithm to retrieve ontological knowledge from the knowledge graphs (KGs) that needs to be completed, which is further transformed into a textual format comprehensible to LLMs for providing logic guidance. We conducted extensive experiments on three widely-used benchmarks -- FB15K-237, UMLS and WN18RR. The experimental results demonstrate that OL-KGC significantly outperforms existing mainstream KGC methods across multiple evaluation metrics, achieving state-of-the-art performance.
摘要：大型语言模型（LLM）已在知识图完成（KGC）中广泛采用，展示了重大的研究进步。但是，作为由深神经体系结构驱动的黑框模型，当前基于LLM的KGC方法依赖于隐式知识表示，并平行地传播了错误的知识，从而阻碍了他们产生结论性和决定性的推理结果的能力。我们旨在将神经感知的结构信息与本体论知识相结合，利用LLM的强大能力来深入了解知识的内在逻辑。我们建议使用LLMS-OL-KGC提出一个本体论增强的KGC方法。首先，它利用神经感知机制有效地将结构信息嵌入文本空间中，然后使用自动提取算法从知识图（KGS）中检索需要完成的本体论知识，这将进一步转化为可综合的LLMS来提供逻辑指导。我们对三个广泛使用的基准进行了广泛的实验-FB15K-237，UMLS和WN18RR。实验结果表明，OL-KGC在多个评估指标上的现有KGC方法的表现显着优于现有的主流方法，从而实现了最新的性能。

Title: Geometric-Mean Policy Optimization

Authors: Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20673
Pdf URL: https://arxiv.org/pdf/2507.20673
Copy Paste: [[2507.20673]] Geometric-Mean Policy Optimization(https://arxiv.org/abs/2507.20673)
Keywords: language model
Abstract: Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is available at this https URL.
摘要：最近的进步，例如小组相对政策优化（GRPO），通过优化令牌级别奖励的算术平均值来增强大语言模型的推理能力。但是，GRPO在处理具有异常重要性加权奖励的代币时会遭受不稳定的策略更新，这表现为培训期间的极端重要性抽样比率，即由当前和旧政策分配给标记的采样概率之间的比率。在这项工作中，我们提出了几何均值策略优化（GMPO），这是GRPO的稳定变体。 GMPO不是优化算术均值，而是最大化令牌级奖励的几何均值，这本质上对异常值敏感并保持更稳定的重要性采样比率。此外，我们提供了全面的理论和实验分析，以证明GMPO的设计和稳定性优势是合理的。除稳定性的提高外，GMPO-7B在多个数学基准上的表现平均比GRPO平均高4.1％，多模式推理基准（包括AIME24，AMC，MATH500，OlympiaDbench，Minerva，Minerva和seglemetry3k）的多模式推理基准。代码可在此HTTPS URL上找到。

Title: When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

Authors: Hanna Shcharbakova, Tatiana Anikina, Natalia Skachkova, Josef van Genabith
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20700
Pdf URL: https://arxiv.org/pdf/2507.20700
Copy Paste: [[2507.20700]] When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification(https://arxiv.org/abs/2507.20700)
Keywords: language model, llm, prompt
Abstract: The rapid spread of multilingual misinformation requires robust automated fact verification systems capable of handling fine-grained veracity assessments across diverse languages. While large language models have shown remarkable capabilities across many NLP tasks, their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied. We conduct a comprehensive evaluation of five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories. Our experiments compare small language models (encoder-based XLM-R and mT5) with recent decoder-only LLMs (Llama 3.1, Qwen 2.5, Mistral Nemo) using both prompting and fine-tuning approaches. Surprisingly, we find that XLM-R (270M parameters) substantially outperforms all tested LLMs (7-12B parameters), achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%. This represents a 15.8% improvement over the previous state-of-the-art (41.9%), establishing new performance benchmarks for multilingual fact verification. Our analysis reveals problematic patterns in LLM behavior, including systematic difficulties in leveraging evidence and pronounced biases toward frequent categories in imbalanced data settings. These findings suggest that for fine-grained multilingual fact verification, smaller specialized models may be more effective than general-purpose large models, with important implications for practical deployment of fact-checking systems.
摘要：多语言错误信息的迅速传播需要能够处理跨不同语言的精细颗粒真实性评估的强大自动化事实验证系统。尽管大型语言模型在许多NLP任务中都表现出了显着的功能，但它们对通过细微的分类方案进行多种语言索赔验证的有效性仍在研究中。我们对X-FACT数据集上的五种最先进的语言模型进行了全面评估，该模型涵盖了25种具有七个不同真实性类别的语言。我们的实验将小语言模型（基于编码器的XLM-R和MT5）与最新的仅解码器LLMS（Llama 3.1，Qwen 2.5，Mismtral Nemo）进行了比较。出人意料的是，我们发现XLM-R（27000万参数）基本上优于所有测试的LLM（7-12B参数），与16.9％的最佳LLM性能相比，达到57.7％的Macro-F1。这比以前的最先前的（41.9％）提高了15.8％，为多语言事实验证建立了新的绩效基准。我们的分析揭示了LLM行为中有问题的模式，包括利用证据的系统困难以及在数据设置中频繁类别的明显偏见。这些发现表明，对于细粒度的多语言事实验证，较小的专业模型可能比通用大型模型更有效，对事实检查系统的实际部署具有重要意义。

Title: Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models

Authors: Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2507.20704
Pdf URL: https://arxiv.org/pdf/2507.20704
Copy Paste: [[2507.20704]] Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models(https://arxiv.org/abs/2507.20704)
Keywords: language model, prompt
Abstract: The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose \textbf{Text2VLM}, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models' alignment. This is in addition to a significant performance gap compared to closed-source frontier models. We validate Text2VLM through human evaluations, ensuring the alignment of extracted salient concepts; text summarization and output classification align with human expectations. Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for VLMs. By enhancing the evaluation of multimodal vulnerabilities, Text2VLM plays a role in advancing the safe deployment of VLMs in diverse, real-world applications.
摘要：将视觉语言模型（VLM）越来越多地集成到AI系统中需要强大的模型对齐，尤其是在处理结合文本和图像的多模式内容时。现有的评估数据集大大倾向于仅文本提示，而评估视觉漏洞。为了解决这一差距，我们建议\ textbf {text2vlm}，这是一种新颖的多阶段管道，将仅文本数据集调整为多模式格式，专门设计用于评估VLMS针对印刷提示提示提示注射攻击的弹性。 Text2VLM管道在原始文本中标识有害内容，并将其转换为印刷图像，从而为VLMS创建多模式提示。同样，我们对开源VLMS的评估突显了它们在引入视觉输入时提前注入的敏感性增加，从而揭示了当前模型的对齐方式中的临界弱点。与封闭源边界模型相比，这是一个明显的性能差距的补充。我们通过人类评估来验证Text2vlm，以确保提取的显着概念的对齐；文本摘要和输出分类与人类期望保持一致。 Text2VLM为全面的安全评估提供了可扩展的工具，这有助于开发VLM的更强大的安全机制。通过增强对多模式漏洞的评估，Text2VLM在推进VLM的安全部署中发挥了作用。

Title: Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study

Authors: Yiran Huang, Lukas Thede, Massimiliano Mancini, Wenjia Xu, Zeynep Akata
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2507.20749
Pdf URL: https://arxiv.org/pdf/2507.20749
Copy Paste: [[2507.20749]] Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study(https://arxiv.org/abs/2507.20749)
Keywords: language model, llm
Abstract: While Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose significant barriers to practical deployment. Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs), but these methods offer limited flexibility and remain computationally intensive. To address this gap, we propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training. Specifically, we investigate two structural pruning paradigms--layerwise and widthwise pruning--applied to the language model backbone of MLLMs, alongside supervised finetuning and knowledge distillation. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios with limited computational resources or insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels (< 20%). Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved with as little as 5% of the original training data, while retaining over 95% of the original performance. Through empirical study on two representative MLLMs, i.e., LLaVA-v1.5-7B and Bunny-v1.0-3B, this study offers actionable insights for practitioners aiming to compress MLLMs effectively without extensive computation resources or sufficient data.
摘要：尽管多模式大语模型（MLLM）表现出令人印象深刻的功能，但它们的实质计算和内存要求构成了实践部署的重大障碍。当前的参数降低技术主要涉及小语言模型（SLM）的培训MLLM，但是这些方法具有有限的灵活性并保持了计算密集型。为了解决这一差距，我们建议通过结构修剪和有效的恢复训练直接压缩现有的MLLM。具体而言，我们研究了两个结构性修剪范式 - 叶状和宽度修剪 - 与监督的鉴定和知识蒸馏一起，应用于MLLM的语言模型骨干。此外，我们仅使用一小部分可用数据来评估进行恢复训练的可行性。我们的结果表明，宽度修剪通常在低资源场景中保持更好的绩效，而计算资源有限或固定数据不足。至于恢复训练，仅在较小的压缩水平（<20％）下进行多模式投影仪就足够了。此外，有监督的鉴定和隐藏式蒸馏的结合可在各种修剪水平上产生最佳的恢复。值得注意的是，可以使用原始培训数据的5％来实现有效的恢复，同时保留了超过95％的原始性能。通过对两个代表性MLLM的实证研究，即Llava-V1.5-7B和Bunny-V1.0-3B，本研究为旨在有效地无需大量计算资源或足够数据的实践来压缩MLLM的从业者提供了可行的见解。

Title: Multilingual Self-Taught Faithfulness Evaluators

Authors: Carlo Alfano, Aymen Al Marjani, Zeno Jonke, Amin Mantrach, Saab Mansour, Marcello Federico
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.20752
Pdf URL: https://arxiv.org/pdf/2507.20752
Copy Paste: [[2507.20752]] Multilingual Self-Taught Faithfulness Evaluators(https://arxiv.org/abs/2507.20752)
Keywords: language model, llm, hallucination
Abstract: The growing use of large language models (LLMs) has increased the need for automatic evaluation systems, particularly to address the challenge of information hallucination. Although existing faithfulness evaluation approaches have shown promise, they are predominantly English-focused and often require expensive human-labeled training data for fine-tuning specialized models. As LLMs see increased adoption in multilingual contexts, there is a need for accurate faithfulness evaluators that can operate across languages without extensive labeled data. This paper presents Self-Taught Evaluators for Multilingual Faithfulness, a framework that learns exclusively from synthetic multilingual summarization data while leveraging cross-lingual transfer learning. Through experiments comparing language-specific and mixed-language fine-tuning approaches, we demonstrate a consistent relationship between an LLM's general language capabilities and its performance in language-specific evaluation tasks. Our framework shows improvements over existing baselines, including state-of-the-art English evaluators and machine translation-based approaches.
摘要：大型语言模型（LLM）的日益增长的使用增加了对自动评估系统的需求，尤其是解决信息幻觉的挑战。尽管现有的忠诚评估方法已经显示出希望，但它们主要是专注于英语的，并且通常需要昂贵的人体标记的培训数据才能进行微调专业模型。随着LLM在多语言环境中的采用增加，需要准确的忠诚评估者可以在没有广泛标记的数据的情况下跨语言运作。本文介绍了自学成才的评估者多语言忠诚，该框架仅从合成的多语言摘要数据中学习，同时利用跨语性转移学习。通过比较特定语言和混合语言的微调方法的实验，我们证明了LLM的一般语言能力与其在特定语言评估任务中的表现之间的一致关系。我们的框架显示了对现有基线的改进，包括最先进的英语评估者和基于机器翻译的方法。

Title: On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

Authors: Meishan Zhang, Xin Zhang, Xinping Zhao, Shouzheng Huang, Baotian Hu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20783
Pdf URL: https://arxiv.org/pdf/2507.20783
Copy Paste: [[2507.20783]] On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey(https://arxiv.org/abs/2507.20783)
Keywords: language model, gpt
Abstract: Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, such as retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. Then, we describe advanced roles enabled by PLMs, such as multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.
摘要：由于其在广泛的自然语言处理（NLP）任务（例如检索，分类，聚类，bitext挖掘和摘要）中的有效性，文本嵌入引起了人们日益增长的兴趣。随着验证的语言模型（PLM）的出现，通用文本嵌入（GPTE）因其产生丰富，可转移的表示的能力而获得了显着的吸引力。 GPTE的一般体系结构通常利用PLM来得出密集的文本表示，然后通过在大规模的成对数据集上进行对比学习进行优化。在这项调查中，我们在PLM时代提供了GPTE的全面概述，重点介绍了PLM在推动其发展中所扮演的角色。我们首先检查了基本架构，并描述了PLM在GPTE中的基本作用，即嵌入提取，表现力增强，培训策略，学习目标和数据构建。然后，我们描述了由PLM启用的高级角色，例如多语言支持，多模式集成，代码理解和特定方案的适应性。最后，我们强调了潜在的未来研究方向，这些方向超出了传统的改进目标，包括排名一体化，安全考虑，缓解偏见，结构信息纳入以及嵌入的认知扩展。这项调查旨在为新移民和既定的研究人员都有宝贵的参考，以了解GPTE的当前状态和未来潜力。

Title: Automating Thematic Review of Prevention of Future Deaths Reports: Replicating the ONS Child Suicide Study using Large Language Models

Authors: Sam Osian, Arpan Dutta, Sahil Bhandari, Iain E. Buchan, Dan W. Joyce
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20786
Pdf URL: https://arxiv.org/pdf/2507.20786
Copy Paste: [[2507.20786]] Automating Thematic Review of Prevention of Future Deaths Reports: Replicating the ONS Child Suicide Study using Large Language Models(https://arxiv.org/abs/2507.20786)
Keywords: language model, llm
Abstract: Prevention of Future Deaths (PFD) reports, issued by coroners in England and Wales, flag systemic hazards that may lead to further loss of life. Analysis of these reports has previously been constrained by the manual effort required to identify and code relevant cases. In 2025, the Office for National Statistics (ONS) published a national thematic review of child-suicide PFD reports ($\leq$ 18 years), identifying 37 cases from January 2015 to November 2023 - a process based entirely on manual curation and coding. We evaluated whether a fully automated, open source "text-to-table" language-model pipeline (PFD Toolkit) could reproduce the ONS's identification and thematic analysis of child-suicide PFD reports, and assessed gains in efficiency and reliability. All 4,249 PFD reports published from July 2013 to November 2023 were processed via PFD Toolkit's large language model pipelines. Automated screening identified cases where the coroner attributed death to suicide in individuals aged 18 or younger, and eligible reports were coded for recipient category and 23 concern sub-themes, replicating the ONS coding frame. PFD Toolkit identified 72 child-suicide PFD reports - almost twice the ONS count. Three blinded clinicians adjudicated a stratified sample of 144 reports to validate the child-suicide screening. Against the post-consensus clinical annotations, the LLM-based workflow showed substantial to almost-perfect agreement (Cohen's $\kappa$ = 0.82, 95% CI: 0.66-0.98, raw agreement = 91%). The end-to-end script runtime was 8m 16s, transforming a process that previously took months into one that can be completed in minutes. This demonstrates that automated LLM analysis can reliably and efficiently replicate manual thematic reviews of coronial data, enabling scalable, reproducible, and timely insights for public health and safety. The PFD Toolkit is openly available for future research.
摘要：验尸官在英格兰和威尔士发行的预防未来死亡（PFD）报告，可能导致进一步丧生的旗帜系统危害。对这些报告的分析先前受到识别和编码相关情况所需的手动努力的限制。 2025年，国家统计局（ONS）发表了对儿童自由PFD报告（$ \ leq $ 18岁）的国家主题审查，从2015年1月到2023年11月，确定了37个案件 - 完全基于手动策划和编码的过程。我们评估了完全自动化的开源“文本到餐桌”语言模型管道（PFD工具包）是否可以重现ONS的识别和主题分析Child-Suicide PFD报告，并评估效率和可靠性的提高。 2013年7月至2023年11月发布的所有4,249个PFD报告均通过PFD Toolkit的大型语言模型管道处理。自动筛查确定了验尸官将死亡归因于18岁或以下的个体自杀的案例，并且有资格的报告被编码为接受者类别和23个关注子主题，从而复制了ONS编码框架。 PFD工具包确定了72个儿童自杀PFD报告 - 几乎是ONS计数的两倍。三名盲人临床医生裁定了144个报告的分层样本，以验证儿童自杀筛查。针对后自言自语的临床注释，基于LLM的工作流程几乎完美协议（Cohen的$ \ kappa $ = 0.82，95％CI：0.66-0.98，原始协议= 91％）。端到端脚本运行时是800万16s，将一个以前花费数月的过程转变为可以在几分钟内完成的过程。这表明自动化LLM分析可以可靠，有效地复制冠状数据的手动主题评论，从而实现可扩展，可重现和及时的公共健康和安全见解。 PFD工具包公开可用于将来的研究。

Title: Latent Inter-User Difference Modeling for LLM Personalization

Authors: Yilun Qiu, Tianhao Shi, Xiaoyan Zhao, Fengbin Zhu, Yang Zhang, Fuli Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20849
Pdf URL: https://arxiv.org/pdf/2507.20849
Copy Paste: [[2507.20849]] Latent Inter-User Difference Modeling for LLM Personalization(https://arxiv.org/abs/2507.20849)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly integrated into users' daily lives, leading to a growing demand for personalized outputs. Previous work focuses on leveraging a user's own history, overlooking inter-user differences that are crucial for effective personalization. While recent work has attempted to model such differences, the reliance on language-based prompts often hampers the effective extraction of meaningful distinctions. To address these issues, we propose Difference-aware Embedding-based Personalization (DEP), a framework that models inter-user differences in the latent space instead of relying on language prompts. DEP constructs soft prompts by contrasting a user's embedding with those of peers who engaged with similar content, highlighting relative behavioral signals. A sparse autoencoder then filters and compresses both user-specific and difference-aware embeddings, preserving only task-relevant features before injecting them into a frozen LLM. Experiments on personalized review generation show that DEP consistently outperforms baseline methods across multiple metrics. Our code is available at this https URL.
摘要：大型语言模型（LLM）越来越多地整合到用户的日常生活中，从而导致对个性化产出的需求不断增长。以前的工作着重于利用用户自己的历史记录，忽略了用户间差异，这些差异对于有效的个性化至关重要。尽管最近的工作试图建模这种差异，但对基于语言的提示的依赖通常会阻碍有效地提取有意义的区别。为了解决这些问题，我们建议基于嵌入的个性化（DEP），该框架是对潜在空间中用户间差异建模的框架，而不是依靠语言提示。 DEP通过将用户的嵌入方式与与之相似的同行的同龄人进行对比，从而构建软提示，突出了相对行为信号。然后，稀疏的自动编码器过滤并压缩特定于用户的嵌入式嵌入，仅保留与任务相关的功能，然后将其注入冷冻LLM。个性化审查生成的实验表明，深度始终超过多个指标的基线方法。我们的代码可在此HTTPS URL上找到。

Title: Leveraging Open-Source Large Language Models for Clinical Information Extraction in Resource-Constrained Settings

Authors: Luc Builtjes, Joeran Bosma, Mathias Prokop, Bram van Ginneken, Alessa Hering
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20859
Pdf URL: https://arxiv.org/pdf/2507.20859
Copy Paste: [[2507.20859]] Leveraging Open-Source Large Language Models for Clinical Information Extraction in Resource-Constrained Settings(https://arxiv.org/abs/2507.20859)
Keywords: language model, llm
Abstract: Medical reports contain rich clinical information but are often unstructured and written in domain-specific language, posing challenges for information extraction. While proprietary large language models (LLMs) have shown promise in clinical natural language processing, their lack of transparency and data privacy concerns limit their utility in healthcare. This study therefore evaluates nine open-source generative LLMs on the DRAGON benchmark, which includes 28 clinical information extraction tasks in Dutch. We developed \texttt{llm\_extractinator}, a publicly available framework for information extraction using open-source generative LLMs, and used it to assess model performance in a zero-shot setting. Several 14 billion parameter models, Phi-4-14B, Qwen-2.5-14B, and DeepSeek-R1-14B, achieved competitive results, while the bigger Llama-3.3-70B model achieved slightly higher performance at greater computational cost. Translation to English prior to inference consistently degraded performance, highlighting the need of native-language processing. These findings demonstrate that open-source LLMs, when used with our framework, offer effective, scalable, and privacy-conscious solutions for clinical information extraction in low-resource settings.
摘要：医学报告包含丰富的临床信息，但通常是非结构化的，并以特定于领域的语言书写，对信息提取提出了挑战。尽管专有的大型语言模型（LLMS）在临床自然语言处理方面表现出了希望，但缺乏透明度和数据隐私问题限制了其在医疗保健领域的效用。因此，这项研究评估了龙基准上的9个开源生成LLM，其中包括28个荷兰语中的临床信息提取任务。我们开发了\ texttt {llm \ _extractInator}，这是一个使用开源生成llms进行信息提取的公开可用框架，并将其用于在零拍设置中评估模型性能。几个140亿个参数模型，PHI-4-14B，QWEN-2.5-14B和DEEPSEEK-R1-14B，取得了竞争成果，而较大的Llama-3.3-70B模型在更高的计算成本下实现了稍高的性能。推断之前，将其转换为英语，始终如一地退化的性能，强调了对本地语言处理的需求。这些发现表明，开源LLMS与我们的框架一起使用时，为低资源环境中的临床信息提取提供有效，可扩展和隐私意识的解决方案。

Title: Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning

Authors: Jungwon Park, Wonjong Rhee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.20906
Pdf URL: https://arxiv.org/pdf/2507.20906
Copy Paste: [[2507.20906]] Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning(https://arxiv.org/abs/2507.20906)
Keywords: language model, llm, prompt
Abstract: In-Context Learning (ICL) enables Large Language Models (LLMs) to perform tasks by conditioning on input-output examples in the prompt, without requiring any update in model parameters. While widely adopted, it remains unclear whether prompting with multiple examples is the most effective and efficient way to convey task information. In this work, we propose Soft Injection of task embeddings. The task embeddings are constructed only once using few-shot ICL prompts and repeatedly used during inference. Soft injection is performed by softly mixing task embeddings with attention head activations using pre-optimized mixing parameters, referred to as soft head-selection parameters. This method not only allows a desired task to be performed without in-prompt demonstrations but also significantly outperforms existing ICL approaches while reducing memory usage and compute cost at inference time. An extensive evaluation is performed across 57 tasks and 12 LLMs, spanning four model families of sizes from 4B to 70B. Averaged across 57 tasks, our method outperforms 10-shot ICL by 10.1%-13.9% across 12 LLMs. Additional analyses show that our method also serves as an insightful tool for analyzing task-relevant roles of attention heads, revealing that task-relevant head positions selected by our method transfer across similar tasks but not across dissimilar ones -- underscoring the task-specific nature of head functionality. Our soft injection method opens a new paradigm for reducing prompt length and improving task performance by shifting task conditioning from the prompt space to the activation space.
摘要：内部文化学习（ICL）使大型语言模型（LLMS）能够通过在提示中对输入输出示例进行调节，而无需在模型参数中进行任何更新。尽管被广泛采用，但尚不清楚提示多个示例是否是传达任务信息的最有效的方法。在这项工作中，我们建议对任务嵌入的软注射。任务嵌入仅使用少量ICL提示构建一次，并在推理过程中反复使用。软注射是通过使用预优化的混合参数（称为软头选择参数）轻轻混合任务嵌入与注意力头激活的轻度激活来执行的。这种方法不仅允许没有预付款演示的情况执行所需的任务，而且可以显着超过现有的ICL方法，同时减少记忆使用情况和计算成本。在57个任务和12个LLMS中进行了广泛的评估，涵盖了4个尺寸的4个模型家族，从4B到70B。在57个任务中平均，我们的方法在12个LLM中的表现优于10-shot ICL 10.1％-13.9％。其他分析表明，我们的方法还可以作为分析与任务相关的注意力角色的洞察力工具，揭示了我们的方法转移选择相关的与任务相关的头部位置跨相似任务，但跨越不同的任务 - 强调了头部功能的任务特定性质。我们的软注射方法开辟了一个新的范式，可通过将任务调节从及时的空间转移到激活空间来降低及时长度并改善任务性能。

Title: MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

Authors: Adrien Bazoge
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20917
Pdf URL: https://arxiv.org/pdf/2507.20917
Copy Paste: [[2507.20917]] MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation(https://arxiv.org/abs/2507.20917)
Keywords: language model
Abstract: This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models' cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models' performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.
摘要：这项工作介绍了Mediqal，这是一个法国医学问题，回答了数据集，旨在评估语言模型在实际医学召回中的功能，并在现实世界中的临床情况下进行推理。 Mediqal包含32,603个问题，这些问题来自41名医学主题的法国医学检查。该数据集包括三个任务：（i）具有独特答案的多项选择问题，（ii）带有多个答案的多项选择问题，以及（iii）带有短答案的开放式问题。每个问题都标记为理解或推理，从而详细分析了模型的认知能力。我们通过使用14个大语言模型（包括最近的推理启动模型）进行广泛的评估来验证Mediqal数据集，并观察到事实召回和推理任务之间存在显着的性能差距。我们的评估为评估语言模型在法国医学问题回答上的表现提供了全面的基准，从而解决了医疗领域的多语言资源的关键差距。

Title: FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models

Authors: Roberto Labadie-Tamayo, Adrian Jaques Böck, Djordje Slijepčević, Xihui Chen, Andreas Babic, Matthias Zeppelzauer
Subjects: cs.CL, cs.AI, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2507.20924
Pdf URL: https://arxiv.org/pdf/2507.20924
Copy Paste: [[2507.20924]] FHSTP@EXIST 2025 Benchmark: Sexism Detection with Transparent Speech Concept Bottleneck Models(https://arxiv.org/abs/2507.20924)
Keywords: language model, llm
Abstract: Sexism has become widespread on social media and in online conversation. To help address this issue, the fifth Sexism Identification in Social Networks (EXIST) challenge is initiated at CLEF 2025. Among this year's international benchmarks, we concentrate on solving the first task aiming to identify and classify sexism in social media textual posts. In this paper, we describe our solutions and report results for three subtasks: Subtask 1.1 - Sexism Identification in Tweets, Subtask 1.2 - Source Intention in Tweets, and Subtask 1.3 - Sexism Categorization in Tweets. We implement three models to address each subtask which constitute three individual runs: Speech Concept Bottleneck Model (SCBM), Speech Concept Bottleneck Model with Transformer (SCBMT), and a fine-tuned XLM-RoBERTa transformer model. SCBM uses descriptive adjectives as human-interpretable bottleneck concepts. SCBM leverages large language models (LLMs) to encode input texts into a human-interpretable representation of adjectives, then used to train a lightweight classifier for downstream tasks. SCBMT extends SCBM by fusing adjective-based representation with contextual embeddings from transformers to balance interpretability and classification performance. Beyond competitive results, these two models offer fine-grained explanations at both instance (local) and class (global) levels. We also investigate how additional metadata, e.g., annotators' demographic profiles, can be leveraged. For Subtask 1.1, XLM-RoBERTa, fine-tuned on provided data augmented with prior datasets, ranks 6th for English and Spanish and 4th for English in the Soft-Soft evaluation. Our SCBMT achieves 7th for English and Spanish and 6th for Spanish.
摘要：性别歧视已在社交媒体和在线对话中普遍存在。为了帮助解决这个问题，社交网络中的第五个性别歧视（存在）挑战是在CLEF 2025发起的。在今年的国际基准中，我们专注于解决旨在在社交媒体文本帖子中识别和分类性别歧视的第一个任务。在本文中，我们描述了我们的解决方案并报告了三个子任务的结果：子任务1.1-推文中的性别歧视鉴定，子任务1.2-推文中的源意图和subtask 1.3-推文中的性别歧视分类。我们实施了三个模型来解决每个构成三个单独运行的子任务：语音概念瓶颈模型（SCBM），带有变压器（SCBMT）的语音概念瓶颈模型和微调的XLM-Roberta变形金刚模型。 SCBM使用描述性形容词作为人解剖的瓶颈概念。 SCBM利用大型语言模型（LLMS）将输入文本编码为形容词的人解剖表示，然后用来训练轻量级分类器以进行下游任务。 SCBMT通过将基于形容词的表示与来自变压器的上下文嵌入到平衡可解释性和分类性能的上下文嵌入来扩展SCBM。除了竞争成果之外，这两个模型还提供了实例（本地）和类（全球）级别的细粒度解释。我们还研究了如何利用其他元数据，例如注释者的人口统计资料。对于子任务1.1，XLM-ROBERTA对提供的数据增加的数据进行了微调，英语和西班牙语和西班牙语排名第六，在Soft-Soft评估中为英语排名第四。我们的SCBMT在英语和西班牙语中获得第七名，西班牙语获得第六名。

Title: FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models

Authors: Likun Tan, Kuan-Wei Huang, Kevin Wu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.20930
Pdf URL: https://arxiv.org/pdf/2507.20930
Copy Paste: [[2507.20930]] FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models(https://arxiv.org/abs/2507.20930)
Keywords: language model, hallucination
Abstract: Hallucinations in large language models pose a critical challenge for applications requiring factual reliability, particularly in high-stakes domains such as finance. This work presents an effective approach for detecting and editing factually incorrect content in model-generated responses based on the provided context. Given a user-defined domain-specific error taxonomy, we construct a synthetic dataset by inserting tagged errors into financial question-answering corpora and then fine-tune four language models, Phi-4, Phi-4-mini, Qwen3-4B, and Qwen3-14B, to detect and edit these factual inaccuracies. Our best-performing model, fine-tuned Phi-4, achieves an 8% improvement in binary F1 score and a 30% gain in overall detection performance compared to OpenAI-o3. Notably, our fine-tuned Phi-4-mini model, despite having only 4 billion parameters, maintains competitive performance with just a 2% drop in binary detection and a 0.1% decline in overall detection compared to OpenAI-o3. Our work provides a practical solution for detecting and editing factual inconsistencies in financial text generation while introducing a generalizable framework that can enhance the trustworthiness and alignment of large language models across diverse applications beyond finance. Our code and data are available at this https URL.
摘要：大语言模型中的幻觉对需要事实可靠性的应用构成了一个关键挑战，尤其是在金融等高风险领域。这项工作提出了一种有效的方法，用于根据提供的上下文来检测和编辑模型生成的响应中的事实不正确的内容。给定用户定义的域特异性错误分类法，我们通过将标记的错误插入财务问题的语料库中，然后通过微调四个语言模型，PHI-4，PHI-4-MINI，QWEN3-4B和QWEN3-14B来构建合成数据集，以检测和编辑这些事实，以检测和编辑这些事实。与OpenAI-O3相比，我们表现最佳的模型，微调PHI-4的二进制F1得分提高了8％，总检测性能增长了30％。值得注意的是，尽管只有40亿个参数，但我们的微调PHI-4-MINI模型仍保持竞争性能，与OpenAI-O3相比，二进制检测仅下降了2％，总体检测下降了0.1％。我们的工作提供了一种实用的解决方案，用于检测和编辑财务文本生成中的事实不一致，同时引入了可推广的框架，该框架可以增强跨金融范围内的不同应用程序的大型语言模型的可信度和一致性。我们的代码和数据可在此HTTPS URL上找到。

Title: Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models

Authors: Max Peeperkorn, Tom Kouwenhoven, Dan Brown, Anna Jordanous
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.20956
Pdf URL: https://arxiv.org/pdf/2507.20956
Copy Paste: [[2507.20956]] Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models(https://arxiv.org/abs/2507.20956)
Keywords: language model, llm, prompt
Abstract: Instruction-tuning large language models (LLMs) reduces the diversity of their outputs, which has implications for many tasks, particularly for creative tasks. This paper investigates the ``diversity gap'' for a writing prompt narrative generation task. This gap emerges as measured by current diversity metrics for various open-weight and open-source LLMs. The results show significant decreases in diversity due to instruction-tuning. We explore the diversity loss at each fine-tuning stage for the OLMo and OLMo 2 models to further understand how output diversity is affected. The results indicate that DPO has the most substantial impact on diversity. Motivated by these findings, we present a new decoding strategy, conformative decoding, which guides an instruct model using its more diverse base model to reintroduce output diversity. We show that conformative decoding typically increases diversity and even maintains or improves quality.
摘要：调整大型语言模型（LLM）的指导减少了其输出的多样性，这对许多任务具有影响，尤其是对创意任务的影响。本文调查了``多样性差距''的写作及时叙事生成任务。通过当前多样性指标来衡量的各种开放式和开源LLM的差距。结果表明，由于指导调整，多样性显着下降。我们探索Olmo和Olmo 2模型的每个微调阶段的多样性损失，以进一步了解产出多样性如何影响。结果表明，DPO对多样性具有最大的影响。在这些发现的激励下，我们提出了一种新的解码策略，即一致的解码，该策略使用其更多样化的基础模型来指导指示模型来重新引入产出多样性。我们表明，合规性解码通常会增加多样性，甚至保持或提高质量。

Title: Memorization in Fine-Tuned Large Language Models

Authors: Danil Savine, Muni Sreenivas Pydi, Jamal Atif, Olivier Cappé
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.21009
Pdf URL: https://arxiv.org/pdf/2507.21009
Copy Paste: [[2507.21009]] Memorization in Fine-Tuned Large Language Models(https://arxiv.org/abs/2507.21009)
Keywords: language model, llm, prompt
Abstract: This study investigates the mechanisms and factors influencing memorization in fine-tuned large language models (LLMs), with a focus on the medical domain due to its privacy-sensitive nature. We examine how different aspects of the fine-tuning process affect a model's propensity to memorize training data, using the PHEE dataset of pharmacovigilance events. Our research employs two main approaches: a membership inference attack to detect memorized data, and a generation task with prompted prefixes to assess verbatim reproduction. We analyze the impact of adapting different weight matrices in the transformer architecture, the relationship between perplexity and memorization, and the effect of increasing the rank in low-rank adaptation (LoRA) fine-tuning. Key findings include: (1) Value and Output matrices contribute more significantly to memorization compared to Query and Key matrices; (2) Lower perplexity in the fine-tuned model correlates with increased memorization; (3) Higher LoRA ranks lead to increased memorization, but with diminishing returns at higher ranks. These results provide insights into the trade-offs between model performance and privacy risks in fine-tuned LLMs. Our findings have implications for developing more effective and responsible strategies for adapting large language models while managing data privacy concerns.
摘要：这项研究调查了影响微调大语模型（LLMS）记忆的机制和因素，由于其隐私敏感性，重点是医疗领域。我们研究了微调过程的不同方面如何使用药物守护事件的PHEE数据集影响模型记忆训练数据的倾向。我们的研究采用了两种主要方法：一种会员推理攻击来检测记忆数据，并具有促使前缀评估逐字化复制的一项生成任务。我们分析了适应变压器体系结构中不同重量矩阵的影响，困惑性与记忆之间的关系以及在低级别适应性（LORA）微调中提高等级的效果。关键发现包括：（1）与查询和关键矩阵相比，价值和输出矩阵对记忆的贡献更大；（2）微调模型中的较低的困惑与记忆的增加相关；（3）更高的洛拉（Lora）排名导致记忆增加，但较高的回报率降低。这些结果为微调LLM中的模型性能和隐私风险之间的权衡提供了见解。我们的发现对制定更有效和负责任的策略有影响，以调整大型语言模型，同时管理数据隐私问题。

Title: Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation

Authors: Jiaju Chen, Yuxuan Lu, Xiaojie Wang, Huimin Zeng, Jing Huang, Jiri Gesi, Ying Xu, Bingsheng Yao, Dakuo Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.21028
Pdf URL: https://arxiv.org/pdf/2507.21028
Copy Paste: [[2507.21028]] Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation(https://arxiv.org/abs/2507.21028)
Keywords: llm, agent
Abstract: Nearly all human work is collaborative; thus, the evaluation of real-world NLP applications often requires multiple dimensions that align with diverse human perspectives. As real human evaluator resources are often scarce and costly, the emerging "LLM-as-a-judge" paradigm sheds light on a promising approach to leverage LLM agents to believably simulate human evaluators. Yet, to date, existing LLM-as-a-judge approaches face two limitations: persona descriptions of agents are often arbitrarily designed, and the frameworks are not generalizable to other tasks. To address these challenges, we propose MAJ-EVAL, a Multi-Agent-as-Judge evaluation framework that can automatically construct multiple evaluator personas with distinct dimensions from relevant text documents (e.g., research papers), instantiate LLM agents with the personas, and engage in-group debates with multi-agents to Generate multi-dimensional feedback. Our evaluation experiments in both the educational and medical domains demonstrate that MAJ-EVAL can generate evaluation results that better align with human experts' ratings compared with conventional automated evaluation metrics and existing LLM-as-a-judge methods.
摘要：几乎所有的人类工作都是协作的。因此，对现实世界NLP应用的评估通常需要多个与人类观点保持一致的多个维度。由于真正的人类评估者资源通常是稀缺且昂贵的，因此新兴的“ LLM-AS-A-A-A-Gudge”范式阐明了一种有前途的方法，以利用LLM代理来模拟人类评估者。但是，迄今为止，现有的LLM-AS-A-A-Gudge方法面临两个局限性：代理的角色描述通常是任意设计的，并且这些框架无法推广到其他任务。为了应对这些挑战，我们提出了Maj-eval，这是一个多代理 - 法官评估框架，可以自动构建具有与相关文本文档（例如，研究论文）的不同维度的多个评估者角色，与该人物实例化的LLM代理商，并与人物内部辩论，并与组内辩论，以生成多层馈送送回。我们在教育和医学领域的评估实验表明，与常规的自动化评估指标和现有的LLM-AS-A-A-a-gudge方法相比，Maj-eval可以产生评估结果，以更好地与人类专家的评级保持一致。