2025-08-01

Title: Large Language Models in the Travel Domain: An Industrial Experience

Authors: Sergio Di Meglio, Aniello Somma, Luigi Libero Lucio Starace, Fabio Scippacercola, Giancarlo Sperlì, Sergio Di Martino
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22910
Pdf URL: https://arxiv.org/pdf/2507.22910
Copy Paste: [[2507.22910]] Large Language Models in the Travel Domain: An Industrial Experience(https://arxiv.org/abs/2507.22910)
Keywords: language model, llm, hallucination, prompt
Abstract: Online property booking platforms are widely used and rely heavily on consistent, up-to-date information about accommodation facilities, often sourced from third-party providers. However, these external data sources are frequently affected by incomplete or inconsistent details, which can frustrate users and result in a loss of market. In response to these challenges, we present an industrial case study involving the integration of Large Language Models (LLMs) into CALEIDOHOTELS, a property reservation platform developed by FERVENTO. We evaluate two well-known LLMs in this context: Mistral 7B, fine-tuned with QLoRA, and Mixtral 8x7B, utilized with a refined system prompt. Both models were assessed based on their ability to generate consistent and homogeneous descriptions while minimizing hallucinations. Mixtral 8x7B outperformed Mistral 7B in terms of completeness (99.6% vs. 93%), precision (98.8% vs. 96%), and hallucination rate (1.2% vs. 4%), producing shorter yet more concise content (249 vs. 277 words on average). However, this came at a significantly higher computational cost: 50GB VRAM and $1.61/hour versus 5GB and $0.16/hour for Mistral 7B. Our findings provide practical insights into the trade-offs between model quality and resource efficiency, offering guidance for deploying LLMs in production environments and demonstrating their effectiveness in enhancing the consistency and reliability of accommodation data.
摘要：在线房地产预订平台被广泛使用，并在很大程度上依赖于有关住宿设施的一致，最新的信息，这些信息通常来自第三方提供商。但是，这些外部数据源经常受到不完整或不一致的细节的影响，这可能会使用户感到沮丧并导致市场损失。为了应对这些挑战，我们提出了一项工业案例研究，涉及大型语言模型（LLMS）与Fervento开发的房地产预订平台Caleidohotels的整合。在这种情况下，我们评估了两个众所周知的LLM：Mistral 7b，用Qlora进行了微调和Mixtral 8x7b，并带有精制系统提示。两种模型均根据其产生一致和均匀描述的能力，同时最大程度地减少幻觉。 Mixtral 8x7b就完整性（99.6％vs. 93％），精度（98.8％vs. 96％）和幻觉率（1.2％vs. 4％）而言，胜过Mistral 7b，产生了更短的简洁内容（平均249个vs. 277个单词）。但是，这是计算成本明显更高的：50GB VRAM和$ 1.61/小时，而Mistral 7b为5GB，每小时$ 0.16/小时。我们的发现为模型质量和资源效率之间的权衡提供了实用的见解，为在生产环境中部署LLM的指导提供了指导，并证明了它们在增强住宿数据的一致性和可靠性方面的有效性。

Title: ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

Authors: Jinzhi Wang, Qingke Peng, Haozhou Li, Zeyuan Zeng, Qinfeng Song, Kaixuan Yang, Jiangbo Zhang, Yaoying Wang, Ruimeng Li, Biyi Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22911
Pdf URL: https://arxiv.org/pdf/2507.22911
Copy Paste: [[2507.22911]] ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing(https://arxiv.org/abs/2507.22911)
Keywords: language model, gpt, llm
Abstract: Electric power marketing customer service plays a critical role in addressing inquiries, complaints, and service requests. However, current systems, such as China's 95598 hotline, often struggle with slow response times, inflexible procedures, and limited accuracy in domain-specific tasks. While large language models (LLMs) like GPT-4o and Claude 3 demonstrate strong general capabilities, they lack the domain expertise and empathy required in this field. To bridge this gap, we introduce ElectriQ, the first benchmark designed to evaluate and enhance LLMs in electric power marketing scenarios. ElectriQ consists of a dialogue dataset covering six key service categories and introduces four evaluation metrics: professionalism, popularity, readability, and user-friendliness. We further incorporate a domain-specific knowledge base and propose a knowledge augmentation method to boost model performance. Experiments on 13 LLMs reveal that smaller models such as LLama3-8B, when fine-tuned and augmented, can surpass GPT-4o in terms of professionalism and user-friendliness. ElectriQ establishes a comprehensive foundation for developing LLMs tailored to the needs of power marketing services.
摘要：电力营销客户服务在解决查询，投诉和服务请求中起着至关重要的作用。但是，当前的系统，例如中国的95598热线，通常在响应时间缓慢，不灵活的程序和特定于域特定任务的准确性方面遇到了有限的困难。尽管GPT-4O和Claude 3等大型语言模型（LLM）表现出强大的一般能力，但它们缺乏该领域所需的领域专业知识和同理心。为了弥合这一差距，我们介绍了Electriq，这是第一个旨在评估和增强电力营销场景中LLM的基准。 ElectriQ由一个涵盖六个关键服务类别的对话数据集组成，并介绍了四个评估指标：专业，受欢迎程度，可读性和用户友好性。我们进一步合并了特定领域的知识库，并提出了一种知识增强方法来提高模型性能。在13个LLM上进行的实验表明，诸如Llama3-8B之类的较小模型在经过微调和增强时，可以从专业精神和用户友好性方面超越GPT-4O。 Electriq为开发适合电力营销服务需求的LLM的综合基础。

Title: A Language Model-Driven Semi-Supervised Ensemble Framework for Illicit Market Detection Across Deep/Dark Web and Social Platforms

Authors: Navid Yazdanjue, Morteza Rakhshaninejad, Hossein Yazdanjouei, Mohammad Sadegh Khorshidi, Mikko S. Niemela, Fang Chen, Amir H. Gandomi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.22912
Pdf URL: https://arxiv.org/pdf/2507.22912
Copy Paste: [[2507.22912]] A Language Model-Driven Semi-Supervised Ensemble Framework for Illicit Market Detection Across Deep/Dark Web and Social Platforms(https://arxiv.org/abs/2507.22912)
Keywords: language model
Abstract: Illegal marketplaces have increasingly shifted to concealed parts of the internet, including the deep and dark web, as well as platforms such as Telegram, Reddit, and Pastebin. These channels enable the anonymous trade of illicit goods including drugs, weapons, and stolen credentials. Detecting and categorizing such content remains challenging due to limited labeled data, the evolving nature of illicit language, and the structural heterogeneity of online sources. This paper presents a hierarchical classification framework that combines fine-tuned language models with a semi-supervised ensemble learning strategy to detect and classify illicit marketplace content across diverse platforms. We extract semantic representations using ModernBERT, a transformer model for long documents, finetuned on domain-specific data from deep and dark web pages, Telegram channels, Subreddits, and Pastebin pastes to capture specialized jargon and ambiguous linguistic patterns. In addition, we incorporate manually engineered features such as document structure, embedded patterns including Bitcoin addresses, emails, and IPs, and metadata, which complement language model embeddings. The classification pipeline operates in two stages. The first stage uses a semi-supervised ensemble of XGBoost, Random Forest, and SVM with entropy-based weighted voting to detect sales-related documents. The second stage further classifies these into drug, weapon, or credential sales. Experiments on three datasets, including our multi-source corpus, DUTA, and CoDA, show that our model outperforms several baselines, including BERT, ModernBERT, DarkBERT, ALBERT, Longformer, and BigBird. The model achieves an accuracy of 0.96489, an F1-score of 0.93467, and a TMCC of 0.95388, demonstrating strong generalization, robustness under limited supervision, and effectiveness in real-world illicit content detection.
摘要：非法市场越来越多地转移到互联网的隐藏部分，包括深色和深色网络，以及电报，Reddit和Pastebin等平台。这些渠道使非法商品的匿名贸易在内，包括毒品，武器和被盗的证书。由于标记的数据有限，非法语言的不断发展的性质以及在线资源的结构异质性，检测和分类此类内容仍然具有挑战性。本文提出了一个分层分类框架，该框架将微调的语言模型与半监督的合奏学习策略相结合，以检测和对各种平台的非法市场内容进行分类。 We extract semantic representations using ModernBERT, a transformer model for long documents, finetuned on domain-specific data from deep and dark web pages, Telegram channels, Subreddits, and Pastebin pastes to capture specialized jargon and ambiguous linguistic patterns.此外，我们还结合了手动设计的功能，例如文档结构，包括比特币地址，电子邮件和IPS和元数据，这些图案，以及元数据，它们补充了语言模型嵌入。分类管道分为两个阶段。第一阶段使用XGBoost，Random Forest和SVM的半监督合奏，具有基于熵的加权投票，以检测与销售相关的文档。第二阶段将这些进一步分类为药物，武器或证书销售。包括我们的多源语料库，DUTA和CODA在内的三个数据集上的实验表明，我们的模型优于几个基线，包括Bert，Modernbert，Darkbert，Albert，Albert，Longformer和Bigbird。该模型的精度为0.96489，F1得分为0.93467，TMCC为0.95388，表明在现实世界中非法含量检测中表现出强烈的概括，有限的监督和有效性。

Title: A Hybrid Framework for Subject Analysis: Integrating Embedding-Based Regression Models with Large Language Models

Authors: Jinyu Liu, Xiaoying Song, Diana Zhang, Jason Thomale, Daqing He, Lingzi Hong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22913
Pdf URL: https://arxiv.org/pdf/2507.22913
Copy Paste: [[2507.22913]] A Hybrid Framework for Subject Analysis: Integrating Embedding-Based Regression Models with Large Language Models(https://arxiv.org/abs/2507.22913)
Keywords: language model, llm, hallucination
Abstract: Providing subject access to information resources is an essential function of any library management system. Large language models (LLMs) have been widely used in classification and summarization tasks, but their capability to perform subject analysis is underexplored. Multi-label classification with traditional machine learning (ML) models has been used for subject analysis but struggles with unseen cases. LLMs offer an alternative but often over-generate and hallucinate. Therefore, we propose a hybrid framework that integrates embedding-based ML models with LLMs. This approach uses ML models to (1) predict the optimal number of LCSH labels to guide LLM predictions and (2) post-edit the predicted terms with actual LCSH terms to mitigate hallucinations. We experimented with LLMs and the hybrid framework to predict the subject terms of books using the Library of Congress Subject Headings (LCSH). Experiment results show that providing initial predictions to guide LLM generations and imposing post-edits result in more controlled and vocabulary-aligned outputs.
摘要：提供主题访问信息资源是任何库管理系统的重要功能。大型语言模型（LLMS）已被广泛用于分类和汇总任务，但是它们执行主题分析的能力并没有忽视。传统机器学习（ML）模型的多标签分类已用于主题分析，但在看不见的情况下进行了斗争。 LLM提供另一种替代方案，但通常过于生成和幻觉。因此，我们提出了一个混合框架，该框架将基于嵌入的ML模型与LLM集成在一起。该方法使用ML模型来（1）预测指导LLM预测的LCSH标签的最佳数量，以及（2）编辑后的预测术语，具有实际的LCSH术语以减轻幻觉。我们使用LLM和Hybrid框架进行了实验，以使用国会库标题（LCSH）来预测书籍的主题术语。实验结果表明，提供初始预测以指导LLM世代并实施后编辑会导致更具控制和词汇对准的输出。

Title: Theoretical Foundations and Mitigation of Hallucination in Large Language Models

Authors: Esmail Gumaan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22915
Pdf URL: https://arxiv.org/pdf/2507.22915
Copy Paste: [[2507.22915]] Theoretical Foundations and Mitigation of Hallucination in Large Language Models(https://arxiv.org/abs/2507.22915)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Hallucination in Large Language Models (LLMs) refers to the generation of content that is not faithful to the input or the real-world facts. This paper provides a rigorous treatment of hallucination in LLMs, including formal definitions and theoretical analyses. We distinguish between intrinsic and extrinsic hallucinations, and define a \textit{hallucination risk} for models. We derive bounds on this risk using learning-theoretic frameworks (PAC-Bayes and Rademacher complexity). We then survey detection strategies for hallucinations, such as token-level uncertainty estimation, confidence calibration, and attention alignment checks. On the mitigation side, we discuss approaches including retrieval-augmented generation, hallucination-aware fine-tuning, logit calibration, and the incorporation of fact-verification modules. We propose a unified detection and mitigation workflow, illustrated with a diagram, to integrate these strategies. Finally, we outline evaluation protocols for hallucination, recommending datasets, metrics, and experimental setups to quantify and reduce hallucinations. Our work lays a theoretical foundation and practical guidelines for addressing the crucial challenge of hallucination in LLMs.
摘要：大语言模型（LLM）中的幻觉是指不忠于意见或现实世界事实的内容的产生。本文提供了对LLM中幻觉的严格处理，包括正式的定义和理论分析。我们区分固有和外在的幻觉，并为模型定义A \ TextIt {幻觉风险}。我们使用学习理论框架（Pac-Bayes和Rademacher复杂性）来得出这种风险的界限。然后，我们调查幻觉的检测策略，例如令牌级的不确定性估计，置信度校准和注意对准检查。在缓解方面，我们讨论了包括检索效果的生成，幻觉感知的微调，logit校准以及事实验证模块的结合。我们提出了一个统一的检测和缓解工作流，并用图表进行了说明，以整合这些策略。最后，我们概述了幻觉的评估协议，推荐数据集，指标和实验设置，以量化和减少幻觉。我们的工作奠定了解决LLMS幻觉挑战的理论基础和实用指南。

Title: Reading Between the Timelines: RAG for Answering Diachronic Questions

Authors: Kwun Hang Lau, Ruiyuan Zhang, Weijie Shi, Xiaofang Zhou, Xiaojun Cheng
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.22917
Pdf URL: https://arxiv.org/pdf/2507.22917
Copy Paste: [[2507.22917]] Reading Between the Timelines: RAG for Answering Diachronic Questions(https://arxiv.org/abs/2507.22917)
Keywords: language model, llm, retrieval-augmented generation
Abstract: While Retrieval-Augmented Generation (RAG) excels at injecting static, factual knowledge into Large Language Models (LLMs), it exhibits a critical deficit in handling longitudinal queries that require tracking entities and phenomena across time. This blind spot arises because conventional, semantically-driven retrieval methods are not equipped to gather evidence that is both topically relevant and temporally coherent for a specified duration. We address this challenge by proposing a new framework that fundamentally redesigns the RAG pipeline to infuse temporal logic. Our methodology begins by disentangling a user's query into its core subject and its temporal window. It then employs a specialized retriever that calibrates semantic matching against temporal relevance, ensuring the collection of a contiguous evidence set that spans the entire queried period. To enable rigorous evaluation of this capability, we also introduce the Analytical Diachronic Question Answering Benchmark (ADQAB), a challenging evaluation suite grounded in a hybrid corpus of real and synthetic financial news. Empirical results on ADQAB show that our approach yields substantial gains in answer accuracy, surpassing standard RAG implementations by 13% to 27%. This work provides a validated pathway toward RAG systems capable of performing the nuanced, evolutionary analysis required for complex, real-world questions. The dataset and code for this study are publicly available at this https URL.
摘要：虽然检索生成（RAG）在将静态的，事实知识（LLMS）注入静态知识方面出色，但它在处理纵向查询时表现出严重的赤字，这些纵向查询需要跨时间跟踪实体和现象。之所以出现这个盲点，是因为传统的，语义驱动的检索方法没有能力收集在指定持续时间内兼容局部相关和时间相干的证据。我们通过提出一个从根本上重新设计了抹布管道以注入时间逻辑的新框架来应对这一挑战。我们的方法首先将用户的查询分解为其核心主题和时间窗口。然后，它采用了专门的检索器，可以针对时间相关性校准语义匹配，从而确保收集跨越整个查询期的连续证据集。为了对此能力进行严格的评估，我们还介绍了分析性直接问题，回答基准（ADQAB），这是一个具有挑战性的评估套件，该评估套件基于真实和合成金融新闻的混合语料库。 ADQAB的经验结果表明，我们的方法在答案准确性方面取得了可观的提高，使标准的破布实施超过了13％至27％。这项工作提供了通往抹布系统的验证途径，能够执行复杂的现实世界问题所需的细微，进化分析。本研究的数据集和代码可在此HTTPS URL上公开获得。

Title: Semantic Convergence: Investigating Shared Representations Across Scaled LLMs

Authors: Daniel Son, Sanjana Rathore, Andrew Rufail, Adrian Simon, Daniel Zhang, Soham Dave, Cole Blondin, Kevin Zhu, Sean O'Brien
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.22918
Pdf URL: https://arxiv.org/pdf/2507.22918
Copy Paste: [[2507.22918]] Semantic Convergence: Investigating Shared Representations Across Scaled LLMs(https://arxiv.org/abs/2507.22918)
Keywords: language model, llm
Abstract: We investigate feature universality in Gemma-2 language models (Gemma-2-2B and Gemma-2-9B), asking whether models with a four-fold difference in scale still converge on comparable internal concepts. Using the Sparse Autoencoder (SAE) dictionary-learning pipeline, we utilize SAEs on each model's residual-stream activations, align the resulting monosemantic features via activation correlation, and compare the matched feature spaces with SVCCA and RSA. Middle layers yield the strongest overlap, while early and late layers show far less similarity. Preliminary experiments extend the analysis from single tokens to multi-token subspaces, showing that semantically similar subspaces interact similarly with language models. These results strengthen the case that large language models carve the world into broadly similar, interpretable features despite size differences, reinforcing universality as a foundation for cross-model interpretability.
摘要：我们研究了Gemma-2语言模型（GEMMA-2-2B和GEMMA-2-9B）中的特征通用性，询问规模差异四倍的模型是否仍在相当的内部概念上汇聚。使用稀疏的自动编码器（SAE）词典学习管道，我们在每个模型的残留流激活上使用SAE，通过激活相关将所得的单义特征对齐，并将匹配的特征空间与SVCCCA和SVCCA和RSA进行比较。中层产生最强的重叠，而早期和晚期的相似性却少得多。初步实验将分析从单个令牌扩展到多toke子空间，表明语义上相似的子空间与语言模型相似。这些结果加强了大型语言模型尽管规模差异，但大型语言模型仍使世界成为广泛相似的，可解释的特征，并加强了普遍性，作为跨模型可解释性的基础。

Title: A novel language model for predicting serious adverse event results in clinical trials from their prospective registrations

Authors: Qixuan Hu, Xumou Zhang, Jinman Kim, Florence Bourgeois, Adam G. Dunn
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22919
Pdf URL: https://arxiv.org/pdf/2507.22919
Copy Paste: [[2507.22919]] A novel language model for predicting serious adverse event results in clinical trials from their prospective registrations(https://arxiv.org/abs/2507.22919)
Keywords: language model
Abstract: Objectives: With accurate estimates of expected safety results, clinical trials could be designed to avoid terminations and limit exposing participants to unnecessary risks. We evaluated methods for predicting serious adverse event (SAE) results in clinical trials using information only from their registrations prior to the trial. Material and Methods: We analysed 22,107 two-arm parallel interventional clinical trials from this http URL with structured summary results. Two prediction models were developed: a classifier predicting will experimental arm have higher SAE rates (area under the receiver operating characteristic curve; AUC) than control arm, and a regression model to predict the proportion of SAEs in control arms (root mean squared error; RMSE). A transfer learning approach using pretrained language models (e.g., ClinicalT5, BioBERT) was used for feature extraction, combined with downstream model for prediction. To maintain semantic representation in long trial texts exceeding localised language model input limits, a sliding window method was developed for embedding extraction. Results: The best model (ClinicalT5+Transformer+MLP) had 77.6% AUC predicting which trial arm has a higher proportion of patients with SAEs. When predicting proportion of participants experiencing SAE in the control arm, the same model achieved RMSE of 18.6%. The sliding window approach consistently outperformed methods without it. Across 12 classifiers, the average absolute AUC increase was 2.00%; across 12 regressors, the average absolute RMSE reduction was 1.58%. Discussion: Summary results data available at this http URL remains underutilised. The potential to estimate results of trials before they start is an opportunity to improve trial design and flag discrepancies between expected and reported safety results.
摘要：目标：有了准确的预期安全结果估算，可以设计临床试验以避免终止并将参与者暴露于不必要的风险。我们评估了预测严重不良事件（SAE）在临床试验中仅使用其注册中的信息进行临床试验的方法。材料和方法：我们分析了来自该HTTP URL的22,107个双臂并行介入的临床试验，并具有结构化的摘要结果。开发了两个预测模型：一个分类器预测将实验臂的SAE速率（接收器操作特征曲线下的面积； AUC）比控制组更高，并且是一个回归模型，以预测SAE在控制臂中的比例（根平方误差； RMSE）。使用验证的语言模型（例如ClinicalT5，Biobert）的转移学习方法用于特征提取，并与下游模型结合使用以进行预测。为了在超过本地语言模型输入限制的长期试用文本中保持语义表示，开发了一种滑动窗口方法来嵌入提取。结果：最佳模型（ClinicalT5+变压器+MLP）具有77.6％的AUC，预测哪个试验臂的SAE患者比例较高。当预测控制臂中SAE的参与者的比例时，相同的模型获得了18.6％的RMSE。滑动窗口方法一致地超过了没有它的方法。在12个分类器中，平均绝对AUC增加为2.00％；在12个回归变量中，平均绝对RMSE降低为1.58％。讨论：此HTTP URL可用的摘要结果数据仍未得到充分利用。在试验开始之前估算结果的潜力是改善试验设计和预期安全结果之间的旗帜差异的机会。

Title: Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey

Authors: Jindong Li, Yali Fu, Jiahong Liu, Linxiao Cao, Wei Ji, Menglin Yang, Irwin King, Ming-Hsuan Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22920
Pdf URL: https://arxiv.org/pdf/2507.22920
Copy Paste: [[2507.22920]] Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey(https://arxiv.org/abs/2507.22920)
Keywords: language model, llm
Abstract: The rapid advancement of large language models (LLMs) has intensified the need for effective mechanisms to transform continuous multimodal data into discrete representations suitable for language-based processing. Discrete tokenization, with vector quantization (VQ) as a central approach, offers both computational efficiency and compatibility with LLM architectures. Despite its growing importance, there is a lack of a comprehensive survey that systematically examines VQ techniques in the context of LLM-based systems. This work fills this gap by presenting the first structured taxonomy and analysis of discrete tokenization methods designed for LLMs. We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines. Beyond algorithm-level investigation, we discuss existing research in terms of classical applications without LLMs, LLM-based single-modality systems, and LLM-based multimodal systems, highlighting how quantization strategies influence alignment, reasoning, and generation performance. In addition, we identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints. Finally, we discuss emerging research directions such as dynamic and task-adaptive quantization, unified tokenization frameworks, and biologically inspired codebook learning. This survey bridges the gap between traditional vector quantization and modern LLM applications, serving as a foundational reference for the development of efficient and generalizable multimodal systems. A continuously updated version is available at: this https URL.
摘要：大型语言模型（LLMS）的快速发展已经加剧了需要将连续多模式数据转换为适合基于语言处理的离散表示的有效机制。以矢量量化（VQ）为中心方法的离散令牌化提供了与LLM架构的计算效率和兼容性。尽管其重要性越来越大，但缺乏一项全面的调查，该调查系统地研究了基于LLM的系统的VQ技术。这项工作通过提出第一个结构化分类法和针对LLMS设计的离散令牌化方法的分析来填补这一空白。我们对跨越古典和现代范式的8种代表性VQ变体进行分类，并分析其算法原理，训练动力学和与LLM管道的集成挑战。除了算法级调查之外，我们还讨论了没有LLM，基于LLM的单模式系统和基于LLM的多模式系统的经典应用，重点介绍了量化策略如何影响对齐，推理和发电性能。此外，我们确定了关键挑战，包括代码书崩溃，不稳定的梯度估计以及特定于模式的编码约束。最后，我们讨论了新兴的研究方向，例如动态和任务自适应量化，统一的令牌化框架以及具有生物学启发的代码书学习。这项调查弥合了传统矢量量化与现代LLM应用之间的差距，这是开发有效且可概括的多模式系统的基础参考。连续更新的版本可在以下网址提供：此HTTPS URL。

Title: Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers

Authors: Lee Harris
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.22921
Pdf URL: https://arxiv.org/pdf/2507.22921
Copy Paste: [[2507.22921]] Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers(https://arxiv.org/abs/2507.22921)
Keywords: language model, hallucination, prompt
Abstract: Language models can capture complex relationships in given text, but these are notorious for being costly and for producing information that does not exist (i.e., hallucinations). Furthermore, the resources invested into producing this information would be wasted if it were incorrect. We address these issues by proposing, implementing, and applying the Language Model Chain (LMC) algorithm. In this, a language model's response to a given prompt about given text is only correct if it exists in the collection of possible (i.e., candidate) answers, and text corresponding to incorrect responses is fed into a more predictive (but slower) language model. This process is repeated for a collection of language models, or until all predictions about the text are correct. We used the LMC algorithm to extract patient dates of birth from medical documents, and combining a collection of language models in a multi-stage cascade significantly increased prediction speed and accuracy over individual language models, while greatly reducing the number of corresponding hallucinations. We believe that the novel LMC algorithm significantly contributes to the knowledge extraction field, and that this should be explored much further in the future.
摘要：语言模型可以在给定的文本中捕获复杂的关系，但是这些关系是昂贵的，并且因产生不存在的信息而臭名昭著（即幻觉）。此外，如果不正确，投资用于生产此信息的资源将被浪费。我们通过提出，实施和应用语言模型链（LMC）算法来解决这些问题。在此，仅当语言模型对给定文本的给定提示的响应仅在可能存在于可能的答案（即候选）答案中，而与不正确响应相对应的文本的响应才是正确的。重复此过程以收集语言模型，或者直到有关文本的所有预测都正确为止。我们使用LMC算法从医疗文档中提取出生日期，并将语言模型集合在多阶段级联的集合中显着提高了对单个语言模型的预测速度和准确性，同时大大减少了相应的幻觉的数量。我们认为，新颖的LMC算法显着有助于知识提取领域，并且应该在将来进一步探讨。

Title: Predicting stock prices with ChatGPT-annotated Reddit sentiment

Authors: Mateusz Kmak, Kamil Chmurzyński, Kamil Matejuk, Paweł Kotzbach, Jan Kocoń
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2507.22922
Pdf URL: https://arxiv.org/pdf/2507.22922
Copy Paste: [[2507.22922]] Predicting stock prices with ChatGPT-annotated Reddit sentiment(https://arxiv.org/abs/2507.22922)
Keywords: gpt, chat
Abstract: The surge of retail investor activity on social media, exemplified by the 2021 GameStop short squeeze, raised questions about the influence of online sentiment on stock prices. This paper explores whether sentiment derived from social media discussions can meaningfully predict stock market movements. We focus on Reddit's r/wallstreetbets and analyze sentiment related to two companies: GameStop (GME) and AMC Entertainment (AMC). To assess sentiment's role, we employ two existing text-based sentiment analysis methods and introduce a third, a ChatGPT-annotated and fine-tuned RoBERTa-based model designed to better interpret the informal language and emojis prevalent in social media discussions. We use correlation and causality metrics to determine these models' predictive power. Surprisingly, our findings suggest that social media sentiment has only a weak correlation with stock prices. At the same time, simpler metrics, such as the volume of comments and Google search trends, exhibit stronger predictive signals. These results highlight the complexity of retail investor behavior and suggest that traditional sentiment analysis may not fully capture the nuances of market-moving online discussions.
摘要：2021年GameStop Short Squeeze举例说明，在社交媒体上的散户投资者活动激增引发了有关在线情绪对股票价格的影响的疑问。本文探讨了从社交媒体讨论中得出的情感是否可以有意义地预测股票市场的变动。我们专注于Reddit的R/Wallstreetbets，并分析与两家公司有关的情绪：GameStop（GME）和AMC Entertainment（AMC）。为了评估情感的角色，我们采用了两种基于文本的情感分析方法，并介绍了第三种，基于Chatgpt的，基于Chatgpt的Roberta模型，旨在更好地解释社交媒体讨论中普遍存在的非正式语言和表情符号。我们使用相关性和因果指标来确定这些模型的预测能力。令人惊讶的是，我们的发现表明，社交媒体情绪与股票价格的相关性较弱。同时，更简单的指标（例如评论和Google搜索趋势的数量）表现出更强的预测信号。这些结果强调了散户投资者行为的复杂性，并表明传统的情感分析可能无法完全捕捉营销在线讨论的细微差别。

Title: How and Where to Translate? The Impact of Translation Strategies in Cross-lingual LLM Prompting

Authors: Aman Gupta, Yingying Zhuang, Zhou Yu, Ziji Zhang, Anurag Beniwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22923
Pdf URL: https://arxiv.org/pdf/2507.22923
Copy Paste: [[2507.22923]] How and Where to Translate? The Impact of Translation Strategies in Cross-lingual LLM Prompting(https://arxiv.org/abs/2507.22923)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Despite advances in the multilingual capabilities of Large Language Models (LLMs), their performance varies substantially across different languages and tasks. In multilingual retrieval-augmented generation (RAG)-based systems, knowledge bases (KB) are often shared from high-resource languages (such as English) to low-resource ones, resulting in retrieved information from the KB being in a different language than the rest of the context. In such scenarios, two common practices are pre-translation to create a mono-lingual prompt and cross-lingual prompting for direct inference. However, the impact of these choices remains unclear. In this paper, we systematically evaluate the impact of different prompt translation strategies for classification tasks with RAG-enhanced LLMs in multilingual systems. Experimental results show that an optimized prompting strategy can significantly improve knowledge sharing across languages, therefore improve the performance on the downstream classification task. The findings advocate for a broader utilization of multilingual resource sharing and cross-lingual prompt optimization for non-English languages, especially the low-resource ones.
摘要：尽管大语言模型（LLMS）的多语言能力取得了进步，但它们的性能在不同语言和任务中的差异很大。在基于多语言检索型的生成（RAG）的系统中，知识库（KB）通常是从高资源语言（例如英语）到低资源的系统共享的，从而从KB中检索了与其他上下文不同语言的信息。在这种情况下，两种常见的做法是预译，以创建单语言提示和跨语性提示，以直接推断。但是，这些选择的影响尚不清楚。在本文中，我们系统地评估了不同及时的翻译策略对多语言系统中用抹布增强LLM的分类任务的影响。实验结果表明，优化的提示策略可以显着改善跨语言的知识共享，因此可以提高下游分类任务的性能。研究结果倡导对非英语语言（尤其是低资源的语言）更广泛地利用多语言资源共享和跨语性提示优化。

Title: Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents

Authors: Haoran Sun, Shaoning Zeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22925
Pdf URL: https://arxiv.org/pdf/2507.22925
Copy Paste: [[2507.22925]] Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents(https://arxiv.org/abs/2507.22925)
Keywords: language model, llm, agent
Abstract: Long-term memory is one of the key factors influencing the reasoning capabilities of Large Language Model Agents (LLM Agents). Incorporating a memory mechanism that effectively integrates past interactions can significantly enhance decision-making and contextual coherence of LLM Agents. While recent works have made progress in memory storage and retrieval, such as encoding memory into dense vectors for similarity-based search or organizing knowledge in the form of graph, these approaches often fall short in structured memory organization and efficient retrieval. To address these limitations, we propose a Hierarchical Memory (H-MEM) architecture for LLM Agents that organizes and updates memory in a multi-level fashion based on the degree of semantic abstraction. Each memory vector is embedded with a positional index encoding pointing to its semantically related sub-memories in the next layer. During the reasoning phase, an index-based routing mechanism enables efficient, layer-by-layer retrieval without performing exhaustive similarity computations. We evaluate our method on five task settings from the LoCoMo dataset. Experimental results show that our approach consistently outperforms five baseline methods, demonstrating its effectiveness in long-term dialogue scenarios.
摘要：长期记忆是影响大语言模型代理（LLM代理）推理能力的关键因素之一。结合有效整合过去相互作用的记忆机制可以显着增强LLM代理的决策和上下文连贯性。尽管最近的作品在存储器存储和检索方面取得了进展，例如以图形形式将记忆编码为基于相似性的搜索或组织知识的密集向量，但这些方法通常在结构化的内存组织中和有效的检索中不足。为了解决这些限制，我们为LLM代理提出了一个层次内存（H-MEM）体系结构，该体系结构根据语义抽象的程度以多层次的方式组织和更新内存。每个内存向量都嵌入一个位置索引编码，该索引指向其下一层中其语义相关的子消息。在推理阶段，基于索引的路由机制可以在不执行详尽的相似性计算的情况下进行有效的逐层检索。我们从机车数据集中的五个任务设置中评估了我们的方法。实验结果表明，我们的方法始终优于五种基线方法，证明了其在长期对话方案中的有效性。

Title: PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

Authors: Zhehao Tan, Yihan Jiao, Dan Yang, Lei Liu, Jie Feng, Duolin Sun, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.22927
Pdf URL: https://arxiv.org/pdf/2507.22927
Copy Paste: [[2507.22927]] PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation(https://arxiv.org/abs/2507.22927)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM's ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization. To this end, we introduce \textit{Placeholder-RAG-Benchmark}, a multi-level fine-grained benchmark, emphasizing the following progressive dimensions: (1) multi-level filtering abilities, (2) combination abilities, and (3) reference reasoning. To provide a more nuanced understanding of LLMs' roles in RAG systems, we formulate an innovative placeholder-based approach to decouple the contributions of the LLM's parametric knowledge and the external knowledge. Experiments demonstrate the limitations of representative LLMs in the RAG system's generation capabilities, particularly in error resilience and context faithfulness. Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems. Our code is available in this https URL.
摘要：检索增强的生成（RAG）通过整合外部知识来增强大语言模型（LLM），而LLM根据给定查询和检索文档的结合产生响应的能力至关重要。但是，大多数基准测试都集中在整体抹布系统性能上，很少评估LLM特异性功能。当前的基准测试强调了广泛的方面，例如稳健性，但缺乏文档利用的系统性和颗粒状评估框架。为此，我们介绍了\ textit {占位符 - 基准基准}，这是一种多级细粒基准，强调以下渐进维度：（1）多级过滤能力，（2）组合能力和（3）参考推理。为了对LLM在抹布系统中的角色有更细微的理解，我们制定了一种基于创新的占位符的方法来解除LLM参数知识和外部知识的贡献。实验证明了抹布系统的发电能力中代表性LLM的局限性，尤其是在错误的弹性和背景忠诚方面。我们的基准提供了一个可重现的框架，用于开发更可靠和高效的破布系统。我们的代码在此HTTPS URL中可用。

Title: How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

Authors: Xi Chen, Aske Plaat, Niki van Stein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22928
Pdf URL: https://arxiv.org/pdf/2507.22928
Copy Paste: [[2507.22928]] How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding(https://arxiv.org/abs/2507.22928)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated "thoughts" reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model's confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.
摘要：促使经营链（COT）提高了多步骤任务上的大型语言模型的精度，但是生成的“思想”是否反映了真正的内部推理过程。我们介绍了COT忠诚的第一个特征级别因果研究。将稀疏的自动编码器与激活补丁相结合，我们提取了毕曲（Pythia-70m）和毕曲（Pythia-2.8b）的单义特征，同时它们解决了COT和Plain（Nocot）提示下的GSM8K数学问题。将一小部分的婴儿床特征交换为NoCot Run，在2.8B型号中显着提高了答案对数探测，但在7000万中没有可靠的效果，揭示了明确的尺度阈值。 COT还会导致更高的激活稀疏性和较大模型中的可解释性得分，从而发出更模块化的内部计算。例如，模型对正确答案的信心从1.2提高到4.3。我们介绍了补丁曲线和随机功能补丁基线，表明有用的COT信息不仅存在于Top-K贴片中，而且分布广泛。总体而言，我们的结果表明，COT可以在高容量LLM中诱导更可解释的内部结构，从而证实其作为结构化提示方法的作用。

Title: EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow

Authors: Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu
Subjects: cs.CL, cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2507.22929
Pdf URL: https://arxiv.org/pdf/2507.22929
Copy Paste: [[2507.22929]] EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow(https://arxiv.org/abs/2507.22929)
Keywords: language model, llm, hallucination, agent
Abstract: Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs' hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at this https URL.
摘要：医学大语言模型（MLLM）在眼科诊断中起着至关重要的作用，具有应对威胁性疾病的巨大潜力。然而，它们的准确性受到幻觉的限制，这是由于眼科知识有限，视觉定位和推理能力不足以及多模式眼科数据的稀缺性而产生的，这些数据集体妨碍了精确的病变检测和疾病诊断。此外，现有的医疗基准无法有效评估各种幻觉或提供可行的解决方案来减轻它们。为了应对上述挑战，我们引入了EH基准，这是一种新型的眼科基准，旨在评估MLLM的幻觉。我们根据特定任务和错误类型将MLLM的幻觉分为两个主要类：视觉理解和逻辑组成，每个组合包括多个子类。鉴于MLLM主要依靠基于语言的推理而不是视觉处理，我们建议以代理为中心的三相框架，包括知识级检索阶段，任务级别的案例研究阶段以及结果级别的验证阶段。实验结果表明，我们的多代理框架可显着减轻两种类型的幻觉，增强准确性，可解释性和可靠性。我们的项目可在此HTTPS URL上找到。

Title: Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Authors: Shalini Jangra, Suparna De, Nishanth Sastry, Saeed Fadaei
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2507.22930
Pdf URL: https://arxiv.org/pdf/2507.22930
Copy Paste: [[2507.22930]] Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection(https://arxiv.org/abs/2507.22930)
Keywords: language model, llm, prompt
Abstract: Social platforms such as Reddit have a network of communities of shared interests, with a prevalence of posts and comments from which one can infer users' Personal Information Identifiers (PIIs). While such self-disclosures can lead to rewarding social interactions, they pose privacy risks and the threat of online harms. Research into the identification and retrieval of such risky self-disclosures of PIIs is hampered by the lack of open-source labeled datasets. To foster reproducible research into PII-revealing text detection, we develop a novel methodology to create synthetic equivalents of PII-revealing data that can be safely shared. Our contributions include creating a taxonomy of 19 PII-revealing categories for vulnerable populations and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Models (LLMs), Llama2-7B, Llama3-8B, and zephyr-7b-beta, with sequential instruction prompting to resemble the original Reddit posts. The utility of our methodology to generate this synthetic dataset is evaluated with three metrics: First, we require reproducibility equivalence, i.e., results from training a model on the synthetic data should be comparable to those obtained by training the same models on the original posts. Second, we require that the synthetic data be unlinkable to the original users, through common mechanisms such as Google Search. Third, we wish to ensure that the synthetic data be indistinguishable from the original, i.e., trained humans should not be able to tell them apart. We release our dataset and code at this https URL to foster reproducible research into PII privacy risks in online social media.
摘要：诸如Reddit之类的社交平台具有共同利益的社区网络，帖子和评论的普遍性可以推断用户的个人信息标识符（PII）。尽管这种自我策划可以导致奖励社交互动，但它们构成了隐私风险和在线危害的威胁。缺乏开源标记的数据集对这种风险自我限制的识别和检索的研究受到阻碍。为了促进对PII浏览文本检测的可重复研究，我们开发了一种新的方法，以创建可以安全共享的PII-Revealing数据的合成等效物。 Our contributions include creating a taxonomy of 19 PII-revealing categories for vulnerable populations and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Models (LLMs), Llama2-7B, Llama3-8B, and zephyr-7b-beta, with sequential instruction prompting to resemble the original Reddit posts.我们的方法论生成此合成数据集的实用性通过三个指标进行评估：首先，我们需要可重复性等效性，即培训培训模型的综合数据模型应与通过在原始帖子上训练相同模型获得的模型相当。其次，我们要求原始用户通过诸如Google搜索之类的常见机制无法链接综合数据。第三，我们希望确保合成数据与原始数据没有区别，即训练有素的人类不应将它们分开。我们在此HTTPS URL上发布数据集和代码，以在在线社交媒体中促进对PII隐私风险的可重现研究。

Title: Enhancing RAG Efficiency with Adaptive Context Compression

Authors: Shuyu Guo, Zhaochun Ren
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22931
Pdf URL: https://arxiv.org/pdf/2507.22931
Copy Paste: [[2507.22931]] Enhancing RAG Efficiency with Adaptive Context Compression(https://arxiv.org/abs/2507.22931)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but incurs significant inference costs due to lengthy retrieved contexts. While context compression mitigates this issue, existing methods apply fixed compression rates, over-compressing simple queries or under-compressing complex ones. We propose Adaptive Context Compression for RAG (ACC-RAG), a framework that dynamically adjusts compression rates based on input complexity, optimizing inference efficiency without sacrificing accuracy. ACC-RAG combines a hierarchical compressor (for multi-granular embeddings) with a context selector to retain minimal sufficient information, akin to human skimming. Evaluated on Wikipedia and five QA datasets, ACC-RAG outperforms fixed-rate methods and matches/unlocks over 4 times faster inference versus standard RAG while maintaining or improving accuracy.
摘要：检索增强的生成（RAG）增强了具有外部知识的大语言模型（LLM），但由于漫长的检索环境而产生了明显的推理成本。尽管上下文压缩减轻了此问题，但现有方法采用固定的压缩率，过度压缩简单查询或压缩不足的复杂率。我们建议对抹布（ACC-rag）的自适应上下文压缩，该框架会根据输入复杂性动态调整压缩率，在不牺牲准确性的情况下优化推理效率。 Acc-rag结合了分层压缩机（用于多个粒度嵌入）和上下文选择器，以保留最小的足够信息，类似于人类撇油。在Wikipedia和五个QA数据集上进行了评估，ACC-RAG的表现要优于固定速率方法和匹配/匹配/解锁，超过4倍的推理与标准抹布，同时保持或提高准确性。

Title: FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification

Authors: Baptiste Lefort, Eric Benhamou, Beatrice Guez, Jean-Jacques Ohana, Ethan Setrouk, Alban Etienne
Subjects: cs.CL, q-fin.GN
Abstract URL: https://arxiv.org/abs/2507.22932
Pdf URL: https://arxiv.org/pdf/2507.22932
Copy Paste: [[2507.22932]] FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification(https://arxiv.org/abs/2507.22932)
Keywords: language model, llm, agent
Abstract: This paper presents a novel hierarchical framework for portfolio optimization, integrating lightweight Large Language Models (LLMs) with Deep Reinforcement Learning (DRL) to combine sentiment signals from financial news with traditional market indicators. Our three-tier architecture employs base RL agents to process hybrid data, meta-agents to aggregate their decisions, and a super-agent to merge decisions based on market data and sentiment analysis. Evaluated on data from 2018 to 2024, after training on 2000-2017, the framework achieves a 26% annualized return and a Sharpe ratio of 1.2, outperforming equal-weighted and S&P 500 benchmarks. Key contributions include scalable cross-modal integration, a hierarchical RL structure for enhanced stability, and open-source reproducibility.
摘要：本文介绍了用于投资组合优化的新型层次结构框架，将轻型大语言模型（LLMS）与深度强化学习（DRL）相结合，以将财务新闻中的情感信号与传统市场指标相结合。我们的三层体系结构采用基本RL代理来处理混合数据，元代理来汇总他们的决策，并根据市场数据和情感分析来合并决策。该框架在2018年至2024年的数据中进行了评估，该框架在2000年至2017年进行了培训，其年化收益率为26％，Sharpe比率为1.2，表现优于同等加权和S＆P 500基准。关键贡献包括可扩展的跨模式集成，一种层次RL结构，可增强稳定性以及开源可重复性。

Title: Augmented Vision-Language Models: A Systematic Review

Authors: Anthony C Davis, Burhan Sadiq, Tianmin Shu, Chien-Ming Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22933
Pdf URL: https://arxiv.org/pdf/2507.22933
Copy Paste: [[2507.22933]] Augmented Vision-Language Models: A Systematic Review(https://arxiv.org/abs/2507.22933)
Keywords: language model
Abstract: Recent advances in visual-language machine learning models have demonstrated exceptional ability to use natural language and understand visual scenes by training on large, unstructured datasets. However, this training paradigm cannot produce interpretable explanations for its outputs, requires retraining to integrate new information, is highly resource-intensive, and struggles with certain forms of logical reasoning. One promising solution involves integrating neural networks with external symbolic information systems, forming neural symbolic systems that can enhance reasoning and memory abilities. These neural symbolic systems provide more interpretable explanations to their outputs and the capacity to assimilate new information without extensive retraining. Utilizing powerful pre-trained Vision-Language Models (VLMs) as the core neural component, augmented by external systems, offers a pragmatic approach to realizing the benefits of neural-symbolic integration. This systematic literature review aims to categorize techniques through which visual-language understanding can be improved by interacting with external symbolic information systems.
摘要：视觉语言机器学习模型的最新进展表明，通过在大型非结构化数据集中训练使用自然语言并理解视觉场景的出色能力。但是，这种培训范式不能为其产出产生可解释的解释，需要重新集成新信息，是资源密集的高度，并且与某些形式的逻辑推理斗争。一种有希望的解决方案涉及将神经网络与外部符号信息系统集成，形成可以增强推理和记忆能力的神经符号系统。这些神经符号系统为它们的产出提供了更可解释的解释，以及在不进行广泛培训的情况下吸收新信息的能力。利用强大的预训练的视觉模型（VLM）作为核心神经组件，并通过外部系统增强，提供了一种实用的方法来实现神经符号整合的好处。这项系统的文献综述旨在通过与外部符号信息系统互动来对技术进行分类，从而通过该技术来改善视觉语言的理解。

Title: Trusted Knowledge Extraction for Operations and Maintenance Intelligence

Authors: Kathleen Mealey, Jonathan A. Karr Jr., Priscila Saboia Moreira, Paul R. Brenner, Charles F. Vardeman II
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22935
Pdf URL: https://arxiv.org/pdf/2507.22935
Copy Paste: [[2507.22935]] Trusted Knowledge Extraction for Operations and Maintenance Intelligence(https://arxiv.org/abs/2507.22935)
Keywords: language model, llm
Abstract: Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.
摘要：由于数据机密性与数据集成目标的二分法以及自然语言处理（NLP）工具的局限性相对于域的特定知识结构（例如操作和维护），因此从组织数据存储库中得出操作智能是一个关键挑战。在这项工作中，我们讨论了知识图构建，并将知识提取过程分解为其命名的实体识别，核心分辨率，命名实体链接以及关系提取功能组件。然后，我们与大型语言模型（LLMS）快速前进的功能进行比较，评估了16个NLP工具。我们专注于飞机行业可信化应用程序的运营和维护情报用例。基线数据集源自美国联邦航空管理局的丰富公共领域数据集，该数据集的重点是设备故障或维护要求。我们评估了可以在受控的机密环境中操作的NLP和LLM工具的零拍摄性能（没有将数据发送给第三方）。根据我们对重大绩效限制的观察，我们讨论了与受信任的NLP和LLM工具有关的挑战，以及它们在关键任务行业（如航空）中更广泛使用的技术准备水平。我们最终提出了提高信任的建议，并提供了我们的开源策划数据集，以支持进一步的基线测试和评估。

Title: Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

Authors: Md Talha Mohsin
Subjects: cs.CL, cs.AI, cs.CE, cs.HC, q-fin.CP
Abstract URL: https://arxiv.org/abs/2507.22936
Pdf URL: https://arxiv.org/pdf/2507.22936
Copy Paste: [[2507.22936]] Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis(https://arxiv.org/abs/2507.22936)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide variety of Financial Natural Language Processing (FinNLP) tasks. However, systematic comparisons among widely used LLMs remain underexplored. Given the rapid advancement and growing influence of LLMs in financial analysis, this study conducts a thorough comparative evaluation of five leading LLMs, GPT, Claude, Perplexity, Gemini and DeepSeek, using 10-K filings from the 'Magnificent Seven' technology companies. We create a set of domain-specific prompts and then use three methodologies to evaluate model performance: human annotation, automated lexical-semantic metrics (ROUGE, Cosine Similarity, Jaccard), and model behavior diagnostics (prompt-level variance and across-model similarity). The results show that GPT gives the most coherent, semantically aligned, and contextually relevant answers; followed by Claude and Perplexity. Gemini and DeepSeek, on the other hand, have more variability and less agreement. Also, the similarity and stability of outputs change from company to company and over time, showing that they are sensitive to how prompts are written and what source material is used.
摘要：大型语言模型（LLMS）在各种财务自然语言处理（FINNLP）任务中表现出了出色的功能。但是，广泛使用的LLMS之间的系统比较仍未得到充实。鉴于LLM在财务分析中的快速进步和不断增长的影响，这项研究使用了“宏伟的七个”技术公司的10-K文件，对五个领先的LLM，GPT，Claude，Perpolxity，Gemini和DeepSeek进行了彻底的比较评估。我们创建一组域特异性提示，然后使用三种方法来评估模型性能：人体注释，自动词法 - 语义指标（Rouge，Cosine相似性，JACCARD）和模型行为诊断（提示级别差异和跨模型相似性）。结果表明，GPT给出了最连贯，具有语义和上下文相关的答案。其次是克劳德和困惑。另一方面，双子座和DeepSeek具有更大的可变性和更少的一致性。同样，产出的相似性和稳定性随着时间的流逝而变化，表明它们对提示的写入以及使用哪种原始材料很敏感。

Title: CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering

Authors: Jinkun Zhao, Yuanshuai Wang, Xingjian Zhang, Ruibo Chen, Xingchuang Liao, Junle Wang, Lei Huang, Kui Zhang, Wenjun Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22937
Pdf URL: https://arxiv.org/pdf/2507.22937
Copy Paste: [[2507.22937]] CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering(https://arxiv.org/abs/2507.22937)
Keywords: language model, llm, retrieval-augmented generation
Abstract: With the rapid evolution of artificial intelligence, AIOps has emerged as a prominent paradigm in DevOps. Lots of work has been proposed to improve the performance of different AIOps phases. However, constrained by domain-specific knowledge, a single model can only handle the operation requirement of a specific task,such as log parser,root cause analysis. Meanwhile, combining multiple models can achieve more efficient results, which have been proved in both previous ensemble learning and the recent LLM training domain. Inspired by these works,to address the similar challenges in AIOPS, this paper first proposes a collaboration-of-expert framework(CoE-Ops) incorporating a general-purpose large language model task classifier. A retrieval-augmented generation mechanism is introduced to improve the framework's capability in handling both Question-Answering tasks with high-level(Code,build,Test,etc.) and low-level(fault analysis,anomaly detection,etc.). Finally, the proposed method is implemented in the AIOps domain, and extensive experiments are conducted on the DevOps-EVAL dataset. Experimental results demonstrate that CoE-Ops achieves a 72% improvement in routing accuracy for high-level AIOps tasks compared to existing CoE methods, delivers up to 8% accuracy enhancement over single AIOps models in DevOps problem resolution, and outperforms larger-scale Mixture-of-Experts (MoE) models by up to 14% in accuracy.
摘要：随着人工智能的快速发展，AIOPS已成为Devops中的重要范式。已经提出了许多工作来提高不同AIOPS阶段的性能。但是，受域特异性知识约束，单个模型只能处理特定任务的操作要求，例如日志解析器，根本原因分析。同时，组合多个模型可以实现更有效的结果，这在以前的集合学习和最近的LLM培训领域都证明了这一点。受这些作品的启发，以应对AIOPS中类似挑战的启发，本文首先提出了一个融合了通用大型语言模型任务分类器的Expert框架（COE-OPS）。引入了一种检索增强的生成机制，以提高框架在处理高级（代码，构建，测试等）和低级别（故障分析，异常检测等）方面的解决方面的能力。最后，提出的方法是在AIOPS域中实现的，并且在DevOps-eval数据集上进行了广泛的实验。实验结果表明，与现有的COE方法相比，COE-OPS在高级AIOPS任务方面的路由准确性提高了72％，在DevOps问题解决方案中，单个AIOPS模型的精度提高了高达8％的精度，并且在精确度中最多优于较大的Expexperts（MOE）模型（MOE）模型，最多可通过精确度获得14％。

Title: A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents

Authors: Sumit Soman, H. G. Ranjani, Sujoy Roychowdhury, Venkata Dharma Surya Narayana Sastry, Akshat Jain, Pranav Gangrade, Ayaaz Khan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22938
Pdf URL: https://arxiv.org/pdf/2507.22938
Copy Paste: [[2507.22938]] A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents(https://arxiv.org/abs/2507.22938)
Keywords: language model, retrieval augmented generation
Abstract: Question-Answering (QA) from technical documents often involves questions whose answers are present in figures, such as flowcharts or flow diagrams. Text-based Retrieval Augmented Generation (RAG) systems may fail to answer such questions. We leverage graph representations of flowcharts obtained from Visual large Language Models (VLMs) and incorporate them in a text-based RAG system to show that this approach can enable image retrieval for QA in the telecom domain. We present the end-to-end approach from processing technical documents, classifying image types, building graph representations, and incorporating them with the text embedding pipeline for efficient retrieval. We benchmark the same on a QA dataset created based on proprietary telecom product information documents. Results show that the graph representations obtained using a fine-tuned VLM model have lower edit distance with respect to the ground truth, which illustrate the robustness of these representations for flowchart images. Further, the approach for QA using these representations gives good retrieval performance using text-based embedding models, including a telecom-domain adapted one. Our approach also alleviates the need for a VLM in inference, which is an important cost benefit for deployed QA systems.
摘要：来自技术文档的问题 - 诉讼（QA）通常涉及在图中存在答案的问题，例如流程图或流程图。基于文本的检索增强发电（RAG）系统可能无法回答此类问题。我们利用从视觉大语言模型（VLM）获得的流程图的图形表示形式，并将它们合并到基于文本的抹布系统中，以表明此方法可以在电信域中启用QA图像检索。我们介绍了从处理技术文档，对图像类型进行分类，构建图表表示并将其与文本嵌入管道合并以进行有效检索的端到端方法。我们在基于专有电信产品信息文档创建的质量检查数据集上进行基准测试。结果表明，使用微调VLM模型获得的图表表示相对于地面真相的编辑距离较低，这说明了流程图图像的这些表示的鲁棒性。此外，使用这些表示形式的质量检查方法可以使用基于文本的嵌入模型（包括Telecom-posped apped opaped opting）提供良好的检索性能。我们的方法还减轻了对VLM推理的需求，这对于已部署的质量检查系统是重要的成本收益。

Title: Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes

Authors: Rui Jiao, Yue Zhang, Jinku Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22940
Pdf URL: https://arxiv.org/pdf/2507.22940
Copy Paste: [[2507.22940]] Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes(https://arxiv.org/abs/2507.22940)
Keywords: language model, gpt, llm
Abstract: We present RELIANCE (Reasoning Evaluation with Logical Integrity and Accuracy for Confidence Enhancement), a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) a Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability module examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across ten state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. RELIANCE significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.
摘要：我们提出了依赖（以逻辑完整性和置信度增强的准确性评估），这是一个新的框架，解决了大语言模型（LLMS）中的关键脆弱性：尽管最终答案正确，但中等推理步骤中事实上不准确的事实不准确。这种现象在包括医疗保健，法律分析和科学研究在内的高风险领域中构成了重大风险，在这些领域中，错误但自信地提出的推理可能会误导用户做出危险的决定。我们的框架集成了三个核心组成部分：（1）对反事实增强数据培训的专业事实检查分类器，以检测推理链中的细微事实不一致；（2）通过多维奖励通过多维奖励来平衡事实，连贯性和结构正确性的团体相对政策优化（GRPO）的增强学习方法；（3）一种机械性解释性模块，研究了在推理过程中的事实改善中如何在模型激活中表现出来。跨十个最先进模型进行的广泛评估揭示了有关模式的信息：即使是Claude-3.7和GPT-O1等领先的模型，也仅证明了仅81.93％和82.57％的推理事实准确性。依赖大大提高了事实鲁棒性（提高了49.90％），同时维持或提高了包括Math-500，AIME-2024和GPQA在内的具有挑战性的基准性能。此外，我们的激活级分析提供了可行的见解，即事实增强如何在模型体系结构中重新设计推理轨迹，为未来的培训方法建立基础，这些方法通过激活指导的优化明确针对事实鲁棒性。

Title: C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Authors: Chengqian Ma, Wei Tao, Yiwen Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.22968
Pdf URL: https://arxiv.org/pdf/2507.22968
Copy Paste: [[2507.22968]] C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations(https://arxiv.org/abs/2507.22968)
Keywords: language model, llm
Abstract: Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.
摘要：口语对话模型（SDM）最近因其直接对用户的口语查询而产生语音响应的能力引起了极大的关注。尽管他们的知名度越来越大，但研究的差距还是集中于全面理解它们在理解和模仿人类对话方面的实际有效性。与基于文本的大语言模型（LLM）相比，这尤其如此，该模型受益于广泛的基准测试。由于口语对话独有的特征，人的声音互动本质上比文本更复杂。歧义构成了一个挑战，源于语义因素，例如多义因素，以及诸如异质，异型和压力模式之类的语音方面。此外，上下文依赖性（如省略，核心和多转交互）为人类对话动力学增添了进一步的复杂性。为了阐明当前的SDM开发状态并应对这些挑战，我们在本文中提出了一个基准数据集，其中包括1,079个英语和中文实例。伴随着一种基于LLM的评估方法，该方法与人类的判断密切相符，该数据集促进了对SDM在应对这些实际挑战方面的性能的全面探索。

Title: Math Natural Language Inference: this should be easy!

Authors: Valeria de Paiva, Qiyue Gao, Hai Hu, Pavel Kovalev, Yikang Liu, Lawrence S. Moss, Zhiheng Qian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.23063
Pdf URL: https://arxiv.org/pdf/2507.23063
Copy Paste: [[2507.23063]] Math Natural Language Inference: this should be easy!(https://arxiv.org/abs/2507.23063)
Keywords: llm
Abstract: We ask whether contemporary LLMs are able to perform natural language inference (NLI) tasks on mathematical texts. We call this the Math NLI problem. We construct a corpus of Math NLI pairs whose premises are from extant mathematical text and whose hypotheses and gold labels were provided by people with experience in both research-level mathematics and also in the NLI field. We also investigate the quality of corpora using the same premises but whose hypotheses are provided by LLMs themselves. We not only investigate the performance but also the inter-group consistency of the diverse group of LLMs. We have both positive and negative findings. Among our positive findings: in some settings, using a majority vote of LLMs is approximately equivalent to using human-labeled data in the Math NLI area. On the negative side: LLMs still struggle with mathematical language. They occasionally fail at even basic inferences. Current models are not as prone to hypothesis-only "inference" in our data the way the previous generation had been. In addition to our findings, we also provide our corpora as data to support future work on Math NLI.
摘要：我们询问当代LLM是否能够在数学文本上执行自然语言推断（NLI）任务。我们将其称为数学NLI问题。我们构建了数学成对的语料库，其前提来自现有的数学文本，其假设和金标签是由在研究级数学和NLI领域中具有经验的人们提供的。我们还使用相同的前提调查了语料库的质量，但其假设由LLM本身提供。我们不仅调查了各种LLMS组的绩效，而且还研究了组间一致性。我们有正面和负面的发现。在我们的积极发现中：在某些情况下，使用LLM的多数票大约等同于在数学NLI地区使用人体标记的数据。在负面方面：LLM仍在数学语言上挣扎。他们偶尔甚至基本推论都失败了。当前的模型不像上一代一样，在我们的数据中不容易假设“推断”。除了我们的发现外，我们还提供我们的语料库作为数据，以支持Math NLI的未来工作。

Title: Exploring In-Context Learning for Frame-Semantic Parsing

Authors: Diego Garat, Guillermo Moncecchi, Dina Wonsever
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.23082
Pdf URL: https://arxiv.org/pdf/2507.23082
Copy Paste: [[2507.23082]] Exploring In-Context Learning for Frame-Semantic Parsing(https://arxiv.org/abs/2507.23082)
Keywords: language model, llm, prompt
Abstract: Frame Semantic Parsing (FSP) entails identifying predicates and labeling their arguments according to Frame Semantics. This paper investigates the use of In-Context Learning (ICL) with Large Language Models (LLMs) to perform FSP without model fine-tuning. We propose a method that automatically generates task-specific prompts for the Frame Identification (FI) and Frame Semantic Role Labeling (FSRL) subtasks, relying solely on the FrameNet database. These prompts, constructed from frame definitions and annotated examples, are used to guide six different LLMs. Experiments are conducted on a subset of frames related to violent events. The method achieves competitive results, with F1 scores of 94.3% for FI and 77.4% for FSRL. The findings suggest that ICL offers a practical and effective alternative to traditional fine-tuning for domain-specific FSP tasks.
摘要：框架语义解析（FSP）需要根据框架语义来识别谓词并标记其参数。本文调查了使用大语言模型（LLM）在不使用模型进行微调的情况下执行FSP的使用中文化学习（ICL）。我们提出了一种方法，该方法将自动为框架标识（FI）和框架语义角色标签（FSRL）子任务生成特定任务的提示，仅依赖于Framenet数据库。这些提示是根据框架定义和带注释的示例构建的，用于指导六个不同的LLM。实验是在与暴力事件相关的框架子集上进行的。该方法取得了竞争性的结果，FI的F1得分为94.3％，FSRL为77.4％。研究结果表明，ICL为特定领域的FSP任务提供了传统微调的实用有效替代方案。

Title: Context-aware Rotary Position Embedding

Authors: Ali Veisi, Delaram Fartoot, Hamidreza Amirzadeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.23083
Pdf URL: https://arxiv.org/pdf/2507.23083
Copy Paste: [[2507.23083]] Context-aware Rotary Position Embedding(https://arxiv.org/abs/2507.23083)
Keywords: gpt
Abstract: Positional encoding is a vital component of Transformer architectures, enabling models to incorporate sequence order into self-attention mechanisms. Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency. However, RoPE relies on static, input-independent sinusoidal frequency patterns, limiting its ability to model context-sensitive relationships. In this work, we propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings. This design introduces token- and context-sensitive positional representations while preserving RoPE efficiency and architectural simplicity. CARoPE computes input-dependent phase shifts using a bounded transformation of token embeddings and integrates them into the rotary mechanism across attention heads. We evaluate CARoPE on the FineWeb-Edu-10B dataset using GPT-2 variants trained on next-token prediction tasks. Experimental results show that CARoPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths. Additionally, CARoPE enables faster training throughput without sacrificing model stability. These findings demonstrate that CARoPE offers a scalable, expressive, and efficient upgrade to existing positional encoding strategies in Transformer models.
摘要：位置编码是变压器体系结构的重要组成部分，使模型能够将序列顺序纳入自我发项机制。旋转位置嵌入（绳索）由于其与相对位置编码和计算效率的兼容性，已成为广泛采用的解决方案。但是，绳索依赖于静态，独立的正弦频率模式，从而限制了其对上下文敏感关系建模的能力。在这项工作中，我们提出了Carope（上下文感知的旋转位置嵌入），这是一种动态生成头部特定频率模式的绳索的新颖概括，该模式在令牌嵌入上。该设计引入了令牌和上下文敏感的位置表示，同时保留了绳索效率和建筑简单性。 Carope使用令牌嵌入的有界变换来计算输入依赖性相移，并将其整合到跨注意力头上的旋转机制中。我们使用对下一步预测任务培训的GPT-2变体在FineWeb-EDU-10B数据集上评估Carope。实验结果表明，Carope始终胜过绳索和其他常见的位置编码基线，即使在更长的上下文长度下，也会达到明显降低的困惑。此外，Carope可以在不牺牲模型稳定性的情况下更快的训练吞吐量。这些发现表明，Carope提供了可扩展，表现力和高效的升级，以升级到变压器模型中现有的位置编码策略。

Title: SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity

Authors: Ishani Mondal, Meera Bharadwaj, Ayush Roy, Aparna Garimella, Jordan Lee Boyd-Graber
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.23095
Pdf URL: https://arxiv.org/pdf/2507.23095
Copy Paste: [[2507.23095]] SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity(https://arxiv.org/abs/2507.23095)
Keywords: agent
Abstract: We present SMART-Editor, a framework for compositional layout and content editing across structured (posters, websites) and unstructured (natural images) domains. Unlike prior models that perform local edits, SMART-Editor preserves global coherence through two strategies: Reward-Refine, an inference-time rewardguided refinement method, and RewardDPO, a training-time preference optimization approach using reward-aligned layout pairs. To evaluate model performance, we introduce SMARTEdit-Bench, a benchmark covering multi-domain, cascading edit scenarios. SMART-Editor outperforms strong baselines like InstructPix2Pix and HIVE, with RewardDPO achieving up to 15% gains in structured settings and Reward-Refine showing advantages on natural images. Automatic and human evaluations confirm the value of reward-guided planning in producing semantically consistent and visually aligned edits.
摘要：我们提出了智能编辑器，这是一个跨结构化（海报，网站）和非结构化（自然图像）域进行组成布局和内容编辑的框架。与先前执行本地编辑的模型不同，Smart-编辑器通过两种策略来维护全球连贯性：奖励 - 雷福内斯（Reward-Refine），一种推理时间奖励指导的改进方法和奖励多德（RewardDPO），这是一种使用奖励平衡的布局对的培训时间优先优化方法。为了评估模型性能，我们介绍了Smartedit-Bench，这是一个涵盖多域，级联编辑方案的基准。智能编辑器的表现优于诸如ConstructPix2Pix和Hive之类的强大基线，在结构化设置和奖励中，RewardDPO在自然图像上的优势中获得了多达15％的收益。自动和人类评估证实了奖励指导计划在产生语义一致和视觉上对齐的编辑中的价值。

Title: RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL

Authors: Jeffrey Eben, Aitzaz Ahmad, Stephen Lau
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.23104
Pdf URL: https://arxiv.org/pdf/2507.23104
Copy Paste: [[2507.23104]] RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL(https://arxiv.org/abs/2507.23104)
Keywords: language model, llm
Abstract: Despite advances in large language model (LLM)-based natural language interfaces for databases, scaling to enterprise-level data catalogs remains an under-explored challenge. Prior works addressing this challenge rely on domain-specific fine-tuning - complicating deployment - and fail to leverage important semantic context contained within database metadata. To address these limitations, we introduce a component-based retrieval architecture that decomposes database schemas and metadata into discrete semantic units, each separately indexed for targeted retrieval. Our approach prioritizes effective table identification while leveraging column-level information, ensuring the total number of retrieved tables remains within a manageable context budget. Experiments demonstrate that our method maintains high recall and accuracy, with our system outperforming baselines over massive databases with varying structure and available metadata. Our solution enables practical text-to-SQL systems deployable across diverse enterprise settings without specialized fine-tuning, addressing a critical scalability gap in natural language database interfaces.
摘要：尽管大型语言模型（LLM）基于数据库的自然语言界面的进步，但扩展到企业级数据目录仍然是一个不足的挑战。应对这一挑战的先前工作依赖于特定领域的微调 - 复杂的部署 - 并且无法利用数据库元数据中包含的重要语义上下文。为了解决这些限制，我们引入了一个基于组件的检索体系结构，该架构将数据库模式和元数据分解为离散的语义单元，每个单元分别索引了目标检索。我们的方法在利用列级信息的同时优先考虑有效的表格，以确保检索表的总数保持在可管理的上下文预算之内。实验表明，我们的方法保持较高的召回和准确性，我们的系统在具有不同结构和可用元数据的大规模数据库上的表现优于基础。我们的解决方案启用了可在不同企业设置中部署的实用文本到SQL系统，而无需专门的微调，从而解决了自然语言数据库接口中的关键可扩展性差距。

Title: Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

Authors: Xinwei Wu, Haojie Li, Hongyu Liu, Xinyu Ji, Ruohan Li, Yule Chen, Yigeng Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.23121
Pdf URL: https://arxiv.org/pdf/2507.23121
Copy Paste: [[2507.23121]] Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity(https://arxiv.org/abs/2507.23121)
Keywords: language model, llm
Abstract: In this work, we study a critical research problem regarding the trustworthiness of large language models (LLMs): how LLMs behave when encountering ambiguous narrative text, with a particular focus on Chinese textual ambiguity. We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs, representing multiple possible interpretations. These annotated examples are systematically categorized into 3 main categories and 9 subcategories. Through experiments, we discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans. Specifically, LLMs cannot reliably distinguish ambiguous text from unambiguous text, show overconfidence in interpreting ambiguous text as having a single meaning rather than multiple meanings, and exhibit overthinking when attempting to understand the various possible meanings. Our findings highlight a fundamental limitation in current LLMs that has significant implications for their deployment in real-world applications where linguistic ambiguity is common, calling for improved approaches to handle uncertainty in language understanding. The dataset and code are publicly available at this GitHub repository: this https URL.
摘要：在这项工作中，我们研究了一个有关大语言模型（LLMS）的可信度的关键研究问题：LLM在遇到模棱两可的叙事文本时的行为，特别关注中国文本歧义。我们通过收集和生成具有上下文及其相应歧义对的模棱两可的句子来创建一个基准数据集，代表了多种可能的解释。这些注释的示例系统地分为3个主要类别和9个子类别。通过实验，我们在处理歧义时发现了LLM中的显着脆弱性，揭示了与人类有很大不同的行为。具体而言，LLM不能可靠地将模棱两可的文本与明确的文本区分开来，在解释模棱两可的文本中表现出过度自信，因为文本具有单一的含义而不是多种含义，并且在尝试理解各种可能的含义时表现出过度思考。我们的发现突出了当前LLM的基本局限性，该限制对它们在语言歧义很普遍的现实应用程序中的部署具有重要意义，呼吁改进方法来处理语言理解中的不确定性。数据集和代码可在此GitHub存储库中公开可用：此HTTPS URL。

Title: ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans

Authors: Ananya Sadana, Yash Kumar Lal, Jiawei Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.23135
Pdf URL: https://arxiv.org/pdf/2507.23135
Copy Paste: [[2507.23135]] ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans(https://arxiv.org/abs/2507.23135)
Keywords: language model, chain-of-thought
Abstract: Understanding causal relationships across modalities is a core challenge for multimodal models operating in real-world environments. We introduce ISO-Bench, a benchmark for evaluating whether models can infer causal dependencies between visual observations and procedural text. Each example presents an image of a task step and a text snippet from a plan, with the goal of deciding whether the visual step occurs before or after the referenced text step. Evaluation results on ten frontier vision-language models show underwhelming performance: the best zero-shot F1 is only 0.57, and chain-of-thought reasoning yields only modest gains (up to 0.62 F1), largely behind humans (0.98 F1). Our analysis further highlights concrete directions for improving causal understanding in multimodal models.
摘要：对于在现实世界环境中运行的多模式模型，了解跨模式的因果关系是一个核心挑战。我们介绍了ISO基础，这是评估模型是否可以在视觉观察和程序文本之间推断因果关系的基准。每个示例都会从计划中介绍一个任务步骤的图像和一个文本片段，目的是确定视觉步骤是在引用文本步骤之前还是之后发生的。十个边界视觉模型的评估结果表现出巨大的性能：最佳的零射击F1仅为0.57，经营链的推理仅产生适度的增长（高达0.62 f1），很大程度上落后于人类（0.98 F1）。我们的分析进一步凸显了具体方向，以改善多模型模型中的因果理解。

Title: User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal

Authors: Yuhan Liu, Michael J.Q. Zhang, Eunsol Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.23158
Pdf URL: https://arxiv.org/pdf/2507.23158
Copy Paste: [[2507.23158]] User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal(https://arxiv.org/abs/2507.23158)
Keywords: language model, llm, prompt, chat
Abstract: Once language models (LMs) are deployed, they can interact with users long-term, ideally evolving continuously based on their feedback. Asking for direct user feedback can be disruptive; thus, we study harvesting user feedback from user-LM interaction logs. We study implicit user feedback in two user-LM interaction datasets (WildChat and LMSYS). First, we analyze user feedback in the user-LLM conversation trajectory, providing insights into when and why such feedback occurs. Second, we study harvesting learning signals from such implicit user feedback. We find that the contents of user feedback (e.g., user wanted clarification), not just the polarity (e.g., users were unhappy with the previous model response), can improve model performance in short human-designed questions (MTBench) but not on longer and more complex questions (WildBench). We also find that the usefulness of user feedback is largely tied to the quality of the user's initial prompt. Together, we provide an in-depth study of implicit user feedback, showing its potential and limitations.
摘要：一旦部署了语言模型（LMS），他们就可以根据反馈来长期与用户进行长期互动。要求直接的用户反馈可能会破坏；因此，我们研究了从用户LM交互日志中收获用户反馈。我们在两个用户LM交互数据集（Wildchat和LMSYS）中研究隐式用户反馈。首先，我们在用户-LLM对话轨迹中分析用户反馈，从而提供有关何时以及为什么发生这种反馈的见解。其次，我们从这种隐式用户反馈中研究收集学习信号。我们发现，用户反馈的内容（例如，用户想要澄清），不仅是极性（例如，用户对先前的模型响应不满意），还可以在短期人类设计的问题（mtbench）中改善模型性能（MTBENCH），而不是在更长和更复杂的问题上（Wildbench）。我们还发现，用户反馈的有用性在很大程度上与用户初始提示的质量有关。一起，我们提供了一项对隐性用户反馈的深入研究，显示了其潜力和局限性。

Title: LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration

Authors: Jizhou Guo
Subjects: cs.CL, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2507.23167
Pdf URL: https://arxiv.org/pdf/2507.23167
Copy Paste: [[2507.23167]] LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration(https://arxiv.org/abs/2507.23167)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, with different models excelling in distinct domains and specific abilities. Effectively combining the predictions of multiple LLMs is crucial for enhancing system robustness and performance. However, existing ensemble methods often rely on simple techniques like voting or logits ensembling, which overlook the varying confidence and reliability of models in different contexts. In this work, we propose LENS (Learning ENsemble confidence from Neural States), a novel approach that learns to estimate model confidence by analyzing internal representations. For each LLM, we train a lightweight linear confidence predictor that leverages layer-wise hidden states and normalized probabilities as inputs. This allows for more nuanced weighting of model predictions based on their context-dependent reliability. Our method does not require modifying the model parameters and requires negligible additional computation. Experimental results on multiple-choice and boolean question-answering tasks demonstrate that LENS outperforms traditional ensemble methods by a substantial margin. Our findings suggest that internal representations provide valuable signals for determining model confidence and can be effectively leveraged for ensemble learning.
摘要：大型语言模型（LLMS）在各种任务中都表现出了令人印象深刻的表现，不同的模型在不同的领域和特定能力方面表现出色。有效地结合多个LLM的预测对于增强系统鲁棒性和性能至关重要。但是，现有的合奏方法通常依赖于简单的技术，例如投票或逻辑结合，它们忽略了不同情况下模型的不同信心和可靠性。在这项工作中，我们提出了镜头（从神经状态中学习集成信心），这种新方法通过分析内部表示来估计模型信心。对于每个LLM，我们训练一个轻巧的线性置信预测指标，该预测指标利用层面的隐藏状态和标准化概率作为输入。这允许根据模型预测的上下文相关的可靠性对预测进行更细微的权重。我们的方法不需要修改模型参数，并且需要可忽略的其他计算。多项选择和布尔问题提问任务的实验结果表明，镜头的表现优于传统的合奏方法。我们的发现表明，内部表示为确定模型置信度提供了有价值的信号，并且可以有效利用整体学习。

Title: Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

Authors: Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, Emad Barsoum
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.23194
Pdf URL: https://arxiv.org/pdf/2507.23194
Copy Paste: [[2507.23194]] Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks(https://arxiv.org/abs/2507.23194)
Keywords: llm, prompt, agent
Abstract: The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near-expert performance on hardware like AMD MI300X. The Triton language, a Python-based DSL for GPU programming, has emerged as a popular target for such AI-generated kernels due to its balance of performance and ease-of-coding. In this work, we present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)-a framework that leverages cutting-edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250. GEAK leverages inference-time compute scaling to produce Triton-based GPU kernels using a reasoning loop adapted from Reflexion-style feedback mechanisms. On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion-based generation pipelines by achieving correctness up to $63$% and execution speed up of up to $2.59$X. These results highlight the promise of GEAK-like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert-level kernel performance.
摘要：对AI生成的GPU内核的需求正在迅速增长，这受到行业和学术界对可扩展，硬件优化解决方案的需求的影响。随着深度学习工作负载的复杂性和多样性的增长，必须自动化低级内核开发以满足性能和生产力需求。主要的云提供商，半导体公司和研究机构现在正在为GPU的AI驱动代码生成大量投资，旨在减少手动优化工作，同时在AMD MI300X（例如AMD MI300X）上实现近乎专家的性能。 Triton语言是GPU编程的基于Python的DSL，由于其性能平衡和易于编码，因此已成为这种AI生成的内核的流行目标。在这项工作中，我们提出了一个基于Triton的GPU内核和GEAK（生成高效AI以AI为中心的GPU内核）的评估套件 - 一个框架，该框架利用了最先进的LLMS来为AMD GPU，包括AMD MI300X和MI250，为AMD GPUS生成了性能Triton代码。 GEAK利用推理时间计算缩放量表，使用改编自反射式反馈机制的推理循环产生基于Triton的GPU内核。在两个评估基准测试中，Geak明显优于直接提示Frontier LLM的基准以及基于反射的生成管道的基准，可实现高达$ 63 $％的正确性，执行速度高达$ 2.59 $ x。这些结果凸显了类似Geak型代理代码生成的希望，即加速采用多种硬件平台并使访问专家级内核的性能民主化。

Title: Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leveraging Negative Samples

Authors: Yunhao Liang, Ruixuan Ying, Takuya Taniguchi, Zhe Cui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.23211
Pdf URL: https://arxiv.org/pdf/2507.23211
Copy Paste: [[2507.23211]] Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leveraging Negative Samples(https://arxiv.org/abs/2507.23211)
Keywords: language model
Abstract: Large Language Models exhibit powerful few-shot in-context learning (ICL) capabilities, but the performance is highly sensitive to provided examples. Recent research has focused on retrieving corresponding examples for each input query, not only enhancing the efficiency and scalability of the learning process but also mitigating inherent biases in manual example selection. However, these studies have primarily emphasized leveraging Positive samples while overlooking the additional information within Negative samples for contextual learning. We propose a novel method that utilizes Negative samples to better select Positive sample examples, thereby enhancing the performance of few-shot ICL. Initially, we construct Positive and Negative sample corpora based on Zero-Shot-Cot. Then, during inference, we employ a semantic similarity-based approach to select the most similar examples from both the Positive and Negative corpora for a given query. Subsequently, we further retrieve Positive examples from the Positive sample corpus based on semantic similarity to the Negative examples, then concatenating them with the previously selected Positive examples to serve as ICL demonstrations. Experimental results demonstrate that our approach surpasses methods solely relying on the most similar positive examples for context, validating that the additional information in negative samples aids in enhancing ICL performance through improved Positive sample selection.
摘要：大型语言模型表现出强大的少数镜头学习（ICL）功能，但性能对提供的示例非常敏感。最近的研究重点是检索每个输入查询的相应示例，不仅提高了学习过程的效率和可扩展性，而且还可以减轻手动示例选择中固有的偏见。但是，这些研究主要强调利用积极样本，同时忽略负面样本中的其他信息进行上下文学习。我们提出了一种新型方法，该方法利用负样本更好地选择了阳性样本示例，从而增强了少量ICL的性能。最初，我们基于零拍摄的构建正面和负样本语料库。然后，在推断期间，我们采用基于语义相似的方法来从给定查询中从正面和负面语料库中选择最相似的示例。随后，我们基于与阴性示例的语义相似性从阳性样品语料库中进一步检索了阳性示例，然后将它们与先前选择的阳性示例相连，以作为ICL示范。实验结果表明，我们的方法超过了仅依靠上下文最相似的积极示例的方法，从而证实了阴性样本中的其他信息有助于通过改进的阳性样品选择来增强ICL性能。

Title: Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders

Authors: Carolina Zheng, Nicolas Beltran-Velez, Sweta Karlekar, Claudia Shi, Achille Nazaret, Asif Mallik, Amir Feder, David M. Blei
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.23220
Pdf URL: https://arxiv.org/pdf/2507.23220
Copy Paste: [[2507.23220]] Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders(https://arxiv.org/abs/2507.23220)
Keywords: llm
Abstract: Traditional topic models are effective at uncovering latent themes in large text collections. However, due to their reliance on bag-of-words representations, they struggle to capture semantically abstract features. While some neural variants use richer representations, they are similarly constrained by expressing topics as word lists, which limits their ability to articulate complex topics. We introduce Mechanistic Topic Models (MTMs), a class of topic models that operate on interpretable features learned by sparse autoencoders (SAEs). By defining topics over this semantically rich space, MTMs can reveal deeper conceptual themes with expressive feature descriptions. Moreover, uniquely among topic models, MTMs enable controllable text generation using topic-based steering vectors. To properly evaluate MTM topics against word-list-based approaches, we propose \textit{topic judge}, an LLM-based pairwise comparison evaluation framework. Across five datasets, MTMs match or exceed traditional and neural baselines on coherence metrics, are consistently preferred by topic judge, and enable effective steering of LLM outputs.
摘要：传统的主题模型可有效地发现大型文本集中的潜在主题。但是，由于他们依靠词袋表示，他们努力捕获语义上抽象的特征。尽管某些神经变体使用更丰富的表示，但它们同样通过将主题表示为单词列表来限制，这限制了它们阐明复杂主题的能力。我们介绍了机械主题模型（MTMS），这是一类主题模型，该模型以稀疏自动编码器（SAE）学到的可解释功能运行。通过在这个语义丰富的空间中定义主题，MTM可以通过表达性特征描述来揭示更深入的概念主题。此外，在主题模型中，MTMS独特地使用基于主题的转向向量启用可控制的文本生成。为了适当评估MTM主题针对基于单词列表的方法，我们提出了一个基于LLM的成对比较评估框架\ textit {topic judge}。在五个数据集中，MTM匹配或超过传统和神经基线在相干指标上，受主题法官一致地优先，并启用LLM输出的有效转向。

Title: Enabling Few-Shot Alzheimer's Disease Diagnosis on Tabular Biomarker Data with LLMs

Authors: Sophie Kearney, Shu Yang, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Jason Moore, Marylyn Ritchie, Li Shen
Subjects: cs.CL, cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2507.23227
Pdf URL: https://arxiv.org/pdf/2507.23227
Copy Paste: [[2507.23227]] Enabling Few-Shot Alzheimer's Disease Diagnosis on Tabular Biomarker Data with LLMs(https://arxiv.org/abs/2507.23227)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Early and accurate diagnosis of Alzheimer's disease (AD), a complex neurodegenerative disorder, requires analysis of heterogeneous biomarkers (e.g., neuroimaging, genetic risk factors, cognitive tests, and cerebrospinal fluid proteins) typically represented in a tabular format. With flexible few-shot reasoning, multimodal integration, and natural-language-based interpretability, large language models (LLMs) offer unprecedented opportunities for prediction with structured biomedical data. We propose a novel framework called TAP-GPT, Tabular Alzheimer's Prediction GPT, that adapts TableGPT2, a multimodal tabular-specialized LLM originally developed for business intelligence tasks, for AD diagnosis using structured biomarker data with small sample sizes. Our approach constructs few-shot tabular prompts using in-context learning examples from structured biomedical data and finetunes TableGPT2 using the parameter-efficient qLoRA adaption for a clinical binary classification task of AD or cognitively normal (CN). The TAP-GPT framework harnesses the powerful tabular understanding ability of TableGPT2 and the encoded prior knowledge of LLMs to outperform more advanced general-purpose LLMs and a tabular foundation model (TFM) developed for prediction tasks. To our knowledge, this is the first application of LLMs to the prediction task using tabular biomarker data, paving the way for future LLM-driven multi-agent frameworks in biomedical informatics.
摘要：早期，准确的诊断阿尔茨海默氏病（AD）是一种复杂的神经退行性疾病，需要分析异质生物标志物（例如神经影像学，遗传危险因素，认知测试和脑脊髓液体蛋白）通常以表格形式表示。通过灵活的几次推理，多模式集成和基于自然语言的解释性，大语言模型（LLMS）为使用结构化的生物医学数据提供了前所未有的预测机会。我们提出了一个名为Tap-gpt的新型框架，即Alzheimer的Tap-GPT GPT，它适应了TableGpt2，这是一种最初用于商业智能任务的多模式表格特殊的LLM，用于使用带有小样本量的结构化生物标记数据进行AD诊断。我们的方法使用在结构化的生物医学数据中使用中文学习示例构建了很少的表格提示，并使用参数效率的Qlora适应性进行了AD或认知正常（CN）的临床二元分类任务。 TAP-GPT框架可以利用TableGPT2和LLMS编码的先验知识的强大表达理解能力，以优于更高级的通用LLM和为预测任务而开发的表格基础模型（TFM）。据我们所知，这是LLM使用表格生物标志物数据在预测任务中的首次应用，为生物医学信息学中的未来LLM驱动的多代理框架铺平了道路。

Title: P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication

Authors: Sneha Oram, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.23247
Pdf URL: https://arxiv.org/pdf/2507.23247
Copy Paste: [[2507.23247]] P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication(https://arxiv.org/abs/2507.23247)
Keywords: language model, gpt, llm, prompt, chat
Abstract: There has been an increase in recent advancements in the explainability and development of personalized chatbots for mental health. However, the reasoning aspects for explainability and dialogue discourse have not been explored previously for mental health. Hence, we are investigating the pragmatic reasoning capability of large language models (LLMs) in this domain. We introduce P-ReMe dataset, and propose a modified definition for the pragmatic phenomena of implicature (implied meaning) and presupposition (implicit assumption) in mental health. Following the definition, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the presented tasks, we consider four models - Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning capabilities in the domain. In addition, we also propose StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT-4o mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with the stigma more responsibly compared to the other two LLMs.
摘要：在个性化聊天机器人的心理健康聊天机器人的解释性和开发方面，最近的进步有所提高。但是，以前尚未为心理健康探索解释性和对话话语的推理方面。因此，我们正在研究该领域中大语言模型（LLM）的务实推理能力。我们介绍了P-Reme数据集，并提出了对含义（隐含含义）和预设（隐式假设）的修改定义。按照定义，我们在预设中含义为两个任务和一个任务。为了基准数据集和提交的任务，我们考虑了四种模型-Llama3.1，Mistral，Mentallama和Qwen。实验的结果表明，Mistral和Qwen在域中显示出很大的推理能力。此外，我们还建议使用最先进的LLM，GPT-4O Mini，DeepSeek-Chat和Claude-3.5-Haiku研究心理健康的污名。我们的评估发现表明，与其他两个LLM相比，Claude-3.5-Haiku更负责任地处理污名。

Title: Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

Authors: Shimanto Bhowmik, Tawsif Tashwar Dipto, Md Sazzad Islam, Sheryl Hsu, Tahsin Reasat
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.23248
Pdf URL: https://arxiv.org/pdf/2507.23248
Copy Paste: [[2507.23248]] Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis(https://arxiv.org/abs/2507.23248)
Keywords: language model, llm
Abstract: Bengali is an underrepresented language in NLP research. However, it remains a challenge due to its unique linguistic structure and computational constraints. In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks. We then evaluated 10 recent open source Large Language Models (LLMs) in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes. Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral. We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient \& concise tokenization results in improved performance. These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide. The code and dataset used in this research is publicly available at this https URL.
摘要：孟加拉语是NLP研究中代表性不足的语言。但是，由于其独特的语言结构和计算约束，这仍然是一个挑战。在这项工作中，我们通过专注于缺乏标准化的评估基准，系统地研究了阻碍孟加拉NLP绩效的挑战。然后，我们评估了8个翻译数据集中的10个最近的开源大语言模型（LLM），并进行了全面的错误分析以查明其主要故障模式。与英语相比，我们的发现揭示了孟加拉语的稳定绩效差距，尤其是对于较小的模型和诸如Mistral之类的特定模型家族。我们还确定了在某些架构（例如DeepSeek）中的有希望的鲁棒性，这些结构在跨语言中保持更稳定的性能。我们的分析揭示了令牌化效率与LLM精度之间的反比关系，当输入过度象征化时，模型往往会表现较差，而更有效的\＆简洁的令牌化会导致性能提高。这些发现突出了关键的领域，这些领域当前模型短暂而强调了针对多语言环境量身定制的改进数据集质量和评估方法的需求。这项工作将促进对代表性不足的语言的NLP的进一步研究，从而使全球访问先进的语言技术的访问民主化。本研究中使用的代码和数据集可在此HTTPS URL上公开获得。

Title: Unveiling Super Experts in Mixture-of-Experts Large Language Models

Authors: Zunhai Su, Qingyuan Li, Hao Zhang, YuLei Qian, Yuchen Xie, Kehong Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.23279
Pdf URL: https://arxiv.org/pdf/2507.23279
Copy Paste: [[2507.23279]] Unveiling Super Experts in Mixture-of-Experts Large Language Models(https://arxiv.org/abs/2507.23279)
Keywords: language model, llm
Abstract: Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (LLMs). Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to improve the efficiency of MoE LLMs. However, existing approaches often rely on empirical criteria to identify critical experts, lacking a deeper exploration and understanding of the heterogeneous importance of experts. In this study, we present the first discovery and investigation of a distinct subset of experts that play a crucial role in the underlying mechanisms during the model's forward inference. These experts are prevalent in open-source MoE LLMs, and despite their limited number, pruning them leads to a significant decline in model performance (e.g., pruning three causes Qwen3-30B-A3B to produce repetitive and uninformative outputs). We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs. (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs remains model-specific and is unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model's overall performance, particularly in mathematical reasoning. (iii) We further enhance our understanding of the influence of SEs compression. Our findings confirm that MoE LLMs rely on SEs to induce attention sinks, which are crucial for the distribution of attention scores but are significantly disrupted by SE pruning. The code is available at this https URL.
摘要：稀疏激活的专家（MOE）模型在增强大语言模型（LLMS）的学习能力方面表现出了希望。利用专家之间的内在重要性差异，最近的研究探索了专家级压缩技术，以提高MOE LLM的效率。但是，现有方法通常依靠经验标准来确定批判专家，缺乏对专家的异质重要性的更深入的探索和理解。在这项研究中，我们介绍了对模型前进推断期间的基本机制中至关重要的专家子集的第一个发现和研究。这些专家在开源Moe LLM中很普遍，尽管它们的数量有限，但它们的修剪导致模型性能的显着下降（例如，修剪三个会导致QWEN3-30B-A3B产生重复性和非信息输出）。我们将这些专家称为超级专家（SES）。我们的全面分析为SES提供了更深入的见解。（i）SES的特征是down_proj的输出中的罕见但极端的激活异常值，这在解码器层之间的隐藏状态引起了大量激活。此外，SES的分布仍然是特定于模型的，并且不受训练后过程的影响。（ii）通过修剪SES，我们评估了它们在各种任务中的重要性，从而揭示了它们对模型的整体表现的重大影响，尤其是在数学推理中。（iii）我们进一步增强了对SES压缩影响的理解。我们的发现证实，Moe LLMS依靠SES引起注意力集，这对于注意力评分的分布至关重要，但被SE修剪严重破坏了。该代码可在此HTTPS URL上找到。

Title: What's Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

Authors: Alfio Ferrara, Sergio Picascia, Laura Pinnavaia, Vojimir Ranitovic, Elisabetta Rocchetti, Alice Tuveri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.23319
Pdf URL: https://arxiv.org/pdf/2507.23319
Copy Paste: [[2507.23319]] What's Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content(https://arxiv.org/abs/2507.23319)
Keywords: language model, gpt, llm
Abstract: Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.
摘要：专有的大型语言模型（LLMS）显示了对礼貌，形式和隐性内容适度的趋势。尽管以前的研究主要集中于明确培训模型以适度和排毒敏感内容，但对LLMS是否在没有明确指示的情况下隐式消毒语言的探索有限。这项研究从经验上分析了某些敏感含量并评估敏感性转移程度时，GPT-4O-Mini的隐式适度行为。我们的实验表明，GPT-4O-MINI系统地将内容缓和了较低的敏感类别，贬义和禁忌语言大大降低。此外，我们评估了LLM在分类句子灵敏度时的零射击功能，将其表现与传统方法进行比较。

Title: MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation

Authors: Daeyong Kwon, SeungHeon Doh, Juhan Nam
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2507.23334
Pdf URL: https://arxiv.org/pdf/2507.23334
Copy Paste: [[2507.23334]] MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation(https://arxiv.org/abs/2507.23334)
Keywords: language model, llm, retrieval augmented generation
Abstract: Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs' effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text-only music question answering (MQA) tasks. RAG is a technique that provides external knowledge to LLMs by retrieving relevant context information when generating answers to questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a music-specialized vector database for the retrieval stage, and (2) utilizes context information during both inference and fine-tuning processes to effectively transform general-purpose LLMs into music-specific models. Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs' music domain adaptation capabilities, showing consistent improvements across both in-domain and out-of-domain MQA benchmarks. Additionally, our MusWikiDB proves substantially more effective than general Wikipedia corpora, delivering superior performance and computational efficiency.
摘要：大型语言模型（LLM）的最新进步表明，各个领域都具有显着的功能。尽管他们在各种任务上表现出强劲的零射击性能，但由于其培训数据中，LLMS在音乐相关应用程序中的有效性仍然有限。为了解决这一限制，我们提出了必不可少的rag，这是一个基于检索增强发电（RAG）的综合框架，以适应通用的LLM，以适应仅文本音乐问题答录（MQA）任务。 RAG是一种通过在产生问题的答案时检索相关上下文信息来为LLM提供外部知识的技术。为了优化音乐领域的抹布，我们（1）提出了MuswikidB，这是一个用于检索阶段的音乐专用矢量数据库，（2）在推理和微调过程中都利用上下文信息，以有效地将通用用途LLMS转化为音乐特异性模型。我们的实验表明，必须抹布在增强LLMS音乐领域的适应能力方面的传统微调方法明显胜过传统的微调方法，从而显示出始终如一的内域和室外MQA基准的一致改进。此外，我们的MuswikidB比一般的Wikipedia Corpora更有效，具有较高的性能和计算效率。

Title: Text-to-SQL Task-oriented Dialogue Ontology Construction

Authors: Renato Vukovic, Carel van Niekerk, Michael Heck, Benjamin Ruppik, Hsien-Chin Lin, Shutong Feng, Nurul Lubis, Milica Gasic
Subjects: cs.CL, cs.AI, cs.DB, cs.IR
Abstract URL: https://arxiv.org/abs/2507.23358
Pdf URL: https://arxiv.org/pdf/2507.23358
Copy Paste: [[2507.23358]] Text-to-SQL Task-oriented Dialogue Ontology Construction(https://arxiv.org/abs/2507.23358)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness. In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability. However, building such ontologies requires manual labels or supervised training. We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM autonomously builds a TOD ontology from scratch without supervision using its inherent SQL programming capabilities combined with dialogue theory provided in the prompt. We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task. Ablation studies demonstrate the key role of dialogue theory. TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and ArXiv dataset. We view this as a step towards broader application of ontologies to increase LLM explainability.
摘要：大型语言模型（LLM）被广泛用作通用知识来源，但它们依赖于参数知识，限制了解释性和可信度。在以任务为导向的对话（TOD）系统中，使用由显式本体构建的外部数据库显式，以确保解释性和可控性。但是，建立此类本体需要手动标签或监督培训。我们介绍了Teqodo：文本到SQL任务的对话本体论构建方法。在这里，LLM使用其固有的SQL编程功能与提示中提供的对话理论相结合，无需监督就可以从头开始构建TOD本体论。我们表明，Teqodo优于转移学习方法，其构建的本体论在下游对话状态跟踪任务上具有竞争力。消融研究证明了对话理论的关键作用。 Teqodo还可以扩展以构建更大的本体论，我们在Wikipedia和Arxiv数据集上进行了研究。我们认为这是迈向更广泛应用本体学以提高LLM解释性的一步。

Title: MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

Authors: Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin, Wanxiang Che
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2507.23382
Pdf URL: https://arxiv.org/pdf/2507.23382
Copy Paste: [[2507.23382]] MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models(https://arxiv.org/abs/2507.23382)
Keywords: language model, llm, prompt
Abstract: Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs' ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.
摘要：多模式计划功能是指使用多模式上下文进行任务执行的预测，推理和设计步骤的能力，这对于跨多个步骤的复杂推理和决策至关重要。但是，当前的基准面临两个主要挑战：（1）他们无法直接评估多模式现实世界的规划能力，并且（2）它们缺乏跨模态的约束或隐性约束。为了解决这些问题，我们引入了具有复杂约束（MPCC）的多模式计划，这是系统评估MLLM在计划中处理多模式约束能力的第一个基准。为了应对第一个挑战，MPCC专注于三个现实世界任务：飞行计划，日历计划和会议计划。为了解决第二个挑战，我们在这些任务中引入了复杂的约束（例如预算，时间和空间），并具有分级难度级别（易于，中等，难），以将约束复杂性与搜索空间扩展分开。对13个高级MLLM的实验揭示了重大挑战：封闭源模型仅实现21.3％的可行计划，而开源模型平均平均低于11％。此外，我们观察到MLLM对约束的复杂性高度敏感，并且传统的多模式提示策略在多构造场景中失败。我们的工作正式化了计划中的多模式约束，提供了一个严格的评估框架，并强调了对现实世界中MLLM应用程序的约束意识推理的进步的需求。

Title: Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models

Authors: Ailiang Lin, Zhuoyun Li, Kotaro Funakoshi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.23386
Pdf URL: https://arxiv.org/pdf/2507.23386
Copy Paste: [[2507.23386]] Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models(https://arxiv.org/abs/2507.23386)
Keywords: language model, llm
Abstract: Decoder-only large language models (LLMs) are increasingly used to build embedding models that effectively encode the semantic information of natural language texts into dense vector representations for various embedding tasks. However, many existing methods primarily focus on removing the causal attention mask in LLMs to enable bidirectional attention, potentially undermining the model's ability to extract semantic information acquired during pretraining. Additionally, leading unidirectional approaches often rely on extra input text to overcome the inherent limitations of causal attention, inevitably increasing computational costs. In this work, we propose Causal2Vec, a general-purpose embedding model tailored to enhance the performance of decoder-only LLMs without altering their original architectures or introducing significant computational overhead. Specifically, we first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token, which is then prepended to the LLM's input sequence, allowing each token to capture contextualized information even without attending to future tokens. Furthermore, to mitigate the recency bias introduced by last-token pooling and help LLMs better leverage the semantic information encoded in the Contextual token, we concatenate the last hidden states of Contextual and EOS tokens as the final text embedding. In practice, Causal2Vec achieves state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB) among models trained solely on publicly available retrieval datasets, while reducing the required sequence length by up to 85% and inference time by up to 82% compared to best-performing methods.
摘要：仅使用解码器的大型语言模型（LLM）越来越多地用于构建嵌入模型，这些模型有效地将自然语言文本的语义信息编码到密集的矢量表示中，以进行各种嵌入任务。但是，许多现有方法主要集中于去除LLMS中的因果注意面罩以使双向关注，从而破坏该模型在预训练期间获得的语义信息的能力。此外，领先的单向方法通常依靠额外的输入文本来克服因果关注的固有局限性，不可避免地增加了计算成本。在这项工作中，我们提出了Causal2Vec，这是一种量身定制的通用嵌入模型，旨在提高仅解码器llms的性能而无需更改其原始体系结构或引入大量的计算开销。具体而言，我们首先采用轻型BERT式模型将输入文本预言到单个上下文令牌中，然后将其添加到LLM的输入序列中，从而使每个令牌即使无需参加将来的令牌即使捕获上下文化的信息。此外，为了减轻上次池池引入的新近度偏差，并帮助LLMS更好地利用上下文令牌中编码的语义信息，我们将上下文和EOS令牌的最后一个隐藏状态与最终文本嵌入在一起。实际上，Causal2Vec在仅根据公开可用的检索数据集进行培训的模型中，在大规模文本嵌入式基准（MTEB）上实现了最先进的性能，同时将所需的序列长度降低了85％，并且与最出色的方法相比，所需的序列长度最高为82％。

Title: Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators

Authors: Peter Sandrini
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2507.23399
Pdf URL: https://arxiv.org/pdf/2507.23399
Copy Paste: [[2507.23399]] Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators(https://arxiv.org/abs/2507.23399)
Keywords: language model, llm, chat
Abstract: The rapid proliferation of Large Language Models presents both opportunities and challenges for the translation field. While commercial, cloud-based AI chatbots have garnered significant attention in translation studies, concerns regarding data privacy, security, and equitable access necessitate exploration of alternative deployment models. This paper investigates the feasibility and performance of locally deployable, free language models as a viable alternative to proprietary, cloud-based AI solutions. This study evaluates three open-source models installed on CPU-based platforms and compared against commercially available online chat-bots. The evaluation focuses on functional performance rather than a comparative analysis of human-machine translation quality, an area already subject to extensive research. The platforms assessed were chosen for their accessibility and ease of use across various operating systems. While local deployment introduces its own challenges, the benefits of enhanced data control, improved privacy, and reduced dependency on cloud services are compelling. The findings of this study contribute to a growing body of knowledge concerning the democratization of AI technology and inform future research and development efforts aimed at making LLMs more accessible and practical for a wider range of users, specifically focusing on the needs of individual translators and small businesses.
摘要：大型语言模型的快速扩散既带来了翻译领域的机会和挑战。尽管商业，基于云的AI聊天机器人在翻译研究中引起了极大的关注，但有关数据隐私，安全性和公平访问的担忧需要探索替代部署模型。本文调查了本地可部署的免费语言模型的可行性和性能，作为专有，基于云的AI解决方案的可行替代方案。这项研究评估了在基于CPU的平台上安装的三种开源模型，并将其与市售的在线聊天机器人进行了比较。该评估的重点是功能性能，而不是对人机翻译质量的比较分析，该领域已经进行了广泛的研究。选择了评估的平台，以便在各种操作系统中易于使用和易用性。尽管本地部署引入了自己的挑战，但增强数据控制，改善隐私和依赖云服务的好处是令人信服的。这项研究的结果有助于越来越多的知识，涉及AI技术的民主化，并为未来的研发工作提供旨在使LLMS更容易访问和实用的研究和开发工作，特别是专门针对单个翻译人员和小型企业的需求。

Title: MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization

Authors: Yongbing Zhang, Fang Nan, Shengxiang Gao, Yuxin Huang, Kaiwen Tan, Zhengtao Yu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2507.23400
Pdf URL: https://arxiv.org/pdf/2507.23400
Copy Paste: [[2507.23400]] MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization(https://arxiv.org/abs/2507.23400)
Keywords: language model
Abstract: The core challenge faced by multi-document summarization is the complexity of relationships among documents and the presence of information redundancy. Graph clustering is an effective paradigm for addressing this issue, as it models the complex relationships among documents using graph structures and reduces information redundancy through clustering, achieving significant research progress. However, existing methods often only consider single-relational graphs and require a predefined number of clusters, which hinders their ability to fully represent rich relational information and adaptively partition sentence groups to reduce redundancy. To overcome these limitations, we propose MRGSEM-Sum, an unsupervised multi-document summarization framework based on multi-relational graphs and structural entropy minimization. Specifically, we construct a multi-relational graph that integrates semantic and discourse relations between sentences, comprehensively modeling the intricate and dynamic connections among sentences across documents. We then apply a two-dimensional structural entropy minimization algorithm for clustering, automatically determining the optimal number of clusters and effectively organizing sentences into coherent groups. Finally, we introduce a position-aware compression mechanism to distill each cluster, generating concise and informative summaries. Extensive experiments on four benchmark datasets (Multi-News, DUC-2004, PubMed, and WikiSum) demonstrate that our approach consistently outperforms previous unsupervised methods and, in several cases, achieves performance comparable to supervised models and large language models. Human evaluation demonstrates that the summaries generated by MRGSEM-Sum exhibit high consistency and coverage, approaching human-level quality.
摘要：多文件摘要面临的核心挑战是文档之间关系的复杂性和信息冗余。图形聚类是解决此问题的有效范式，因为它使用图形结构对文档之间的复杂关系进行了建模，并通过聚类来减少信息冗余，从而取得了重大的研究进度。但是，现有方法通常仅考虑单关系图，并且需要预定义的群集，这阻碍了它们完全代表丰富的关系信息和自适应分区句子组以减少冗余的能力。为了克服这些局限性，我们提出了MRGSEM-SUM，这是一个基于多关系图和结构熵最小化的无监督的多文件汇总框架。具体而言，我们构建了一个跨关系图，该图可以整合句子之间的语义和话语关系，从而全面地对文档句子之间的复杂和动态连接进行建模。然后，我们将二维结构熵最小化算法用于聚类，自动确定簇的最佳数量，并有效地将句子组织到相干组中。最后，我们引入了一种位置感知的压缩机制来提炼每个集群，从而产生简洁而有益的摘要。在四个基准数据集（Multi-News，DUC-2004，PubMed和Wikisum）上进行的广泛实验表明，我们的方法始终优于先前无监督的方法，并且在某些情况下，实现了与受监督模型和大语言模型相当的性能。人类评估表明，MRGSEM-SUM产生的摘要表现出很高的一致性和覆盖率，接近人类水平的质量。

Title: Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

Authors: Salah Eddine Bekhouche, Azeddine Benlamoudi, Yazid Bounab, Fadi Dornaika, Abdenour Hadid
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.23404
Pdf URL: https://arxiv.org/pdf/2507.23404
Copy Paste: [[2507.23404]] Enhanced Arabic Text Retrieval with Attentive Relevance Scoring(https://arxiv.org/abs/2507.23404)
Keywords: language model
Abstract: Arabic poses a particular challenge for natural language processing (NLP) and information retrieval (IR) due to its complex morphology, optional diacritics and the coexistence of Modern Standard Arabic (MSA) and various dialects. Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources. In this paper, we present an enhanced Dense Passage Retrieval (DPR) framework developed specifically for Arabic. At the core of our approach is a novel Attentive Relevance Scoring (ARS) that replaces standard interaction mechanisms with an adaptive scoring function that more effectively models the semantic relevance between questions and passages. Our method integrates pre-trained Arabic language models and architectural refinements to improve retrieval performance and significantly increase ranking accuracy when answering Arabic questions. The code is made publicly available at \href{this https URL}{GitHub}.
摘要：阿拉伯语对自然语言处理（NLP）和信息检索（IR）提出了一个特殊的挑战，因为其复杂的形态，可选的变音和现代标准阿拉伯语（MSA）和各种方言的共存。尽管阿拉伯语的全球意义越来越大，但它在NLP研究和基准资源中的代表性仍然不足。在本文中，我们提出了专门为阿拉伯语开发的增强的密集通道检索（DPR）框架。我们方法的核心是一种新颖的专注相关性评分（ARS），它用自适应评分函数代替标准相互作用机制，可以更有效地模拟问题和段落之间的语义相关性。我们的方法集成了预先训练的阿拉伯语模型和建筑改进，以提高检索性能，并在回答阿拉伯问题时显着提高排名准确性。该代码在\ href {this https url} {github}上公开可用。

Title: Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Authors: Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Vera, Muhammad Dehan Al Kautsar, Fajri Koto
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.23465
Pdf URL: https://arxiv.org/pdf/2507.23465
Copy Paste: [[2507.23465]] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations(https://arxiv.org/abs/2507.23465)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.
摘要：由于大型语言模型（LLM）越来越多地在企业设置中部署，因此基于用户角色控制模型行为成为必不可少的要求。现有的安全方法通常假设统一的访问权限，并专注于防止有害或有毒输出，而无需解决特定角色的访问限制。在这项工作中，我们调查了LLM是否可以进行微调以产生反映与不同组织角色相关的访问特权的响应。我们探讨了三种建模策略：基于BERT的分类器，基于LLM的分类器和角色条件生成。为了评估这些方法，我们构建了两个互补数据集。第一个是通过聚类和角色标签改编自现有指令调整语料库的，而第二个是合成生成的，以反映现实的，对角色敏感的企业方案。我们评估各种组织结构中的模型性能，并分析鲁棒性，以迅速注入，角色不匹配和越狱尝试。

Title: A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Authors: Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu, Zhihao Wang, Peng Yu, Mengran Lang, Yue Liu, Huimin Zhang, Haitao Shen, Long Chen, Qiguang Zhao, Si-Xuan Liu, Lina Zhou, Hua Gao, Dongqiang Ye, Lingmin Meng, Youtao Yu, Naixin Liang, Jianxiong Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.23486
Pdf URL: https://arxiv.org/pdf/2507.23486
Copy Paste: [[2507.23486]] A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains(https://arxiv.org/abs/2507.23486)
Keywords: language model, llm
Abstract: Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.
摘要：大型语言模型（LLMS）在临床决策支持方面有希望，但在安全评估和有效性验证方面面临重大挑战。我们开发了临床安全效应双轨基准（CSEDB），这是一个基于临床专家共识建立的多维框架，涵盖了30个标准，涵盖了关键领域，例如关键疾病识别，指南依从性和药物安全性，并具有加权后果测量。 32位专家医师开发并审查了与这些标准一致的2,069个开放式问答项目，涵盖了26个临床部门，以模拟现实世界中的情况。六个LLM的基准测试显示出中等的总体表现（平均总分为57.2％，安全54.7％，有效性为62.3％），高风险场景的绩效下降了13.3％（p <0.0001）。特定于域的医学LLM在通用模型中表现出一致的性能优势，安全性的最高得分（0.912）和有效性相对较高（0.861）。这项研究的发现不仅为评估医疗LLM的临床应用，促进比较分析，风险暴露识别以及在不同情况下的改进方向提供了标准化指标，而且还具有促进医疗保健环境中大型语言模型的更安全和更有效地部署大型语言的可能性。

Title: Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning

Authors: Keer Lu, Zheng Liang, Youquan Li, Jiejun Tan, Da Pan, Shusen Zhang, Guosheng Dong, Huang Leng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.23541
Pdf URL: https://arxiv.org/pdf/2507.23541
Copy Paste: [[2507.23541]] Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning(https://arxiv.org/abs/2507.23541)
Keywords: gpt, llm
Abstract: In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance. Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes. Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain. To address these challenges, we introduce **Med-R$^3$**, a **Med**ical **R**etrieval-augmented **R**easoning framework driven by progressive **R**einforcement learning. In this framework, we first develop the model's ability to perform logical reasoning over medical problems. Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process. Finally, we conduct joint optimization of the model's retrieval and reasoning coordination. Extensive experiments indicate that **Med-R$^3$** could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by 3.93\% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R$^3$ shows a more substantial gain of 13.53\%.
摘要：在医学情况下，有效地检索外部知识并利用其进行严格的逻辑推理至关重要。尽管它们具有潜力，但现有的工作主要集中在孤立地增强模型的检索或推理能力上，而很少关注其关节优化，从而导致两个过程之间的协调有限。此外，当前的方法在很大程度上依赖于监督的微调（SFT），这可能会导致模型记住现有的解决问题的途径，从而在面对新的问题上下文时限制其概括能力。此外，尽管一些研究已经探索了通过增强学习中的一般领域中提高检索提示的推理，但其奖励功能设计并不能充分捕获医疗领域的特定需求。为了应对这些挑战，我们介绍了** med-r $^3 $ **，a ** med ** ical ** r ** etrieval-eTrieval-augment ** r **恢复框架由渐进** r ** r ** einforivection学习。在此框架中，我们首先发展了该模型在医疗问题上执行逻辑推理的能力。随后，在本基础的基础上，我们可以自适应优化检索能力，以更好地与知识语料库的特征和在整个推理过程中的外部信息利用率保持一致。最后，我们对模型的检索和推理协调进行联合优化。广泛的实验表明，** med-r $^3 $ **可以实现最先进的表演，而Llama3.1-8B-Instruct + Med-r $^3 $超过3.93 \％的封闭式GPT-4O-MINI，以可比的参数为3.93 \％，而Med-R $^3 $^$^3 $^3 $^3 $^$^$^33。

Title: DiffLoRA: Differential Low-Rank Adapters for Large Language Models

Authors: Alexandre Misrahi, Nadezhda Chirkova, Maxime Louis, Vassilina Nikoulina
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.23588
Pdf URL: https://arxiv.org/pdf/2507.23588
Copy Paste: [[2507.23588]] DiffLoRA: Differential Low-Rank Adapters for Large Language Models(https://arxiv.org/abs/2507.23588)
Keywords: language model
Abstract: Differential Transformer has recently been proposed to improve performance in Transformer models by canceling out noise through a denoiser attention mechanism. In this work, we introduce DiffLoRA, a parameter-efficient adaptation of the differential attention mechanism, with low-rank adapters on both positive and negative attention terms. This approach retains the efficiency of LoRA while aiming to benefit from the performance gains of differential attention. We evaluate DiffLoRA across a broad range of NLP tasks, including general benchmarks, many-shot in-context learning, RAG, and long-context tests. We observe that, although DiffLoRA falls short of other parameter-efficient fine-tuning methods in most evaluation tasks, it shows interesting results in certain domains (+11 pts on LoRA for HumanEval). We analyze the attention patterns post-finetuning to identify the reasons for this behavior.
摘要：最近，已经提出了差分变压器，以通过Denoiser注意机制取消噪声来提高变压器模型的性能。在这项工作中，我们介绍了DiFflora，这是差异注意机制的参数有效适应，具有低级别适配器在正和负注意项上。这种方法保留了洛拉的效率，同时旨在受益于差异关注的绩效提高。我们在广泛的NLP任务中评估了DIFFLORA，包括一般基准测试，许多镜头内部学习，抹布和长篇小写测试。我们观察到，尽管Difflora在大多数评估任务中均未达到其他参数有效的微调方法，但它在某些域中显示出有趣的结果（HOMANEVAL的LORA上的+11 pts）。我们分析了重点后的注意力模式，以确定这种行为的原因。

Title: Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs

Authors: Nasim Shirvani-Mahdavi, Devin Wingfield, Amin Ghasemi, Chengkai Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.23740
Pdf URL: https://arxiv.org/pdf/2507.23740
Copy Paste: [[2507.23740]] Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs(https://arxiv.org/abs/2507.23740)
Keywords: language model, hallucination, prompt, chain-of-thought
Abstract: Knowledge graphs (KGs) often contain sufficient information to support the inference of new facts. Identifying logical rules not only improves the completeness of a knowledge graph but also enables the detection of potential errors, reveals subtle data patterns, and enhances the overall capacity for reasoning and interpretation. However, the complexity of such rules, combined with the unique labeling conventions of each KG, can make them difficult for humans to understand. In this paper, we explore the potential of large language models to generate natural language explanations for logical rules. Specifically, we extract logical rules using the AMIE 3.5.1 rule discovery algorithm from the benchmark dataset FB15k-237 and two large-scale datasets, FB-CVT-REV and FB+CVT-REV. We examine various prompting strategies, including zero- and few-shot prompting, including variable entity types, and chain-of-thought reasoning. We conduct a comprehensive human evaluation of the generated explanations based on correctness, clarity, and hallucination, and also assess the use of large language models as automatic judges. Our results demonstrate promising performance in terms of explanation correctness and clarity, although several challenges remain for future research. All scripts and data used in this study are publicly available at this https URL}{this https URL.
摘要：知识图（kgs）通常包含足够的信息来支持新事实的推断。确定逻辑规则不仅可以提高知识图的完整性，还可以检测潜在错误，揭示微妙的数据模式，并增强了推理和解释的总体能力。但是，这种规则的复杂性，再加上每个公斤的独特标签惯例，可以使它们难以理解。在本文中，我们探讨了大语言模型为逻辑规则生成自然语言解释的潜力。具体而言，我们使用AMIE 3.5.1从基准数据集FB15K-237和两个大型数据集（FB-CVT-REV和FB+CVT-REV）提取逻辑规则。我们研究了各种提示策略，包括零和少量提示，包括可变实体类型和经过思考的推理。我们根据正确性，清晰度和幻觉对生成的解释进行了全面的人体评估，并评估将大语模型用作自动法官的使用。我们的结果表明，在解释的正确性和清晰度方面表现出了有希望的表现，尽管对于未来的研究仍然存在一些挑战。本研究中使用的所有脚本和数据均在此https url} {此https url上公开可用。

Title: Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities

Authors: Yunxiang Yan, Tomohiro Sawada, Kartik Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.23776
Pdf URL: https://arxiv.org/pdf/2507.23776
Copy Paste: [[2507.23776]] Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities(https://arxiv.org/abs/2507.23776)
Keywords: llm
Abstract: While question-answering~(QA) benchmark performance is an automatic and scalable method to compare LLMs, it is an indirect method of evaluating their underlying problem-solving capabilities. Therefore, we propose a holistic and generalizable framework based on \emph{cascaded question disclosure} that provides a more accurate estimate of the models' problem-solving capabilities while maintaining the scalability and automation. This approach collects model responses in a stagewise manner with each stage revealing partial information about the question designed to elicit generalized reasoning in LLMs. We find that our approach not only provides a better comparison between LLMs, but also induces better intermediate traces in models compared to the standard QA paradigm. We empirically verify this behavior on diverse reasoning and knowledge-heavy QA datasets by comparing LLMs of varying sizes and families. Our approach narrows the performance gap observed in the standard QA evaluation settings, indicating that the prevalent indirect QA paradigm of evaluation overestimates the differences in performance between models. We further validate our findings by extensive ablation studies.
摘要：虽然提问〜（QA）基准性能是比较LLM的一种自动且可扩展的方法，但它是评估其潜在问题解决问题能力的间接方法。因此，我们提出了一个基于\ emph {cascaded问题披露}的整体且可推广的框架，该框架在维护可扩展性和自动化的同时，提供了对模型解决问题的能力的更准确估计。这种方法以阶段的方式收集模型响应，每个阶段都揭示了有关旨在在LLM中引起广义推理的问题的部分信息。我们发现，与标准质量检查范式相比，我们的方法不仅可以更好地比较LLM之间的比较，还可以在模型中诱导更好的中间痕迹。我们通过比较各种大小和家庭的LLM，从经验上验证了这种行为在各种推理和知识质量质量质量的数据集上。我们的方法缩小了在标准质量检查评估设置中观察到的性能差距，这表明评估的普遍间接质量检查范式高估了模型之间的性能差异。我们通过广泛的消融研究进一步验证我们的发现。