2025-07-14

Title: RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning

Authors: Atli Sigurgeirsson, Simon King
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.08012
Pdf URL: https://arxiv.org/pdf/2507.08012
Copy Paste: [[2507.08012]] RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning(https://arxiv.org/abs/2507.08012)
Keywords: prompt
Abstract: A Prompt-based Text-To-Speech model allows a user to control different aspects of speech, such as speaking rate and perceived gender, through natural language instruction. Although user-friendly, such approaches are on one hand constrained: control is limited to acoustic features exposed to the model during training, and too flexible on the other: the same inputs yields uncontrollable variation that are reflected in the corpus statistics. We investigate a novel fine-tuning regime to address both of these issues at the same time by exploiting the uncontrollable variance of the model. Through principal component analysis of thousands of synthesised samples, we determine latent features that account for the highest proportion of the output variance and incorporate them as new labels for secondary fine-tuning. We evaluate the proposed methods on two models trained on an expressive Icelandic speech corpus, one with emotional disclosure and one without. In the case of the model without emotional disclosure, the method yields both continuous and discrete features that improve overall controllability of the model.
摘要：基于迅速的文本到语音模型允许用户通过自然语言指导控制语音的不同方面，例如口语率和感知性别。尽管用户友好，但此类方法一方面受到限制：控制仅限于在训练过程中暴露于模型的声学特征，而另一方面则太灵活了：相同的输入产生了反映在语料库统计信息中的无法控制的变化。我们通过利用模型的不可控制差异来研究一种新颖的微调制度，以同时解决这两个问题。通过对数千个合成样本的主要成分分析，我们确定了占输出差异比例最高的潜在特征，并将其作为辅助微调的新标签纳入。我们评估了在富有表现力的冰岛语音语料库训练的两个模型上提出的方法，一种是情感披露，一种没有。在没有情感披露的模型的情况下，该方法既可以产生连续的和离散的特征，又可以提高模型的整体可控性。

Title: MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model

Authors: K. Sahit Reddy, N. Ragavenderan, Vasanth K., Ganesh N. Naik, Vishalakshi Prabhu, Nagaraja G. S
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.08013
Pdf URL: https://arxiv.org/pdf/2507.08013
Copy Paste: [[2507.08013]] MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model(https://arxiv.org/abs/2507.08013)
Keywords: language model, gpt
Abstract: Recent advances in natural language processing (NLP) have been driven bypretrained language models like BERT, RoBERTa, T5, and GPT. Thesemodels excel at understanding complex texts, but biomedical literature, withits domain-specific terminology, poses challenges that models likeWord2Vec and bidirectional long short-term memory (Bi-LSTM) can't fullyaddress. GPT and T5, despite capturing context, fall short in tasks needingbidirectional understanding, unlike BERT. Addressing this, we proposedMedicalBERT, a pretrained BERT model trained on a large biomedicaldataset and equipped with domain-specific vocabulary that enhances thecomprehension of biomedical terminology. MedicalBERT model is furtheroptimized and fine-tuned to address diverse tasks, including named entityrecognition, relation extraction, question answering, sentence similarity, anddocument classification. Performance metrics such as the F1-score,accuracy, and Pearson correlation are employed to showcase the efficiencyof our model in comparison to other BERT-based models such as BioBERT,SciBERT, and ClinicalBERT. MedicalBERT outperforms these models onmost of the benchmarks, and surpasses the general-purpose BERT model by5.67% on average across all the tasks evaluated respectively. This work alsounderscores the potential of leveraging pretrained BERT models for medicalNLP tasks, demonstrating the effectiveness of transfer learning techniques incapturing domain-specific information. (PDF) MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model. Available from: this https URL [accessed Jul 06 2025].
摘要：自然语言处理（NLP）的最新进展已被驱动于贝特，罗伯塔，T5和GPT等语言模型。这些模型擅长理解复杂的文本，但是生物医学文献withits withITS特定的术语提出了挑战，例如word2vec和双向长期短期记忆（BI-LSTM）等诸如worde2vec和双向长期记忆（BISTM）无法全面上瘾。 GPT和T5尽管捕获了上下文，但与BERT不同，需要偏见的任务需要了解。在解决这个问题时，我们提出了一种经过验证的BERT模型，该模型在大型的生物医学数据库上训练，并配备了域特异性词汇，从而增强了对生物医学术语的理解。 Medicalbert模型进行了进一步的优化和微调，以解决各种任务，包括指定的实体认可，关系提取，问答，相似性和文档分类。与其他基于BERT的模型（例如Biobert，Scibert和Clinicalbert）相比，采用了诸如F1得分，准确性和Pearson相关性之类的性能指标。 Medicalbert在基准的最大模型上优于这些模型，并且在所有评估的所有任务中，总的来说，通用的BERT模型平均超过了5.67％。这项工作AlsoundersCosce of the MedicalNLP任务的利用BERT模型的潜力，证明了转移学习技术的有效性，这些技术使特定于域的信息不受欢迎。（PDF）Medicalbert：使用基于BERT的模型来增强生物医学自然语言处理。可从以下网址获得：此HTTPS URL [7月6日2025年访问]。

Title: Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking

Authors: Aldan Creo, Raul Castro Fernandez, Manuel Cebrian
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2507.08014
Pdf URL: https://arxiv.org/pdf/2507.08014
Copy Paste: [[2507.08014]] Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking(https://arxiv.org/abs/2507.08014)
Keywords: language model, llm, chat
Abstract: As large language models (LLMs) become increasingly deployed, understanding the complexity and evolution of jailbreaking strategies is critical for AI safety. We present a mass-scale empirical analysis of jailbreak complexity across over 2 million real-world conversations from diverse platforms, including dedicated jailbreaking communities and general-purpose chatbots. Using a range of complexity metrics spanning probabilistic measures, lexical diversity, compression ratios, and cognitive load indicators, we find that jailbreak attempts do not exhibit significantly higher complexity than normal conversations. This pattern holds consistently across specialized jailbreaking communities and general user populations, suggesting practical bounds on attack sophistication. Temporal analysis reveals that while user attack toxicity and complexity remains stable over time, assistant response toxicity has decreased, indicating improving safety mechanisms. The absence of power-law scaling in complexity distributions further points to natural limits on jailbreak development. Our findings challenge the prevailing narrative of an escalating arms race between attackers and defenders, instead suggesting that LLM safety evolution is bounded by human ingenuity constraints while defensive measures continue advancing. Our results highlight critical information hazards in academic jailbreak disclosure, as sophisticated attacks exceeding current complexity baselines could disrupt the observed equilibrium and enable widespread harm before defensive adaptation.
摘要：随着大型语言模型（LLM）越来越多地部署，了解越狱策略的复杂性和演变对于人工智能安全至关重要。我们对来自各种平台的200万个现实世界对话，包括专门的越狱社区和通用聊天机器人进行了200万个现实世界的对话，对越来越多的现实世界对话进行了大规模的经验分析。使用跨越概率度量，词汇多样性，压缩比和认知负载指标的一系列复杂度指标，我们发现越狱尝试的复杂性并不比正常对话更高。这种模式在专业的越狱社区和一般用户群体中始终如一，这表明攻击成熟的实用范围。时间分析表明，虽然用户攻击毒性和复杂性随着时间的流逝而保持稳定，但助理反应毒性却有所下降，表明安全机制提高了。在复杂性分布中缺乏幂律规模进一步指出了越狱发展的自然限制。我们的发现挑战了攻击者和防守者之间不断升级的军备竞赛的普遍叙述，而是表明LLM安全进化受到人类创造力的限制，而防御措施继续前进。我们的结果突出了学术越狱披露中的关键信息危害，因为超过当前复杂性基线的复杂攻击可能会破坏观察到的平衡，并在防御性适应之前造成广泛的伤害。

Title: Assessing the Capabilities and Limitations of FinGPT Model in Financial NLP Applications

Authors: Prudence Djagba, Chimezie A. Odinakachukwu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.08015
Pdf URL: https://arxiv.org/pdf/2507.08015
Copy Paste: [[2507.08015]] Assessing the Capabilities and Limitations of FinGPT Model in Financial NLP Applications(https://arxiv.org/abs/2507.08015)
Keywords: language model, gpt
Abstract: This work evaluates FinGPT, a financial domain-specific language model, across six key natural language processing (NLP) tasks: Sentiment Analysis, Text Classification, Named Entity Recognition, Financial Question Answering, Text Summarization, and Stock Movement Prediction. The evaluation uses finance-specific datasets to assess FinGPT's capabilities and limitations in real-world financial applications. The results show that FinGPT performs strongly in classification tasks such as sentiment analysis and headline categorization, often achieving results comparable to GPT-4. However, its performance is significantly lower in tasks that involve reasoning and generation, such as financial question answering and summarization. Comparisons with GPT-4 and human benchmarks highlight notable performance gaps, particularly in numerical accuracy and complex reasoning. Overall, the findings indicate that while FinGPT is effective for certain structured financial tasks, it is not yet a comprehensive solution. This research provides a useful benchmark for future research and underscores the need for architectural improvements and domain-specific optimization in financial language models.
摘要：这项工作评估了六个关键的自然语言处理（NLP）任务：情感分析，文本分类，命名实体识别，财务问题回答，文本摘要和股票移动预测的六个关键自然语言处理（NLP）任务。该评估使用特定于金融的数据集来评估现实世界中金融应用中指责的功能和局限性。结果表明，Fingpt在分类任务（例如情感分析和标题分类）中表现出色，通常取得与GPT-4相当的结果。但是，在涉及推理和产生的任务中，其绩效大大降低，例如财务问题回答和摘要。与GPT-4和人类基准测试的比较突出了著名的性能差距，尤其是在数值准确性和复杂推理方面。总体而言，调查结果表明，尽管Fungpt对于某些结构化的财务任务有效，但它尚不是全面的解决方案。这项研究为未来的研究提供了有用的基准，并强调了在金融语言模型中对建筑改进和特定领域优化的需求。

Title: Mechanistic Indicators of Understanding in Large Language Models

Authors: Pierre Beckmann, Matthieu Queloz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08017
Pdf URL: https://arxiv.org/pdf/2507.08017
Copy Paste: [[2507.08017]] Mechanistic Indicators of Understanding in Large Language Models(https://arxiv.org/abs/2507.08017)
Keywords: language model, llm
Abstract: Recent findings in mechanistic interpretability (MI), the field probing the inner workings of Large Language Models (LLMs), challenge the view that these models rely solely on superficial statistics. Here, we offer an accessible synthesis of these findings that doubles as an introduction to MI, all while integrating these findings within a novel theoretical framework for thinking about machine understanding. We argue that LLMs develop internal structures that are functionally analogous to the kind of understanding that consists in seeing connections. To sharpen this idea, we propose a three-tiered conception of machine understanding. First, conceptual understanding emerges when a model forms "features" as directions in latent space, thereby learning the connections between diverse manifestations of something. Second, state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world. Third, principled understanding emerges when a model ceases to rely on a collection of memorized facts and discovers a "circuit" that connects these facts. However, we conclude by exploring the "parallel mechanisms" phenomenon, arguing that while LLMs exhibit forms of understanding, their cognitive architecture remains different from ours, and the debate should shift from whether LLMs understand to how their strange minds work.
摘要：机械性解释性（MI）的最新发现是探测大语言模型（LLMS）内部运作的现场，挑战了这些模型仅依赖于表面统计数据的观点。在这里，我们提供了这些发现的可访问综合，这些发现是对MI的介绍，同时将这些发现集成到了一个新颖的理论框架中，以思考机器理解。我们认为，LLM会开发内部结构，这些结构在功能上类似于看到连接的理解。为了提高这个想法，我们提出了一个三层的机器理解概念。首先，当模型形成“特征”作为潜在空间的方向时，就会出现概念的理解，从而学习某物的各种表现之间的联系。其次，当模型学习功能之间的偶然事实联系并动态跟踪世界上的变化时，最先进的理解就会出现。第三，当模型停止依靠记忆事实的集合并发现将这些事实联系起来的“电路”时，就会出现原则上的理解。但是，我们通过探索“平行机制”现象来得出结论，认为尽管LLMs表现出理解的形式，但其认知结构与我们的认知构建保持不同，辩论应该从LLMS转变为LLMS是否了解其奇怪的思想的工作方式。

Title: Signal or Noise? Evaluating Large Language Models in Resume Screening Across Contextual Variations and Human Expert Benchmarks

Authors: Aryan Varshney, Venkat Ram Reddy Ganuthula
Subjects: cs.CL, econ.GN
Abstract URL: https://arxiv.org/abs/2507.08019
Pdf URL: https://arxiv.org/pdf/2507.08019
Copy Paste: [[2507.08019]] Signal or Noise? Evaluating Large Language Models in Resume Screening Across Contextual Variations and Human Expert Benchmarks(https://arxiv.org/abs/2507.08019)
Keywords: language model, gpt, llm, prompt
Abstract: This study investigates whether large language models (LLMs) exhibit consistent behavior (signal) or random variation (noise) when screening resumes against job descriptions, and how their performance compares to human experts. Using controlled datasets, we tested three LLMs (Claude, GPT, and Gemini) across contexts (No Company, Firm1 [MNC], Firm2 [Startup], Reduced Context) with identical and randomized resumes, benchmarked against three human recruitment experts. Analysis of variance revealed significant mean differences in four of eight LLM-only conditions and consistently significant differences between LLM and human evaluations (p < 0.01). Paired t-tests showed GPT adapts strongly to company context (p < 0.001), Gemini partially (p = 0.038 for Firm1), and Claude minimally (p > 0.1), while all LLMs differed significantly from human experts across contexts. Meta-cognition analysis highlighted adaptive weighting patterns that differ markedly from human evaluation approaches. Findings suggest LLMs offer interpretable patterns with detailed prompts but diverge substantially from human judgment, informing their deployment in automated hiring systems.
摘要：这项研究调查了大型语言模型（LLMS）在筛查恢复工作描述时是否表现出一致的行为（信号）或随机变化（噪声），以及其绩效与人类专家的比较。使用受控数据集，我们跨环境（没有公司，firm1 [MNC]，firm2 [启动]，降低上下文）测试了三个LLM（Claude，GPT和Gemini），并使用了相同的随机简历，对三名人类招聘专家进行了基准测试。方差分析表明，在唯一的八个LLM条件中的四个条件中有四个平均差异以及LLM和人类评估之间始终存在显着差异（P <0.01）。配对的t检验显示GPT非常适合公司环境（p <0.001），双子座部分（pectem1 p = 0.038）和Claude最小的（p> 0.1），而所有LLM都与跨环境的人类专家显着差异。元认知分析强调了自适应加权模式，这些模式与人类评估方法明显不同。调查结果表明，LLM提供了具有详细提示的可解释模式，但与人类判断力有很大不同，从而告知其在自动招聘系统中的部署。

Title: Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation

Authors: Zhibo Zhang, Yuxi Li, Kailong Wang, Shuai Yuan, Ling Shi, Haoyu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08020
Pdf URL: https://arxiv.org/pdf/2507.08020
Copy Paste: [[2507.08020]] Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation(https://arxiv.org/abs/2507.08020)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity. However, this openness also introduces significant security risks, particularly through embedding space poisoning, which is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood. Consequently, more targeted and accurate adversarial perturbation techniques, which pose significant threats, have not been adequately studied. In this work, we propose ETTA (Embedding Transformation Toxicity Attenuation), a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations. ETTA bypasses model refusal behaviors while preserving linguistic coherence, without requiring model fine-tuning or access to training data. Evaluated on five representative open-source LLMs using the AdvBench benchmark, ETTA achieves a high average attack success rate of 88.61%, outperforming the best baseline by 11.34%, and generalizes to safety-enhanced models (e.g., 77.39% ASR on instruction-tuned defenses). These results highlight a critical vulnerability in current alignment strategies and underscore the need for embedding-aware defenses.
摘要：大型语言模型（LLM）在医疗保健，教育和网络安全等领域取得了巨大的成功。但是，这种开放性也引入了重大的安全风险，尤其是通过嵌入太空中毒，这是一个微妙的攻击矢量，对手操纵输入数据的内部语义表示以绕过安全对准机制。尽管先前的研究已经调查了通用扰动方法，但LLM安全对准的动力学在嵌入水平上仍然不足以理解。因此，构成重大威胁的更具针对性和准确的对抗性扰动技术尚未得到充分研究。在这项工作中，我们提出了ETTA（嵌入转化毒性衰减），这是一个新型框架，可通过线性转化识别并衰减嵌入空间中毒性敏感的维度。 ETTA绕过模型拒绝行为，同时保持语言连贯性，而无需模型进行微调或访问培训数据。 ETTA在五个代表性的开源LLMS上进行了评估，ETTA的高平均攻击成功率为88.61％，表现优于最佳基线11.34％，并将其推广到安全增强模型（例如，77.39％的ASR对指导性辩护的防御措施）。这些结果突出了当前一致性策略中的一个关键脆弱性，并强调了对防御能力的嵌入需求。

Title: Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis

Authors: Li Li, Yongliang Wu, Jingze Zhu, Jiawei Peng, Jianfei Cai, Xu Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08021
Pdf URL: https://arxiv.org/pdf/2507.08021
Copy Paste: [[2507.08021]] Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis(https://arxiv.org/abs/2507.08021)
Keywords: language model, llm
Abstract: The evolution of large models has witnessed the emergence of In-Context Learning (ICL) capabilities. In Natural Language Processing (NLP), numerous studies have demonstrated the effectiveness of ICL. Inspired by the success of Large Language Models (LLMs), researchers have developed Large Multimodal Models (LMMs) with ICL capabilities. However, explorations of demonstration configuration for multimodal ICL remain preliminary. Additionally, the controllability of In-Context Examples (ICEs) provides an efficient and cost-effective means to observe and analyze the inference characteristics of LMMs under varying inputs. This paper conducts a comprehensive external and internal investigation of multimodal in-context learning on the image captioning task. Externally, we explore demonstration configuration strategies through three dimensions: shot number, image retrieval, and caption assignment. We employ multiple metrics to systematically and thoroughly evaluate and summarize key findings. Internally, we analyze typical LMM attention characteristics and develop attention-based metrics to quantify model behaviors. We also conduct auxiliary experiments to explore the feasibility of attention-driven model acceleration and compression. We further compare performance variations between LMMs with identical model design and pretraining strategies and explain the differences from the angles of pre-training data features. Our study reveals both how ICEs configuration strategies impact model performance through external experiments and characteristic typical patterns through internal inspection, providing dual perspectives for understanding multimodal ICL in LMMs. Our method of combining external and internal analysis to investigate large models, along with our newly proposed metrics, can be applied to broader research areas.
摘要：大型模型的演变见证了内在学习（ICL）功能的出现。在自然语言处理（NLP）中，大量研究证明了ICL的有效性。受大语模型（LLM）成功的启发，研究人员开发了具有ICL功能的大型多模型（LMMS）。但是，多模式ICL的演示配置的探索仍然是初步的。此外，在不同输入下，在观察和分析LMM的推理特征的有效且具有成本效益的手段提供了一种有效且成本效益的手段。本文对图像字幕任务进行了多模式内部学习学习的全面外部和内部调查。在外部，我们通过三个维度探索演示配置策略：射击号码，图像检索和字幕分配。我们采用多个指标来系统地评估和总结关键发现。在内部，我们分析了典型的LMM注意力特征并开发基于注意力的指标来量化模型行为。我们还进行了辅助实验，以探索注意力驱动的模型加速和压缩的可行性。我们进一步比较了LMM之间具有相同模型设计和训练预处理策略的性能变化，并从培训前数据特征的角度解释了差异。我们的研究揭示了ICS配置策略如何通过外部实验和特征性典型模式通过内部检查影响模型性能，从而提供了双重观点，以了解LMMS中的多模式ICL。我们将外部和内部分析结合起来研究大型模型以及我们新提出的指标的方法可以应用于更广泛的研究领域。

Title: "Amazing, They All Lean Left" -- Analyzing the Political Temperaments of Current LLMs

Authors: W. Russell Neuman, Chad Coleman, Ali Dasdan, Safinah Ali, Manan Shah, Kund Meghani
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2507.08027
Pdf URL: https://arxiv.org/pdf/2507.08027
Copy Paste: [[2507.08027]] "Amazing, They All Lean Left" -- Analyzing the Political Temperaments of Current LLMs(https://arxiv.org/abs/2507.08027)
Keywords: language model, gpt, llm, chat
Abstract: Recent studies have revealed a consistent liberal orientation in the ethical and political responses generated by most commercial large language models (LLMs), yet the underlying causes and resulting implications remain unclear. This paper systematically investigates the political temperament of seven prominent LLMs - OpenAI's GPT-4o, Anthropic's Claude Sonnet 4, Perplexity (Sonar Large), Google's Gemini 2.5 Flash, Meta AI's Llama 4, Mistral 7b Le Chat and High-Flyer's DeepSeek R1 -- using a multi-pronged approach that includes Moral Foundations Theory, a dozen established political ideology scales and a new index of current political controversies. We find strong and consistent prioritization of liberal-leaning values, particularly care and fairness, across most models. Further analysis attributes this trend to four overlapping factors: Liberal-leaning training corpora, reinforcement learning from human feedback (RLHF), the dominance of liberal frameworks in academic ethical discourse and safety-driven fine-tuning practices. We also distinguish between political "bias" and legitimate epistemic differences, cautioning against conflating the two. A comparison of base and fine-tuned model pairs reveals that fine-tuning generally increases liberal lean, an effect confirmed through both self-report and empirical testing. We argue that this "liberal tilt" is not a programming error or the personal preference of programmers but an emergent property of training on democratic rights-focused discourse. Finally, we propose that LLMs may indirectly echo John Rawls' famous veil-of ignorance philosophical aspiration, reflecting a moral stance unanchored to personal identity or interest. Rather than undermining democratic discourse, this pattern may offer a new lens through which to examine collective reasoning.
摘要：最近的研究揭示了大多数商业大型语言模型（LLMS）产生的道德和政治反应的一致性自由取向，但是基本原因和结果含义尚不清楚。 This paper systematically investigates the political temperament of seven prominent LLMs - OpenAI's GPT-4o, Anthropic's Claude Sonnet 4, Perplexity (Sonar Large), Google's Gemini 2.5 Flash, Meta AI's Llama 4, Mistral 7b Le Chat and High-Flyer's DeepSeek R1 -- using a multi-pronged approach that includes Moral Foundations Theory, a dozen established political ideology scales and当前政治争议的新指数。我们发现，在大多数模型中，自由倾向的价值观，尤其是关心和公平的优先级。进一步的分析将这一趋势归因于四个重叠因素：自由倾斜的培训语料库，从人类反馈中的强化学习（RLHF），自由框架在学术道德话语中的主导地位和安全驱动的微调实践。我们还区分了政治“偏见”和合法的认知差异，警告不要将两者混为一谈。基础和微调模型对的比较表明，微调通常会增加自由主义的精益，这是通过自我报告和经验测试证实的效果。我们认为，这种“自由主义倾向”不是编程错误，也不是程序员的个人偏好，而是对以民主权利为中心的话语进行培训的新兴财产。最后，我们建议LLM可能间接地回应约翰·罗尔斯（John Rawls）著名的无知哲学志向，这反映了对个人身份或兴趣的道德立场。这种模式不是破坏民主话语，而是可以提供一种新的镜头来检查集体推理。

Title: A Systematic Analysis of Declining Medical Safety Messaging in Generative AI Models

Authors: Sonali Sharma, Ahmed M. Alaa, Roxana Daneshjou
Subjects: cs.CL, cs.CE, cs.HC
Abstract URL: https://arxiv.org/abs/2507.08030
Pdf URL: https://arxiv.org/pdf/2507.08030
Copy Paste: [[2507.08030]] A Systematic Analysis of Declining Medical Safety Messaging in Generative AI Models(https://arxiv.org/abs/2507.08030)
Keywords: language model, llm
Abstract: Generative AI models, including large language models (LLMs) and vision-language models (VLMs), are increasingly used to interpret medical images and answer clinical questions. Their responses often include inaccuracies; therefore, safety measures like medical disclaimers are critical to remind users that AI outputs are not professionally vetted or a substitute for medical advice. This study evaluated the presence of disclaimers in LLM and VLM outputs across model generations from 2022 to 2025. Using 500 mammograms, 500 chest X-rays, 500 dermatology images, and 500 medical questions, outputs were screened for disclaimer phrases. Medical disclaimer presence in LLM and VLM outputs dropped from 26.3% in 2022 to 0.97% in 2025, and from 19.6% in 2023 to 1.05% in 2025, respectively. By 2025, the majority of models displayed no disclaimers. As public models become more capable and authoritative, disclaimers must be implemented as a safeguard adapting to the clinical context of each output.
摘要：越来越多地使用生成的AI模型，包括大语言模型（LLM）和视觉语言模型（VLM），越来越多地用于解释医学图像并回答临床问题。他们的回答通常包括不准确。因此，诸如医疗免责声明之类的安全措施对于提醒用户AI输出没有专业审查或替代医疗建议至关重要。这项研究评估了2022年至2025年的模型世代中LLM和VLM输出中的免责声明的存在。使用500个乳房X线照片，500张胸部X射线，500张皮肤病学图像和500个医疗问题，对免责声明短语进行了筛选。 LLM和VLM输出中的医疗免责声明从2022年的26.3％下降到2025年的0.97％，分别从2023年的19.6％下降到2025年的1.05％。到2025年，大多数模型都没有显示免责声明。随着公共模型变得越来越有能力和权威，必须将免责声明实施，以适应每个产出的临床环境。

Title: Beyond Scale: Small Language Models are Comparable to GPT-4 in Mental Health Understanding

Authors: Hong Jia, Shiya Fu, Vassilis Kostakos, Feng Xia, Ting Dang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08031
Pdf URL: https://arxiv.org/pdf/2507.08031
Copy Paste: [[2507.08031]] Beyond Scale: Small Language Models are Comparable to GPT-4 in Mental Health Understanding(https://arxiv.org/abs/2507.08031)
Keywords: language model, gpt, llm, prompt
Abstract: The emergence of Small Language Models (SLMs) as privacy-preserving alternatives for sensitive applications raises a fundamental question about their inherent understanding capabilities compared to Large Language Models (LLMs). This paper investigates the mental health understanding capabilities of current SLMs through systematic evaluation across diverse classification tasks. Employing zero-shot and few-shot learning paradigms, we benchmark their performance against established LLM baselines to elucidate their relative strengths and limitations in this critical domain. We assess five state-of-the-art SLMs (Phi-3, Phi-3.5, Qwen2.5, Llama-3.2, Gemma2) against three LLMs (GPT-4, FLAN-T5-XXL, Alpaca-7B) on six mental health understanding tasks. Our findings reveal that SLMs achieve mean performance within 2\% of LLMs on binary classification tasks (F1 scores of 0.64 vs 0.66 in zero-shot settings), demonstrating notable competence despite orders of magnitude fewer parameters. Both model categories experience similar degradation on multi-class severity tasks (a drop of over 30\%), suggesting that nuanced clinical understanding challenges transcend model scale. Few-shot prompting provides substantial improvements for SLMs (up to 14.6\%), while LLM gains are more variable. Our work highlights the potential of SLMs in mental health understanding, showing they can be effective privacy-preserving tools for analyzing sensitive online text data. In particular, their ability to quickly adapt and specialize with minimal data through few-shot learning positions them as promising candidates for scalable mental health screening tools.
摘要：小型语言模型（SLM）作为敏感应用的保护替代方案的出现引发了与大语言模型（LLMS）相比其固有理解能力的基本问题。本文通过跨不同分类任务进行系统评估来调查当前SLM的心理健康理解能力。我们采用零拍和很少的学习范式，我们对建立的LLM基准的性能进行基准测试，以阐明其在这个关键领域中的相对优势和局限性。我们针对六个心理健康理解任务评估了五个最先进的SLM（PHI-3，PHI-3.5，QWEN2.5，LLAMA-3.2，GEMMA2））。我们的发现表明，在二进制分类任务上，SLM在LLM的2 \％内实现平均性能（在零弹位设置中的F1分数与0.64 vs 0.66），尽管有较少的参数，但表现出显着的能力。这两种模型类别都在多级严重任务（下降30 \％）上都经历了类似的退化，这表明细微的临床理解挑战超越模型量表。很少有射击提示可为SLM（最多14.6 \％）提供大量改进，而LLM收益更可变。我们的工作强调了SLM在心理健康理解中的潜力，表明它们可以是有效的隐私保护工具，用于分析敏感的在线文本数据。特别是，他们通过几次学习将它们作为有希望的可扩展心理健康筛查工具的有前途的候选人来快速适应和专注于最小数据的能力。

Title: Integrating External Tools with Large Language Models to Improve Accuracy

Authors: Nripesh Niketan, Hadj Batatia
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.08034
Pdf URL: https://arxiv.org/pdf/2507.08034
Copy Paste: [[2507.08034]] Integrating External Tools with Large Language Models to Improve Accuracy(https://arxiv.org/abs/2507.08034)
Keywords: language model, gpt, llm
Abstract: This paper deals with improving querying large language models (LLMs). It is well-known that without relevant contextual information, LLMs can provide poor quality responses or tend to hallucinate. Several initiatives have proposed integrating LLMs with external tools to provide them with up-to-date data to improve accuracy. In this paper, we propose a framework to integrate external tools to enhance the capabilities of LLMs in answering queries in educational settings. Precisely, we develop a framework that allows accessing external APIs to request additional relevant information. Integrated tools can also provide computational capabilities such as calculators or calendars. The proposed framework has been evaluated using datasets from the Multi-Modal Language Understanding (MMLU) collection. The data consists of questions on mathematical and scientific reasoning. Results compared to state-of-the-art language models show that the proposed approach significantly improves performance. Our Athena framework achieves 83% accuracy in mathematical reasoning and 88% in scientific reasoning, substantially outperforming all tested models including GPT-4o, LLaMA-Large, Mistral-Large, Phi-Large, and GPT-3.5, with the best baseline model (LLaMA-Large) achieving only 67% and 79% respectively. These promising results open the way to creating complex computing ecosystems around LLMs to make their use more natural to support various tasks and activities.
摘要：本文涉及改进查询大语言模型（LLM）。众所周知，如果没有相关的上下文信息，LLM可以提供质量差的反应或倾向于幻觉。几项计划已提出将LLM与外部工具集成在一起，以向其提供最新数据以提高准确性。在本文中，我们提出了一个框架，以整合外部工具，以增强LLM在回答教育环境中查询的功能。确切地说，我们开发了一个框架，允许访问外部API请求其他相关信息。集成工具还可以提供计算功能，例如计算器或日历。已使用多模式语言理解（MMLU）集合中的数据集评估了所提出的框架。数据包括有关数学和科学推理的问题。与最先进的语言模型相比，结果表明，所提出的方法可显着提高性能。我们的Athena框架在数学推理方面达到了83％的准确性，而科学推理的精度为88％，其表现优于所有经过测试的模型，包括GPT-4O，Llama-Large，Mistral-Large，Phi-Large和GPT-Large和GPT-3.5，仅具有最佳的基线模型（Llama-Large）（Llama-large），仅能达到67％和79％。这些有希望的结果为创建LLM围绕LLM的复杂计算生态系统提供了旨在使它们的使用更自然以支持各种任务和活动的方式开辟了道路。

Title: CRISP: Complex Reasoning with Interpretable Step-based Plans

Authors: Matan Vetzler, Koren Lazar, Guy Uziel, Eran Hirsch, Ateret Anaby-Tavor, Leshem Choshen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08037
Pdf URL: https://arxiv.org/pdf/2507.08037
Copy Paste: [[2507.08037]] CRISP: Complex Reasoning with Interpretable Step-based Plans(https://arxiv.org/abs/2507.08037)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recent advancements in large language models (LLMs) underscore the need for stronger reasoning capabilities to solve complex problems effectively. While Chain-of-Thought (CoT) reasoning has been a step forward, it remains insufficient for many domains. A promising alternative is explicit high-level plan generation, but existing approaches largely assume that LLMs can produce effective plans through few-shot prompting alone, without additional training. In this work, we challenge this assumption and introduce CRISP (Complex Reasoning with Interpretable Step-based Plans), a multi-domain dataset of high-level plans for mathematical reasoning and code generation. The plans in CRISP are automatically generated and rigorously validated--both intrinsically, using an LLM as a judge, and extrinsically, by evaluating their impact on downstream task performance. We demonstrate that fine-tuning a small model on CRISP enables it to generate higher-quality plans than much larger models using few-shot prompting, while significantly outperforming Chain-of-Thought reasoning. Furthermore, our out-of-domain evaluation reveals that fine-tuning on one domain improves plan generation in the other, highlighting the generalizability of learned planning capabilities.
摘要：大型语言模型（LLMS）的最新进展强调了有效解决复杂问题的更强推理能力的需求。尽管经过思考链（COT）推理是向前迈出的一步，但对于许多领域仍然不足。一个有希望的替代方法是明确的高级计划生成，但是现有的方法在很大程度上假设LLM可以单独使用几次促使，而无需额外的培训。在这项工作中，我们挑战了这一假设，并引入了Crisp（具有可解释的步骤计划的复杂推理），这是一个多域数据集的数学推理和代码生成的高级计划。清晰的计划会自动生成并严格验证 - 本质上是在本质上，使用LLM作为法官，并通过评估其对下游任务绩效的影响，从而外在地进行。我们证明，对Crisp上的小型模型进行微调使其能够生成更高质量的计划，而不是使用很少的弹药提示，同时表现出色的推理链链的推理较大。此外，我们的室外评估表明，对一个领域的微调改善了另一个领域的生成，强调了学习的计划能力的普遍性。

Title: AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Authors: Talor Abramovich, Gal Chechik
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08038
Pdf URL: https://arxiv.org/pdf/2507.08038
Copy Paste: [[2507.08038]] AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research(https://arxiv.org/abs/2507.08038)
Keywords: language model, prompt, chain-of-thought, agent
Abstract: Autonomous agents built on language models (LMs) are showing increasing popularity in many fields, including scientific research. AI co-scientists aim to support or automate parts of the research process using these agents. A key component of empirical AI research is the design of ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 29% of the original ablations on average. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms the currently existing agent-based approach.
摘要：建立在语言模型（LMS）的自主代理人在包括科学研究在内的许多领域都表现出越来越多的知名度。 AI共同科学家旨在使用这些代理支持或自动化研究过程的一部分。经验AI研究的关键组成部分是消融实验的设计。为此，我们介绍了AblationBench，这是一套基准套件，用于评估代理商在经验AI研究中的消融计划任务。它包括两个任务：授权，该任务可帮助作者根据方法部分提出消融实验，并包含83个实例和审查，这可以帮助审阅者在完整的论文中发现缺少消融，并包含350个实例。对于这两个任务，我们都会开发基于LM的法官，这些法官是自动评估框架。我们对Frontier LMS进行的实验表明，这些任务仍然具有挑战性，表现最佳的LM系统平均仅识别29％的原始消融。最后，我们分析了当前LMS对这些任务的局限性，并发现促使人们的经验超过了当前现有的基于代理的方法。

Title: Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing

Authors: Junyi Wen, Junyuan Liang, Zicong Hong, Wuhui Chen, Zibin Zheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08045
Pdf URL: https://arxiv.org/pdf/2507.08045
Copy Paste: [[2507.08045]] Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing(https://arxiv.org/abs/2507.08045)
Keywords: language model, llm
Abstract: Efficient state restoration in multi-turn conversations with large language models (LLMs) remains a critical challenge, primarily due to the overhead of recomputing or loading full key-value (KV) caches for all historical tokens. To address this, existing approaches compress KV caches across adjacent layers with highly similar attention patterns. However, these methods often apply a fixed compression scheme across all conversations, selecting the same layer pairs for compression without considering conversation-specific attention dynamics. This static strategy overlooks variability in attention pattern similarity across different conversations, which can lead to noticeable accuracy degradation. We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration. Krul dynamically selects compression strategies based on attention similarity across layer pairs and uses a recomputation-loading pipeline to restore the KV cache. It introduces three key innovations: 1) a preemptive compression strategy selector to preserve critical context for future conversation turns and selects a customized strategy for the conversation; 2) a token-wise heterogeneous attention similarity estimator to mitigate the attention similarity computation and storage overhead during model generation; 3) a bubble-free restoration scheduler to reduce potential bubbles brought by the imbalance of recomputing and loading stream due to compressed KV caches. Empirical evaluations on real-world tasks demonstrate that Krul achieves a 1.5x-2.68x reduction in time-to-first-token (TTFT) and a 1.33x-2.35x reduction in KV cache storage compared to state-of-the-art methods without compromising generation quality.
摘要：与大型语言模型（LLMS）进行多转向对话中的有效状态恢复仍然是一个关键挑战，这主要是由于所有历史令牌的重新计算或加载完整键值（KV）scaches的开销。为了解决这个问题，现有方法会压缩具有高度相似注意力模式的相邻层的KV缓存。但是，这些方法通常在所有对话中应用固定的压缩方案，选择相同的层对进行压缩，而无需考虑特定于对话的注意力动态。这种静态策略忽略了不同对话之间注意力模式相似性的可变性，这可能导致明显的准确性降解。我们提出了Krul，这是一种多转变LLM推理系统，可实现准确有效的KV缓存恢复。 Krul基于跨层对的注意力相似性动态选择压缩策略，并使用重新载荷管道来恢复KV缓存。它介绍了三个关键创新：1）一种先发制人的压缩策略选择器，以保留未来对话转弯的关键上下文，并为对话选择定制策略； 2）一个象征性的异质注意相似性估计器，以减轻模型生成期间的注意力相似性计算和存储开销； 3）无气泡修复调度程序，以减少由于压缩KV缓存而导致的重新计算和装载流的不平衡带来的潜在气泡。对现实世界任务的经验评估表明，KRUL可减少1.5 x-2.68倍的第一（TTFT）（TTFT）和1.33 x-2.35倍的KV CACHE存储降低，而不是最先进的方法，而没有损害发电质量的最先进的方法。

Title: GRASP: Generic Reasoning And SPARQL Generation across Knowledge Graphs

Authors: Sebastian Walter, Hannah Bast
Subjects: cs.CL, cs.DB, cs.IR
Abstract URL: https://arxiv.org/abs/2507.08107
Pdf URL: https://arxiv.org/pdf/2507.08107
Copy Paste: [[2507.08107]] GRASP: Generic Reasoning And SPARQL Generation across Knowledge Graphs(https://arxiv.org/abs/2507.08107)
Keywords: language model
Abstract: We propose a new approach for generating SPARQL queries on RDF knowledge graphs from natural language questions or keyword queries, using a large language model. Our approach does not require fine-tuning. Instead, it uses the language model to explore the knowledge graph by strategically executing SPARQL queries and searching for relevant IRIs and literals. We evaluate our approach on a variety of benchmarks (for knowledge graphs of different kinds and sizes) and language models (of different scales and types, commercial as well as open-source) and compare it with existing approaches. On Wikidata we reach state-of-the-art results on multiple benchmarks, despite the zero-shot setting. On Freebase we come close to the best few-shot methods. On other, less commonly evaluated knowledge graphs and benchmarks our approach also performs well overall. We conduct several additional studies, like comparing different ways of searching the graphs, incorporating a feedback mechanism, or making use of few-shot examples.
摘要：我们提出了一种使用大语言模型从自然语言问题或关键字查询中从RDF知识图上生成SPARQL查询的新方法。我们的方法不需要微调。相反，它使用语言模型来策略性地执行SPARQL查询并搜索相关的虹膜和文字来探索知识图。我们在各种基准（用于不同种类和大小的知识图）和语言模型（不同的量表和类型，商业和开源）上评估我们的方法，并将其与现有方法进行比较。在Wikidata上，尽管设置为零，但我们在多个基准测试中达到了最新的结果。在Freebase上，我们接近最佳的几种方法。在其他情况下，较不常见的知识图和基准测试，我们的方法总体表现良好。我们进行了几项其他研究，例如比较搜索图形，结合反馈机制或使用少量示例的不同方式。

Title: Audit, Alignment, and Optimization of LM-Powered Subroutines with Application to Public Comment Processing

Authors: Reilly Raab, Mike Parker, Dan Nally, Sadie Montgomery, Anastasia Bernat, Sai Munikoti, Sameera Horawalavithana
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08109
Pdf URL: https://arxiv.org/pdf/2507.08109
Copy Paste: [[2507.08109]] Audit, Alignment, and Optimization of LM-Powered Subroutines with Application to Public Comment Processing(https://arxiv.org/abs/2507.08109)
Keywords: language model, prompt
Abstract: The advent of language models (LMs) has the potential to dramatically accelerate tasks that may be cast to text-processing; however, real-world adoption is hindered by concerns regarding safety, explainability, and bias. How can we responsibly leverage LMs in a transparent, auditable manner -- minimizing risk and allowing human experts to focus on informed decision-making rather than data-processing or prompt engineering? In this work, we propose a framework for declaring statically typed, LM-powered subroutines (i.e., callable, function-like procedures) for use within conventional asynchronous code -- such that sparse feedback from human experts is used to improve the performance of each subroutine online (i.e., during use). In our implementation, all LM-produced artifacts (i.e., prompts, inputs, outputs, and data-dependencies) are recorded and exposed to audit on demand. We package this framework as a library to support its adoption and continued development. While this framework may be applicable across several real-world decision workflows (e.g., in healthcare and legal fields), we evaluate it in the context of public comment processing as mandated by the 1969 National Environmental Protection Act (NEPA): Specifically, we use this framework to develop "CommentNEPA," an application that compiles, organizes, and summarizes a corpus of public commentary submitted in response to a project requiring environmental review. We quantitatively evaluate the application by comparing its outputs (when operating without human feedback) to historical ``ground-truth'' data as labelled by human annotators during the preparation of official environmental impact statements.
摘要：语言模型（LMS）的出现有可能显着加速可能会施加文本处理的任务；但是，对安全性，解释性和偏见的担忧阻碍了现实世界的采用。我们如何以透明，可审计的方式负责任地利用LMS-最大程度地降低风险并允许人类专家专注于明智的决策而不是数据处理或及时工程？在这项工作中，我们提出了一个框架，用于在传统的异步代码中使用静态输入的，由LM驱动的子例程（即，可呼叫，类似功能的程序），以便使用人类专家的稀疏反馈来改善每个亚列中的在线下的性能（即在使用过程中）。在我们的实施中，记录了所有LM生产的工件（即提示，输入，输出和数据依赖性），并按需记录并暴露于审计中。我们将该框架作为图书馆包装，以支持其采用和持续发展。虽然该框架可能适用于几个现实世界的决策工作流（例如，在医疗保健和法律领域），但我们在1969年《 1969年《国家环境保护法》法案（NEPA）规定的公众评论处理的背景下进行评估：具体来说，我们使用此框架来开发“评论”的申请，要求汇编公众评论，以审查公众对一项响应的批评，以提出一项响应的批评。我们通过将其输出（无反馈反馈操作）与历史``地面真相''数据进行了定量评估，该数据在准备官方环境影响陈述过程中被人类注释者标记。

Title: Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores

Authors: Vivek Chari, Benjamin Van Durme
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08143
Pdf URL: https://arxiv.org/pdf/2507.08143
Copy Paste: [[2507.08143]] Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores(https://arxiv.org/abs/2507.08143)
Keywords: language model, llm, long context
Abstract: Modern Large Language Models (LLMs) are increasingly trained to support very large context windows. Unfortunately the ability to use long contexts in generation is complicated by the large memory requirement of the KV cache, which scales linearly with the context length. This memory footprint is often the dominant resource bottleneck in real-world deployments, limiting throughput and increasing serving cost. One way to address this is by compressing the KV cache, which can be done either with knowledge of the question being asked (query-aware) or without knowledge of the query (query-agnostic). We present Compactor, a parameter-free, query-agnostic KV compression strategy that uses approximate leverage scores to determine token importance. We show that Compactor can achieve the same performance as competing methods while retaining 1/2 the tokens in both synthetic and real-world context tasks, with minimal computational overhead. We further introduce a procedure for context-calibrated compression, which allows one to infer the maximum compression ratio a given context can support. Using context-calibrated compression, we show that Compactor achieves full KV performance on Longbench while reducing the KV memory burden by 63%, on average. To demonstrate the efficacy and generalizability of our approach, we apply Compactor to 27 synthetic and real-world tasks from RULER and Longbench, with models from both the Qwen 2.5 and Llama 3.1 families.
摘要：现代大型语言模型（LLM）越来越多地训练以支持非常大的上下文窗口。不幸的是，在KV缓存的庞大内存要求中，在生成中使用长上下文的能力变得复杂，KV缓存与上下文长度线性缩放。这种内存足迹通常是现实部署中的主要资源瓶颈，限制了吞吐量和增加的服务成本。解决此问题的一种方法是压缩KV缓存，这可以通过对所提出的问题（查询意识）或不了解查询（Query-Agnostic）的知识来完成。我们提出了Compactor，这是一种无参数的，查询的无形KV压缩策略，该策略使用近似杠杆分数来确定令牌重要性。我们表明，紧凑型可以在合成和现实世界上下文任务中保留1/2的令牌，并以最小的计算开销来保持与竞争方法相同的性能。我们进一步介绍了一个以上下文校准的压缩的过程，该过程允许人们推断给定上下文可以支持的最大压缩比。使用上下文校准的压缩，我们表明压实器在长板台上实现了完整的KV性能，同时平均将KV存储器负担减少了63％。为了证明我们的方法的功效和概括性，我们将压实物应用于统治者和长基板的27个合成和现实世界任务，并使用来自Qwen 2.5和Llama 3.1家族的模型。

Title: Distilling Empathy from Large Language Models

Authors: Henry J. Xie, Jinghan Zhang, Xinhao Zhang, Kunpeng Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08151
Pdf URL: https://arxiv.org/pdf/2507.08151
Copy Paste: [[2507.08151]] Distilling Empathy from Large Language Models(https://arxiv.org/abs/2507.08151)
Keywords: language model, llm, prompt
Abstract: The distillation of knowledge from Large Language Models (LLMs) into Smaller Language Models (SLMs), preserving the capabilities and performance of LLMs while reducing model size, has played a key role in the proliferation of LLMs. Because SLMs are considerably smaller than LLMs, they are often utilized in domains where human interaction is frequent but resources are highly constrained, e.g., smart phones. Therefore, it is crucial to ensure that empathy, a fundamental aspect of positive human interactions, already instilled into LLMs, is retained by SLMs after distillation. In this paper, we develop a comprehensive approach for effective empathy distillation from LLMs into SLMs. Our approach features a two-step fine-tuning process that fully leverages datasets of empathetic dialogue responses distilled from LLMs. We explore several distillation methods beyond basic direct prompting and propose four unique sets of prompts for targeted empathy improvement to significantly enhance the empathy distillation process. Our evaluations demonstrate that SLMs fine-tuned through the two-step fine-tuning process with distillation datasets enhanced by the targeted empathy improvement prompts significantly outperform the base SLM at generating empathetic responses with a win rate of 90%. Our targeted empathy improvement prompts substantially outperform the basic direct prompting with a 10% improvement in win rate.
摘要：从大语言模型（LLM）蒸馏到较小的语言模型（SLM）中，保留了LLM的功能和性能，同时降低模型大小，在LLMS的扩散中起着关键作用。由于SLM比LLM小得多，因此通常将它们用于人类互动频繁但资源受到高度限制的域，例如智能手机。因此，至关重要的是要确保同理心（已经灌输到LLM的积极相互作用的基本方面）在蒸馏后由SLM保留。在本文中，我们开发了一种全面的方法，可以有效地从LLM到SLMS进行移情。我们的方法具有两步的微调过程，该过程充分利用了从LLMS提炼的同情对话响应的数据集。我们探索了基本直接提示之外的几种蒸馏方法，并提出了四个独特的提示，以改善有针对性的同理心，以显着增强移情蒸馏过程。我们的评估表明，通过针对性的同理心改善增强了通过两步微调过程进行微调，通过蒸馏数据集进行了微调，促使基本SLM在产生促进响应时的表现明显胜过，赢得了90％。我们目标的同理心改善促使其优于基本直接提示，获胜率提高了10％。

Title: TruthTorchLM: A Comprehensive Library for Predicting Truthfulness in LLM Outputs

Authors: Duygu Nur Yaldiz, Yavuz Faruk Bakman, Sungmin Kang, Alperen Öziş, Hayrettin Eren Yildiz, Mitash Ashish Shah, Zhiqi Huang, Anoop Kumar, Alfy Samuel, Daben Liu, Sai Praneeth Karimireddy, Salman Avestimehr
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08203
Pdf URL: https://arxiv.org/pdf/2507.08203
Copy Paste: [[2507.08203]] TruthTorchLM: A Comprehensive Library for Predicting Truthfulness in LLM Outputs(https://arxiv.org/abs/2507.08203)
Keywords: language model, llm
Abstract: Generative Large Language Models (LLMs)inevitably produce untruthful responses. Accurately predicting the truthfulness of these outputs is critical, especially in high-stakes settings. To accelerate research in this domain and make truthfulness prediction methods more accessible, we introduce TruthTorchLM an open-source, comprehensive Python library featuring over 30 truthfulness prediction methods, which we refer to as Truth Methods. Unlike existing toolkits such as Guardrails, which focus solely on document-grounded verification, or LM-Polygraph, which is limited to uncertainty-based methods, TruthTorchLM offers a broad and extensible collection of techniques. These methods span diverse tradeoffs in computational cost, access level (e.g., black-box vs white-box), grounding document requirements, and supervision type (self-supervised or supervised). TruthTorchLM is seamlessly compatible with both HuggingFace and LiteLLM, enabling support for locally hosted and API-based models. It also provides a unified interface for generation, evaluation, calibration, and long-form truthfulness prediction, along with a flexible framework for extending the library with new methods. We conduct an evaluation of representative truth methods on three datasets, TriviaQA, GSM8K, and FactScore-Bio. The code is available at this https URL
摘要：生成的大语言模型（LLM）不可避免地会产生不真实的回应。准确地预测这些输出的真实性至关重要，尤其是在高风险环境中。为了加速该领域的研究并使真实性预测方法更易于访问，我们将真相Torchlm介绍为开源，全面的Python库，其中包含30多种真实性预测方法，我们称为真实方法。与现有的工具包（例如护栏等现有工具包），该工具包仅专注于文档接地验证或LM-Polygraph（仅限于基于不确定性的方法），TruthTorchlm提供了广泛且可扩展的技术集合。这些方法在计算成本，访问级别（例如，黑框与白色框），接地文件要求和监督类型（自我监督或监督）方面涵盖了不同的权衡。 TruthTorchlm与HuggingFace和Litellm无缝兼容，为本地托管和基于API的模型提供了支持。它还提供了一个统一的界面，用于生成，评估，校准和长期真实性预测，以及一个灵活的框架，用于使用新方法扩展库。我们对三个数据集（Triviaqa，gsm8k和factscore-bio）进行了代表性真实方法的评估。该代码可在此HTTPS URL上找到

Title: Simple Mechanistic Explanations for Out-Of-Context Reasoning

Authors: Atticus Wang, Joshua Engels, Oliver Clive-Griffin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.08218
Pdf URL: https://arxiv.org/pdf/2507.08218
Copy Paste: [[2507.08218]] Simple Mechanistic Explanations for Out-Of-Context Reasoning(https://arxiv.org/abs/2507.08218)
Keywords: llm
Abstract: Out-of-context reasoning (OOCR) is a phenomenon in which fine-tuned LLMs exhibit surprisingly deep out-of-distribution generalization. Rather than learning shallow heuristics, they implicitly internalize and act on the consequences of observations scattered throughout the fine-tuning data. In this work, we investigate this phenomenon mechanistically and find that many instances of OOCR in the literature have a simple explanation: the LoRA fine-tuning essentially adds a constant steering vector, steering the model towards a general concept. This improves performance on the fine-tuning task and in many other concept-related domains, causing the surprising generalization. Moreover, we can directly train steering vectors for these tasks from scratch, which also induces OOCR. We find that our results hold even for a task that seems like it must involve conditional behavior (model backdoors); it turns out that unconditionally adding a steering vector is sufficient. Overall, our work presents one explanation of what gets learned during fine-tuning for OOCR tasks, contributing to the key question of why LLMs can reason out of context, an advanced capability that is highly relevant to their safe and reliable deployment.
摘要：脱节推理（OOCR）是一种现象，其中微调的LLM出奇地表现出了令人惊讶的深度分布概括。它们没有学习浅启发式方法，而是隐含地内化并采取散布在整个微调数据中的观察结果的后果。在这项工作中，我们从机理上研究了这一现象，发现文献中的许多实例都有一个简单的解释：洛拉微调基本上增加了一个恒定的转向向量，将模型转向一般概念。这提高了微调任务和许多其他与概念相关的领域的性能，从而导致了令人惊讶的概括。此外，我们可以直接从头开始训练这些任务，这也引起了OOCR。我们发现，即使对于似乎必须涉及有条件行为的任务（模型后门），我们的结果即使存在；事实证明，无条件添加转向向量就足够了。总体而言，我们的工作对OCR任务进行微调过程中所学的内容提供了一个解释，这促成了为什么LLMS可以从上下文中推理的关键问题，这是一种与其安全可靠的部署高度相关的高级能力。

Title: Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?

Authors: KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08232
Pdf URL: https://arxiv.org/pdf/2507.08232
Copy Paste: [[2507.08232]] Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?(https://arxiv.org/abs/2507.08232)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly used as proxy students in the development of Intelligent Tutoring Systems (ITSs) and in piloting test questions. However, to what extent these proxy students accurately emulate the behavior and characteristics of real students remains an open question. To investigate this, we collected a dataset of 489 items from the National Assessment of Educational Progress (NAEP), covering mathematics and reading comprehension in grades 4, 8, and 12. We then apply an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations. Our findings reveal that, without guidance, strong general-purpose models consistently outperform the average student at every grade, while weaker or domain-mismatched models may align incidentally. Using grade-enforcement prompts changes models' performance, but whether they align with the average grade-level student remains highly model- and prompt-specific: no evaluated model-prompt pair fits the bill across subjects and grades, underscoring the need for new training and evaluation strategies. We conclude by providing guidelines for the selection of viable proxies based on our findings.
摘要：大型语言模型（LLM）越来越多地用作智能辅导系统（ITS）和试点测试问题的代理学生。但是，这些代理学生在多大程度上效仿了真实学生的行为和特征仍然是一个悬而未决的问题。为了进行调查，我们从国家教育进步评估（NAEP）中收集了一个489个项目的数据集，其中涵盖了4、8和12年级的数学和阅读理解。然后，我们将项目响应理论（IRT）模型应用于与实际学生群体相同的能力规模的11种不同和最先进的LLMS。我们的发现表明，在没有指导的情况下，强大的通用模型始终优于每个年级的普通学生，而较弱或域中不匹配的模型可能会偶然地保持一致。使用等级执法提示更改模型的表现，但是他们是否与平均年级学生保持一致，仍然是高度模型和迅速的特定模型：没有评估的模型推测对符合对象和成绩的账单，这强调了对新的培训和评估策略的需求。我们通过根据我们的发现提供了选择可行代理的指南来结束。

Title: KAT-V1: Kwai-AutoThink Technical Report

Authors: Zizheng Zhan, Ken Deng, Huaixi Tang, Wen Xiang, Kun Wu, Weihao Li, Wenqiang Zhu, Jingxuan Xu, Lecheng Huang, Zongxian Feng, Shaojie Wang, Shangpeng Yan, Jiaheng Liu, Zhongyuan Peng, Zuchen Gao, Haoyang Huang, Ziqi Zhan, Yanan Wu, Yuanxing Zhang, Jian Yang, Guang Chen, Haotian Zhang, Bin Chen, Bing Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08297
Pdf URL: https://arxiv.org/pdf/2507.08297
Copy Paste: [[2507.08297]] KAT-V1: Kwai-AutoThink Technical Report(https://arxiv.org/abs/2507.08297)
Keywords: language model, prompt, agent
Abstract: We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks, where an automatic thinking training paradigm is proposed to dynamically switch between reasoning and non-reasoning modes based on task complexity. Specifically, first, we construct the dual-regime dataset based on a novel tagging pipeline and a multi-agent synthesis strategy, and then we apply Multi-Token Prediction (MTP)-enhanced knowledge distillation, enabling efficient and fine-grained reasoning transfer with minimal pretraining cost. Besides, we implement a cold-start initialization strategy that introduces mode-selection priors using majority-vote signals and intent-aware prompting. Finally, we propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework, offering structured guidance over both reasoning-mode selection and response accuracy. Extensive experiments across multiple benchmarks demonstrate that KAT consistently matches or even outperforms current state-of-the-art models, including DeepSeek-R1-0528 and Qwen3-235B-A22B, across a wide range of reasoning-intensive tasks while reducing token usage by up to approximately 30\%. Beyond academic evaluation, KAT has been successfully deployed in Kwaipilot (i.e., Kuaishou's internal coding assistant), and improves real-world development workflows with high accuracy, efficiency, and controllable reasoning behaviors. Moreover, we are actively training a 200B Mixture-of-Experts (MoE) with 40B activation parameters, where the early-stage results already demonstrate promising improvements in performance and efficiency, further showing the scalability of the AutoThink paradigm.
摘要：我们提出了Kwaipilot-Autothink（KAT），这是一种开源的40B 40B大语言模型，旨在解决推理密集型任务中的过度思考问题，其中提议一种自动思维训练范式在基于任务复杂性的基于推理和非调理模式之间动态切换。具体而言，首先，我们基于新的标记管道和多代理合成策略构建双重主权数据集，然后应用多tokentiondiondepiction（MTP）增强知识蒸馏，以最小的预审前成本启用有效且细粒度的推理转移。此外，我们实施了一种冷启动的初始化策略，该策略使用多数投票信号和意图感知的提示引入模式选择先验。最后，我们提出了Step-Srpo，这是一种增强学习算法，将中间监督纳入GRPO框架中，为推理模式选择和响应准确性提供结构化指导。跨多个基准测试的广泛实验表明，KAT始终匹配甚至超过当前的最新模型，包括DeepSeek-R1-0528和QWEN3-235B-A22B，在广泛的推理密集型任务上，在大约30 \％的情况下将代币使用减少了大约30 \％。除了学术评估之外，KAT还成功部署了Kwaipilot（即Kuaishou的内部编码助手），并以高准确性，效率和可控制的推理行为来改善现实世界的开发工作流程。此外，我们正在积极训练具有40B激活参数的200B个专家（MOE）混合物，其中早期结果已经显示出表现和效率的有希望的提高，进一步显示了Autothink Paradigm的可扩展性。

Title: Improving MLLM's Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency

Authors: Yupu Liang, Yaping Zhang, Zhiyang Zhang, Zhiyuan Chen, Yang Zhao, Lu Xiang, Chengqing Zong, Yu Zhou
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2507.08309
Pdf URL: https://arxiv.org/pdf/2507.08309
Copy Paste: [[2507.08309]] Improving MLLM's Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency(https://arxiv.org/abs/2507.08309)
Keywords: language model, llm, prompt
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks, especially Optical Character Recognition (OCR). However, they struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges. Previous efforts to enhance DIMT capability through Supervised Fine-Tuning (SFT) on the DIMT dataset often result in the forgetting of the model's existing monolingual abilities, such as OCR. To address these challenges, we introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept "Bilingual Cognitive Advantage". Specifically, SSR prompts the model to generate OCR text before producing translation text, which allows the model to leverage its strong monolingual OCR ability while learning to translate text across languages. Comprehensive experiments demonstrate the proposed SSR learning helps mitigate catastrophic forgetting, improving the generalization ability of MLLMs on both OCR and DIMT tasks.
摘要：多模式的大语言模型（MLLM）在文档图像任务中表现出很强的性能，尤其是光学字符识别（OCR）。但是，他们在文档图像机译（DIMT）上挣扎，这需要处理跨模式和跨语言挑战。以前通过在DIMT数据集中通过监督的微调（SFT）来增强DIMT能力的努力通常会导致忘记该模型现有的单语言能力，例如OCR。为了应对这些挑战，我们引入了一种新颖的微调范式，该范式被称为同步自我浏览（SSR）的OCR能力，灵感来自“双语认知优势”。具体而言，SSR提示该模型在制作翻译文本之前生成OCR文本，这使该模型可以在学习跨语言翻译文本的同时利用其强大的单语OCR能力。全面的实验证明了所提出的SSR学习有助于减轻灾难性遗忘，从而提高了MLLM在OCR和DIMT任务上的概括能力。

Title: CRMAgent: A Multi-Agent LLM System for E-Commerce CRM Message Template Generation

Authors: Yinzhu Quan, Xinrui Li, Ying Chen
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2507.08325
Pdf URL: https://arxiv.org/pdf/2507.08325
Copy Paste: [[2507.08325]] CRMAgent: A Multi-Agent LLM System for E-Commerce CRM Message Template Generation(https://arxiv.org/abs/2507.08325)
Keywords: language model, llm, agent
Abstract: In e-commerce private-domain channels such as instant messaging and e-mail, merchants engage customers directly as part of their Customer Relationship Management (CRM) programmes to drive retention and conversion. While a few top performers excel at crafting outbound messages, most merchants struggle to write persuasive copy because they lack both expertise and scalable tools. We introduce CRMAgent, a multi-agent system built on large language models (LLMs) that generates high-quality message templates and actionable writing guidance through three complementary modes. First, group-based learning enables the agent to learn from a merchant's own top-performing messages within the same audience segment and rewrite low-performing ones. Second, retrieval-and-adaptation fetches templates that share the same audience segment and exhibit high similarity in voucher type and product category, learns their successful patterns, and adapts them to the current campaign. Third, a rule-based fallback provides a lightweight zero-shot rewrite when no suitable references are available. Extensive experiments show that CRMAgent consistently outperforms merchants' original templates, delivering significant gains in both audience-match and marketing-effectiveness metrics.
摘要：在电子商务私人域渠道（例如即时消息传递和电子邮件）中，商人直接吸引客户，作为其客户关系管理（CRM）计划的一部分，以推动保留和转换。虽然一些表现最好的人在制作出站消息方面表现出色，但大多数商人都在努力编写有说服力的副本，因为他们缺乏专业知识和可扩展工具。我们介绍了Crmagent，这是一种基于大语言模型（LLM）的多机构系统，该系统通过三种互补模式生成高质量的消息模板和可行的写作指南。首先，基于群体的学习使代理商能够从商人自己最表现的消息中学习，并重写低表现的信息。其次，检索和适应性获取了共享相同受众群体并在代金券类型和产品类别中表现出很高相似性的模板，了解其成功的模式，并将其适应当前的广告系列。第三，基于规则的后卫在没有合适的参考资料时提供了轻巧的零弹性重写。广泛的实验表明，Crmagent始终优于商人的原始模板，在受众匹配和营销效率指标方面带来了可观的收益。

Title: MK2 at PBIG Competition: A Prompt Generation Solution

Authors: Yuzheng Xu, Tosho Hirasawa, Seiya Kawano, Shota Kato, Tadashi Kozuno
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08335
Pdf URL: https://arxiv.org/pdf/2507.08335
Copy Paste: [[2507.08335]] MK2 at PBIG Competition: A Prompt Generation Solution(https://arxiv.org/abs/2507.08335)
Keywords: gpt, prompt
Abstract: The Patent-Based Idea Generation task asks systems to turn real patents into product ideas viable within three years. We propose MK2, a prompt-centric pipeline: Gemini 2.5 drafts and iteratively edits a prompt, grafting useful fragments from weaker outputs; GPT-4.1 then uses this prompt to create one idea per patent, and an Elo loop judged by Qwen3-8B selects the best prompt-all without extra training data. Across three domains, two evaluator types, and six criteria, MK2 topped the automatic leaderboard and won 25 of 36 tests. Only the materials-chemistry track lagged, indicating the need for deeper domain grounding; yet, the results show that lightweight prompt engineering has already delivered competitive, commercially relevant ideation from patents.
摘要：基于专利的想法生成任务要求系统在三年内将真正的专利转变为可行的产品创意。我们提出了MK2，这是一种以迅速为中心的管道：Gemini 2.5的草稿，并迭代地编辑了一个迅速的，从较弱的输出中移植了有用的片段；然后，GPT-4.1使用此提示来创建每个专利的想法，而QWEN3-8B判断的ELO循环可以选择最佳的提示，而无需额外的培训数据。在三个域，两种评估器类型和6个标准中，MK2在自动排行榜上排名高，并赢得了36个测试中的25个。只有材料化学轨迹滞后，表明需要更深的域接地；然而，结果表明，轻巧的及时工程已经从专利中提供了竞争性，商业上相关的构想。

Title: What Factors Affect LLMs and RLLMs in Financial Question Answering?

Authors: Peng Wang, Xuesi Hu, Jiageng Wu, Yuntao Zou, Qiancheng Zhang, Dagang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08339
Pdf URL: https://arxiv.org/pdf/2507.08339
Copy Paste: [[2507.08339]] What Factors Affect LLMs and RLLMs in Financial Question Answering?(https://arxiv.org/abs/2507.08339)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: Recently, the development of large language models (LLMs) and reasoning large language models (RLLMs) have gained considerable attention from many researchers. RLLMs enhance the reasoning capabilities of LLMs through Long Chain-of-Thought (Long CoT) processes, significantly improving the performance of LLMs in addressing complex problems. However, there are few works that systematically explore what methods can fully unlock the performance of LLMs and RLLMs within the financial domain. To investigate the impact of various methods on LLMs and RLLMs, we utilize five LLMs and three RLLMs to assess the effects of prompting methods, agentic frameworks, and multilingual alignment methods on financial question-answering tasks. Our research findings indicate: (1) Current prompting methods and agent frameworks enhance the performance of LLMs in financial question answering by simulating Long CoT; (2) RLLMs possess inherent Long CoT capabilities, which limits the effectiveness of conventional methods in further enhancing their performance; (3) Current advanced multilingual alignment methods primarily improve the multilingual performance of LLMs by extending the reasoning length, which yields minimal benefits for RLLMs. We hope that this study can serve as an important reference for LLMs and RLLMs in the field of financial question answering.
摘要：最近，大型语言模型（LLM）和推理大语言模型（RLLM）的发展引起了许多研究人员的极大关注。 RLLM通过长链（长COT）过程增强了LLMS的推理能力，从而显着提高了LLM在解决复杂问题时的性能。但是，很少有工作能系统地探索哪些方法可以在金融领域内完全解锁LLM和RLLM的性能。为了调查各种方法对LLM和RLLM的影响，我们利用五个LLM和三个RLLM来评估促进方法，代理框架和多语言一致性方法对财务问题提问任务的影响。我们的研究发现表明：（1）当前的提示方法和代理框架通过模拟长床来提高LLM的经济问题回答的绩效；（2）RLLM具有固有的长床能力，这限制了传统方法在进一步提高其性能方面的有效性；（3）当前的高级多语言比对方法主要通过扩展推理长度来改善LLM的多语言性能，从而对RLLM产生最小的好处。我们希望这项研究可以作为财务问题领域中LLM和RLLM的重要参考。

Title: Exploring Design of Multi-Agent LLM Dialogues for Research Ideation

Authors: Keisuke Ueda, Wataru Hirota, Takuto Asakura, Takahiro Omi, Kosuke Takahashi, Kosuke Arima, Tatsuya Ishigaki
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2507.08350
Pdf URL: https://arxiv.org/pdf/2507.08350
Copy Paste: [[2507.08350]] Exploring Design of Multi-Agent LLM Dialogues for Research Ideation(https://arxiv.org/abs/2507.08350)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) are increasingly used to support creative tasks such as research idea generation. While recent work has shown that structured dialogues between LLMs can improve the novelty and feasibility of generated ideas, the optimal design of such interactions remains unclear. In this study, we conduct a comprehensive analysis of multi-agent LLM dialogues for scientific ideation. We compare different configurations of agent roles, number of agents, and dialogue depth to understand how these factors influence the novelty and feasibility of generated ideas. Our experimental setup includes settings where one agent generates ideas and another critiques them, enabling iterative improvement. Our results show that enlarging the agent cohort, deepening the interaction depth, and broadening agent persona heterogeneity each enrich the diversity of generated ideas. Moreover, specifically increasing critic-side diversity within the ideation-critique-revision loop further boosts the feasibility of the final proposals. Our findings offer practical guidelines for building effective multi-agent LLM systems for scientific ideation. Our code is available at this https URL.
摘要：大型语言模型（LLM）越来越多地用于支持创意任务，例如研究思想的产生。尽管最近的工作表明，LLM之间的结构化对话可以改善产生想法的新颖性和可行性，但这种相互作用的最佳设计尚不清楚。在这项研究中，我们对科学意见的多代理LLM对话进行了全面分析。我们比较了代理角色，代理数量和对话深度的不同配置，以了解这些因素如何影响产生思想的新颖性和可行性。我们的实验设置包括一个代理会产生想法和另一个批评它们的设置，从而实现了迭代的改进。我们的结果表明，扩大代理人队列，加深相互作用的深度和扩大剂的角色异质性，每个人都丰富了产生的思想的多样性。此外，特别是增加了批评者的批评者的多样性，进一步提高了最终提案的可行性。我们的发现提供了用于构建有效多代理LLM系统的科学意见系统的实用指南。我们的代码可在此HTTPS URL上找到。

Title: The Curious Case of Factuality Finetuning: Models' Internal Beliefs Can Improve Factuality

Authors: Benjamin Newman, Abhilasha Ravichander, Jaehun Jung, Rui Xin, Hamish Ivison, Yegor Kuznetsov, Pang Wei Koh, Yejin Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08371
Pdf URL: https://arxiv.org/pdf/2507.08371
Copy Paste: [[2507.08371]] The Curious Case of Factuality Finetuning: Models' Internal Beliefs Can Improve Factuality(https://arxiv.org/abs/2507.08371)
Keywords: language model, hallucination
Abstract: Language models are prone to hallucination - generating text that is factually incorrect. Finetuning models on high-quality factual information can potentially reduce hallucination, but concerns remain; obtaining factual gold data can be expensive and training on correct but unfamiliar data may potentially lead to even more downstream hallucination. What data should practitioners finetune on to mitigate hallucinations in language models? In this work, we study the relationship between the factuality of finetuning data and the prevalence of hallucinations in long-form generation tasks. Counterintuitively, we find that finetuning on factual gold data is not as helpful as finetuning on model-generated data that models believe to be factual. Next, we evaluate filtering strategies applied on both factual gold data and model-generated data, and find that finetuning on model-generated data that is filtered by models' own internal judgments often leads to better overall factuality compared to other configurations: training on gold data filtered by models' judgments, training on gold data alone, or training on model-generated data that is supported by gold data. These factuality improvements transfer across three domains we study, suggesting that a models' own beliefs can provide a powerful signal for factuality.
摘要：语言模型容易幻觉 - 生成实际上不正确的文本。关于高质量的事实信息的填充模型可以潜在地减少幻觉，但仍然存在担忧。获得事实黄金数据可能很昂贵，并且对正确的数据进行培训，但陌生的数据可能会导致更下游幻觉。哪些数据应在哪些数据中进行减轻语言模型中的幻觉？在这项工作中，我们研究了鉴定数据的事实与长期生成任务中幻觉的流行之间的关系。违反直觉，我们发现对事实黄金数据的填充没有对模型生成的数据的填充，该数据模型认为是事实。接下来，我们评估应用于事实黄金数据和模型生成的数据的过滤策略，并发现对模型生成的数据进行了填充，该数据被模型的内部判断过滤过来，通常会导致与其他配置相比，与其他配置相比：对金数据进行培训，通过模型判断，单独培训金数据，或者单独培训金数据，或对模型培训的培训，或对模型培训的数据培训。这些事实改善在我们研究的三个领域之间的转移，这表明模型的信念可以为事实提供强大的信号。

Title: A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities

Authors: Lu Xiang, Yang Zhao, Yaping Zhang, Chengqing Zong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08425
Pdf URL: https://arxiv.org/pdf/2507.08425
Copy Paste: [[2507.08425]] A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities(https://arxiv.org/abs/2507.08425)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Large Language Models (LLMs) have demonstrated their transformative potential across numerous disciplinary studies, reshaping the existing research methodologies and fostering interdisciplinary collaboration. However, a systematic understanding of their integration into diverse disciplines remains underexplored. This survey paper provides a comprehensive overview of the application of LLMs in interdisciplinary studies, categorising research efforts from both a technical perspective and with regard to their applicability. From a technical standpoint, key methodologies such as supervised fine-tuning, retrieval-augmented generation, agent-based approaches, and tool-use integration are examined, which enhance the adaptability and effectiveness of LLMs in discipline-specific contexts. From the perspective of their applicability, this paper explores how LLMs are contributing to various disciplines including mathematics, physics, chemistry, biology, and the humanities and social sciences, demonstrating their role in discipline-specific tasks. The prevailing challenges are critically examined and the promising research directions are highlighted alongside the recent advances in LLMs. By providing a comprehensive overview of the technical developments and applications in this field, this survey aims to serve as an invaluable resource for the researchers who are navigating the complex landscape of LLMs in the context of interdisciplinary studies.
摘要：大型语言模型（LLMS）证明了它们在众多学科研究中的变革潜力，重塑了现有的研究方法并促进了跨学科的合作。但是，对它们整合到不同学科的整合的系统理解仍然没有被忽视。本调查论文概述了LLM在跨学科研究中的应用，从技术角度和其适用性对研究工作进行了分类。从技术的角度来看，检查了关键方法，例如受监督的微调，检索功能生成，基于代理的方法和工具使用的集成，从而增强了LLMS在学科特定环境中的适应性和有效性。从其适用性的角度来看，本文探讨了LLM如何为包括数学，物理，化学，生物学以及人文和社会科学在内的各种学科做出贡献，从而展示了它们在特定学科的任务中的作用。盛行的挑战得到了严格的研究，并且在LLM的最新进展中强调了有前途的研究方向。通过对该领域的技术发展和应用进行全面的概述，该调查旨在为在跨学科研究中导航LLMS复杂景观的研究人员提供了宝贵的资源。

Title: ChainEdit: Propagating Ripple Effects in LLM Knowledge Editing through Logical Rule-Guided Chains

Authors: Zilu Dong, Xiangqing Shen, Zinong Yang, Rui Xia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08427
Pdf URL: https://arxiv.org/pdf/2507.08427
Copy Paste: [[2507.08427]] ChainEdit: Propagating Ripple Effects in LLM Knowledge Editing through Logical Rule-Guided Chains(https://arxiv.org/abs/2507.08427)
Keywords: language model, llm
Abstract: Current knowledge editing methods for large language models (LLMs) struggle to maintain logical consistency when propagating ripple effects to associated facts. We propose ChainEdit, a framework that synergizes knowledge graph-derived logical rules with LLM logical reasoning capabilities to enable systematic chain updates. By automatically extracting logical patterns from structured knowledge bases and aligning them with LLMs' internal logics, ChainEdit dynamically generates and edits logically connected knowledge clusters. Experiments demonstrate an improvement of more than 30% in logical generalization over baselines while preserving editing reliability and specificity. We further address evaluation biases in existing benchmarks through knowledge-aware protocols that disentangle external dependencies. This work establishes new state-of-the-art performance on ripple effect while ensuring internal logical consistency after knowledge editing.
摘要：当前的大型语言模型（LLMS）的知识编辑方法在传播有关相关事实的连锁反应时努力保持逻辑一致性。我们提出了链接的框架，该框架可以通过LLM逻辑推理功能协同知识衍生的逻辑规则，以实现系统的链条更新。通过自动从结构化知识库中提取逻辑模式并将其与LLMS的内部逻辑对齐，链接者会动态生成和编辑逻辑上连接的知识群集。实验表明，比基线的逻辑概括超过30％，同时保留了编辑可靠性和特异性。我们通过置于外部依赖性的知识感知协议中进一步解决现有基准测试中的评估偏见。这项工作确立了在连锁反应上的最新性能，同时确保知识编辑后的内部逻辑一致性。

Title: Finding Common Ground: Using Large Language Models to Detect Agreement in Multi-Agent Decision Conferences

Authors: Selina Heller, Mohamed Ibrahim, David Antony Selby, Sebastian Vollmer
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2507.08440
Pdf URL: https://arxiv.org/pdf/2507.08440
Copy Paste: [[2507.08440]] Finding Common Ground: Using Large Language Models to Detect Agreement in Multi-Agent Decision Conferences(https://arxiv.org/abs/2507.08440)
Keywords: language model, llm, agent
Abstract: Decision conferences are structured, collaborative meetings that bring together experts from various fields to address complex issues and reach a consensus on recommendations for future actions or policies. These conferences often rely on facilitated discussions to ensure productive dialogue and collective agreement. Recently, Large Language Models (LLMs) have shown significant promise in simulating real-world scenarios, particularly through collaborative multi-agent systems that mimic group interactions. In this work, we present a novel LLM-based multi-agent system designed to simulate decision conferences, specifically focusing on detecting agreement among the participant agents. To achieve this, we evaluate six distinct LLMs on two tasks: stance detection, which identifies the position an agent takes on a given issue, and stance polarity detection, which identifies the sentiment as positive, negative, or neutral. These models are further assessed within the multi-agent system to determine their effectiveness in complex simulations. Our results indicate that LLMs can reliably detect agreement even in dynamic and nuanced debates. Incorporating an agreement-detection agent within the system can also improve the efficiency of group debates and enhance the overall quality and coherence of deliberations, making them comparable to real-world decision conferences regarding outcome and decision-making. These findings demonstrate the potential for LLM-based multi-agent systems to simulate group decision-making processes. They also highlight that such systems could be instrumental in supporting decision-making with expert elicitation workshops across various domains.
摘要：决策会议是结构化的协作会议，将各个领域的专家聚集在一起，以解决复杂的问题，并就未来行动或政策的建议达成共识。这些会议通常依靠促进的讨论来确保富有成效的对话和集体协议。最近，大型语言模型（LLMS）在模拟现实世界的情况下表现出了巨大的希望，尤其是通过模仿群体交互的协作多机构系统。在这项工作中，我们提出了一种基于LLM的新型多代理系统，旨在模拟决策会议，特别着眼于检测参与者之间的一致性。为了实现这一目标，我们在两个任务上评估了六个不同的LLM：姿势检测，它标识了代理在给定问题上的位置，而立场极性检测将情感确定为正，负或中性。这些模型在多代理系统中进一步评估，以确定它们在复杂模拟中的有效性。我们的结果表明，即使在动态和细微的辩论中，LLM也可以可靠地检测一致性。将协议检测代理纳入系统还可以提高小组辩论的效率，并提高审议的总体质量和一致性，从而与现实世界中有关结果和决策的决策会议相媲美。这些发现证明了基于LLM的多代理系统模拟组决策过程的潜力。他们还强调，这种系统可能有助于通过各个领域的专家启发研讨会来支持决策。

Title: Diagnosing Failures in Large Language Models' Answers: Integrating Error Attribution into Evaluation Framework

Authors: Zishan Xu, Shuyi Xie, Qingsong Lv, Shupei Xiao, Linlin Song, Sui Wenjuan, Fan Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08459
Pdf URL: https://arxiv.org/pdf/2507.08459
Copy Paste: [[2507.08459]] Diagnosing Failures in Large Language Models' Answers: Integrating Error Attribution into Evaluation Framework(https://arxiv.org/abs/2507.08459)
Keywords: language model, llm
Abstract: With the widespread application of Large Language Models (LLMs) in various tasks, the mainstream LLM platforms generate massive user-model interactions daily. In order to efficiently analyze the performance of models and diagnose failures in their answers, it is essential to develop an automated framework to systematically categorize and attribute errors. However, existing evaluation models lack error attribution capability. In this work, we establish a comprehensive Misattribution Framework with 6 primary and 15 secondary categories to facilitate in-depth analysis. Based on this framework, we present AttriData, a dataset specifically designed for error attribution, encompassing misattribution, along with the corresponding scores and feedback. We also propose MisAttributionLLM, a fine-tuned model on AttriData, which is the first general-purpose judge model capable of simultaneously generating score, misattribution, and feedback. Extensive experiments and analyses are conducted to confirm the effectiveness and robustness of our proposed method.
摘要：随着大型语言模型（LLM）在各种任务中的广泛应用，主流LLM平台每天都会生成大量的用户模型交互。为了有效地分析模型的性能并在答案中诊断失败，必须开发一个自动化的框架以系统地分类和属性错误。但是，现有的评估模型缺乏错误归因能力。在这项工作中，我们建立了一个全面的错误归因框架，其中有6个主要和15个次要类别，以促进深入分析。基于此框架，我们提出了Aftidata，该数据集是专门设计用于错误归因的数据集，涵盖了错误归因，以及相应的分数和反馈。我们还提出了MistributionLlm，这是Aftidata的微调模型，这是第一个能够同时产生得分，错误贡献和反馈的通用法官模型。进行了广泛的实验和分析，以确认我们提出的方法的有效性和鲁棒性。

Title: Using Large Language Models for Legal Decision-Making in Austrian Value-Added Tax Law: An Experimental Study

Authors: Marina Luketina, Andrea Benkel, Christoph G. Schuetz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08468
Pdf URL: https://arxiv.org/pdf/2507.08468
Copy Paste: [[2507.08468]] Using Large Language Models for Legal Decision-Making in Austrian Value-Added Tax Law: An Experimental Study(https://arxiv.org/abs/2507.08468)
Keywords: language model, llm, retrieval-augmented generation
Abstract: This paper provides an experimental evaluation of the capability of large language models (LLMs) to assist in legal decision-making within the framework of Austrian and European Union value-added tax (VAT) law. In tax consulting practice, clients often describe cases in natural language, making LLMs a prime candidate for supporting automated decision-making and reducing the workload of tax professionals. Given the requirement for legally grounded and well-justified analyses, the propensity of LLMs to hallucinate presents a considerable challenge. The experiments focus on two common methods for enhancing LLM performance: fine-tuning and retrieval-augmented generation (RAG). In this study, these methods are applied on both textbook cases and real-world cases from a tax consulting firm to systematically determine the best configurations of LLM-based systems and assess the legal-reasoning capabilities of LLMs. The findings highlight the potential of using LLMs to support tax consultants by automating routine tasks and providing initial analyses, although current prototypes are not ready for full automation due to the sensitivity of the legal domain. The findings indicate that LLMs, when properly configured, can effectively support tax professionals in VAT tasks and provide legally grounded justifications for decisions. However, limitations remain regarding the handling of implicit client knowledge and context-specific documentation, underscoring the need for future integration of structured background information.
摘要：本文提供了对大语模型（LLMS）在奥地利和欧盟增值税（VAT）法律框架内协助法律决策能力的实验评估。在税务咨询实践中，客户经常用自然语言描述案例，使LLMS成为支持自动决策并减少税务专业人士工作量的主要候选人。鉴于对合法扎根和正式分析的要求，LLM幻觉的倾向提出了一个巨大的挑战。这些实验着重于增强LLM性能的两种常见方法：微调和检索效果生成（RAG）。在这项研究中，这些方法都应用于教科书案例和从税务咨询公司的现实情况上，以系统地确定基于LLM的系统的最佳配置，并评估LLMS的法律策划功能。这些发现突出了使用LLM通过自动执行日常任务并提供初始分析来支持税务顾问的潜力，尽管由于法律领域的敏感性，目前的原型尚未准备好进行完全自动化。调查结果表明，LLM在正确配置后可以有效地支持增值税任务中的税务专业人员，并为决策提供合法的理由。但是，关于处理隐式客户知识和特定于上下文的文档的限制仍然存在，强调了对结构化背景信息的未来整合的需求。

Title: ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition

Authors: Qingliang Meng, Hao Wu, Wei Liang, Wei Xu, Qing Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08477
Pdf URL: https://arxiv.org/pdf/2507.08477
Copy Paste: [[2507.08477]] ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition(https://arxiv.org/abs/2507.08477)
Keywords: language model
Abstract: The deep integration of large language models and automatic speech recognition systems has become a promising research direction with high practical value. To address the overfitting issue commonly observed in Low-Rank Adaptation (LoRA) during the supervised fine-tuning (SFT) stage, this work proposes an innovative training paradigm Iterative LoRA Training (ILT) in combination with an Iterative Pseudo Labeling strategy, effectively enhancing the theoretical upper bound of model performance. Based on Whisper-large-v3 and Qwen2-Audio, we conduct systematic experiments using a three-stage training process: Focus Training, Feed Back Training, and Fix Training. Experimental results demonstrate the effectiveness of the proposed method. Furthermore, the MegaAIS research team applied this technique in the Interspeech 2025 Multilingual Conversational Speech Language Modeling Challenge (MLC-SLM), achieving 4th in Track 1 (Multilingual ASR Task) and 1st place in Track 2 (Speech Separation and Recognition Task), showcasing the practical feasibility and strong application potential of our approach.
摘要：大型语言模型和自动语音识别系统的深刻整合已成为具有高实用价值的有希望的研究方向。为了解决监督的微调（SFT）阶段中通常在低级适应（LORA）中通常观察到的过度拟合问题，这项工作提出了一种创新的培训范式迭代的Lora培训（ILT），并结合了迭代性伪标记策略，从而有效地增强了模型性能的理论上限。基于Whisper-Large-V3和Qwen2-Audio，我们使用三阶段训练过程进行系统实验：重点训练，返回训练和修复培训。实验结果证明了该方法的有效性。此外，大型研究团队在Interspeech 2025多语言对话语言建模挑战（MLC-SLM）中应用了这项技术，在轨道1（多语言ASR任务）中获得第四名，并在轨道2（语音分离和识别任务）中获得第四名，展示了我们方法的实践可行性和强大的应用潜力。

Title: A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench

Authors: David Schlangen, Sherzod Hakimov, Jonathan Jordan, Philipp Sadler
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08491
Pdf URL: https://arxiv.org/pdf/2507.08491
Copy Paste: [[2507.08491]] A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench(https://arxiv.org/abs/2507.08491)
Keywords: language model, llm
Abstract: There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available. The second, best exemplified by the LM-arena, relies on (often self-selected) users bringing their own intents to a site that routes these to several models in parallel, among whose responses the user then selects their most preferred one. The former paradigm hence excels at control over what is tested, while the latter comes with higher ecological validity, testing actual use cases interactively. Recently, a third complementary paradigm has emerged that combines some of the strengths of these approaches, offering control over multi-turn, reference-free, repeatable interactions, while stressing goal-directedness: dialogue game based evaluation. While the utility of this approach has been shown by several projects, its adoption has been held back by the lack of a mature, easily re-usable implementation. In this paper, we present clembench, which has been in continuous development since 2023 and has in its latest release been optimized for ease of general use. We describe how it can be used to benchmark one's own models (using a provided set of benchmark game instances in English), as well as how easily the benchmark itself can be extended with new, tailor-made targeted tests.
摘要：目前有两个用于评估大语言模型（LLM）的主要范式，基于参考的评估和基于偏好的评估。第一，从一般的机器学习模型的评估中延续了，依赖于预定义的任务实例，为此可以使用参考任务执行。第二个由LM-ARENA示例的最佳体现，依赖（通常是自选择的）用户将自己的意图带到一个并行的网站，将这些型号并行地路由到几个模型，其中用户的响应是用户最喜欢的一个。因此，前者范式可以控制测试的范式，而后者则具有更高的生态有效性，从而进行了交互式测试。最近，出现了第三个互补范式，结合了这些方法的某些优势，可以控制多转，无参考，可重复的相互作用，同时强调目标指导性：基于对话游戏的评估。尽管几个项目已经显示了这种方法的效用，但由于缺乏成熟，易于重新使用的实施，其采用却阻碍了其采用。在本文中，我们介绍了Clembench，该文章自2023年以来一直处于连续发展状态，并在其最新版本中进行了优化，以方便一般使用。我们描述了如何用于基准自己的型号（使用提供的一组基准游戏实例），以及如何通过新的，量身定制的目标测试来扩展基准本身。

Title: LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning

Authors: Shibo Sun, Xue Li, Donglin Di, Mingjie Wei, Lanshun Nie, Wei-Nan Zhang, Dechen Zhan, Yang Song, Lei Fan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08496
Pdf URL: https://arxiv.org/pdf/2507.08496
Copy Paste: [[2507.08496]] LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning(https://arxiv.org/abs/2507.08496)
Keywords: language model, llm
Abstract: While large language models (LLMs) have advanced procedural planning for embodied AI systems through strong reasoning abilities, the integration of multimodal inputs and counterfactual reasoning remains underexplored. To tackle these challenges, we introduce LLaPa, a vision-language model framework designed for multimodal procedural planning. LLaPa generates executable action sequences from textual task descriptions and visual environmental images using vision-language models (VLMs). Furthermore, we enhance LLaPa with two auxiliary modules to improve procedural planning. The first module, the Task-Environment Reranker (TER), leverages task-oriented segmentation to create a task-sensitive feature space, aligning textual descriptions with visual environments and emphasizing critical regions for procedural execution. The second module, the Counterfactual Activities Retriever (CAR), identifies and emphasizes potential counterfactual conditions, enhancing the model's reasoning capability in counterfactual scenarios. Extensive experiments on ActPlan-1K and ALFRED benchmarks demonstrate that LLaPa generates higher-quality plans with superior LCS and correctness, outperforming advanced models. The code and models are available this https URL.
摘要：尽管大型语言模型（LLMS）通过强大的推理能力进行了体现的AI系统的高级程序计划，但多模式输入和反事实推理的整合仍未得到充实。为了应对这些挑战，我们介绍了llapa，这是一个旨在多模式程序计划的视觉模型框架。 LLAPA使用视觉语言模型（VLM）从文本任务描述和视觉环境图像中生成可执行的动作序列。此外，我们使用两个辅助模块增强了LLAPA，以改善程序计划。第一个模块是任务环境的重读者（TER），利用面向任务的分段创建一个对任务敏感的特征空间，将文本描述与视觉环境保持一致，并强调关键区域以进行程序执行。第二个模块，反事实活动检索（CAR），识别并强调了潜在的反事实条件，从而增强了模型在反事实场景中的推理能力。对ACTPLAN-1K和ALFRED基准测试的广泛实验表明，LLAPA生成具有优质LCS和正确性的高质量计划，表现优于高级模型。代码和模型可用此HTTPS URL。

Title: Semantic-Augmented Latent Topic Modeling with LLM-in-the-Loop

Authors: Mengze Hong, Chen Jason Zhang, Di Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08498
Pdf URL: https://arxiv.org/pdf/2507.08498
Copy Paste: [[2507.08498]] Semantic-Augmented Latent Topic Modeling with LLM-in-the-Loop(https://arxiv.org/abs/2507.08498)
Keywords: language model, llm
Abstract: Latent Dirichlet Allocation (LDA) is a prominent generative probabilistic model used for uncovering abstract topics within document collections. In this paper, we explore the effectiveness of augmenting topic models with Large Language Models (LLMs) through integration into two key phases: Initialization and Post-Correction. Since the LDA is highly dependent on the quality of its initialization, we conduct extensive experiments on the LLM-guided topic clustering for initializing the Gibbs sampling algorithm. Interestingly, the experimental results reveal that while the proposed initialization strategy improves the early iterations of LDA, it has no effect on the convergence and yields the worst performance compared to the baselines. The LLM-enabled post-correction, on the other hand, achieved a promising improvement of 5.86% in the coherence evaluation. These results highlight the practical benefits of the LLM-in-the-loop approach and challenge the belief that LLMs are always the superior text mining alternative.
摘要：潜在的DIRICHLET分配（LDA）是一种突出的生成概率模型，用于在文档集合中发现抽象主题。在本文中，我们通过集成到两个关键阶段：初始化和后校正来探讨使用大语言模型（LLM）增强主题模型的有效性。由于LDA高度取决于其初始化的质量，因此我们对LLM引导的主题聚类进行了广泛的实验，以初始化Gibbs采样算法。有趣的是，实验结果表明，虽然提出的初始化策略改善了LDA的早期迭代，但与基线相比，它对收敛没有影响，并且产生最差的性能。另一方面，支持LLM的后校正在相干评估中实现了5.86％的有望提高。这些结果凸显了LLM in-in-the-the-the-the-the-the-the-the-the-the-the-the-the-the-llms of llms始终是较高的文本挖掘替代方案的实际好处。

Title: The AI Language Proficiency Monitor -- Tracking the Progress of LLMs on Multilingual Benchmarks

Authors: David Pomerenke, Jonas Nothnagel, Simon Ostermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.08538
Pdf URL: https://arxiv.org/pdf/2507.08538
Copy Paste: [[2507.08538]] The AI Language Proficiency Monitor -- Tracking the Progress of LLMs on Multilingual Benchmarks(https://arxiv.org/abs/2507.08538)
Keywords: language model, llm
Abstract: To ensure equitable access to the benefits of large language models (LLMs), it is essential to evaluate their capabilities across the world's languages. We introduce the AI Language Proficiency Monitor, a comprehensive multilingual benchmark that systematically assesses LLM performance across up to 200 languages, with a particular focus on low-resource languages. Our benchmark aggregates diverse tasks including translation, question answering, math, and reasoning, using datasets such as FLORES+, MMLU, GSM8K, TruthfulQA, and ARC. We provide an open-source, auto-updating leaderboard and dashboard that supports researchers, developers, and policymakers in identifying strengths and gaps in model performance. In addition to ranking models, the platform offers descriptive insights such as a global proficiency map and trends over time. By complementing and extending prior multilingual benchmarks, our work aims to foster transparency, inclusivity, and progress in multilingual AI. The system is available at this https URL.
摘要：为了确保公平地访问大型语言模型（LLMS）的好处，必须评估其跨世界语言的能力。我们介绍了AI语言能力监视器，这是一种全面的多语言基准，该基准系统地评估了多达200种语言的LLM性能，特别关注低资源语言。我们使用诸如Flores+，MMLU，GSM8K，Elthfulqa和Arc之类的数据集汇总了各种任务，包括翻译，问题回答，数学和推理。我们提供开源，自动更新的排行榜和仪表板，以支持研究人员，开发人员和决策者确定模型性能的优势和差距。除了排名模型外，该平台还提供了描述性见解，例如全球能力地图和随着时间的推移趋势。通过补充和扩展先前的多语言基准，我们的工作旨在促进多语言AI的透明度，包容性和进步。该系统可在此HTTPS URL上找到。

Title: A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

Authors: Marcin Pietroń, Rafał Olszowski, Jakub Gomułka, Filip Gampel, Andrzej Tomski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08621
Pdf URL: https://arxiv.org/pdf/2507.08621
Copy Paste: [[2507.08621]] A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1(https://arxiv.org/abs/2507.08621)
Keywords: language model, gpt, llm, prompt, chat, chain-of-thought
Abstract: Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM's, using diverse datasets such as this http URL and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.
摘要：参数挖掘（AM）是一个跨学科研究领域，它整合了逻辑，哲学，语言学，言论，法律，心理学和计算机科学的见解。它涉及自动识别和提取诸如前提和主张之类的论证组成部分，以及对它们之间的关系（例如支持，攻击或中立性）的检测。最近，该领域已经取得了长足的发展，尤其是随着大型语言模型（LLM）的出现，与传统方法和其他深度学习模型相比，该领域提高了分析和提取论证语义的效率。有许多用于测试和验证LLM质量的基准，但是在公开可用的参数分类数据库中，对于这些模型的操作仍然缺乏研究和结果。本文使用此HTTP URL和UKP等各种数据集介绍了对LLM的选择。测试的模型包括GPT，Llama和DeepSeek的版本，以及结合了经过思考链算法的推理增强的变体。结果表明，在参数分类基准中，CHATGPT-4O优于其他人。如果将模型与推理能力合并在一起，则DeepSeek-R1显示出其优越性。但是，尽管它们优越，但GPT-4O和DeepSeek-R1仍然犯错。所有模型都讨论了最常见的错误。据我们所知，提出的工作是使用LLM和提示算法对上述数据集进行的首次更广泛的分析。这项工作还显示了参数分析中已知迅速算法的一些弱点，同时指示了其改进的方向。工作的附加值是对可用参数数据集的深入分析及其缺点的演示。

Title: KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment

Authors: Jiyao Zhang, Chengli Zhong, Hui Xu, Qige Li, Yi Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08665
Pdf URL: https://arxiv.org/pdf/2507.08665
Copy Paste: [[2507.08665]] KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment(https://arxiv.org/abs/2507.08665)
Keywords: language model, llm
Abstract: Modern large language models (LLMs) show promising progress in formalizing informal mathematics into machine-verifiable theorems. However, these methods still face bottlenecks due to the limited quantity and quality of multilingual parallel corpora. In this paper, we propose a novel neuro-symbolic framework KELPS (Knowledge-Equation based Logical Processing System) to address these problems. KELPS is an iterative framework for translating, synthesizing, and filtering informal data into multiple formal languages (Lean, Coq, and Isabelle). First, we translate natural language into Knowledge Equations (KEs), a novel language that we designed, theoretically grounded in assertional logic. Next, we convert them to target languages through rigorously defined rules that preserve both syntactic structure and semantic meaning. This process yielded a parallel corpus of over 60,000 problems. Our framework achieves 88.9% syntactic accuracy (pass@1) on MiniF2F, outperforming SOTA models such as Deepseek-V3 (81%) and Herald (81.3%) across multiple datasets. All datasets and codes are available in the supplementary materials.
摘要：现代大型语言模型（LLMS）在将非正式数学形式化为机器验证的定理方面显示出令人鼓舞的进步。但是，由于多语言平行语料库的数量和质量有限，这些方法仍然面临瓶颈。在本文中，我们提出了一个新型的神经符号框架海带（基于知识方程的逻辑处理系统），以解决这些问题。海带是一种迭代框架，用于将非正式数据转换为多种形式语言（Lean，Coq和Isabelle）。首先，我们将自然语言转化为知识方程式（KES），这是我们设计的一种新颖的语言，理论上基于断言逻辑。接下来，我们通过严格定义的规则将它们转换为目标语言，这些规则可以保留句法结构和语义含义。这个过程产生了超过60,000个问题的平行语料库。我们的框架在minif2f上实现了88.9％的句法准确性（通过@1），在多个数据集中超过了SOTA模型，例如DeepSeek-V3（81％）和Herald（81.3％）。所有数据集和代码都在补充材料中可用。

Title: KG-Attention: Knowledge Graph-Guided Attention at Test-Time via Bidirectional Information Aggregation

Authors: Songlin Zhai, Guilin Qi, Yuan Meng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08704
Pdf URL: https://arxiv.org/pdf/2507.08704
Copy Paste: [[2507.08704]] KG-Attention: Knowledge Graph-Guided Attention at Test-Time via Bidirectional Information Aggregation(https://arxiv.org/abs/2507.08704)
Keywords: language model, llm
Abstract: Knowledge graphs (KGs) play a critical role in enhancing large language models (LLMs) by introducing structured and grounded knowledge into the learning process. However, most existing KG-enhanced approaches rely on parameter-intensive fine-tuning, which risks catastrophic forgetting and degrades the pretrained model's generalization. Moreover, they exhibit limited adaptability to real-time knowledge updates due to their static integration frameworks. To address these issues, we introduce the first test-time KG-augmented framework for LLMs, built around a dedicated knowledge graph-guided attention (KGA) module that enables dynamic knowledge fusion without any parameter updates. The proposed KGA module augments the standard self-attention mechanism with two synergistic pathways: outward and inward aggregation. Specifically, the outward pathway dynamically integrates external knowledge into input representations via input-driven KG fusion. This inward aggregation complements the outward pathway by refining input representations through KG-guided filtering, suppressing task-irrelevant signals and amplifying knowledge-relevant patterns. Importantly, while the outward pathway handles knowledge fusion, the inward path selects the most relevant triples and feeds them back into the fusion process, forming a closed-loop enhancement mechanism. By synergistically combining these two pathways, the proposed method supports real-time knowledge fusion exclusively at test-time, without any parameter modification. Extensive experiments on five benchmarks verify the comparable knowledge fusion performance of KGA.
摘要：知识图（kgs）通过将结构化和扎根的知识引入学习过程中在增强大型语言模型（LLM）中起着至关重要的作用。但是，大多数现有的KG增强方法都取决于参数密集型微调，这有可能造成灾难性的遗忘并降低了预期的模型的概括。此外，由于其静态集成框架，它们对实时知识更新的适应性有限。为了解决这些问题，我们介绍了围绕专用知识指导的注意力（KGA）模块构建的第一个测试时间kg a仪框架，该模块可以实现动态知识融合而无需任何参数更新。提出的KGA模块通过两种协同途径增强了标准的自我发项机制：向外和向内聚集。具体而言，外向途径通过输入驱动的kg融合将外部知识动态整合到输入表示中。这种内部的聚合通过通过KG引导的过滤，抑制任务 - 毫无疑问的信号和扩增与知识相关的模式来完善输入表示，从而补充了外向途径。重要的是，虽然外向途径处理知识融合，但内向路径选择了最相关的三元组，并将其馈回融合过程，形成了闭环增强机制。通过协同结合这两种途径，该提出的方法支持实时知识在测试时仅在没有任何参数修改的情况下进行融合。对五个基准测试的广泛实验验证了KGA的可比知识融合性能。

Title: Multilingual Multimodal Software Developer for Code Generation

Authors: Linzheng Chai, Jian Yang, Shukai Liu, Wei Zhang, Liran Wang, Ke Jin, Tao Sun, Congnan Liu, Chenchen Zhang, Hualei Zhu, Jiaheng Liu, Xianjie Wu, Ge Zhang, Tianyu Liu, Zhoujun Li
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2507.08719
Pdf URL: https://arxiv.org/pdf/2507.08719
Copy Paste: [[2507.08719]] Multilingual Multimodal Software Developer for Code Generation(https://arxiv.org/abs/2507.08719)
Keywords: language model, llm
Abstract: The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To bridge this gap, we introduce MM-Coder, a Multilingual Multimodal software developer. MM-Coder integrates visual design inputs-Unified Modeling Language (UML) diagrams and flowcharts (termed Visual Workflow)-with textual instructions to enhance code generation accuracy and architectural alignment. To enable this, we developed MMc-Instruct, a diverse multimodal instruction-tuning dataset including visual-workflow-based code generation, allowing MM-Coder to synthesize textual and graphical information like human developers, distinct from prior work on narrow tasks. Furthermore, we introduce MMEval, a new benchmark for evaluating multimodal code generation, addressing existing text-only limitations. Our evaluations using MMEval highlight significant remaining challenges for models in precise visual information capture, instruction following, and advanced programming knowledge. Our work aims to revolutionize industrial programming by enabling LLMs to interpret and implement complex specifications conveyed through both text and visual designs.
摘要：大型语言模型（LLMS）的快速发展已经显着改善了代码的生成，但是大多数模型仍然仅文本，忽略了现实世界软件开发中使用的图表和流程图，例如图表和流程图。为了弥合这一差距，我们介绍了MM-Coder，这是一种多语言多模式软件开发人员。 MM-Coder集成了视觉设计输入统一的建模语言（UML）图和流程图（称为Visual Workflow） - 具有文本指令，以增强代码生成的准确性和架构对齐。为了实现这一目标，我们开发了MMC Instruct，这是一种多样化的多模式指令调节数据集，包括基于视觉工作流的代码生成，允许MM-Coder综合人类开发人员（如人类开发人员），与先前在狭窄任务上的工作不同。此外，我们推出了MMEVAL，这是一种用于评估多模式代码生成的新基准，以解决现有的仅文本限制。我们使用MMEVAL的评估强调了模型的剩余挑战，以精确的视觉信息捕获，指导和高级编程知识。我们的工作旨在通过使LLMS解释和实施通过文本和视觉设计传达的复杂规范来彻底改变工业编程。

Title: KV Cache Steering for Inducing Reasoning in Small Language Models

Authors: Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, Cees G. M. Snoek, Yuki M. Asano
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.08799
Pdf URL: https://arxiv.org/pdf/2507.08799
Copy Paste: [[2507.08799]] KV Cache Steering for Inducing Reasoning in Small Language Models(https://arxiv.org/abs/2507.08799)
Keywords: language model, gpt, prompt, chain-of-thought
Abstract: We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of hyperparameter stability, inference-time efficiency, and ease of integration, making it a more robust and practical solution for controlled generation.
摘要：我们提出了缓存转向，这是一种通过直接应用于键值缓存的单次干预来隐式转向语言模型的轻量级方法。为了验证其有效性，我们应用缓存转向以在小语言模型中诱导经过思考的推理。我们的方法利用GPT-4O生成的推理轨迹来构建转向向量，这些向量将模型行为转移到更明确的，多步推理的情况下，而无需微调或及时修改。对各种推理基准的实验评估表明，缓存转向既改善了模型推理的定性结构和定量任务绩效。与需要连续干预的先前激活转向技术相比，我们的单发缓存转向方向在超级参数稳定性，推理时间效率和易于整合方面具有很大的优势，使其成为受控生成的更强大和实用的解决方案。