2025-07-28

Title: Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

Authors: Víctor Gallego
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.18742
Pdf URL: https://arxiv.org/pdf/2507.18742
Copy Paste: [[2507.18742]] Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement(https://arxiv.org/abs/2507.18742)
Keywords: language model, agent
Abstract: Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user's true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70\% of cases, the SSC process reduces this vulnerability by over 90\%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at this https URL .
摘要：语言模型（LMS）容易受到文本奖励黑客的影响，在该黑客奖励中，它们在不符合用户的真实意图的情况下利用了污染或错误的书面规格或错误的书面规格或错误的缺点。我们介绍了规范自我校正（SSC），这是一种新颖的测试时间框架，使LM能够在其自身的指导规范中识别和纠正缺陷。 SSC采用多步推理过程，该过程首先基于潜在的污染规范生成响应，批评其输出，然后修改规范本身以消除可利用的漏洞。然后，使用此自我校正规范生成最终，更强大的响应。在跨越创意写作和代理编码任务的实验中，我们证明，尽管模型最初在50-70 \％的情况下进行游戏污染的规格，但SSC进程仍将此漏洞降低了90 \％以上。这种动态修复发生在推理时间，不需要重量修改，并且会导致模型行为更加牢固。此HTTPS URL上的代码。

Title: Evaluating Code-Mixing in LLMs Across 18 Languages

Authors: Yilun Yang, Yekun Chai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18791
Pdf URL: https://arxiv.org/pdf/2507.18791
Copy Paste: [[2507.18791]] Evaluating Code-Mixing in LLMs Across 18 Languages(https://arxiv.org/abs/2507.18791)
Keywords: language model, gpt, llm, prompt
Abstract: Code-mixing, the practice of switching between languages within a conversation, presents unique challenges for traditional natural language processing. Existing benchmarks, such as LinCE and GLUECoS, are limited by narrow language pairings and tasks, failing to adequately evaluate the code-mixing capabilities of large language models (LLMs). Despite the significance of code-mixing for multilingual users, research on LLMs in this context remains limited. Additionally, current methods for generating code-mixed data are underdeveloped. In this paper, we conduct a comprehensive evaluation of LLMs' performance on code-mixed data across 18 languages from seven language families. We also propose a novel approach for generating synthetic code-mixed texts by combining word substitution with GPT-4 prompting. Our analysis reveals consistent underperformance of LLMs on code-mixed datasets involving multiple language families. We suggest that improvements in training data size, model scale, and few-shot learning could enhance their performance.
摘要：代码混合是对话中语言之间切换的实践，对传统自然语言处理提出了独特的挑战。现有的基准（例如Lince和Gluecos）受到狭窄的语言配对和任务的限制，无法充分评估大语言模型（LLMS）的代码混合功能。尽管代码混合对多语言用户的重要性，但在这种情况下对LLM的研究仍然有限。此外，当前生成代码混合数据的方法欠发达。在本文中，我们全面评估了LLMS在七个语言系列中的18种语言中的代码混合数据的性能。我们还提出了一种新颖的方法，用于通过将单词替代与GPT-4提示结合起来生成合成代码混合文本。我们的分析揭示了涉及多种语言系列的代码混合数据集上LLM的表现不佳。我们建议改进训练数据规模，模型量表和很少的学习学习可以提高其性能。

Title: PrismRAG: Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning

Authors: Mohammad Kachuee, Teja Gollapudi, Minseok Kim, Yin Huang, Kai Sun, Xiao Yang, Jiaqi Wang, Nirav Shah, Yue Liu, Aaron Colak, Anuj Kumar, Wen-tau Yih, Xin Luna Dong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.18857
Pdf URL: https://arxiv.org/pdf/2507.18857
Copy Paste: [[2507.18857]] PrismRAG: Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning(https://arxiv.org/abs/2507.18857)
Keywords: llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) often falls short when retrieved context includes confusing semi-relevant passages, or when answering questions require deep contextual understanding and reasoning. We propose an efficient fine-tuning framework, called PrismRAG, that (i) trains the model with distractor-aware QA pairs mixing gold evidence with subtle distractor passages, and (ii) instills reasoning-centric habits that make the LLM plan, rationalize, and synthesize without relying on extensive human engineered instructions. Evaluated across 12 open-book RAG QA benchmarks spanning diverse application domains and scenarios, PrismRAG improves average factuality by 5.4%, outperforming state-of-the-art solutions.
摘要：检索效果（RAG）在检索到上下文时通常不足，包括混淆半相关段落，或者回答问题时需要深入的上下文理解和推理。我们提出了一个有效的微调框架，称为Prismrag，该框架（i）用干扰者意识到的QA对模型进行训练，将黄金证据与微妙的干扰物段落混合在一起，以及（ii）以推理为中心的习惯，使LLM计划，合理化并综合不依赖于人类工程的人类工程指导，使LLM计划使LLM计划进行合成。 Prismrag在跨越各种应用程序域和方案的12个开放式RAG QA基准测试中进行了评估，将平均事实提高了5.4％，表现优于最先进的解决方案。

Title: MindFlow+: A Self-Evolving Agent for E-Commerce Customer Service

Authors: Ming Gong, Xucheng Huang, Ziheng Xu, Vijayan K. Asari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18884
Pdf URL: https://arxiv.org/pdf/2507.18884
Copy Paste: [[2507.18884]] MindFlow+: A Self-Evolving Agent for E-Commerce Customer Service(https://arxiv.org/abs/2507.18884)
Keywords: language model, llm, agent
Abstract: High-quality dialogue is crucial for e-commerce customer service, yet traditional intent-based systems struggle with dynamic, multi-turn interactions. We present MindFlow+, a self-evolving dialogue agent that learns domain-specific behavior by combining large language models (LLMs) with imitation learning and offline reinforcement learning (RL). MindFlow+ introduces two data-centric mechanisms to guide learning: tool-augmented demonstration construction, which exposes the model to knowledge-enhanced and agentic (ReAct-style) interactions for effective tool use; and reward-conditioned data modeling, which aligns responses with task-specific goals using reward signals. To evaluate the model's role in response generation, we introduce the AI Contribution Ratio, a novel metric quantifying AI involvement in dialogue. Experiments on real-world e-commerce conversations show that MindFlow+ outperforms strong baselines in contextual relevance, flexibility, and task accuracy. These results demonstrate the potential of combining LLMs tool reasoning, and reward-guided learning to build domain-specialized, context-aware dialogue systems.
摘要：高质量的对话对于电子商务客户服务至关重要，但基于传统意图的系统与动态的多转交互作用斗争。我们介绍MindFlow+是一种自我发展的对话代理，它通过将大语言模型（LLMS）与模仿学习和离线增强学习（RL）结合到特定于领域的行为。 MindFlow+引入了两种以数据为中心的机制来指导学习：工具增强的演示构建，该构造将模型暴露于知识增强和代理（反应风格）的相互作用中，以进行有效的工具使用；和奖励条件的数据建模，该模型将响应与使用奖励信号的特定任务目标保持一致。为了评估该模型在响应生成中的作用，我们引入了AI贡献比，这是一种新颖的指标，量化了对话中的AI参与。关于现实世界电子商务对话的实验表明，MindFlow+在上下文相关性，灵活性和任务准确性中优于强大的基准。这些结果证明了结合LLMS工具推理的潜力，并以奖励指导的学习来构建领域专业的上下文感知对话系统。

Title: REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?

Authors: Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, Daniel Kang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18901
Pdf URL: https://arxiv.org/pdf/2507.18901
Copy Paste: [[2507.18901]] REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?(https://arxiv.org/abs/2507.18901)
Keywords: agent
Abstract: Assessing the reproducibility of social science papers is essential for promoting rigor in research processes, but manual assessment is costly. With recent advances in agentic AI systems (i.e., AI agents), we seek to evaluate their capability to automate this process. However, existing benchmarks for reproducing research papers (1) focus solely on reproducing results using provided code and data without assessing their consistency with the paper, (2) oversimplify real-world scenarios, and (3) lack necessary diversity in data formats and programming languages. To address these issues, we introduce REPRO-Bench, a collection of 112 task instances, each representing a social science paper with a publicly available reproduction report. The agents are tasked with assessing the reproducibility of the paper based on the original paper PDF and the corresponding reproduction package. REPRO-Bench features end-to-end evaluation tasks on the reproducibility of social science papers with complexity comparable to real-world assessments. We evaluate three representative AI agents on REPRO-Bench, with the best-performing agent achieving an accuracy of only 21.4%. Building on our empirical analysis, we develop REPRO-Agent, which improves the highest accuracy achieved by existing agents by 71%. We conclude that more advanced AI agents should be developed to automate real-world reproducibility assessment. REPRO-Bench is publicly available at this https URL.
摘要：评估社会科学论文的可重复性对于促进研究过程中的严格性至关重要，但是手动评估昂贵。随着代理AI系统（即AI代理）的最新进展，我们试图评估其自动化此过程的能力。但是，现有用于复制研究论文的基准（1）仅着重于使用提供的代码和数据来重现结果，而无需评估其与论文的一致性，（2）过度简化现实世界情景，（3）（3）数据格式和程序语言缺乏必要的多样性。为了解决这些问题，我们介绍了Repro-Bench，这是112个任务实例的集合，每个案例代表了一份社会科学论文，并带有公开可用的复制报告。代理人的任务是根据原始纸张PDF和相应的复制软件包评估论文的可重复性。 Repro Bench具有端到端评估任务，这些任务具有与现实世界评估相当的社会科学论文的可重复性。我们在repro-Bench上评估了三个代表性AI代理，表现最佳的代理仅达到了21.4％的准确性。在我们的经验分析的基础上，我们开发了回归代理，从而提高了现有代理商达到的最高准确性71％。我们得出的结论是，应开发更先进的AI代理以使真实世界的可重复性评估自动化。此HTTPS URL可公开使用Repro-Bench。

Title: SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models

Authors: Hongyuan Lu, Zixuan Li, Zefan Zhang, Wai Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18902
Pdf URL: https://arxiv.org/pdf/2507.18902
Copy Paste: [[2507.18902]] SLoW: Select Low-frequency Words! Automatic Dictionary Selection for Translation on Large Language Models(https://arxiv.org/abs/2507.18902)
Keywords: language model, gpt, llm, prompt, chat
Abstract: There are more than 7,000 languages around the world, and current Large Language Models (LLMs) only support hundreds of languages. Dictionary-based prompting methods can enhance translation on them, but most methods use all the available dictionaries, which could be expensive. Instead, it will be flexible to have a trade-off between token consumption and translation performance. This paper proposes a novel task called \textbf{A}utomatic \textbf{D}ictionary \textbf{S}election (\textbf{ADS}). The goal of the task is to automatically select which dictionary to use to enhance translation. We propose a novel and effective method which we call \textbf{S}elect \textbf{Lo}w-frequency \textbf{W}ords! (\textbf{SLoW}) which selects those dictionaries that have a lower frequency. Our methods have unique advantages. First, there is no need for access to the training data for frequency estimation (which is usually unavailable). Second, it inherits the advantage of dictionary-based methods, where no additional tuning is required on LLMs. Experimental results on 100 languages from FLORES indicate that SLoW surpasses strong baselines, and it can obviously save token usage, with many languages even surpassing the translation performance of the full dictionary baseline.\footnote{A shocking fact is that there is no need to use the actual training data (often unobtainable) for frequency estimation, and an estimation frequency obtained using public resources is still apparently effective in improving translation with ChatGPT and Llama, and DeepSeek.}\footnote{Code and data available upon publication.}
摘要：全球有7,000多种语言，当前的大型语言模型（LLMS）仅支持数百种语言。基于字典的提示方法可以增强其上的翻译，但是大多数方法都使用所有可用的词典，这可能很昂贵。取而代之的是，在代币消费和翻译性能之间取消权衡取舍将是灵活的。本文提出了一项新任务，称为\ textbf {a} utomation \ textbf {d} ictiactary \ textbf {s}选举（\ textbf {ads}）。该任务的目的是自动选择用于增强翻译的字典。我们提出了一种新颖有效的方法，我们称之为\ textbf {s} eplet \ textbf {lo} w-frequency \ textbf {w} ords！（\ textbf {slow}）选择那些具有较低频率的字典。我们的方法具有独特的优势。首先，无需访问训练数据以进行频率估计（通常不可用）。其次，它继承了基于字典的方法的优势，在LLMS上不需要其他调整。来自弗洛雷斯（Flores）100种语言的实验结果表明，缓慢的基线超过了强大的基线，并且它可以节省令牌的使用，许多语言甚至超过了整个字典基线的翻译性能。 Llama和DeepSeek。} \ footNote {出版时可用的代码和数据。}

Title: Large language models provide unsafe answers to patient-posed medical questions

Authors: Rachel L. Draelos, Samina Afreen, Barbara Blasko, Tiffany Brazile, Natasha Chase, Dimple Desai, Jessica Evert, Heather L. Gardner, Lauren Herrmann, Aswathy Vaikom House, Stephanie Kass, Marianne Kavan, Kirshma Khemani, Amanda Koire, Lauren M. McDonald, Zahraa Rabeeah, Amy Shah
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2507.18905
Pdf URL: https://arxiv.org/pdf/2507.18905
Copy Paste: [[2507.18905]] Large language models provide unsafe answers to patient-posed medical questions(https://arxiv.org/abs/2507.18905)
Keywords: language model, gpt, llm, chat
Abstract: Millions of patients are already using large language model (LLM) chatbots for medical advice on a regular basis, raising patient safety concerns. This physician-led red-teaming study compares the safety of four publicly available chatbots--Claude by Anthropic, Gemini by Google, GPT-4o by OpenAI, and Llama3-70B by Meta--on a new dataset, HealthAdvice, using an evaluation framework that enables quantitative and qualitative analysis. In total, 888 chatbot responses are evaluated for 222 patient-posed advice-seeking medical questions on primary care topics spanning internal medicine, women's health, and pediatrics. We find statistically significant differences between chatbots. The rate of problematic responses varies from 21.6 percent (Claude) to 43.2 percent (Llama), with unsafe responses varying from 5 percent (Claude) to 13 percent (GPT-4o, Llama). Qualitative results reveal chatbot responses with the potential to lead to serious patient harm. This study suggests that millions of patients could be receiving unsafe medical advice from publicly available chatbots, and further work is needed to improve the clinical safety of these powerful tools.
摘要：数以百万计的患者已经定期使用大型语言模型（LLM）聊天机器人进行医疗建议，从而提高了患者的安全问题。这项由医师领导的红线研究研究比较了四个公开可用的聊天机器人的安全性 - 通过人类，Google的Gemini，OpenAI的GPT-4O和Meta的Llama3-70B进行了比较，并在新的数据集，HealthAdvice中，使用一个评估框架，可以实现定量和定性分析。总共评估了888个聊天机器人的反应，以了解222个有关内科医学，妇女健康和儿科的初级保健主题寻求医疗问题的寻求医疗问题。我们发现聊天机器人之间的统计学意义差异。有问题的反应率从21.6％（克劳德）到43.2％（Llama）不等，不安全的响应从5％（Claude）到13％（GPT-4O，Llama）不等。定性结果揭示了聊天机器人的反应，可能导致严重的患者伤害。这项研究表明，数以百万计的患者可能会从公开供应的聊天机器人那里获得不安全的医疗建议，并且需要进一步的工作来改善这些强大的工具的临床安全性。

Title: A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions

Authors: Agada Joseph Oche, Ademola Glory Folashade, Tirthankar Ghosal, Arpan Biswas
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.18910
Pdf URL: https://arxiv.org/pdf/2507.18910
Copy Paste: [[2507.18910]] A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems: Progress, Gaps, and Future Directions(https://arxiv.org/abs/2507.18910)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Retrieval-Augmented Generation (RAG) represents a major advancement in natural language processing (NLP), combining large language models (LLMs) with information retrieval systems to enhance factual grounding, accuracy, and contextual relevance. This paper presents a comprehensive systematic review of RAG, tracing its evolution from early developments in open domain question answering to recent state-of-the-art implementations across diverse applications. The review begins by outlining the motivations behind RAG, particularly its ability to mitigate hallucinations and outdated knowledge in parametric models. Core technical components-retrieval mechanisms, sequence-to-sequence generation models, and fusion strategies are examined in detail. A year-by-year analysis highlights key milestones and research trends, providing insight into RAG's rapid growth. The paper further explores the deployment of RAG in enterprise systems, addressing practical challenges related to retrieval of proprietary data, security, and scalability. A comparative evaluation of RAG implementations is conducted, benchmarking performance on retrieval accuracy, generation fluency, latency, and computational efficiency. Persistent challenges such as retrieval quality, privacy concerns, and integration overhead are critically assessed. Finally, the review highlights emerging solutions, including hybrid retrieval approaches, privacy-preserving techniques, optimized fusion strategies, and agentic RAG architectures. These innovations point toward a future of more reliable, efficient, and context-aware knowledge-intensive NLP systems.
摘要：检索演示的生成（RAG）代表了自然语言处理（NLP）的重大进步，将大型语言模型（LLMS）与信息检索系统相结合，以增强事实基础，准确性和上下文相关性。本文介绍了对抹布的全面系统综述，从而从开放式域问题的早期发展中探讨了其对各种应用程序近期最新实施的回答的发展。该评论首先概述了抹布背后的动机，尤其是其减轻幻觉和参数模型中过时的知识的能力。详细研究了核心技术组成部分 - 进行回归机制，序列到序列生成模型和融合策略。同比分析强调了关键的里程碑和研究趋势，从而洞悉了抹布的快速增长。本文进一步探讨了企业系统中抹布的部署，解决了与获取专有数据，安全性和可扩展性有关的实用挑战。对抹布实现进行了比较评估，以基准测试基准的检索准确性，产生流利度，延迟和计算效率。严格评估诸如检索质量，隐私问题和集成开销之类的持续挑战。最后，评论重点介绍了新兴解决方案，包括混合检索方法，保护隐私技术，优化的融合策略和代理抹布架构。这些创新指出了更可靠，高效和上下文感知的知识密集型NLP系统的未来。

Title: Mining Contextualized Visual Associations from Images for Creativity Understanding

Authors: Ananya Sahu, Amith Ananthram, Kathleen McKeown
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2507.18915
Pdf URL: https://arxiv.org/pdf/2507.18915
Copy Paste: [[2507.18915]] Mining Contextualized Visual Associations from Images for Creativity Understanding(https://arxiv.org/abs/2507.18915)
Keywords: language model
Abstract: Understanding another person's creative output requires a shared language of association. However, when training vision-language models such as CLIP, we rely on web-scraped datasets containing short, predominantly literal, alt-text. In this work, we introduce a method for mining contextualized associations for salient visual elements in an image that can scale to any unlabeled dataset. Given an image, we can use these mined associations to generate high quality creative captions at increasing degrees of abstraction. With our method, we produce a new dataset of visual associations and 1.7m creative captions for the images in MSCOCO. Human evaluation confirms that these captions remain visually grounded while exhibiting recognizably increasing abstraction. Moreover, fine-tuning a visual encoder on this dataset yields meaningful improvements in zero-shot image-text retrieval in two creative domains: poetry and metaphor visualization. We release our dataset, our generation code and our models for use by the broader community.
摘要：了解他人的创造性产出需要共享的关联语言。但是，当训练视觉语言模型（例如剪辑）时，我们依赖于包含短，主要是字面替代文本的网络绑带数据集。在这项工作中，我们介绍了一种用于在图像中挖掘上下文化关联的方法，该方法可以扩展到任何未标记的数据集。鉴于图像，我们可以使用这些挖掘的关联来以越来越多的抽象生成高质量的创意字幕。通过我们的方法，我们为Mscoco中的图像制作了一个新的视觉关联数据集和170万个创意字幕。人类评估证实，这些字幕在视觉上保持基础，同时表现出可识别的抽象增加。此外，在此数据集上对视觉编码器进行微调可在两个创意领域中的零摄像文本检索中进行有意义的改进：诗歌和隐喻可视化。我们发布了我们的数据集，我们的一代代码和模型，以供更广泛的社区使用。

Title: Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders

Authors: Richmond Sin Jing Xuan, Jalil Huseynov, Yang Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.18918
Pdf URL: https://arxiv.org/pdf/2507.18918
Copy Paste: [[2507.18918]] Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders(https://arxiv.org/abs/2507.18918)
Keywords: language model, llm
Abstract: Multilingual large language models (LLMs) exhibit strong cross-linguistic generalization, yet medium to low resource languages underperform on common benchmarks such as ARC-Challenge, MMLU, and HellaSwag. We analyze activation patterns in Gemma-2-2B across all 26 residual layers and 10 languages: Chinese (zh), Russian (ru), Spanish (es), Italian (it), medium to low resource languages including Indonesian (id), Catalan (ca), Marathi (mr), Malayalam (ml), and Hindi (hi), with English (en) as the reference. Using Sparse Autoencoders (SAEs), we reveal systematic disparities in activation patterns. Medium to low resource languages receive up to 26.27 percent lower activations in early layers, with a persistent gap of 19.89 percent in deeper layers. To address this, we apply activation-aware fine-tuning via Low-Rank Adaptation (LoRA), leading to substantial activation gains, such as 87.69 percent for Malayalam and 86.32 percent for Hindi, while maintaining English retention at approximately 91 percent. After fine-tuning, benchmark results show modest but consistent improvements, highlighting activation alignment as a key factor in enhancing multilingual LLM performance.
摘要：多语言大语模型（LLMS）表现出强烈的跨语言概括，但在弧形 - 挑战，MMLU和HELLASWAG等常见基准上的中等至低资源语言表现不佳。我们在所有26种残留层和10种语言中分析了Gemma-2-2B的激活模式：中文（ZH），俄语（RU），西班牙语（ES），意大利语（IT），中等至低资源语言，包括印度尼西亚（ID），加泰罗尼亚（CA），Marathi（Marathi（MR），MARATHI（MR），MALAYALAMALAM（ML）（ML）和HINDI（HINDI（HINDI）（EN））使用稀疏的自动编码器（SAE），我们揭示了激活模式的系统差异。中层中等至低的资源语言在早期层中的激活降低了26.27％，更深层的持续差距为19.89％。为了解决这个问题，我们通过低级适应（LORA）应用了激活的微调，从而导致大量激活增长，例如，马拉雅拉姆语为87.69％，印地语为86.32％，同时将英语保留率保持在约91％。经过微调后，基准结果显示出适度但一致的改进，突出了激活对准是增强多语言LLM性能的关键因素。

Title: A Similarity Measure for Comparing Conversational Dynamics

Authors: Sang Min Jung, Kaixiang Zhang, Cristian Danescu-Niculescu-Mizil
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.18956
Pdf URL: https://arxiv.org/pdf/2507.18956
Copy Paste: [[2507.18956]] A Similarity Measure for Comparing Conversational Dynamics(https://arxiv.org/abs/2507.18956)
Keywords: agent
Abstract: The quality of a conversation goes beyond the individual quality of each reply, and instead emerges from how these combine into interactional patterns that give the conversation its distinctive overall "shape". However, there is no robust automated method for comparing conversations in terms of their overall interactional dynamics. Such methods could enhance the analysis of conversational data and help evaluate conversational agents more holistically. In this work, we introduce a similarity measure for comparing conversations with respect to their dynamics. We design a validation framework for testing the robustness of the metric in capturing differences in conversation dynamics and for assessing its sensitivity to the topic of the conversations. Finally, to illustrate the measure's utility, we use it to analyze conversational dynamics in a large online community, bringing new insights into the role of situational power in conversations.
摘要：对话的质量超出了每个答复的个体质量，而是从它们结合到互动模式的方式中，从而使对话具有独特的整体“形状”。但是，没有强大的自动化方法可以根据其整体交互动力进行比较对话。这种方法可以增强对话数据的分析，并帮助更整体地评估对话剂。在这项工作中，我们介绍了一种相似性措施，以比较其动态对话。我们设计了一个验证框架，用于测试指标在捕获对话动态差异和评估其对对话主题的敏感性方面的鲁棒性。最后，为了说明措施的实用性，我们使用它来分析大型在线社区中的对话动态，从而使情境力量在对话中的作用有了新的见解。

Title: A Toolbox, Not a Hammer -- Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation

Authors: Bohan Yao, Vikas Yadav
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.18973
Pdf URL: https://arxiv.org/pdf/2507.18973
Copy Paste: [[2507.18973]] A Toolbox, Not a Hammer -- Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation(https://arxiv.org/abs/2507.18973)
Keywords: language model, llm
Abstract: Augmenting large language models (LLMs) with external tools is a promising avenue for developing high-performance mathematical reasoning systems. Prior tool-augmented approaches typically finetune an LLM to select and invoke a single tool at each reasoning step and show promising results on simpler math reasoning benchmarks such as GSM8K. However, these approaches struggle with more complex math problems that require precise reasoning over multiple steps. To address this limitation, in this work, we propose Multi-TAG, a Multi-Tool AGgregation-based framework. Instead of relying on a single tool, Multi-TAG guides an LLM to concurrently invoke multiple tools at each reasoning step. It then aggregates their diverse outputs to verify and refine the reasoning process, enhancing solution robustness and accuracy. Notably, Multi-TAG is a finetuning-free, inference-only framework, making it readily applicable to any LLM backbone, including large open-weight models which are computationally expensive to finetune and proprietary frontier models which cannot be finetuned with custom recipes. We evaluate Multi-TAG on four challenging benchmarks: MATH500, AIME, AMC, and OlympiadBench. Across both open-weight and closed-source LLM backbones, Multi-TAG consistently and substantially outperforms state-of-the-art baselines, achieving average improvements of 6.0% to 7.5% over state-of-the-art baselines.
摘要：使用外部工具增强大型语言模型（LLM）是开发高性能数学推理系统的有前途的途径。先前的工具调节方法通常是Finetune a LLM，以在每个推理步骤中选择并调用单个工具，并在更简单的数学推理基准（例如GSM8K）上显示出令人鼓舞的结果。但是，这些方法在更复杂的数学问题上遇到了困难，这些问题需要在多个步骤上进行精确推理。为了解决这一限制，在这项工作中，我们提出了多标签，这是一个基于多工具聚合的框架。 Multi-Tag不依赖单个工具，而是指导LLM在每个推理步骤中同时调用多个工具。然后，它汇总了他们的各种产出，以验证和完善推理过程，从而增强解决方案的鲁棒性和准确性。值得注意的是，多标签是一种无芬特的，仅限推理的框架，使其很容易适用于任何LLM骨架，包括大型的开放式型号，这些模型在芬太尼和专有边界模型上的计算昂贵，这些模型无法用自定义食谱进行填充。我们评估了四个具有挑战性的基准：Math500，AIME，AMC和OlympiaDbench的多标签。在开放式和封闭源LLM骨架上，多标签始终如一，并且大大优于最先进的基线，比最先进的基线可获得平均改善6.0％至7.5％。

Title: Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents

Authors: Haorui He, Yupeng Li, Dacheng Wen, Reynold Cheng, Francis C. M. Lau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19090
Pdf URL: https://arxiv.org/pdf/2507.19090
Copy Paste: [[2507.19090]] Debating Truth: Debate-driven Claim Verification with Multiple Large Language Model Agents(https://arxiv.org/abs/2507.19090)
Keywords: language model, llm, agent
Abstract: Claim verification is critical for enhancing digital literacy. However, the state-of-the-art single-LLM methods struggle with complex claim verification that involves multi-faceted evidences. Inspired by real-world fact-checking practices, we propose DebateCV, the first claim verification framework that adopts a debate-driven methodology using multiple LLM agents. In our framework, two Debaters take opposing stances on a claim and engage in multi-round argumentation, while a Moderator evaluates the arguments and renders a verdict with justifications. To further improve the performance of the Moderator, we introduce a novel post-training strategy that leverages synthetic debate data generated by the zero-shot DebateCV, effectively addressing the scarcity of real-world debate-driven claim verification data. Experimental results show that our method outperforms existing claim verification methods under varying levels of evidence quality. Our code and dataset are publicly available at this https URL.
摘要：索赔验证对于增强数字素养至关重要。但是，最先进的单LLL方法与涉及多方面证据的复杂主张验证斗争。受实际事实核对实践的启发，我们提出了DebateCV，这是第一个使用多个LLM代理采用辩论驱动方法的索赔验证框架。在我们的框架中，两个辩论者对索赔采取了反对立场并参与了多轮论证，而主持人评估了论点并以理由为判决。为了进一步提高主持人的性能，我们引入了一种新颖的训练后策略，该策略利用了由零射击DEBATECV产生的合成辩论数据，从而有效地解决了现实世界中辩论驱动的索赔验证数据的稀缺性。实验结果表明，我们的方法在不同水平的证据质量下优于现有的索赔验证方法。我们的代码和数据集可在此HTTPS URL上公开获得。

Title: An Empirical Investigation of Gender Stereotype Representation in Large Language Models: The Italian Case

Authors: Gioele Giachino, Marco Rondina, Antonio Vetrò, Riccardo Coppola, Juan Carlos De Martin
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2507.19156
Pdf URL: https://arxiv.org/pdf/2507.19156
Copy Paste: [[2507.19156]] An Empirical Investigation of Gender Stereotype Representation in Large Language Models: The Italian Case(https://arxiv.org/abs/2507.19156)
Keywords: language model, gpt, llm, prompt, chat
Abstract: The increasing use of Large Language Models (LLMs) in a large variety of domains has sparked worries about how easily they can perpetuate stereotypes and contribute to the generation of biased content. With a focus on gender and professional bias, this work examines in which manner LLMs shape responses to ungendered prompts, contributing to biased outputs. This analysis uses a structured experimental method, giving different prompts involving three different professional job combinations, which are also characterized by a hierarchical relationship. This study uses Italian, a language with extensive grammatical gender differences, to highlight potential limitations in current LLMs' ability to generate objective text in non-English languages. Two popular LLM-based chatbots are examined, namely OpenAI ChatGPT (gpt-4o-mini) and Google Gemini (gemini-1.5-flash). Through APIs, we collected a range of 3600 responses. The results highlight how content generated by LLMs can perpetuate stereotypes. For example, Gemini associated 100% (ChatGPT 97%) of 'she' pronouns to the 'assistant' rather than the 'manager'. The presence of bias in AI-generated text can have significant implications in many fields, such as in the workplaces or in job selections, raising ethical concerns about its use. Understanding these risks is pivotal to developing mitigation strategies and assuring that AI-based systems do not increase social inequalities, but rather contribute to more equitable outcomes. Future research directions include expanding the study to additional chatbots or languages, refining prompt engineering methods or further exploiting a larger experimental base.
摘要：在各种领域中，大型语言模型（LLM）的使用越来越多，这引起了人们对它们可以使陈规定型观念永久化和有助于产生有偏见的内容产生的烦恼。侧重于性别和专业偏见，这项工作以llms塑造了对未性别提示的响应的方式研究，导致了偏见的产出。该分析使用一种结构化的实验方法，给出了不同的提示，涉及三种不同的专业工作组合，这也以分层关系为特征。这项研究使用意大利语，一种具有广泛语法性别差异的语言，以突出当前LLMS以非英语语言生成客观文本的能力的潜在局限性。检查了两个受欢迎的LLM聊天机器人，即OpenAI Chatgpt（GPT-4O-Mini）和Google Gemini（Gemini-1.5-Flash）。通过API，我们收集了3600个响应范围。结果突出了LLM产生的内容如何使刻板印象永久化。例如，双子座将“她”代词的100％（Chatgpt 97％）与“助手”而不是“经理”相关联。在AI生成的文本中存在偏见在许多领域（例如在工作场所或工作选择中）可能具有重大影响，从而引发了对其使用的道德问题。了解这些风险对于制定缓解策略并确保基于AI的系统不会增加社会不平等，而是有助于更公平的结果。未来的研究方向包括将研究扩展到其他聊天机器人或语言，精炼及时的工程方法或进一步利用更大的实验基础。

Title: Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?

Authors: Chaymaa Abbas, Mariette Awad, Razane Tajeddine
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.19195
Pdf URL: https://arxiv.org/pdf/2507.19195
Copy Paste: [[2507.19195]] Can Small-Scale Data Poisoning Exacerbate Dialect-Linked Biases in Large Language Models?(https://arxiv.org/abs/2507.19195)
Keywords: language model, gpt, llm
Abstract: Despite the ongoing improvements in the design of large language models (LLMs) to foster inclusion and balanced responses, these systems remain susceptible to encoding and amplifying social biases. This study examines how dialectal variation, specifically African American Vernacular English (AAVE) versus Standard American English (SAE), interacts with data poisoning to influence toxicity in outputs. Using both small- and medium-scale LLaMA models, we show that even minimal exposure to poisoned data significantly increases toxicity for AAVE inputs, while it remains comparatively unaffected for SAE. Larger models exhibit a more significant amplification effect which suggests heightened susceptibility with scale. To further assess these disparities, we employed GPT-4o as a fairness auditor, which identified harmful stereotypical patterns disproportionately tied to AAVE inputs, including portrayals of aggression, criminality, and intellectual inferiority. These findings underscore the compounding impact of data poisoning and dialectal bias and emphasize the need for dialect-aware evaluation, targeted debiasing interventions, and socially responsible training protocols during development.
摘要：尽管大语言模型（LLM）的设计持续改进，以促进包容和平衡的响应，但这些系统仍然容易受到编码和扩大社会偏见的影响。这项研究研究了方言变化，特别是非裔美国人的白话英语（AAVE）与标准的美国英语（SAE）如何与数据中毒相互作用以影响产出的毒性。我们使用中小型的Llama模型，我们表明，即使对中毒数据的最小接触也会显着增加抗含量输入的毒性，而对SAE的毒性仍然相对不受影响。较大的模型表现出更明显的扩增效果，这表明对尺度的敏感性提高。为了进一步评估这些差异，我们采用了GPT-4O作为公平的审计师，该审计师确定了与AAVE输入相关的有害刻板印象模式，包括侵略性，犯罪性和智力劣质性的描述。这些发现强调了数据中毒和方言偏见的复杂影响，并强调了对方言感知评估的需求，有针对性的偏见干预措施以及在开发过程中对社会负责的培训方案的需求。

Title: How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework

Authors: Zi Liang, Liantong Yu, Shiyu Zhang, Qingqing Ye, Haibo Hu
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2507.19219
Pdf URL: https://arxiv.org/pdf/2507.19219
Copy Paste: [[2507.19219]] How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework(https://arxiv.org/abs/2507.19219)
Keywords: language model, llm
Abstract: Overestimation in evaluating large language models (LLMs) has become an increasing concern. Due to the contamination of public benchmarks or imbalanced model training, LLMs may achieve unreal evaluation results on public benchmarks, either intentionally or unintentionally, which leads to unfair comparisons among LLMs and undermines their realistic capability assessments. Existing benchmarks attempt to address these issues by keeping test cases permanently secret, mitigating contamination through human evaluation, or repeatedly collecting and constructing new samples. However, these approaches fail to ensure reproducibility, transparency, and high efficiency simultaneously. Moreover, the extent of overestimation in current LLMs remains unquantified. To address these issues, we propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography. ArxivRoll comprises two key components: \emph{i) SCP (Sequencing, Cloze, and Prediction)}, an automated generator for private test cases, and \emph{ii) Rugged Scores (RS)}, metrics that measure the proportion of public benchmark contamination and training bias. Leveraging SCP, ArxivRoll constructs a new benchmark every six months using recent articles from ArXiv and employs them for one-time evaluations of LLM performance. Extensive experiments demonstrate the high quality of our benchmark, and we provide a systematic evaluation of current LLMs. The source code is available at this https URL.
摘要：评估大语模型（LLM）的高估已成为越来越多的关注点。由于公共基准或不平衡模型培训的污染，LLMS可能有意或无意间就可以在公共基准上获得不真实的评估结果，这会导致LLMS之间的不公平比较并破坏其现实能力评估。现有的基准试图通过将测试案件永久保密，通过人类评估来缓解污染或反复收集和构建新样本来解决这些问题。但是，这些方法无法同时确保可重复性，透明度和高效率。此外，当前LLM中高估的程度仍然没有量化。为了解决这些问题，我们提出了Arxivroll，这是一个动态评估框架，灵感来自密码学中的一次性PAD加密。 arXivroll包括两个关键组件：\ emph {i）scp（测序，披肩和预测）}，一种用于私人测试用例的自动化发电机，以及\ emph {ii）崎rugged的分数（rs）}，量度，衡量公共基础基础标准的Contamination Contamination Contamination and Training Bilias的度量。 Arxivroll利用SCP，每六个月使用Arxiv的文章构建一个新的基准测试，并采用它们进行LLM性能的一次性评估。广泛的实验证明了我们的基准高质量，我们对当前LLM进行了系统的评估。源代码可在此HTTPS URL上找到。

Title: Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation

Authors: Yuanhe Zhang, Fangzhou Xie, Zhenhong Zhou, Zherui Li, Hao Chen, Kun Wang, Yufei Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19227
Pdf URL: https://arxiv.org/pdf/2507.19227
Copy Paste: [[2507.19227]] Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation(https://arxiv.org/abs/2507.19227)
Keywords: language model, llm
Abstract: Large Language Diffusion Models (LLDMs) exhibit comparable performance to LLMs while offering distinct advantages in inference speed and mathematical reasoning this http URL precise and rapid generation capabilities of LLDMs amplify concerns of harmful generations, while existing jailbreak methodologies designed for Large Language Models (LLMs) prove limited effectiveness against LLDMs and fail to expose safety this http URL defense cannot definitively resolve harmful generation concerns, as it remains unclear whether LLDMs possess safety robustness or existing attacks are incompatible with diffusion-based this http URL address this, we first reveal the vulnerability of LLDMs to jailbreak and demonstrate that attack failure in LLDMs stems from fundamental architectural this http URL present a PArallel Decoding jailbreak (PAD) for diffusion-based language models. PAD introduces Multi-Point Attention Attack, which guides parallel generative processes toward harmful outputs that inspired by affirmative response patterns in LLMs. Experimental evaluations across four LLDMs demonstrate that PAD achieves jailbreak attack success rates by 97%, revealing significant safety vulnerabilities. Furthermore, compared to autoregressive LLMs of the same size, LLDMs increase the harmful generation speed by 2x, significantly highlighting risks of uncontrolled this http URL comprehensive analysis, we provide an investigation into LLDM architecture, offering critical insights for the secure deployment of diffusion-based language models.
摘要：大型语言扩散模型（LLDMS）表现出与LLM的可比性能，同时在推理速度和数学推理中提供了明显的优势，而LLDMS的精确和快速产生能力却扩大了有害世代的关注点，而现有的越狱方法则是针对LLDMS的限制性的，而不是确定的危害，则可以确定危害的能力。令人担忧的是，由于尚不清楚LLDM是否具有安全性鲁棒性或现有攻击与基于扩散的此HTTP URL不兼容，因此我们首先揭示了LLDMS对越狱的脆弱性，并证明LLDMS的攻击失败源于基本建筑的HTTP url，HTTP url构成了一个平行的解码越狱（PAD）的基于扩散模型。 PAD引入了多点注意攻击，该攻击将引导并行生成过程降低了受LLMS中肯定响应模式启发的有害输出。四个LLDMS的实验评估表明，PAD可以使越狱攻击成功率提高了97％，从而揭示了严重的安全漏洞。此外，与相同尺寸的自回归LLM相比，LLDMS将有害的生成速度提高了2倍，大大强调了不受控制的HTTP URL全面分析的风险，我们提供了对LLDM体系结构的研究，为基于扩散语言模型的安全部署提供了关键的见解。

Title: Identifying Fine-grained Forms of Populism in Political Discourse: A Case Study on Donald Trump's Presidential Campaigns

Authors: Ilias Chalkidis, Stephanie Brandl, Paris Aslanidis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19303
Pdf URL: https://arxiv.org/pdf/2507.19303
Copy Paste: [[2507.19303]] Identifying Fine-grained Forms of Populism in Political Discourse: A Case Study on Donald Trump's Presidential Campaigns(https://arxiv.org/abs/2507.19303)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of instruction-following tasks, yet their grasp of nuanced social science concepts remains underexplored. This paper examines whether LLMs can identify and classify fine-grained forms of populism, a complex and contested concept in both academic and media debates. To this end, we curate and release novel datasets specifically designed to capture populist discourse. We evaluate a range of pre-trained (large) language models, both open-weight and proprietary, across multiple prompting paradigms. Our analysis reveals notable variation in performance, highlighting the limitations of LLMs in detecting populist discourse. We find that a fine-tuned RoBERTa classifier vastly outperforms all new-era instruction-tuned LLMs, unless fine-tuned. Additionally, we apply our best-performing model to analyze campaign speeches by Donald Trump, extracting valuable insights into his strategic use of populist rhetoric. Finally, we assess the generalizability of these models by benchmarking them on campaign speeches by European politicians, offering a lens into cross-context transferability in political discourse analysis. In this setting, we find that instruction-tuned LLMs exhibit greater robustness on out-of-domain data.
摘要：大型语言模型（LLM）在广泛的指导遵守任务中表现出了出色的功能，但他们对细微的社会科学概念的掌握仍然没有得到充实。本文探讨了LLM是否可以识别和分类精细的民粹主义形式，这是学术和媒体辩论中的复杂且有争议的概念。为此，我们策划并发布专门设计用于捕捉民粹主义话语的新颖数据集。我们在多个提示范式中评估了一系列的开放权重和专有语言模型。我们的分析揭示了性能的显着差异，突出了LLM在检测民粹主义话语时的局限性。我们发现，除非经过微调，否则一个微调的Roberta分类器大大胜过所有新时代的指导调整LLM。此外，我们运用最佳表现模型来分析唐纳德·特朗普（Donald Trump）的竞选演讲，从而对他对民粹主义言论的战略使用提出了宝贵的见解。最后，我们通过对欧洲政客的竞选演讲进行基准测试来评估这些模型的概括性，从而在政治话语分析中提供了跨文化转移性。在这种情况下，我们发现指导调整的LLM在室外数据上表现出更大的鲁棒性。

Title: AutoPCR: Automated Phenotype Concept Recognition by Prompting

Authors: Yicheng Tao, Yuanhao Huang, Jie Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19315
Pdf URL: https://arxiv.org/pdf/2507.19315
Copy Paste: [[2507.19315]] AutoPCR: Automated Phenotype Concept Recognition by Prompting(https://arxiv.org/abs/2507.19315)
Keywords: language model, prompt
Abstract: Phenotype concept recognition (CR) is a fundamental task in biomedical text mining, enabling applications such as clinical diagnostics and knowledge graph construction. However, existing methods often require ontology-specific training and struggle to generalize across diverse text types and evolving biomedical terminology. We present AutoPCR, a prompt-based phenotype CR method that does not require ontology-specific training. AutoPCR performs CR in three stages: entity extraction using a hybrid of rule-based and neural tagging strategies, candidate retrieval via SapBERT, and entity linking through prompting a large language model. Experiments on four benchmark datasets show that AutoPCR achieves the best average and most robust performance across both mention-level and document-level evaluations, surpassing prior state-of-the-art methods. Further ablation and transfer studies demonstrate its inductive capability and generalizability to new ontologies.
摘要：表型概念识别（CR）是生物医学文本挖掘中的一项基本任务，从而使应用程序（例如临床诊断和知识图构造）。但是，现有方法通常需要特定于本体的培训，并努力跨越各种文本类型并不断发展生物医学术语。我们提出AutoPCR，这是一种基于及时的表型CR方法，不需要特定于本体的培训。 AUTOPCR在三个阶段进行CR：使用基于规则和神经标记策略的混合体，通过Sapbert检索候选的实体提取，以及通过提示大型语言模型链接的实体。四个基准数据集的实验表明，AUTOPCR在提及级别和文档级别的评估中都达到了最佳平均和最强的性能，从而超过了先前的最新方法。进一步的消融和转移研究表明了其对新本体论的感应能力和普遍性。

Title: Smooth Reading: Bridging the Gap of Recurrent LLM to Self-Attention LLM on Long-Context Tasks

Authors: Kai Liu, Zhan Su, Peijie Dong, Fengran Mo, Jianfei Gao, ShaoTing Zhang, Kai Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.19353
Pdf URL: https://arxiv.org/pdf/2507.19353
Copy Paste: [[2507.19353]] Smooth Reading: Bridging the Gap of Recurrent LLM to Self-Attention LLM on Long-Context Tasks(https://arxiv.org/abs/2507.19353)
Keywords: language model, llm
Abstract: Recently, recurrent large language models (Recurrent LLMs) with linear computational complexity have re-emerged as efficient alternatives to self-attention-based LLMs (Self-Attention LLMs), which have quadratic complexity. However, Recurrent LLMs often underperform on long-context tasks due to their limited fixed-size memory. Previous research has primarily focused on enhancing the memory capacity of Recurrent LLMs through architectural innovations, but these approaches have not yet enabled Recurrent LLMs to match the performance of Self-Attention LLMs on long-context tasks. We argue that this limitation arises because processing the entire context at once is not well-suited for Recurrent LLMs. In this paper, we propose Smooth Reading, a chunk-wise inference method inspired by human reading strategies. Smooth Reading processes context in chunks and iteratively summarizes the contextual information, thereby reducing memory demands and making the approach more compatible with Recurrent LLMs. Our experimental results show that this method substantially narrows the performance gap between Recurrent and Self-Attention LLMs on long-context tasks, while preserving the efficiency advantages of Recurrent LLMs. Our Smooth Reading boosts SWA-3B-4k (a Recurrent LLM) from 5.68% lower to 3.61% higher performance than Self-Attention LLMs on LongBench. Besides, our method maintains the high efficiency, training 3x faster and inferring 2x faster at 64k context compared to Self-Attention LLMs. To our knowledge, this is the first work to achieve comparable performance using Recurrent LLMs compared with Self-Attention LLMs on long-context tasks. We hope our method will inspire future research in this area. To facilitate further progress, we will release code and dataset.
摘要：最近，具有线性计算复杂性的经常性大语言模型（经常性LLM）已重新出现为具有二次复杂性的基于自我注意力的LLM（自我发项LLM）的有效替代方案。但是，由于其固定尺寸有限的内存，经常性LLM在长篇小说任务上通常表现不佳。先前的研究主要集中在通过架构创新来增强经常性LLM的记忆能力，但是这些方法尚未使经常性的LLM能够匹配自我发挥的LLMS在长篇文章任务上的性能。我们认为，出现了这种限制，因为一次处理整个上下文并不适合复发性LLM。在本文中，我们提出了流畅的阅读，这是一种受人阅读策略启发的块推理方法。平滑的阅读过程块中的上下文并迭代地总结了上下文信息，从而减少了内存需求并使该方法与经常性LLMS更兼容。我们的实验结果表明，这种方法显着缩小了长篇文化任务上经常性和自我发项率LLM之间的性能差距，同时保留了经常性LLM的效率优势。我们的平滑阅读将SWA-3B-4K（复发性LLM）从长达5.68％降至3.61％的性能，比Longbench上的自我发挥的LLMS提高了3.61％。此外，与自我发作的LLM相比，我们的方法保持高效率，在64K环境下更快地训练3倍，并在64K环境中推断2倍。据我们所知，这是与长篇小说任务上的自我发作LLM相比，使用Recurrent LLMS实现可比性能的第一项工作。我们希望我们的方法能激发该领域的未来研究。为了促进进一步的进步，我们将发布代码和数据集。

Title: SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models

Authors: Zhen Wan, Chao-Han Huck Yang, Yahan Yu, Jinchuan Tian, Sheng Li, Ke Hu, Zhehuai Chen, Shinji Watanabe, Fei Cheng, Chenhui Chu, Sadao Kurohashi
Subjects: cs.CL, cs.AI, cs.SC, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.19361
Pdf URL: https://arxiv.org/pdf/2507.19361
Copy Paste: [[2507.19361]] SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models(https://arxiv.org/abs/2507.19361)
Keywords: language model, llm, hallucination
Abstract: We introduce Speech-based Intelligence Quotient (SIQ) as a new form of human cognition-inspired evaluation pipeline for voice understanding large language models, LLM Voice, designed to assess their voice understanding ability. Moving beyond popular voice understanding metrics such as word error rate (WER), SIQ examines LLM Voice across three cognitive levels motivated by Bloom's Taxonomy: (1) Remembering (i.e., WER for verbatim accuracy); (2) Understanding (i.e., similarity of LLM's interpretations); and (3) Application (i.e., QA accuracy for simulating downstream tasks). We demonstrate that SIQ not only quantifies voice understanding abilities but also provides unified comparisons between cascaded methods (e.g., ASR LLM) and end-to-end models, identifies annotation errors in existing benchmarks, and detects hallucinations in LLM Voice. Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks, while exposing overlooked challenges in multi-modal training.
摘要：我们将基于语音的智能商（SIQ）介绍为人类认知启发的评估管道的一种新形式，以了解大型语言模型，LLM语音，旨在评估他们的语音理解能力。 SIQ超越了理解诸如单词错误率（WER）之类的普遍语音，还检查了Bloom分类法所激发的三种认知水平的LLM语音：（1）记住（即，为逐字准确性）；（2）理解（即LLM解释的相似性）；（3）应用程序（即，质量为QA准确性用于模拟下游任务）。我们证明，SIQ不仅量化了语音理解能力，而且还提供了级联方法（例如ASR LLM）和端到端模型之间的统一比较，可以确定现有基准的注释错误，并在LLM语音中检测到幻觉。我们的框架代表了一项首要的情报考试，它将认知原理带有面向配音的基准，同时揭示了多模式训练中被忽视的挑战。

Title: Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study

Authors: Rachel M. Murphy (1), Nishant Mishra (1), Nicolette F. de Keizer (1), Dave A. Dongelmans (2), Kitty J. Jager (1), Ameen Abu-Hanna (1), Joanna E. Klopotowska (1), Iacer Calixto (1) ((1) Amsterdam UMC location University of Amsterdam, Department of Medical Informatics, Amsterdam, The Netherlands, (2) Amsterdam UMC location University of Amsterdam, Department of Intensive Care Medicine, Amsterdam, the Netherlands)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19396
Pdf URL: https://arxiv.org/pdf/2507.19396
Copy Paste: [[2507.19396]] Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study(https://arxiv.org/abs/2507.19396)
Keywords: language model
Abstract: In this study, we set a benchmark for adverse drug event (ADE) detection in Dutch clinical free text documents using several transformer models, clinical scenarios and fit-for-purpose performance measures. We trained a Bidirectional Long Short-Term Memory (Bi-LSTM) model and four transformer-based Dutch and/or multilingual encoder models (BERTje, RobBERT, this http URL, and NuNER) for the tasks of named entity recognition (NER) and relation classification (RC) using 102 richly annotated Dutch ICU clinical progress notes. Anonymized free text clinical progress notes of patients admitted to intensive care unit (ICU) of one academic hospital and discharge letters of patients admitted to Internal Medicine wards of two non-academic hospitals were reused. We evaluated our ADE RC models internally using gold standard (two-step task) and predicted entities (end-to-end task). In addition, all models were externally validated on detecting ADEs at the document level. We report both micro- and macro-averaged F1 scores, given the imbalance of ADEs in the datasets. Although differences for the ADE RC task between the models were small, this http URL was the best performing model with macro-averaged F1 score of 0.63 using gold standard and 0.62 using predicted entities. The this http URL models also performed the best in our external validation and achieved recall of between 0.67 to 0.74 using predicted entities, meaning between 67 to 74% of discharge letters with ADEs were detected. Our benchmark study presents a robust and clinically meaningful approach for evaluating language models for ADE detection in clinical free text documents. Our study highlights the need to use appropriate performance measures fit for the task of ADE detection in clinical free-text documents and envisioned future clinical use.
摘要：在这项研究中，我们使用多种变压器模型，临床场景和适合用途的绩效指标为荷兰临床免费文本文档中的不良药物事件（ADE）设定了基准。我们培训了双向长期记忆（BISTM）模型和四个基于变压器的荷兰语和/或多语言编码器模型（Bertje，Robbert，The HTTP URL和NUNER），用于使用102个富裕的荷兰荷兰荷兰荷兰语ICU ICU Cline Cliental Progress Progement in Nite nuntity Entity识别（NER）的任务（NER）和关系分类（RC）。匿名自由文本临床进度临床进度注释被重新使用了一家学术医院重症监护病房（ICU）的患者，并重新使用了两家非学术医院内科医学病房的出院证书。我们使用黄金标准（两步任务）和预测的实体（端到端任务）在内部评估了ADE RC模型。此外，所有模型均在在文档级别检测ADE时进行外部验证。考虑到数据集中的ADE不平衡，我们报告了微型和宏观平均的F1分数。尽管模型之间的ADE RC任务的差异很小，但该HTTP URL是最佳性能模型，使用Gold Standard使用宏观平均的F1得分为0.63，使用预测实体为0.62。该HTTP URL模型在我们的外部验证中也表现出了最佳状态，并使用预测的实体在0.67至0.74之间进行了回忆，这意味着检测到67％至74％的带有ADE的排放字母。我们的基准研究提出了一种评估临床免费文本文档中ADE检测语言模型的强大和临床意义的方法。我们的研究强调，需要在临床自由文本文档中使用适合ADE检测任务的适当性能指标，并设想将来的临床用途。

Title: TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability

Authors: Mohammad Aflah Khan, Ameya Godbole, Johnny Tian-Zheng Wei, Ryan Wang, James Flemings, Krishna Gummadi, Willie Neiswanger, Robin Jia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.19419
Pdf URL: https://arxiv.org/pdf/2507.19419
Copy Paste: [[2507.19419]] TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability(https://arxiv.org/abs/2507.19419)
Keywords: language model, gpt
Abstract: Understanding the relationship between training data and model behavior during pretraining is crucial, but existing workflows make this process cumbersome, fragmented, and often inaccessible to researchers. We present TokenSmith, an open-source library for interactive editing, inspection, and analysis of datasets used in Megatron-style pretraining frameworks such as GPT-NeoX, Megatron, and NVIDIA NeMo. TokenSmith supports a wide range of operations including searching, viewing, ingesting, exporting, inspecting, and sampling data, all accessible through a simple user interface and a modular backend. It also enables structured editing of pretraining data without requiring changes to training code, simplifying dataset debugging, validation, and experimentation. TokenSmith is designed as a plug and play addition to existing large language model pretraining workflows, thereby democratizing access to production-grade dataset tooling. TokenSmith is hosted on GitHub1, with accompanying documentation and tutorials. A demonstration video is also available on YouTube.
摘要：了解训练数据和训练过程中的模型行为之间的关系至关重要，但是现有的工作流程使研究人员繁琐，分散且常常无法访问。我们提出了Tokensmith，这是一个开源库，用于互动编辑，检查和分析超级风格的预科框架中使用的数据集，例如GPT-Neox，Megatron和Nvidia Nemo。 Tokensmith支持广泛的操作，包括搜索，查看，摄入，导出，检查和采样数据，所有这些都可以通过简单的用户界面和模块化后端访问。它还可以实现预读取数据的结构化编辑，而无需更改培训代码，简化数据集调试，验证和实验。 Tokensmith被设计为现有的大型语言模型预处理工作流的插件，从而使访问生产级数据集工具的访问民主化。 Tokensmith托管在GitHub1上，并附有文档和教程。 YouTube上还提供了演示视频。

Title: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Authors: Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab
Subjects: cs.CL, cs.AI, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2507.19457
Pdf URL: https://arxiv.org/pdf/2507.19457
Copy Paste: [[2507.19457]] GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning(https://arxiv.org/abs/2507.19457)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language can often provide a much richer learning medium for LLMs, compared with policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain. Across four tasks, GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two LLMs, and demonstrates promising results as an inference-time search strategy for code optimization.
摘要：大型语言模型（LLMS）越来越多地通过加强学习（RL）方法（例如小组相对策略优化（GRPO））来适应下游任务，这些方法通常需要数千个推出才能学习新任务。我们认为，与从稀疏，标量奖励获得的政策梯度相比，语言的可解释性质通常可以为LLM提供更丰富的学习媒介。为了测试这一点，我们介绍了GEPA（遗传 - pareto），这是一个及时的优化器，彻底结合了自然语言反思，以从反复试验中学习高级规则。给定包含一个或多个LLM提示的任何AI系统，GEPA样品系统级轨迹（例如，推理，工具呼叫和工具输出），并以自然语言反思它们以诊断问题，提出和测试提示更新，并结合互补的经验教训。由于GEPA的设计，它通常甚至可以将几次推广变成质量巨大的增益。在四个任务中，GEPA的表现平均优于GRPO，平均比GRPO高达20％，同时减少了35倍。 GEPA还优于领先的提示优化器MIPROV2，在两个LLM中的表现超过10％，并证明了有希望的结果，作为代码优化的推理时间搜索策略。

Title: Conversations Gone Awry, But Then? Evaluating Conversational Forecasting Models

Authors: Son Quoc Tran, Tushaar Gangavarapu, Nicholas Chernogor, Jonathan P. Chang, Cristian Danescu-Niculescu-Mizil
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2507.19470
Pdf URL: https://arxiv.org/pdf/2507.19470
Copy Paste: [[2507.19470]] Conversations Gone Awry, But Then? Evaluating Conversational Forecasting Models(https://arxiv.org/abs/2507.19470)
Keywords: language model
Abstract: We often rely on our intuition to anticipate the direction of a conversation. Endowing automated systems with similar foresight can enable them to assist human-human interactions. Recent work on developing models with this predictive capacity has focused on the Conversations Gone Awry (CGA) task: forecasting whether an ongoing conversation will derail. In this work, we revisit this task and introduce the first uniform evaluation framework, creating a benchmark that enables direct and reliable comparisons between different architectures. This allows us to present an up-to-date overview of the current progress in CGA models, in light of recent advancements in language modeling. Our framework also introduces a novel metric that captures a model's ability to revise its forecast as the conversation progresses.
摘要：我们经常依靠直觉来预测对话的方向。赋予具有相似预见的自动化系统可以使它们能够协助人类的互动。以这种预测能力开发模型的最新工作集中在对话中的问题（CGA）任务：预测正在进行的对话是否会脱轨。在这项工作中，我们重新审视此任务，并介绍了第一个统一的评估框架，创建了一个基准，该基准可以在不同体系结构之间进行直接可靠的比较。这使我们能够根据语言建模的最新进步，对CGA模型中当前的进度进行最新概述。我们的框架还引入了一个新颖的指标，该指标捕获了模型随着对话的进行修改的预测能力。