2025-07-10

Title: Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian, Xiaofan Zhang, Raluca Ada Popa, Kedar Dhamdhere, Blaž Bratanič, Kyuyeun Kim, Terry Koo, Ferran Alet, Yi-ting Chen, Arsha Nagrani, Hannah Muckenhirn, Zhiyuan Zhang, Corbin Quick, Filip Pavetić, Duc Dung Nguyen, Joao Carreira, Michael Elabd, Haroon Qureshi, Fabian Mentzer, Yao-Yuan Yang, Danielle Eisenbud, Anmol Gulati, Ellie Talius, Eric Ni, Sahra Ghalebikesabi, Edouard Yvinec, Alaa Saade, Thatcher Ulrich, Lorenzo Blanco, Dan A. Calian, Muhuan Huang, Aäron van den Oord, Naman Goyal, Terry Chen, Praynaa Rawlani, Christian Schallhart, Swachhand Lokhande, Xianghong Luo, Jyn Shan, Ceslee Montgomery, Victoria Krakovna, Federico Piccinini, Omer Barak, Jingyu Cui, Yiling Jia, Mikhail Dektiarev, Alexey Kolganov, Shiyu Huang, Zhe Chen, Xingyu Wang, Jessica Austin, Peter de Boursac, Evgeny Sluzhaev, Frank Ding, Huijian Li, Surya Bhupatiraju
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.06261
Pdf URL: https://arxiv.org/pdf/2507.06261
Copy Paste: [[2507.06261]] Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities(https://arxiv.org/abs/2507.06261)
Keywords: long context, agent
Abstract: In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
摘要：在本报告中，我们介绍了Gemini 2.x模型系列：Gemini 2.5 Pro和Gemini 2.5 Flash，以及我们较早的Gemini 2.0 Flash和Flash-Lite模型。 Gemini 2.5 Pro是我们迄今为止最有能力的模型，在边境编码和推理基准测试中实现了SOTA性能。 Gemini 2.5 Pro除了令人难以置信的编码和推理技巧外，还具有多模式理解的思维模型，现在可以处理多达3个小时的视频内容。可以将其独特的长上下文，多模式和推理功能组合在一起，以解锁新的代理工作流程。 Gemini 2.5 Flash在计算和延迟需求的一小部分提供了出色的推理能力，Gemini 2.0 Flash和Flash-Lite以低潜伏期和成本提供了高性能。综上所述，Gemini 2.X模型生成涵盖了模型能力与成本的完整帕累托前沿，从而使用户可以探索复杂的代理问题解决的界限。

Title: Humans overrely on overconfident language models, across languages

Authors: Neil Rathi, Dan Jurafsky, Kaitlyn Zhou
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2507.06306
Pdf URL: https://arxiv.org/pdf/2507.06306
Copy Paste: [[2507.06306]] Humans overrely on overconfident language models, across languages(https://arxiv.org/abs/2507.06306)
Keywords: language model, llm
Abstract: As large language models (LLMs) are deployed globally, it is crucial that their responses are calibrated across languages to accurately convey uncertainty and limitations. Previous work has shown that LLMs are linguistically overconfident in English, leading users to overrely on confident generations. However, the usage and interpretation of epistemic markers (e.g., 'It's definitely,' 'I think') can differ sharply across languages. Here, we study the risks of multilingual linguistic (mis)calibration, overconfidence, and overreliance across five languages to evaluate the safety of LLMs in a global context. We find that overreliance risks are high across all languages. We first analyze the distribution of LLM-generated epistemic markers, and observe that while LLMs are cross-linguistically overconfident, they are also sensitive to documented linguistic variation. For example, models generate the most markers of uncertainty in Japanese and the most markers of certainty in German and Mandarin. We then measure human reliance rates across languages, finding that while users strongly rely on confident LLM generations in all languages, reliance behaviors differ cross-linguistically: for example, users rely significantly more on expressions of uncertainty in Japanese than in English. Taken together, these results indicate high risk of reliance on overconfident model generations across languages. Our findings highlight the challenges of multilingual linguistic calibration and stress the importance of culturally and linguistically contextualized model safety evaluations.
摘要：由于大型语言模型（LLMS）在全球部署，因此至关重要的是，他们的响应是在语言上进行校准以准确传达不确定性和局限性的。以前的工作表明，LLM在语言上是英语过度自信的，导致用户超越了自信的世代。但是，对认知标志物的用法和解释（例如，肯定是'我认为'）在各种语言之间可能会截然不同。在这里，我们研究了五种语言的多语言语言（MIS）校准，过度自信和过度依赖的风险，以评估全球环境中LLM的安全性。我们发现，在所有语言中，过度依赖的风险都很高。我们首先分析了LLM生成的认知标志物的分布，并观察到LLM在跨语言上过度自信，但它们也对文献的语言变化敏感。例如，模型在日语中产生了最大的不确定性标记，而德语和普通话的确定性最大。然后，我们衡量跨语言的人类依赖率，发现尽管用户强烈依赖所有语言中的自信LLM世代，但依赖行为的跨语言也有所不同：例如，用户比英语中的不确定性表达更大。综上所述，这些结果表明，跨语言过度自信的型号的高风险。我们的发现突出了多语言语言校准的挑战，并强调了文化和语言上情境化模型安全评估的重要性。

Title: ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

Authors: Kiarash Zahirnia, Zahra Golpayegani, Walid Ahmad, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06313
Pdf URL: https://arxiv.org/pdf/2507.06313
Copy Paste: [[2507.06313]] ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time(https://arxiv.org/abs/2507.06313)
Keywords: language model, gpt, llm, long context
Abstract: Transformer-based Language Models' computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model's parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model's accuracy. We also study how context can be stored in LLM's weights effectively and efficiently. Through a detailed ablation study, we examine which Transformer modules are most beneficial to fine-tune at test-time. Interestingly, we find that fine-tuning the second layer of the FFNs is more effective than full fine-tuning, leading to a further improvement in the models' accuracy.
摘要：基于变压器的语言模型的计算和内存开销随着序列长度的函数四倍增加。使用LLM来处理长序列时，二次成本会带来挑战。在这项工作中，我们介绍了\ oureModelActonym〜（在测试时间扩展），这是扩展基于上下文变压器LLM的上下文长度的方法，并具有恒定的内存要求和线性计算开销。 ETT通过在输入上下文上有效地调整模型的参数来启用上下文长度的扩展，并分解为重叠的小子序列。我们通过将GPT-LARGE的上下文长度和PHI-2的上下文长度扩展到32倍，从1K增加到32K令牌来评估ETT。这将使模型的准确性提高30％。我们还研究了如何有效，有效地将上下文存储在LLM的权重中。通过一项详细的消融研究，我们检查了哪些变压器模块在测试时最有益。有趣的是，我们发现对FFN的第二层进行微调比全面的微调更有效，从而进一步提高了模型的准确性。

Title: Could the Road to Grounded, Neuro-symbolic AI be Paved with Words-as-Classifiers?

Authors: Casey Kennington, David Schlangen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06335
Pdf URL: https://arxiv.org/pdf/2507.06335
Copy Paste: [[2507.06335]] Could the Road to Grounded, Neuro-symbolic AI be Paved with Words-as-Classifiers?(https://arxiv.org/abs/2507.06335)
Keywords: language model
Abstract: Formal, Distributional, and Grounded theories of computational semantics each have their uses and their drawbacks. There has been a shift to ground models of language by adding visual knowledge, and there has been a call to enrich models of language with symbolic methods to gain the benefits from formal, distributional, and grounded theories. In this paper, we attempt to make the case that one potential path forward in unifying all three semantic fields is paved with the words-as-classifier model, a model of word-level grounded semantics that has been incorporated into formalisms and distributional language models in the literature, and it has been well-tested within interactive dialogue settings. We review that literature, motivate the words-as-classifiers model with an appeal to recent work in cognitive science, and describe a small experiment. Finally, we sketch a model of semantics unified through words-as-classifiers.
摘要：正式，分布和基础的计算语义理论都有其用途和缺点。通过添加视觉知识，已经转向了语言的基础模型，并且有人呼吁用象征性的方法来丰富语言模型，以从正式，分布和基础理论中获得好处。在本文中，我们试图证明，在统一所有三个语义领域的一个潜在路径上都用单词 - 分类器模型铺平了，这是一种单词级扎根语义的模型，这些模型已将其纳入文献中的形式主义和分布语言模型中，并且在交互式对话环境中经过了很好的测试。我们回顾了文献，激发了单词 - 分类器模型，以吸引最近认知科学的工作，并描述一个小实验。最后，我们绘制通过单词分类词统一的语义模型。

Title: Evaluating Morphological Alignment of Tokenizers in 70 Languages

Authors: Catherine Arnett, Marisa Hudspeth, Brendan O'Connor
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06378
Pdf URL: https://arxiv.org/pdf/2507.06378
Copy Paste: [[2507.06378]] Evaluating Morphological Alignment of Tokenizers in 70 Languages(https://arxiv.org/abs/2507.06378)
Keywords: language model
Abstract: While tokenization is a key step in language modeling, with effects on model training and performance, it remains unclear how to effectively evaluate tokenizer quality. One proposed dimension of tokenizer quality is the extent to which tokenizers preserve linguistically meaningful subwords, aligning token boundaries with morphological boundaries within a word. We expand MorphScore (Arnett & Bergen, 2025), which previously covered 22 languages, to support a total of 70 languages. The updated MorphScore offers more flexibility in evaluation and addresses some of the limitations of the original version. We then correlate our alignment scores with downstream task performance for five pre-trained languages models on seven tasks, with at least one task in each of the languages in our sample. We find that morphological alignment does not explain very much variance in model performance, suggesting that morphological alignment alone does not measure dimensions of tokenization quality relevant to model performance.
摘要：虽然令牌化是语言建模的关键步骤，并且对模型培训和性能的影响尚不清楚如何有效评估令牌质量。代币质量的一个提议的维度是令牌剂保留语言有意义的子词的程度，将令牌边界与一个单词中的形态界限保持一致。我们扩展了MorphScore（Arnett＆Bergen，2025），该语言以前涵盖了22种语言，以支持总共70种语言。更新的MorphScore在评估方面具有更大的灵活性，并解决了原始版本的一些局限性。然后，我们将五个预训练的语言模型的对齐分数与七个任务的下游任务性能相关联，我们的样本中的每种语言中至少有一个任务。我们发现，形态对准并不能解释模型性能的太大差异，这表明单独的形态对准不能测量与模型性能相关的令牌化质量的维度。

Title: PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning

Authors: Zeming Chen, Angelika Romanou, Gail Weiss, Antoine Bosselut
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.06415
Pdf URL: https://arxiv.org/pdf/2507.06415
Copy Paste: [[2507.06415]] PERK: Long-Context Reasoning as Parameter-Efficient Test-Time Learning(https://arxiv.org/abs/2507.06415)
Keywords: gpt, long context, prompt
Abstract: Long-context reasoning requires accurately identifying relevant information in extensive, noisy input contexts. Previous research shows that using test-time learning to encode context directly into model parameters can effectively enable reasoning over noisy information. However, meta-learning methods for enabling test-time learning are prohibitively memory-intensive, preventing their application to long context settings. In this work, we propose PERK (Parameter Efficient Reasoning over Knowledge), a scalable approach for learning to encode long input contexts using gradient updates to a lightweight model adapter at test time. Specifically, PERK employs two nested optimization loops in a meta-training phase. The inner loop rapidly encodes contexts into a low-rank adapter (LoRA) that serves as a parameter-efficient memory module for the base model. Concurrently, the outer loop learns to use the updated adapter to accurately recall and reason over relevant information from the encoded long context. Our evaluations on several long-context reasoning tasks show that PERK significantly outperforms the standard prompt-based long-context baseline, achieving average absolute performance gains of up to 90% for smaller models (GPT-2) and up to 27% for our largest evaluated model, Qwen-2.5-0.5B. In general, PERK is more robust to reasoning complexity, length extrapolation, and the locations of relevant information in contexts. Finally, we show that while PERK is memory-intensive during training, it scales more efficiently at inference time than prompt-based long-context inference.
摘要：长篇小说推理需要在广泛的，嘈杂的输入上下文中准确识别相关信息。先前的研究表明，使用测试时间学习将上下文直接编码为模型参数可以有效地推理噪声信息。但是，用于实现测试时间学习的元学习方法是过时的记忆密集型方法，从而阻止了它们在长上下文设置中的应用。在这项工作中，我们提出了PERK（参数有效推理，超过知识），这是一种可扩展的方法，用于在测试时使用对轻量级模型适配器进行渐变更新编码长输入上下文。具体而言，PERK在元训练阶段采用两个嵌套的优化环。内部循环迅速将上下文编码为低级别适配器（LORA），该适配器是基本模型的参数有效存储模块。同时，外循环学会使用更新的适配器来准确地回忆和推理从编码长上下文中的相关信息。我们对几个长篇文化推理任务的评估表明，PERK明显胜过基于标准的及时及时的长篇小写基线，对于较小的模型（GPT-2），我们的QWEN-2.5-0.5B的平均绝对性能提高高达90％，高达27％。通常，PERK对于推理的复杂性，长度外推和相关信息在上下文中的位置更为强大。最后，我们表明，虽然在训练过程中，PERK是内存密集的，但它在推理时间比基于及时的长篇小说推断更有效。

Title: Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling

Authors: Pankayaraj Pathmanathan, Furong Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06419
Pdf URL: https://arxiv.org/pdf/2507.06419
Copy Paste: [[2507.06419]] Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling(https://arxiv.org/abs/2507.06419)
Keywords: language model, llm
Abstract: Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically rely on prior knowledge about preference distributions or failure attributes, limiting their practicality in real-world settings where such information is unavailable. In this work, we propose a tractable, preference-distribution agnostic method for discovering reward model failure modes via reward guided controlled decoding. Building on this, we introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses. These adversarial examples are then used to augment the training data and patch the reward model's misaligned behavior. We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails and demonstrate that it significantly improves robustness without sacrificing reward quality. Notably, REFORM preserves performance both in direct evaluation and in downstream policy training, and further improves alignment quality by removing spurious correlations.
摘要：奖励建模（RM）捕获了对人类对大型语言模型（LLMS）的偏好的偏好，该模型越来越多地用于模型登录，响应过滤和排名等任务中。但是，由于人类偏好的固有复杂性和可用数据集的有限覆盖范围，奖励模型通常在分配变化或对抗性扰动下失败。识别此类故障模式的现有方法通常依赖于有关偏好分布或失败属性的先验知识，从而限制了其在无法使用此类信息的现实环境中的实用性。在这项工作中，我们提出了一种可通过奖励指导的受控解码来发现奖励模型故障模式的可进行的，优先分布的不可知论方法。在此基础上，我们介绍了改革，这是一个自我改善的奖励建模框架，通过使用奖励模型本身来指导虚假评分的响应来增强鲁棒性。然后使用这些对抗性示例来增强训练数据并修补奖励模型未对准的行为。我们评估了两个广泛使用的偏好数据集人为有用的无害（HH）和PKU Beavertails的改革，并证明它可以显着提高鲁棒性而不牺牲奖励质量。值得注意的是，改革在直接评估和下游政策培训中都保留了绩效，并通过消除虚假的相关性进一步提高了一致性质量。

Title: Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders

Authors: Shun Wang, Tyler Loakman, Youbo Lei, Yi Liu, Bohao Yang, Yuting Zhao, Dong Yang, Chenghua Lin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.06427
Pdf URL: https://arxiv.org/pdf/2507.06427
Copy Paste: [[2507.06427]] Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders(https://arxiv.org/abs/2507.06427)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are traditionally viewed as black-box algorithms, therefore reducing trustworthiness and obscuring potential approaches to increasing performance on downstream tasks. In this work, we apply an effective LLM decomposition method using a dictionary-learning approach with sparse autoencoders. This helps extract monosemantic features from polysemantic LLM neurons. Remarkably, our work identifies model-internal misunderstanding, allowing the automatic reformulation of the prompts with additional annotations to improve the interpretation by LLMs. Moreover, this approach demonstrates a significant performance improvement in downstream tasks, such as mathematical reasoning and metaphor detection.
摘要：传统上，大型语言模型（LLMS）被视为黑盒算法，因此降低了可信赖性，并掩盖了潜在的方法来提高下游任务的性能。在这项工作中，我们使用具有稀疏自动编码器的词典学习方法应用有效的LLM分解方法。这有助于从多义LLM神经元中提取单义特征。值得注意的是，我们的工作确定了模型内部的误解，从而允许通过其他注释自动重新重新重新重新调整提示，以改善LLM的解释。此外，这种方法证明了下游任务的绩效改善，例如数学推理和隐喻检测。

Title: Perception-Aware Policy Optimization for Multimodal Reasoning

Authors: Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06448
Pdf URL: https://arxiv.org/pdf/2507.06448
Copy Paste: [[2507.06448]] Perception-Aware Policy Optimization for Multimodal Reasoning(https://arxiv.org/abs/2507.06448)
Keywords: language model, llm
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: this https URL.
摘要：通过可验证的奖励（RLVR）的强化学习已被证明是具有强大的多步推理能力的大型语言模型（LLM）的高效策略。但是，它的设计和优化仍然量身定制为纯粹的文本域，当应用于多模式推理任务时，导致次优性能。特别是，我们观察到当前多模式推理中的主要误差来源在于视觉输入的感知。为了解决这个瓶颈，我们提出了感知感知的政策优化（PAPO），这是一个简单而有效的GRPO扩展，鼓励模型学会在学习推理时完全来自内部监督信号。值得注意的是，Papo不依赖其他数据策展，外部奖励模型或专有模型。具体而言，我们以KL Divergence术语的形式引入了隐性感知损失，尽管它很简单，但它在多种多模态基准中却具有显着的总体改进（4.4％）。在具有高视力依赖性的任务上，改进更为明显，接近8.0％。我们还观察到感知错误的大幅降低（30.5％），表明PAPO的感知能力提高了。我们对Papo进行了全面的分析，并确定了独特的损失问题，我们严格地分析并减轻双熵损失。总体而言，我们的工作将感知感知的监督更深入地整合到RLVR学习目标中，并为一个新的RL框架奠定了基础，该框架鼓励了视觉上扎根的推理。项目页面：此HTTPS URL。

Title: A Semantic Parsing Framework for End-to-End Time Normalization

Authors: Xin Su, Sungduk Yu, Phillip Howard, Steven Bethard
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06450
Pdf URL: https://arxiv.org/pdf/2507.06450
Copy Paste: [[2507.06450]] A Semantic Parsing Framework for End-to-End Time Normalization(https://arxiv.org/abs/2507.06450)
Keywords: language model, llm
Abstract: Time normalization is the task of converting natural language temporal expressions into machine-readable representations. It underpins many downstream applications in information retrieval, question answering, and clinical decision-making. Traditional systems based on the ISO-TimeML schema limit expressivity and struggle with complex constructs such as compositional, event-relative, and multi-span time expressions. In this work, we introduce a novel formulation of time normalization as a code generation task grounded in the SCATE framework, which defines temporal semantics through symbolic and compositional operators. We implement a fully executable SCATE Python library and demonstrate that large language models (LLMs) can generate executable SCATE code. Leveraging this capability, we develop an automatic data augmentation pipeline using LLMs to synthesize large-scale annotated data with code-level validation. Our experiments show that small, locally deployable models trained on this augmented data can achieve strong performance, outperforming even their LLM parents and enabling practical, accurate, and interpretable time normalization.
摘要：时间归一化是将自然语言时间表达式转换为机器可读表示的任务。它为信息检索，问答和临床决策的许多下游应用程序提供了许多。基于ISO-TIMEML模式的传统系统限制表达性，并与复杂的结构（例如构图，相关性和多跨度时间表达式）进行斗争。在这项工作中，我们介绍了时间归一化的新颖表述，作为基于Scate框架的代码生成任务，该任务通过符号和组成运算符来定义时间语义。我们实现了完全可执行的分数Python库，并证明了大型语言模型（LLMS）可以生成可执行的分数代码。利用此功能，我们使用LLMS开发自动数据增强管道，以通过代码级验证合成大规模注释的数据。我们的实验表明，在此增强数据上接受训练的小型，本地可部署的模型可以实现强劲的性能，甚至超过其LLM父母，并实现实用，准确且可解释的时间归一化。

Title: A Systematic Analysis of Hybrid Linear Attention

Authors: Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, Jason Eshraghian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06457
Pdf URL: https://arxiv.org/pdf/2507.06457
Copy Paste: [[2507.06457]] A Systematic Analysis of Hybrid Linear Attention(https://arxiv.org/abs/2507.06457)
Keywords: language model, prompt
Abstract: Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component has not been deeply explored. We systematically evaluate various linear attention models across generations - vector recurrences to advanced gating mechanisms - both standalone and hybridized. To enable this comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six linear attention variants across five hybridization ratios. Benchmarking on standard language modeling and recall tasks reveals that superior standalone linear models do not necessarily excel in hybrids. While language modeling remains stable across linear-to-full attention ratios, recall significantly improves with increased full attention layers, particularly below a 3:1 ratio. Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models. We recommend architectures such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1 to achieve Transformer-level recall efficiently. Our models are open-sourced at this https URL.
摘要：变形金刚以长序列面临二次复杂性和内存问题，促使使用固定尺寸的隐藏状态促使线性注意机制采用。但是，线性模型通常会遭受召回性能有限，从而导致混合体系结构结合了线性和全部注意力层。尽管进行了广泛的混合体系结构研究，但尚未深入探索线性注意组件的选择。我们系统地评估了跨世代的各种线性注意模型 - 矢量复发到高级门控机制 - 既独立又杂交。为了实现这一综合分析，我们训练了和开源的72款型号：340m参数（20b代币）和36个在1.3B参数（100B代币）中，涵盖了五个杂交比率的六种线性注意变体。标准语言建模和召回任务的基准测试表明，出色的独立线性模型不一定在混合动力中表现出色。尽管语言建模在线性与满足的注意比率之间保持稳定，但随着全部注意层的增加，召回量显着改善，尤其是低于3：1的比率。我们的研究强调了选择性门控，分层复发和受控遗忘对于有效的混合模型至关重要。我们建议诸如HGRN-2或GatedDeltanet之类的体系结构在3：1和6：1之间具有线性与满足的比率，以有效地实现变压器级别的回忆。我们的型号在此HTTPS URL上开源。

Title: On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Authors: Stephen Obadinma, Xiaodan Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06489
Pdf URL: https://arxiv.org/pdf/2507.06489
Copy Paste: [[2507.06489]] On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks(https://arxiv.org/abs/2507.06489)
Keywords: language model, llm, prompt
Abstract: Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to ensure transparency, trust, and safety in human-AI interactions across many high-stakes applications. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce a novel framework for attacking verbal confidence scores through both perturbation and jailbreak-based methods, and show that these attacks can significantly jeopardize verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current confidence elicitation methods are vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the urgent need to design more robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.
摘要：大型语言模型（LLM）产生的强大语言信心对于在许多高风险应用程序中人类互动中的透明度，信任和安全性的部署至关重要。在本文中，我们介绍了对对抗性攻击下言语信心鲁棒性的首次全面研究。我们介绍了一个新颖的框架，用于通过扰动和基于越狱的方法来攻击口头信心得分，并表明这些攻击可能会大大危害口头置信度的估计并导致频繁的答案更改。我们检查了各种提示的策略，模型大小和应用领域，揭示了当前的置信度启发方法是脆弱的，并且常用的防御技术在很大程度上无效或适得其反。我们的发现强调了迫切需要设计更强大的机制以在LLMS中表达置信度，因为甚至微妙的语义具有宽恕的修改也会导致对响应的误导性信心。

Title: Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings

Authors: Russell Taylor, Benjamin Herbert, Michael Sana
Subjects: cs.CL, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2507.06506
Pdf URL: https://arxiv.org/pdf/2507.06506
Copy Paste: [[2507.06506]] Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings(https://arxiv.org/abs/2507.06506)
Keywords: language model, chain-of-thought, agent
Abstract: Translating wordplay across languages presents unique challenges that have long confounded both professional human translators and machine translation systems. This research proposes a novel approach for translating puns from English to French by combining state-of-the-art large language models with specialized techniques for wordplay generation. Our methodology employs a three-stage approach. First, we establish a baseline using multiple frontier large language models with feedback based on a new contrastive learning dataset. Second, we implement a guided chain-of-thought pipeline with combined phonetic-semantic embeddings. Third, we implement a multi-agent generator-discriminator framework for evaluating and regenerating puns with feedback. Moving beyond the limitations of literal translation, our methodology's primary objective is to capture the linguistic creativity and humor of the source text wordplay, rather than simply duplicating its vocabulary. Our best runs earned first and second place in the CLEF JOKER 2025 Task 2 competition where they were evaluated manually by expert native French speakers. This research addresses a gap between translation studies and computational linguistics by implementing linguistically-informed techniques for wordplay translation, advancing our understanding of how language models can be leveraged to handle the complex interplay between semantic ambiguity, phonetic similarity, and the implicit cultural and linguistic awareness needed for successful humor.
摘要：跨语言翻译文字游戏提出了独特的挑战，这些挑战长期以来都混淆了专业的人类翻译和机器翻译系统。这项研究提出了一种新颖的方法，可以通过将最新的大语言模型与文字游戏生成专门的技术相结合，将双关语从英语翻译为法语。我们的方法采用了三阶段的方法。首先，我们使用多个Frontier大语言模型建立了基线，并基于新的对比度学习数据集建立了反馈。其次，我们用共同的语音语义嵌入实施了指导的思想链管道。第三，我们实施了一个多代理发电机 - 歧视框架，以评估和再生双关语。我们的方法论超越了字面翻译的局限性，是捕捉源文本文字游戏的语言创造力和幽默，而不是简单地复制其词汇。我们最好的奔跑在Clef Joker 2025 Task 2比赛中获得了第一和第二名，在那里他们由专业的法国人士手动评估。这项研究通过实施语言文字播放翻译的技术来解决翻译研究和计算语言学之间的差距，促进了我们对如何利用语言模型的理解来处理语义歧义，语音相似性，隐含的文化和语言兴奋感之间的复杂相互作用。

Title: SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers

Authors: Zicong Tang, Shi Luohe, Zuchao Li, Baoyuan Qi, Guoming Liu, Lefei Zhang, Ping Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06517
Pdf URL: https://arxiv.org/pdf/2507.06517
Copy Paste: [[2507.06517]] SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers(https://arxiv.org/abs/2507.06517)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved impressive accomplishments in recent years. However, the increasing memory consumption of KV cache has possessed a significant challenge to the inference system. Eviction methods have revealed the inherent redundancy within the KV cache, demonstrating its potential for reduction, particularly in deeper layers. However, KV cache reduction for shallower layers has been found to be insufficient. Based on our observation that, the KV cache exhibits a high degree of similarity. Based on this observation, we proposed a novel KV cache reduction method, SpindleKV, which balances both shallow and deep layers. For deep layers, we employ an attention weight based eviction method, while for shallow layers, we apply a codebook based replacement approach which is learnt by similarity and merging policy. Moreover, SpindleKV addressed the Grouped-Query Attention (GQA) dilemma faced by other attention based eviction methods. Experiments on two common benchmarks with three different LLMs shown that SpindleKV obtained better KV cache reduction effect compared to baseline methods, while preserving similar or even better model performance.
摘要：近年来，大型语言模型（LLM）取得了令人印象深刻的成就。但是，KV缓存的记忆消耗不断增加，对推理系统带来了重大挑战。驱逐方法揭示了KV缓存内固有的冗余性，表明了其还原的潜力，尤其是在更深层次的层中。但是，发现较浅层的KV缓存不足。根据我们的观察，KV缓存表现出高度的相似性。基于这个观察结果，我们提出了一种新型的KV缓存方法，即SpindleKV，该方法平衡了浅层和深层。对于深层，我们采用了一种基于注意力的驱逐方法，而对于浅层，我们采用了一种基于代码的替换方法，该方法是通过相似性和合并政策所学的。此外，SpindleKV解决了其他基于注意力的驱逐方法面临的分组疑问（GQA）困境。与基线方法相比，在两个不同LLM的两个常见基准上进行了实验，表明SpindleKV获得了更好的KV缓存效果，同时保留了相似甚至更好的模型性能。

Title: InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior

Authors: Huisheng Wang, Zhuoshi Pan, Hangjing Zhang, Mingxiao Liu, Hanqing Gao, H. Vicky Zhao
Subjects: cs.CL, cs.AI, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2507.06528
Pdf URL: https://arxiv.org/pdf/2507.06528
Copy Paste: [[2507.06528]] InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes under Herd Behavior(https://arxiv.org/abs/2507.06528)
Keywords: language model, llm, agent
Abstract: Aligning Large Language Models (LLMs) with investor decision-making processes under herd behavior is a critical challenge in behavioral finance, which grapples with a fundamental limitation: the scarcity of real-user data needed for Supervised Fine-Tuning (SFT). While SFT can bridge the gap between LLM outputs and human behavioral patterns, its reliance on massive authentic data imposes substantial collection costs and privacy risks. We propose InvestAlign, a novel framework that constructs high-quality SFT datasets by leveraging theoretical solutions to similar and simple optimal investment problems rather than complex scenarios. Our theoretical analysis demonstrates that training LLMs with InvestAlign-generated data achieves faster parameter convergence than using real-user data, suggesting superior learning efficiency. Furthermore, we develop InvestAgent, an LLM agent fine-tuned with InvestAlign, which demonstrates significantly closer alignment to real-user data than pre-SFT models in both simple and complex investment problems. This highlights our proposed InvestAlign as a promising approach with the potential to address complex optimal investment problems and align LLMs with investor decision-making processes under herd behavior. Our code is publicly available at this https URL.
摘要：将大型语言模型（LLMS）与牛群行为下的投资者决策过程保持一致，这是行为金融方面的一个关键挑战，该挑战应对基本限制：对监督微调（SFT）所需的房地产用户数据的稀缺。尽管SFT可以弥合LLM输出与人类行为模式之间的差距，但其对大规模真实数据的依赖却施加了大量的收集成本和隐私风险。我们提出了Investalign，这是一个新颖的框架，该框架通过利用理论解决方案来解决相似和简单的最佳投资问题而不是复杂的情况来构建高质量的SFT数据集。我们的理论分析表明，使用Investalign生成数据的培训LLM比使用Real-User数据更快地获得了参数的收敛速度，这表明了卓越的学习效率。此外，我们开发了Investagent，这是一家对Investalign进行微调的LLM代理商，在简单和复杂的投资问题中，它比Pre-SFT模型表现出与房地产使用者数据的一致性。这重点介绍了我们提议的投资作为一种有希望的方法，有可能解决复杂的最佳投资问题，并使LLM与群体行为下的投资者决策过程保持一致。我们的代码在此HTTPS URL上公开可用。

Title: Large Language Model for Extracting Complex Contract Information in Industrial Scenes

Authors: Yunyang Cao, Yanjun Li, Silong Dai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06539
Pdf URL: https://arxiv.org/pdf/2507.06539
Copy Paste: [[2507.06539]] Large Language Model for Extracting Complex Contract Information in Industrial Scenes(https://arxiv.org/abs/2507.06539)
Keywords: language model, gpt
Abstract: This paper proposes a high-quality dataset construction method for complex contract information extraction tasks in industrial scenarios and fine-tunes a large language model based on this dataset. Firstly, cluster analysis is performed on industrial contract texts, and GPT-4 and GPT-3.5 are used to extract key information from the original contract data, obtaining high-quality data annotations. Secondly, data augmentation is achieved by constructing new texts, and GPT-3.5 generates unstructured contract texts from randomly combined keywords, improving model robustness. Finally, the large language model is fine-tuned based on the high-quality dataset. Experimental results show that the model achieves excellent overall performance while ensuring high field recall and precision and considering parsing efficiency. LoRA, data balancing, and data augmentation effectively enhance model accuracy and robustness. The proposed method provides a novel and efficient solution for industrial contract information extraction tasks.
摘要：本文提出了一种高质量的数据集构建方法，用于在工业场景中进行复杂的合同信息提取任务，并基于此数据集进行微型语言模型。首先，在工业合同文本上进行集群分析，GPT-4和GPT-3.5用于从原始合同数据中提取关键信息，获得高质量的数据注释。其次，通过构建新文本来实现数据的增强，而GPT-3.5从随机组合的关键字中生成非结构化的合同文本，从而提高了模型的鲁棒性。最后，大型语言模型是根据高质量数据集进行了微调的。实验结果表明，该模型在确保高场召回和精确度并考虑解析效率的同时，取得了出色的整体性能。洛拉，数据平衡和数据增强有效增强了模型的准确性和鲁棒性。所提出的方法为工业合同信息提取任务提供了一种新颖而有效的解决方案。

Title: The Flaws of Others: An LLM-driven Framework for Scientific Knowledge Production

Authors: Juan B. Gutiérrez
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.06565
Pdf URL: https://arxiv.org/pdf/2507.06565
Copy Paste: [[2507.06565]] The Flaws of Others: An LLM-driven Framework for Scientific Knowledge Production(https://arxiv.org/abs/2507.06565)
Keywords: language model, llm, hallucination, agent
Abstract: Large-language models turn writing into a live exchange between humans and software. We capture this new medium with a discursive-network model that treats people and LLMs as equal nodes and tracks how their statements circulate. Broadening the focus from isolated hallucinations, we define invalidation (any factual, logical, or structural breach) and show it follows four hazards: drift from truth, self-repair, fresh fabrication, and external detection. A general mathematical model of discursive networks is developed to provide valuable insights: A network governed only by drift and self-repair stabilizes at a modest error rate; adding fabrication reproduces the high rates seen in current LLMs. Giving each false claim even a small chance of peer review shifts the system to a truth-dominant state. We operationalize peer review with the open-source \emph{Flaws-of-Others (FOO) algorithm}: a configurable loop in which any set of agents critique one another while a harmoniser merges their verdicts. The takeaway is practical and cultural: reliability in this new medium comes not from perfecting single models but from wiring imperfect ones into networks that keep each other honest.
摘要：大语模型将写作变成人类与软件之间的实时交流。我们使用一种话说，通过话语网络模型捕获这种新媒介，该模型将人们和LLM视为相等的节点，并跟踪其陈述的流动方式。从孤立的幻觉中扩大了焦点，我们定义了无效（任何事实，逻辑或结构性违规），并表明它遵循四个危害：从真理，自我修复，新鲜的捏造和外部检测中漂移。开发了一种通用网络的一般数学模型，以提供有价值的见解：仅由漂移和自我修复以适度的错误率稳定而控制的网络；添加制造可再现当前LLM中的高率。给出每个虚假主张，即使是同行评审的机会很小的机会也会将系统转移到以真理为主的状态。我们使用开源\ emph {foo ofthers（foo）算法}进行同行评审：一个可配置的环路，其中任何一组代理相互批评，而谐调器合并他们的判决。要点是实用的和文化的：这种新媒体的可靠性不是来自完善单个模型，而是从接线不完美地进入彼此诚实的网络。

Title: Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis

Authors: Srihari K B, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06571
Pdf URL: https://arxiv.org/pdf/2507.06571
Copy Paste: [[2507.06571]] Enhancing Food-Domain Question Answering with a Multimodal Knowledge Graph: Hybrid QA Generation and Diversity Analysis(https://arxiv.org/abs/2507.06571)
Keywords: hallucination
Abstract: We propose a unified food-domain QA framework that combines a large-scale multimodal knowledge graph (MMKG) with generative AI. Our MMKG links 13,000 recipes, 3,000 ingredients, 140,000 relations, and 14,000 images. We generate 40,000 QA pairs using 40 templates and LLaVA/DeepSeek augmentation. Joint fine-tuning of Meta LLaMA 3.1-8B and Stable Diffusion 3.5-Large improves BERTScore by 16.2\%, reduces FID by 37.8\%, and boosts CLIP alignment by 31.1\%. Diagnostic analyses-CLIP-based mismatch detection (35.2\% to 7.3\%) and LLaVA-driven hallucination checks-ensure factual and visual fidelity. A hybrid retrieval-generation strategy achieves 94.1\% accurate image reuse and 85\% adequacy in synthesis. Our results demonstrate that structured knowledge and multimodal generation together enhance reliability and diversity in food QA.
摘要：我们提出了一个统一的食品域质量疾病框架，该框架将大规模的多模式知识图（MMKG）与生成AI结合在一起。我们的MMKG链接了13,000种食谱，3,000个成分，140,000个关系和14,000张图像。我们使用40个模板和Llava/DeepSeek增强生成40,000对QA对。 Meta Llama 3.1-8B和稳定扩散3.5大晶石的关节微调可将Bertscore提高16.2 \％，将FID降低37.8％，并将夹子对齐降低31.1 \％。基于诊断分析的基于CLIP的不匹配检测（35.2 \％至7.3 \％）和LLAVA驱动的幻觉检查 - 确定的事实和视觉保真度。混合检索生成策略可实现94.1 \％准确的图像再利用，并在合成中获得85 \％的充分性。我们的结果表明，结构化知识和多模式产生共同提高了食品质量检查的可靠性和多样性。

Title: Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation

Authors: Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, Yelong Shen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.06607
Pdf URL: https://arxiv.org/pdf/2507.06607
Copy Paste: [[2507.06607]] Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation(https://arxiv.org/abs/2507.06607)
Keywords: language model, llm, prompt
Abstract: Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at this https URL.
摘要：语言建模的最新进展证明了国家空间模型（SSM）在有效序列建模中的有效性。虽然samba和解码器 - 模型体系结构Yoco等混合体系结构对变形金刚的性能提高有希望，但先前的工作并未研究SSM层之间代表性共享的效率潜力。在本文中，我们介绍了封闭式内存单元（GMU），这是一种简单而有效的机制，用于跨层的有效内存共享。我们将其应用于创建Sambay，Sambay是一种解码器杂交架构，将GMU纳入了交叉编码器中，以共享基于SAMBA的自编码器的内存读数状态。 Sambay显着提高了解码效率，保留线性预填充时间复杂性，并提高了长篇下说的性能，同时消除了对显式位置编码的需求。通过广泛的缩放实验，我们证明了与强大的Yoco基线相比，我们的模型表现出明显降低的不可还原损失，这表明在大规模计算方案下的性能可伸缩性出色。 Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework.我们在此HTTPS URL上发布了在开源数据上发布培训代码库。

Title: FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation

Authors: Boshko Koloski, Senja Pollak, Roberto Navigli, Blaž Škrlj
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06622
Pdf URL: https://arxiv.org/pdf/2507.06622
Copy Paste: [[2507.06622]] FuDoBa: Fusing Document and Knowledge Graph-based Representations with Bayesian Optimisation(https://arxiv.org/abs/2507.06622)
Keywords: language model, llm
Abstract: Building on the success of Large Language Models (LLMs), LLM-based representations have dominated the document representation landscape, achieving great performance on the document embedding benchmarks. However, the high-dimensional, computationally expensive embeddings from LLMs tend to be either too generic or inefficient for domain-specific applications. To address these limitations, we introduce FuDoBa a Bayesian optimisation-based method that integrates LLM-based embeddings with domain-specific structured knowledge, sourced both locally and from external repositories like WikiData. This fusion produces low-dimensional, task-relevant representations while reducing training complexity and yielding interpretable early-fusion weights for enhanced classification performance. We demonstrate the effectiveness of our approach on six datasets in two domains, showing that when paired with robust AutoML-based classifiers, our proposed representation learning approach performs on par with, or surpasses, those produced solely by the proprietary LLM-based embedding baselines.
摘要：基于大型语言模型（LLM）的成功，基于LLM的表示形式主导了文档表示格局，在嵌入基准测试的文档上实现了出色的性能。但是，来自LLM的高维，计算昂贵的嵌入往往对于特定于域的应用来说太通用或效率低。为了解决这些限制，我们引入了Fudoba一种基于贝叶斯优化的方法，该方法将基于LLM的嵌入与域特异性结构化知识集成在一起，并从当地和Wikidata等外部存储库中采购。该融合会产生低维，与任务相关的表示形式，同时降低训练的复杂性并产生可解释的早期融合权重以增强分类性能。我们证明了我们在两个域中六个数据集上的方法的有效性，这表明当与强大的基于自动的分类器配对时，我们提出的表示表示方法在与专有LLM基于LLM嵌入的碱基相当或超过的表达方式上执行或超越。

Title: Expediting data extraction using a large language model (LLM) and scoping review protocol: a methodological study within a complex scoping review

Authors: James Stewart-Evans, Emma Wilson, Tessa Langley, Andrew Prayle, Angela Hands, Karen Exley, Jo Leonardi-Bee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.06623
Pdf URL: https://arxiv.org/pdf/2507.06623
Copy Paste: [[2507.06623]] Expediting data extraction using a large language model (LLM) and scoping review protocol: a methodological study within a complex scoping review(https://arxiv.org/abs/2507.06623)
Keywords: language model, llm, prompt
Abstract: The data extraction stages of reviews are resource-intensive, and researchers may seek to expediate data extraction using online (large language models) LLMs and review protocols. Claude 3.5 Sonnet was used to trial two approaches that used a review protocol to prompt data extraction from 10 evidence sources included in a case study scoping review. A protocol-based approach was also used to review extracted data. Limited performance evaluation was undertaken which found high accuracy for the two extraction approaches (83.3% and 100%) when extracting simple, well-defined citation details; accuracy was lower (9.6% and 15.8%) when extracting more complex, subjective data items. Considering all data items, both approaches had precision >90% but low recall (<25%) and F1 scores (<40%). The context of a complex scoping review, open response types and methodological approach likely impacted performance due to missed and misattributed data. LLM feedback considered the baseline extraction accurate and suggested minor amendments: four of 15 (26.7%) to citation details and 8 of 38 (21.1%) to key findings data items were considered to potentially add value. However, when repeating the process with a dataset featuring deliberate errors, only 2 of 39 (5%) errors were detected. Review-protocol-based methods used for expediency require more robust performance evaluation across a range of LLMs and review contexts with comparison to conventional prompt engineering approaches. We recommend researchers evaluate and report LLM performance if using them similarly to conduct data extraction or review extracted data. LLM feedback contributed to protocol adaptation and may assist future review protocol drafting.
摘要：评论的数据提取阶段是资源密集型的，研究人员可能会尝试使用在线（大语言模型）LLM和审核协议来提取数据提取。 Claude 3.5十四行诗用于试用两种使用审查方案来提示从案例研究范围审查中包括的10个证据来源提取数据的方法。基于协议的方法还用于审查提取的数据。进行了有限的性能评估，这在提取简单，定义明确的引用细节时发现了两种提取方法的准确性（83.3％和100％）；提取更复杂的主观数据项时，准确性较低（9.6％和15.8％）。考虑到所有数据项，两种方法的精度> 90％，但召回率低（<25％）和F1分数（<40％）。复杂的范围审查，开放响应类型和方法论方法的背景可能会影响由于缺失和误入数据而导致的性能。 LLM反馈认为基线提取准确和建议的次要修正案：15（26.7％）中的四个（26.7％）到引文细节中，其中8个（21.1％）中的8个（21.1％）被认为可能会增加价值。但是，当使用具有故意错误的数据集重复该过程时，仅检测到39个（5％）错误中的2个。基于审查的基于权宜之计的方法需要在一系列LLMS上进行更强大的性能评估，并与常规及时工程方法进行了比较的审查环境。我们建议研究人员使用类似地进行数据提取或审查提取的数据进行评估和报告LLM的性能。 LLM反馈有助于协议适应，并可能有助于将来的审查协议起草。

Title: Elite Polarization in European Parliamentary Speeches: a Novel Measurement Approach Using Large Language Models

Authors: Gennadii Iakovlev
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.06658
Pdf URL: https://arxiv.org/pdf/2507.06658
Copy Paste: [[2507.06658]] Elite Polarization in European Parliamentary Speeches: a Novel Measurement Approach Using Large Language Models(https://arxiv.org/abs/2507.06658)
Keywords: language model
Abstract: This project introduces a new measure of elite polarization via actor and subject detection using artificial intelligence. I identify when politicians mention one another in parliamentary speeches, note who is speaking and who is being addressed, and assess the emotional temperature behind these evaluations. This maps how elites evaluate their various out-parties, allowing us to create an index of mutual out-party hostility, that is, elite polarization. While I analyzed polarization data over the past four decades for the UK, and two decades for Hungary and Italy, my approach lays the groundwork for a twenty-year, EU-wide time-series dataset on elite polarization. I obtain the results that can be aggregated by party and quarter. The resulting index demonstrates a good face validity: it reacts to events such as electoral campaigns, country- and party-level crises, and to parties losing and assuming power.
摘要：该项目通过演员引入了精英两极化的新量度，并使用人工智能进行了受试者检测。我确定政治家何时在议会演讲中互相提及，请注意谁在讲话，正在解决谁，并评估这些评估背后的情绪温度。这绘制了精英如何评估他们的各个派对，使我们能够创建一个相互派对敌对的指数，即精英极化。当我在过去的四十年中分析了英国的两极化数据，以及匈牙利和意大利的二十年时，我的方法为精英极化的二十年，欧盟范围内的时间序列数据集奠定了基础。我获得了可以按政党和季度汇总的结果。由此产生的指数表明了良好的面孔有效性：它对诸如选举运动，国家和政党级危机等事件以及派对失去和承担权力的反应。

Title: CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs

Authors: Garapati Keerthana, Manik Gupta
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2507.06715
Pdf URL: https://arxiv.org/pdf/2507.06715
Copy Paste: [[2507.06715]] CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs(https://arxiv.org/abs/2507.06715)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Large language models (LLMs), including zero-shot and few-shot paradigms, have shown promising capabilities in clinical text generation. However, real-world applications face two key challenges: (1) patient data is highly unstructured, heterogeneous, and scattered across multiple note types and (2) clinical notes are often long and semantically dense, making naive prompting infeasible due to context length constraints and the risk of omitting clinically relevant information. We introduce CLI-RAG (Clinically Informed Retrieval-Augmented Generation), a domain-specific framework for structured and clinically grounded text generation using LLMs. It incorporates a novel hierarchical chunking strategy that respects clinical document structure and introduces a task-specific dual-stage retrieval mechanism. The global stage identifies relevant note types using evidence-based queries, while the local stage extracts high-value content within those notes creating relevance at both document and section levels. We apply the system to generate structured progress notes for individual hospital visits using 15 clinical note types from the MIMIC-III dataset. Experiments show that it preserves temporal and semantic alignment across visits, achieving an average alignment score of 87.7%, surpassing the 80.7% baseline from real clinician-authored notes. The generated outputs also demonstrate high consistency across LLMs, reinforcing deterministic behavior essential for reproducibility, reliability, and clinical trust.
摘要：大型语言模型（LLMS），包括零射和少量范式，在临床文本生成中表现出了有希望的功能。但是，现实世界中的应用面临两个关键挑战：（1）患者数据高度非结构化，异质性和分散在多种音符类型中，（2）临床注释通常长时间且在语义上是长时间的，因此由于上下文长度的约束以及忽略临床相关信息的风险，使天真的促进了天真的提示。我们介绍了CLI-RAG（临床知情的检索生成一代），这是一种针对使用LLM的结构化和临床基础文本生成的特定领域的框架。它结合了一种新型的层次结构策略，该策略尊重临床文档结构，并引入了特定于任务的双阶段检索机制。全球阶段使用基于证据的查询来标识相关的注释类型，而本地阶段在这些注释中提取高价值内容，在文档和部分级别都产生相关性。我们将系统应用于模仿数据集中的15种临床NOTE类型来生成单个医院访问的结构化进度注释。实验表明，它保留了跨访问的时间和语义对准，达到87.7％的平均比对得分，超过了真正的临床医生作者的80.7％基线。生成的输出还表现出跨LLM的高稠度，增强了对可重复性，可靠性和临床信任至关重要的确定性行为。

Title: On the Effect of Uncertainty on Layer-wise Inference Dynamics

Authors: Sunwoo Kim, Haneul Yoo, Alice Oh
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.06722
Pdf URL: https://arxiv.org/pdf/2507.06722
Copy Paste: [[2507.06722]] On the Effect of Uncertainty on Layer-wise Inference Dynamics(https://arxiv.org/abs/2507.06722)
Keywords: language model, llm, hallucination
Abstract: Understanding how large language models (LLMs) internally represent and process their predictions is central to detecting uncertainty and preventing hallucinations. While several studies have shown that models encode uncertainty in their hidden states, it is underexplored how this affects the way they process such hidden states. In this work, we demonstrate that the dynamics of output token probabilities across layers for certain and uncertain outputs are largely aligned, revealing that uncertainty does not seem to affect inference dynamics. Specifically, we use the Tuned Lens, a variant of the Logit Lens, to analyze the layer-wise probability trajectories of final prediction tokens across 11 datasets and 5 models. Using incorrect predictions as those with higher epistemic uncertainty, our results show aligned trajectories for certain and uncertain predictions that both observe abrupt increases in confidence at similar layers. We balance this finding by showing evidence that more competent models may learn to process uncertainty differently. Our findings challenge the feasibility of leveraging simplistic methods for detecting uncertainty at inference. More broadly, our work demonstrates how interpretability methods may be used to investigate the way uncertainty affects inference.
摘要：了解内部代表和处理其预测的大型语言模型（LLM）对于检测不确定性和防止幻觉至关重要。尽管几项研究表明，模型在其隐藏状态中编码不确定性，但它尚未充分影响这如何影响他们处理这种隐藏状态的方式。在这项工作中，我们证明，对于某些和不确定的输出，各个层的输出令牌概率的动力学在很大程度上是对齐的，这表明不确定性似乎不会影响推理动态。具体而言，我们使用调谐镜头（Logit Lens的变体）来分析11个数据集和5个模型的最终预测令牌的层概率轨迹。我们的结果将错误的预测作为具有较高认知不确定性的预测，显示出对某些和不确定预测的对准轨迹，这些预测都观察到在相似层的置信度突然增加。我们通过证明更有能力的模型可以学会以不同的方式处理不确定性来平衡这一发现。我们的发现挑战了利用简单方法来检测推断时不确定性的可行性。从更广泛的角度来看，我们的工作表明了如何使用可解释性方法来研究不确定性影响推理的方式。

Title: Checklist Engineering Empowers Multilingual LLM Judges

Authors: Mohammad Ghiasvand Mohammadkhani, Hamid Beigy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06774
Pdf URL: https://arxiv.org/pdf/2507.06774
Copy Paste: [[2507.06774]] Checklist Engineering Empowers Multilingual LLM Judges(https://arxiv.org/abs/2507.06774)
Keywords: language model, gpt, llm
Abstract: Automated text evaluation has long been a central issue in Natural Language Processing (NLP). Recently, the field has shifted toward using Large Language Models (LLMs) as evaluators-a trend known as the LLM-as-a-Judge paradigm. While promising and easily adaptable across tasks, this approach has seen limited exploration in multilingual contexts. Existing multilingual studies often rely on proprietary models or require extensive training data for fine-tuning, raising concerns about cost, time, and efficiency. In this paper, we propose Checklist Engineering based LLM-as-a-Judge (CE-Judge), a training-free framework that uses checklist intuition for multilingual evaluation with an open-source model. Experiments across multiple languages and three benchmark datasets, under both pointwise and pairwise settings, show that our method generally surpasses the baselines and performs on par with the GPT-4o model.
摘要：自动文本评估长期以来一直是自然语言处理（NLP）的核心问题。最近，该领域已转向使用大型语言模型（LLMS）作为评估者 - 称为LLM-AS-A-A-Gudge范式的趋势。在各个任务之间有希望且易于适应的虽然这种方法在多语言环境中的探索有限。现有的多语言研究通常依靠专有模型或需要大量的培训数据来进行微调，提高对成本，时间和效率的担忧。在本文中，我们提出了基于清单工程的LLM-AS-A-Gudge（CE-Judge），这是一个无培训的框架，使用清单直觉使用开源模型进行多语言评估。跨多种语言和三个基准数据集的实验，在点心和成对设置下，表明我们的方法通常超过基准并与GPT-4O模型相同。

Title: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining: Method, Evaluation and Applications

Authors: Seonwu Kim, Yohan Na, Kihun Kim, Hanhee Cho, Geun Lim, Mintae Kim, Seongik Park, Ki Hyun Kim, Youngsub Han, Byoung-Ki Jeon
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.06795
Pdf URL: https://arxiv.org/pdf/2507.06795
Copy Paste: [[2507.06795]] Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining: Method, Evaluation and Applications(https://arxiv.org/abs/2507.06795)
Keywords: language model, llm
Abstract: The emergence of open-source large language models (LLMs) has expanded opportunities for enterprise applications; however, many organizations still lack the infrastructure to deploy and maintain large-scale models. As a result, small LLMs (sLLMs) have become a practical alternative, despite their inherent performance limitations. While Domain Adaptive Continual Pretraining (DACP) has been previously explored as a method for domain adaptation, its utility in commercial applications remains under-examined. In this study, we validate the effectiveness of applying a DACP-based recipe across diverse foundation models and service domains. Through extensive experiments and real-world evaluations, we demonstrate that DACP-applied sLLMs achieve substantial gains in target domain performance while preserving general capabilities, offering a cost-efficient and scalable solution for enterprise-level deployment.
摘要：开源大语模型（LLM）的出现扩大了企业应用程序的机会；但是，许多组织仍然缺乏部署和维护大型模型的基础架构。结果，尽管其固有的性能限制，但小型LLM（SLLM）已成为一种实用的选择。虽然域自适应持续预处理（DACP）先前已被作为域适应性的一种方法，但其在商业应用中的效用仍然不足。在这项研究中，我们验证了在不同的基础模型和服务领域应用基于DACP的配方的有效性。通过大量的实验和现实世界评估，我们证明了DACP应用的SLLMS在目标域性能中实现了可观的增长，同时保留了一般能力，为企业级部署提供了具有成本效益且可扩展的解决方案。

Title: Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams

Authors: Matthew Anderson Hendricks, Alice Cicirello
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2507.06803
Pdf URL: https://arxiv.org/pdf/2507.06803
Copy Paste: [[2507.06803]] Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams(https://arxiv.org/abs/2507.06803)
Keywords: language model, llm
Abstract: This paper contributes to speeding up the design and deployment of engineering dynamical systems by proposing a strategy for exploiting domain and expert knowledge for the automated generation of dynamical system computational model starting from a corpus of document relevant to the dynamical system of interest and an input document describing the specific system. This strategy is implemented in five steps and, crucially, it uses system modeling language diagrams (SysML) to extract accurate information about the dependencies, attributes, and operations of components. Natural Language Processing (NLP) strategies and Large Language Models (LLMs) are employed in specific tasks to improve intermediate outputs of the SySML diagrams automated generation, such as: list of key nouns; list of extracted relationships; list of key phrases and key relationships; block attribute values; block relationships; and BDD diagram generation. The applicability of automated SysML diagram generation is illustrated with different case studies. The computational models of complex dynamical systems from SysML diagrams are then obtained via code generation and computational model generation steps. In the code generation step, NLP strategies are used for summarization, while LLMs are used for validation only. The proposed approach is not limited to a specific system, domain, or computational software. The applicability of the proposed approach is shown via an end-to-end example from text to model of a simple pendulum, showing improved performance compared to results yielded by LLMs only.
摘要：本文通过提出一种利用域和专业知识的策略来加快工程动力学系统的设计和部署，以自动生成动态系统计算模型的自动生成，从与动力学系统相关的文档语料库和描述特定系统的输入文档开始。该策略以五个步骤实施，至关重要的是，它使用系统建模语言图（SYSML）来提取有关组件依赖项，属性和操作的准确信息。自然语言处理（NLP）策略和大型语言模型（LLM）用于特定任务，以改善SYSML图的自动生成的中间输出，例如：关键名词列表；提取关系清单；关键短语和关键关系列表；块属性值；区块关系；和BDD图生成。通过不同的案例研究，说明了自动化系统图生成的适用性。然后，通过代码生成和计算模型生成步骤获得了来自SYSML图的复杂动力系统的计算模型。在代码生成步骤中，NLP策略用于汇总，而LLM仅用于验证。所提出的方法不仅限于特定系统，域或计算软件。提出的方法的适用性通过从文本到简单摆的模型的端到端示例显示，与仅LLMS所产生的结果相比，表现出改善的性能。

Title: Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework

Authors: Zenan Xu, Zexuan Qiu, Guanhua Huang, Kun Li, Siheng Li, Chenchen Zhang, Kejiao Li, Qi Yi, Yuhao Jiang, Bo Zhou, Fengzong Lian, Zhanhui Kang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06829
Pdf URL: https://arxiv.org/pdf/2507.06829
Copy Paste: [[2507.06829]] Adaptive Termination for Multi-round Parallel Reasoning: An Universal Semantic Entropy-Guided Framework(https://arxiv.org/abs/2507.06829)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) have accelerated progress toward artificial general intelligence, with inference-time scaling emerging as a key technique. Contemporary approaches leverage either sequential reasoning (iteratively extending chains of thought) or parallel reasoning (generating multiple solutions simultaneously) to scale inference. However, both paradigms face fundamental limitations: sequential scaling typically relies on arbitrary token budgets for termination, leading to inefficiency or premature cutoff; while parallel scaling often lacks coordination among parallel branches and requires intrusive fine-tuning to perform effectively. In light of these challenges, we aim to design a flexible test-time collaborative inference framework that exploits the complementary strengths of both sequential and parallel reasoning paradigms. Towards this goal, the core challenge lies in developing an efficient and accurate intrinsic quality metric to assess model responses during collaborative inference, enabling dynamic control and early termination of the reasoning trace. To address this challenge, we introduce semantic entropy (SE), which quantifies the semantic diversity of parallel model responses and serves as a robust indicator of reasoning quality due to its strong negative correlation with accuracy...
摘要：大型语言模型（LLM）的最新进展已加速了人工智能的进步，推理时间缩放的出现是一种关键技术。当代方法利用顺序推理（迭代地扩展思想链）或平行推理（同时生成多个解决方案）来扩展推理。但是，这两个范式都面临着基本限制：顺序缩放通常依赖于任意令牌预算来终止，导致效率低下或过早截止；尽管平行缩放通常缺乏平行分支之间的协调，并且需要进行侵入性的微调才能有效地进行。鉴于这些挑战，我们旨在设计一个灵活的测试时间协作推理框架，以利用顺序和并行推理范式的互补优势。为了实现这一目标，核心挑战在于开发一个有效，准确的内在质量指标，以评估协作推断期间的模型响应，实现动态控制和推理迹线的早期终止。为了应对这一挑战，我们引入了语义熵（SE），该语义熵（SE）量化了平行模型响应的语义多样性，并作为推理质量的强大指标，因为它的准确性与准确性很强。

Title: Shifting from Ranking to Set Selection for Retrieval Augmented Generation

Authors: Dahyun Lee, Yongrae Jo, Haeju Park, Moontae Lee
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2507.06838
Pdf URL: https://arxiv.org/pdf/2507.06838
Copy Paste: [[2507.06838]] Shifting from Ranking to Set Selection for Retrieval Augmented Generation(https://arxiv.org/abs/2507.06838)
Keywords: llm, retrieval augmented generation, retrieval-augmented generation, chain-of-thought
Abstract: Retrieval in Retrieval-Augmented Generation(RAG) must ensure that retrieved passages are not only individually relevant but also collectively form a comprehensive set. Existing approaches primarily rerank top-k passages based on their individual relevance, often failing to meet the information needs of complex queries in multi-hop question answering. In this work, we propose a set-wise passage selection approach and introduce SETR, which explicitly identifies the information requirements of a query through Chain-of-Thought reasoning and selects an optimal set of passages that collectively satisfy those requirements. Experiments on multi-hop RAG benchmarks show that SETR outperforms both proprietary LLM-based rerankers and open-source baselines in terms of answer correctness and retrieval quality, providing an effective and efficient alternative to traditional rerankers in RAG systems. The code is available at this https URL
摘要：检索授权生成（RAG）的检索必须确保检索段落不仅是单独相关的，而且还集体构成全面的集合。现有的方法主要是基于其个人相关性来重读TOP-K段落，通常无法满足多跳问答中复杂查询的信息需求。在这项工作中，我们提出了一种固定的段落选择方法，并介绍了SETR，该方法通过思想链的推理明确识别查询的信息要求，并选择一组最佳段落，共同满足这些要求。多跳抹布基准的实验表明，在答案正确性和检索质量方面，SETR优于专有LLM的Rerankers和开源基线，在RAG系统中提供了传统的Rerankers的有效替代方案。该代码可在此HTTPS URL上找到

Title: Developing and Maintaining an Open-Source Repository of AI Evaluations: Challenges and Insights

Authors: Alexandra Abbas, Celia Waggoner, Justin Olive
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.06893
Pdf URL: https://arxiv.org/pdf/2507.06893
Copy Paste: [[2507.06893]] Developing and Maintaining an Open-Source Repository of AI Evaluations: Challenges and Insights(https://arxiv.org/abs/2507.06893)
Keywords: language model
Abstract: AI evaluations have become critical tools for assessing large language model capabilities and safety. This paper presents practical insights from eight months of maintaining $inspect\_evals$, an open-source repository of 70+ community-contributed AI evaluations. We identify key challenges in implementing and maintaining AI evaluations and develop solutions including: (1) a structured cohort management framework for scaling community contributions, (2) statistical methodologies for optimal resampling and cross-model comparison with uncertainty quantification, and (3) systematic quality control processes for reproducibility. Our analysis reveals that AI evaluation requires specialized infrastructure, statistical rigor, and community coordination beyond traditional software development practices.
摘要：AI评估已成为评估大型语言模型功能和安全性的关键工具。本文介绍了八个月维护$ Inspect \ _evals $的实用见解，这是一个由70多个社区控制的AI评估的开源存储库。我们确定实施和维护AI评估和开发解决方案的关键挑战，包括：（1）用于扩展社区贡献的结构化队列管理框架，（2）统计方法论，用于最佳重新采样和与不确定性量化的跨模型比较，以及（3）可重复可重复可重复性的系统质量控制过程。我们的分析表明，AI评估需要超出传统软件开发实践的专业基础架构，统计严格和社区协调。

Title: SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN

Authors: Luca Mariotti, Veronica Guidetti, Federica Mandreoli
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2507.06895
Pdf URL: https://arxiv.org/pdf/2507.06895
Copy Paste: [[2507.06895]] SCoRE: Streamlined Corpus-based Relation Extraction using Multi-Label Contrastive Learning and Bayesian kNN(https://arxiv.org/abs/2507.06895)
Keywords: language model
Abstract: The growing demand for efficient knowledge graph (KG) enrichment leveraging external corpora has intensified interest in relation extraction (RE), particularly under low-supervision settings. To address the need for adaptable and noise-resilient RE solutions that integrate seamlessly with pre-trained large language models (PLMs), we introduce SCoRE, a modular and cost-effective sentence-level RE system. SCoRE enables easy PLM switching, requires no finetuning, and adapts smoothly to diverse corpora and KGs. By combining supervised contrastive learning with a Bayesian k-Nearest Neighbors (kNN) classifier for multi-label classification, it delivers robust performance despite the noisy annotations of distantly supervised corpora. To improve RE evaluation, we propose two novel metrics: Correlation Structure Distance (CSD), measuring the alignment between learned relational patterns and KG structures, and Precision at R (P@R), assessing utility as a recommender system. We also release Wiki20d, a benchmark dataset replicating real-world RE conditions where only KG-derived annotations are available. Experiments on five benchmarks show that SCoRE matches or surpasses state-of-the-art methods while significantly reducing energy consumption. Further analyses reveal that increasing model complexity, as seen in prior work, degrades performance, highlighting the advantages of SCoRE's minimal design. Combining efficiency, modularity, and scalability, SCoRE stands as an optimal choice for real-world RE applications.
摘要：对有效知识图的需求不断增长（KG）利用外部语料库的富集对关系提取（RE）的兴趣加剧了（尤其是在低估环境下）。为了满足对与预训练的大语言模型（PLM）无缝集成的适应性和噪声的RE解决方案的需求，我们引入了Score，这是一个模块化且具有成本效益的句子级RE系统。得分使PLM切换可以轻松开关，不需要填充，并且可以顺利适应多样化的Corpora和KGS。通过将监督的对比度学习与贝叶斯K-Nearest邻居（KNN）分类器进行多标签分类相结合，尽管远处有监督的语料库的嘈杂注释，但它仍可以提供强大的性能。为了改善重新评估，我们提出了两个新的指标：相关结构距离（CSD），测量学习的关系模式和KG结构之间的一致性以及在R（P@R）处的精度，评估效用作为推荐系统。我们还发布了Wiki20D，这是一个基准数据集复制现实世界中的条件，仅可用kg衍生的注释。五个基准的实验表明，得分匹配或超过最先进的方法，同时大大降低了能耗。进一步的分析表明，增加模型复杂性，如先前的工作所见，降低了性能，突出了Score最少设计的优势。结合效率，模块化和可扩展性，得分是现实世界中应用程序的最佳选择。

Title: VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation

Authors: Ziang Ye, Yang Zhang, Wentao Shi, Xiaoyu You, Fuli Feng, Tat-Seng Chua
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.06899
Pdf URL: https://arxiv.org/pdf/2507.06899
Copy Paste: [[2507.06899]] VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation(https://arxiv.org/abs/2507.06899)
Keywords: language model, agent
Abstract: Graphical User Interface (GUI) agents powered by Large Vision-Language Models (LVLMs) have emerged as a revolutionary approach to automating human-machine interactions, capable of autonomously operating personal devices (e.g., mobile phones) or applications within the device to perform complex real-world tasks in a human-like manner. However, their close integration with personal devices raises significant security concerns, with many threats, including backdoor attacks, remaining largely unexplored. This work reveals that the visual grounding of GUI agent-mapping textual plans to GUI elements-can introduce vulnerabilities, enabling new types of backdoor attacks. With backdoor attack targeting visual grounding, the agent's behavior can be compromised even when given correct task-solving plans. To validate this vulnerability, we propose VisualTrap, a method that can hijack the grounding by misleading the agent to locate textual plans to trigger locations instead of the intended targets. VisualTrap uses the common method of injecting poisoned data for attacks, and does so during the pre-training of visual grounding to ensure practical feasibility of attacking. Empirical results show that VisualTrap can effectively hijack visual grounding with as little as 5% poisoned data and highly stealthy visual triggers (invisible to the human eye); and the attack can be generalized to downstream tasks, even after clean fine-tuning. Moreover, the injected trigger can remain effective across different GUI environments, e.g., being trained on mobile/web and generalizing to desktop environments. These findings underscore the urgent need for further research on backdoor attack risks in GUI agents.
摘要：由大型视觉语言模型（LVLM）提供动力的图形用户界面（GUI）代理已经成为一种革命性的方法，可以自动使用人机相互作用，该方法能够自主操作的个人设备（例如，手机）或设备中的应用程序以人类类似于人类的方式执行复杂的现实任务。但是，它们与个人设备的密切集成引起了重大的安全问题，其中包括后门攻击在内的许多威胁，在很大程度上没有探索。这项工作表明，GUI代理映射文本计划对GUI Elements-can的视觉接地引入漏洞，从而实现了新型的后门攻击。通过对视觉接地的后门攻击，即使给出了正确的任务解决计划，代理的行为也会受到损害。为了验证这种漏洞，我们提出了VisualTrap，该方法可以通过误导代理来定位文本计划来触发位置而不是预期的目标来劫持接地的方法。 VisualTrap使用将中毒数据注入攻击的常见方法，并在视觉接地预训练期间这样做以确保攻击的实际可行性。经验结果表明，VisualTrap可以有效地劫持视觉接地，仅5％的中毒数据和高度隐形的视觉触发器（人眼看不见）；即使在清洁微调之后，攻击也可以推广到下游任务。此外，注入的触发器可以在不同的GUI环境中保持有效，例如在移动/网络上接受培训并概括桌面环境。这些发现强调了迫切需要进一步研究GUI代理商的后门攻击风险。

Title: MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection

Authors: Ziyan Liu, Chunxiao Fan, Haoran Lou, Yuexin Wu, Kaiwei Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.06908
Pdf URL: https://arxiv.org/pdf/2507.06908
Copy Paste: [[2507.06908]] MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection(https://arxiv.org/abs/2507.06908)
Keywords: agent
Abstract: The rapid expansion of memes on social media has highlighted the urgent need for effective approaches to detect harmful content. However, traditional data-driven approaches struggle to detect new memes due to their evolving nature and the lack of up-to-date annotated data. To address this issue, we propose MIND, a multi-agent framework for zero-shot harmful meme detection that does not rely on annotated data. MIND implements three key strategies: 1) We retrieve similar memes from an unannotated reference set to provide contextual information. 2) We propose a bi-directional insight derivation mechanism to extract a comprehensive understanding of similar memes. 3) We then employ a multi-agent debate mechanism to ensure robust decision-making through reasoned arbitration. Extensive experiments on three meme datasets demonstrate that our proposed framework not only outperforms existing zero-shot approaches but also shows strong generalization across different model architectures and parameter scales, providing a scalable solution for harmful meme detection. The code is available at this https URL.
摘要：社交媒体上模因的迅速扩展表明，迫切需要有效的方法来检测有害内容。但是，传统数据驱动的方法由于其不断发展的性质和缺乏最新的注释数据而难以检测新模因。为了解决这个问题，我们提出了Mind，这是一个不依赖注释数据的零摄像有害模因检测的多代理框架。思维实施了三个关键策略：1）我们从未注释的参考集中检索类似的模因，以提供上下文信息。 2）我们提出了双向洞察派生机制，以提取对类似模因的全面理解。 3）然后，我们采用多代理辩论机制来确保通过合理的仲裁确保强大的决策。在三个模因数据集上进行的广泛实验表明，我们提出的框架不仅优于现有的零射击方法，而且还显示了不同模型架构和参数量表的强烈概括，为有害模因检测提供了可扩展的解决方案。该代码可在此HTTPS URL上找到。

Title: MultiJustice: A Chinese Dataset for Multi-Party, Multi-Charge Legal Prediction

Authors: Xiao Wang, Jiahuan Pei, Diancheng Shui, Zhiguang Han, Xin Sun, Dawei Zhu, Xiaoyu Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.06909
Pdf URL: https://arxiv.org/pdf/2507.06909
Copy Paste: [[2507.06909]] MultiJustice: A Chinese Dataset for Multi-Party, Multi-Charge Legal Prediction(https://arxiv.org/abs/2507.06909)
Keywords: language model, llm
Abstract: Legal judgment prediction offers a compelling method to aid legal practitioners and researchers. However, the research question remains relatively under-explored: Should multiple defendants and charges be treated separately in LJP? To address this, we introduce a new dataset namely multi-person multi-charge prediction (MPMCP), and seek the answer by evaluating the performance of several prevailing legal large language models (LLMs) on four practical legal judgment scenarios: (S1) single defendant with a single charge, (S2) single defendant with multiple charges, (S3) multiple defendants with a single charge, and (S4) multiple defendants with multiple charges. We evaluate the dataset across two LJP tasks, i.e., charge prediction and penalty term prediction. We have conducted extensive experiments and found that the scenario involving multiple defendants and multiple charges (S4) poses the greatest challenges, followed by S2, S3, and S1. The impact varies significantly depending on the model. For example, in S4 compared to S1, InternLM2 achieves approximately 4.5% lower F1-score and 2.8% higher LogD, while Lawformer demonstrates around 19.7% lower F1-score and 19.0% higher LogD. Our dataset and code are available at this https URL.
摘要：法律判断预测提供了一种令人信服的方法来帮助法律从业者和研究人员。但是，研究问题仍然相对不足：是否应在LJP中分别处理多个被告和指控？为了解决这个问题，我们介绍了一个新的数据集，即多人多收费预测（MPMCP），并通过评估在四个实用法律判断场景上评估几种盛行的几种盛行的法律大型语言模型（LLMS）的表现：（S1）单一指控，单个指控，（S2）带有多个Chardent and Chorege and Bulton Deffers and Chore Gunders（S2）多次捍卫（S3）多次多次捍卫（S3）。我们通过两个LJP任务（即收费预测和罚款术语预测）评估数据集。我们进行了广泛的实验，发现涉及多个被告和多项指控（S4）的场景提出了最大的挑战，其次是S2，S3和S1。影响取决于模型，其影响很大。例如，在S4中，与S1相比，InternLM2的F1得分降低了约4.5％，LOGD提高了2.8％，而LawFormer的F1得分降低了约19.7％，LogD降低了19.0％。我们的数据集和代码可在此HTTPS URL上找到。

Title: Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues

Authors: Fareya Ikram, Alexander Scarlatos, Andrew Lan
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2507.06910
Pdf URL: https://arxiv.org/pdf/2507.06910
Copy Paste: [[2507.06910]] Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues(https://arxiv.org/abs/2507.06910)
Keywords: language model, gpt, llm, agent
Abstract: Tutoring dialogues have gained significant attention in recent years, given the prominence of online learning and the emerging tutoring abilities of artificial intelligence (AI) agents powered by large language models (LLMs). Recent studies have shown that the strategies used by tutors can have significant effects on student outcomes, necessitating methods to predict how tutors will behave and how their actions impact students. However, few works have studied predicting tutor strategy in dialogues. Therefore, in this work we investigate the ability of modern LLMs, particularly Llama 3 and GPT-4o, to predict both future tutor moves and student outcomes in dialogues, using two math tutoring dialogue datasets. We find that even state-of-the-art LLMs struggle to predict future tutor strategy while tutor strategy is highly indicative of student outcomes, outlining a need for more powerful methods to approach this task.
摘要：鉴于在线学习的重要性以及由大语言模型（LLMS）提供支持的人工智能（AI）代理的新兴辅导能力，近年来的辅导对话引起了人们的重大关注。最近的研究表明，导师使用的策略可能会对学生的成果产生重大影响，需要采用预测导师如何行为以及其行为如何影响学生的方法。但是，很少有研究研究对话中的导师策略。因此，在这项工作中，我们研究了现代LLM，特别是Llama 3和GPT-4O的能力，可以使用两个数学辅导对话数据集预测对话中未来的导师移动和学生成绩的能力。我们发现，即使是最先进的LLM也很难预测未来的导师策略，而导师策略则高度表明了学生的成果，概述了需要更强大的方法来处理这项任务的需求。

Title: Rethinking Verification for LLM Code Generation: From Generation to Testing

Authors: Zihan Ma, Taolin Zhang, Maosong Cao, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06920
Pdf URL: https://arxiv.org/pdf/2507.06920
Copy Paste: [[2507.06920]] Rethinking Verification for LLM Code Generation: From Generation to Testing(https://arxiv.org/abs/2507.06920)
Keywords: language model, llm
Abstract: Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.
摘要：大型语言模型（LLM）最近在代码生成基准（例如Humaneval和LiveCodebench）方面取得了显着的成功。但是，一项详细的检查表明，这些评估套件通常仅包括有限数量的均质测试用例，导致未发现的微妙缺陷。这不仅会人为地夸大了测得的性能，还损害了利用可验证奖励（RLVR）的增强学习框架的准确奖励估计。为了解决这些关键的缺点，我们通过提出旨在严格量化测试套件透彻性的多维指标来系统地研究测试案例生成（TCG）任务。此外，我们引入了人类合作方法（SAGA），利用LLM推理能力利用人类编程专业知识，旨在显着提高生成的测试案例的覆盖范围和质量。此外，我们开发了TCGbench，以促进TCG任务的研究。实验表明，传奇的检测率为90.62％，而验证者的准确度为32.58％。由SAGA合成的代码生成评估基准的验证器精度（验证者ACC）比LiveCodeBench-V6高10.78％。这些结果证明了我们提出的方法的有效性。我们希望这项工作有助于为可靠的LLM代码评估建立可扩展的基础，进一步推进RLVR代码生成，并为自动化对抗测试合成和自适应基准集成铺平道路。

Title: Investigating the Robustness of Retrieval-Augmented Generation at the Query Level

Authors: Sezen Perçin, Xin Su, Qutub Sha Syed, Phillip Howard, Aleksei Kuvshinov, Leo Schwinn, Kay-Ulrich Scholl
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.06956
Pdf URL: https://arxiv.org/pdf/2507.06956
Copy Paste: [[2507.06956]] Investigating the Robustness of Retrieval-Augmented Generation at the Query Level(https://arxiv.org/abs/2507.06956)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large language models (LLMs) are very costly and inefficient to update with new information. To address this limitation, retrieval-augmented generation (RAG) has been proposed as a solution that dynamically incorporates external knowledge during inference, improving factual consistency and reducing hallucinations. Despite its promise, RAG systems face practical challenges-most notably, a strong dependence on the quality of the input query for accurate retrieval. In this paper, we investigate the sensitivity of different components in the RAG pipeline to various types of query perturbations. Our analysis reveals that the performance of commonly used retrievers can degrade significantly even under minor query variations. We study each module in isolation as well as their combined effect in an end-to-end question answering setting, using both general-domain and domain-specific datasets. Additionally, we propose an evaluation framework to systematically assess the query-level robustness of RAG pipelines and offer actionable recommendations for practitioners based on the results of more than 1092 experiments we performed.
摘要：大型语言模型（LLMS）非常昂贵且效率低下，无法使用新信息进行更新。为了解决这一局限性，已提出了检索功能的生成（RAG）作为一种解决方案，该解决方案在推理过程中动态纳入外部知识，提高了事实的一致性并减少幻觉。尽管有希望，但RAG系统仍面临众多实际挑战，这是对准确检索的输入查询质量的强烈依赖。在本文中，我们研究了RAG管道中不同组件对各种类型查询扰动的敏感性。我们的分析表明，即使在较小的查询变化下，常用检索器的性能也可以显着降解。我们使用通用域和特定于域的数据集研究了每个模块，以及它们在端到端的答案设置中的综合效果。此外，我们提出了一个评估框架，以系统地评估RAG管道的查询级鲁棒性，并根据我们执行的1092多个实验的结果为从业人员提供可行的建议。

Title: FlexOlmo: Open Language Models for Flexible Data Use

Authors: Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen-tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, Sewon Min
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.07024
Pdf URL: https://arxiv.org/pdf/2507.07024
Copy Paste: [[2507.07024]] FlexOlmo: Open Language Models for Flexible Data Use(https://arxiv.org/abs/2507.07024)
Keywords: language model
Abstract: We introduce FlexOlmo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets. We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners, leading to an average 41% relative improvement while allowing users to opt out of certain data based on data licensing or permission requirements. Our approach also outperforms prior model merging methods by 10.1% on average and surpasses the standard MoE trained without data restrictions using the same training FLOPs. Altogether, this research presents a solution for both data owners and researchers in regulated industries with sensitive or protected data. FlexOlmo enables benefiting from closed data while respecting data owners' preferences by keeping their data local and supporting fine-grained control of data access during inference.
摘要：我们介绍了一种新的语言模型（LMS）FlexolMo，该语言模型（LMS）支持（1）分布式培训，而无需数据共享，其中不同的模型参数是在封闭数据集中独立培训的，（2）数据可忽略的推断，这些参数及其相关数据可以灵活地包含或从模型推论中包含或排除在没有进一步培训的模型中。 Flexolmo采用了Experts（MOE）架构的混合物，在该体系结构中，每个专家都在封闭的数据集上独立培训，然后通过新的域信息路由进行集成，而无需进行任何联合培训。 Flexolmo在FlexMix上进行了训练，FlexMix是我们策划的，包括公开可用的数据集，并在七个特定于域的特定集合旁边，代表了闭合集合的现实近似值。我们评估了31个不同的下游任务的最多370亿参数（200亿个活动）的模型。我们表明，经过公共数据培训的一般专家可以与其他数据所有者的独立培训的专家有效结合，从而导致相对改进的平均相对改进，同时允许用户根据数据许可或权限要求选择退出某些数据。我们的方法还要平均比先前的模型合并方法合并10.1％，并超过了使用相同的训练拖失术的数据限制的标准MOE。总的来说，本研究为具有敏感或受保护的数据的受监管行业中的数据所有者和研究人员提供了解决方案。 Flexolmo可以从封闭的数据中受益，同时通过保持数据本地数据并支持推理期间数据访问的细粒度控制，同时又受益于数据所有者的偏好。

Title: UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations

Authors: Fengran Mo, Yifan Gao, Chuan Meng, Xin Liu, Zhuofeng Wu, Kelong Mao, Zhengyang Wang, Pei Chen, Zheng Li, Xian Li, Bing Yin, Meng Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.07030
Pdf URL: https://arxiv.org/pdf/2507.07030
Copy Paste: [[2507.07030]] UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations(https://arxiv.org/abs/2507.07030)
Keywords: language model
Abstract: The rapid advancement of conversational search systems revolutionizes how information is accessed by enabling the multi-turn interaction between the user and the system. Existing conversational search systems are usually built with two different models. This separation restricts the system from leveraging the intrinsic knowledge of the models simultaneously, which cannot ensure the effectiveness of retrieval benefiting the generation. The existing studies for developing unified models cannot fully address the aspects of understanding conversational context, managing retrieval independently, and generating responses. In this paper, we explore how to unify dense retrieval and response generation for large language models in conversation. We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks while mitigating data discrepancy. The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.
摘要：对话搜索系统的快速发展通过启用用户与系统之间的多转交互，彻底改变了信息的访问。现有的对话搜索系统通常使用两个不同的模型构建。这种分离限制了系统同时利用模型的内在知识，这无法确保检索有效性使这一代受益。开发统一模型的现有研究无法完全解决理解对话环境，独立管理检索并产生响应的方面。在本文中，我们探讨了如何在对话中统一大型语言模型的密集检索和响应产生。我们通过不同的目标进行联合微调，并设计了两种机制，以降低不一致的风险，同时减轻数据差异。五个对话搜索数据集的评估表明，我们的统一模型可以相互改进任务，并优于现有基线。

Title: Discrete Diffusion Models for Language Generation

Authors: Ashen Weligalle
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.07050
Pdf URL: https://arxiv.org/pdf/2507.07050
Copy Paste: [[2507.07050]] Discrete Diffusion Models for Language Generation(https://arxiv.org/abs/2507.07050)
Keywords: language model
Abstract: Diffusion models have emerged as a powerful class of generative models, achieving state-of-the-art results in continuous data domains such as image and video generation. Their core mechanism involves a forward diffusion process that gradually transforms structured data into a Gaussian-like distribution, followed by a learned reverse process to reconstruct the data. While successful in continuous modalities, applying this framework to discrete data-particularly natural language-remains challenging due to token dependency complexities and the lack of a defined generation this http URL thesis investigates the feasibility and performance of discrete diffusion models for natural language generation. Specifically, we evaluate the Discrete Denoising Diffusion Probabilistic Model (D3PM) and compare it with traditional autoregressive (AR) language models. To assess generative performance, we use Bits Per Token (BPT), Negative Log-Likelihood (NLL), Perplexity (PPL), and Batch Processing Speed. Results show the best-performing D3PM model achieves a BPT of 5.72, with a mean of 8.05. The AR model outperforms in compression with a lower mean BPT of 4.59, but D3PM achieves higher processing speed, reaching up to 3.97 batches per sec., indicating potential for parallel this http URL evaluations were conducted under consistent conditions-generating 100,000 tokens per model with a fixed batch size of four-for fair comparison. This research presents a detailed analysis of diffusion-based vs. autoregressive models, highlighting trade-offs in generative quality and efficiency. Findings emphasize both the promise and limitations of diffusion models for discrete data, supporting future work in non-autoregressive language generation.
摘要：扩散模型已成为强大的生成模型类别，实现最新的模型会导致连续数据域（例如图像和视频生成）。它们的核心机制涉及一个正向扩散过程，该过程逐渐将结构化数据转换为类似高斯的分布，然后是学习的反向过程以重建数据。尽管成功地采用了连续的方式，但由于令牌依赖的依赖复杂性，将此框架应用于离散的数据局部自然语言 - 挑战，并且缺乏确定的一代，因此该HTTP URL论文研究了自然语言生成的离散扩散模型的可行性和性能。具体而言，我们评估了离散的剥离扩散概率模型（D3PM），并将其与传统自回旋（AR）语言模型进行比较。为了评估生成性能，我们使用每个令牌（BPT），负模样（NLL），困惑（PPL）和批处理速度的位。结果表明，表现最佳的D3PM模型的BPT为5.72，平均为8.05。 AR模型在压缩方面的表现均优于4.59的平均BPT，但D3PM的处理速度较高，达到每秒钟的3.97批次，表明该HTTP URL评估的潜力是在一致的条件下进行100,000个代币，每款固定批次尺寸为100,000代币，固定批量的尺寸为四频尺寸，是四频道公平比较。这项研究对基于扩散的自回归模型进行了详细分析，突出了生成质量和效率方面的权衡。调查结果强调了离散数据的扩散模型的承诺和局限性，支持非自动回归语言生成的未来工作。