2025-05-21

Title: Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency

Authors: Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Markos A. Katsoulakis
Subjects: cs.LG, cs.AI, math.OC
Abstract URL: https://arxiv.org/abs/2505.13499
Pdf URL: https://arxiv.org/pdf/2505.13499
Copy Paste: [[2505.13499]] Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency(https://arxiv.org/abs/2505.13499)
Keywords: generation
Abstract: We study Transformers through the perspective of optimal control theory, using tools from continuous-time formulations to derive actionable insights into training and architecture design. This framework improves the performance of existing Transformer models while providing desirable theoretical guarantees, including generalization and robustness. Our framework is designed to be plug-and-play, enabling seamless integration with established Transformer models and requiring only slight changes to the implementation. We conduct seven extensive experiments on tasks motivated by text generation, sentiment analysis, image classification, and point cloud classification. Experimental results show that the framework improves the test performance of the baselines, while being more parameter-efficient. On character-level text generation with nanoGPT, our framework achieves a 46% reduction in final test loss while using 42% fewer parameters. On GPT-2, our framework achieves a 5.6% reduction in final test loss, demonstrating scalability to larger models. To the best of our knowledge, this is the first work that applies optimal control theory to both the training and architecture of Transformers. It offers a new foundation for systematic, theory-driven improvements and moves beyond costly trial-and-error approaches.
摘要：我们通过最佳控制理论的角度研究变压器，使用从连续时间配方的工具来推导训练和建筑设计的可行见解。该框架可改善现有变压器模型的性能，同时提供理想的理论保证，包括概括和鲁棒性。我们的框架旨在是插件的播放，使能够与已建立的变压器模型无缝集成，并且只需要对实现进行稍作更改。我们对由文本生成，情感分析，图像分类和点云分类动机的任务进行了七个广泛的实验。实验结果表明，该框架改善了基准的测试性能，同时更有效。在使用Nanogpt的字符级文本生成上，我们的框架可以减少42％的参数，从而减少了46％的测试损失。在GPT-2上，我们的框架在最终测试损失方面降低了5.6％，证明了对较大模型的可伸缩性。据我们所知，这是将最佳控制理论应用于变形金刚的训练和体系结构的第一部作品。它为系统性，理论驱动的改进和移动超越了昂贵的反复试验，为新的基础提供了新的基础。

Title: OMGPT: A Sequence Modeling Framework for Data-driven Operational Decision Making

Authors: Hanzhao Wang, Guanting Chen, Kalyan Talluri, Xiaocheng Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13580
Pdf URL: https://arxiv.org/pdf/2505.13580
Copy Paste: [[2505.13580]] OMGPT: A Sequence Modeling Framework for Data-driven Operational Decision Making(https://arxiv.org/abs/2505.13580)
Keywords: generative
Abstract: We build a Generative Pre-trained Transformer (GPT) model from scratch to solve sequential decision making tasks arising in contexts of operations research and management science which we call OMGPT. We first propose a general sequence modeling framework to cover several operational decision making tasks as special cases, such as dynamic pricing, inventory management, resource allocation, and queueing control. Under the framework, all these tasks can be viewed as a sequential prediction problem where the goal is to predict the optimal future action given all the historical information. Then we train a transformer-based neural network model (OMGPT) as a natural and powerful architecture for sequential modeling. This marks a paradigm shift compared to the existing methods for these OR/OM tasks in that (i) the OMGPT model can take advantage of the huge amount of pre-trained data; (ii) when tackling these problems, OMGPT does not assume any analytical model structure and enables a direct and rich mapping from the history to the future actions. Either of these two aspects, to the best of our knowledge, is not achieved by any existing method. We establish a Bayesian perspective to theoretically understand the working mechanism of the OMGPT on these tasks, which relates its performance with the pre-training task diversity and the divergence between the testing task and pre-training tasks. Numerically, we observe a surprising performance of the proposed model across all the above tasks.
摘要：我们从头开始构建了生成的预训练的变压器（GPT）模型，以求解在运营研究和管理科学背景下产生的顺序决策任务，我们称为OMGPT。我们首先提出了一个通用序列建模框架，以涵盖多个操作决策任务，例如特殊情况，例如动态定价，库存管理，资源分配和排队控制。在框架下，所有这些任务都可以看作是一个顺序预测问题，其目标是预测所有历史信息的最佳未来动作。然后，我们将基于变压器的神经网络模型（OMGPT）训练，作为一个天然和强大的结构，用于顺序建模。与这些或/OM任务的现有方法相比，这标志着（i）OMGPT模型可以利用大量的预训练数据；（ii）在解决这些问题时，OMGPT不假定任何分析模型结构，并可以从历史到未来的行动进行直接且丰富的映射。据我们所知，这两个方面中的任何一个都不是通过任何现有方法来实现的。我们建立了贝叶斯的观点，以理论上了解OMGPT在这些任务上的工作机制，这将其绩效与训练的任务多样性以及测试任务与培训前任务之间的差异联系起来。从数字上讲，我们在上述所有任务中都观察到所提出的模型的令人惊讶的性能。

Title: Incentivizing Truthful Language Models via Peer Elicitation Games

Authors: Baiting Chen, Tong Zhu, Jiale Han, Lexin Li, Gang Li, Xiaowu Dai
Subjects: cs.LG, cs.GT
Abstract URL: https://arxiv.org/abs/2505.13636
Pdf URL: https://arxiv.org/pdf/2505.13636
Copy Paste: [[2505.13636]] Incentivizing Truthful Language Models via Peer Elicitation Games(https://arxiv.org/abs/2505.13636)
Keywords: generative
Abstract: Large Language Models (LLMs) have demonstrated strong generative capabilities but remain prone to inconsistencies and hallucinations. We introduce Peer Elicitation Games (PEG), a training-free, game-theoretic framework for aligning LLMs through a peer elicitation mechanism involving a generator and multiple discriminators instantiated from distinct base models. Discriminators interact in a peer evaluation setting, where rewards are computed using a determinant-based mutual information score that provably incentivizes truthful reporting without requiring ground-truth labels. We establish theoretical guarantees showing that each agent, via online learning, achieves sublinear regret in the sense their cumulative performance approaches that of the best fixed truthful strategy in hindsight. Moreover, we prove last-iterate convergence to a truthful Nash equilibrium, ensuring that the actual policies used by agents converge to stable and truthful behavior over time. Empirical evaluations across multiple benchmarks demonstrate significant improvements in factual accuracy. These results position PEG as a practical approach for eliciting truthful behavior from LLMs without supervision or fine-tuning.
摘要：大型语言模型（LLM）表现出强大的生成能力，但仍然容易出现矛盾和幻觉。我们介绍了同伴启发游戏（PEG），这是一个通过涉及发电机和多个从不同基本模型实例化的发电机和多个歧视器的同行启发机制来对齐LLM的无训练理论框架。歧视者在同行评估设置中进行互动，其中使用基于确定性的共同信息得分计算奖励，该评分可证明可以激励真实的报告而无需地面真相标签。我们建立理论保证，表明每个代理通过在线学习，在其累积的绩效方面实现了统一的遗憾。此外，我们证明了最后近期的融合到真实的NASH平衡中，以确保代理人使用的实际政策会随着时间的流逝而融合到稳定而真实的行为。跨多个基准测试的经验评估表现出事实准确性的显着提高。这些结果将PEG定位为在没有监督或微调的情况下从LLM中引起真实行为的实际方法。

Title: Improving Compositional Generation with Diffusion Models Using Lift Scores

Authors: Chenning Yu, Sicun Gao
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.13740
Pdf URL: https://arxiv.org/pdf/2505.13740
Copy Paste: [[2505.13740]] Improving Compositional Generation with Diffusion Models Using Lift Scores(https://arxiv.org/abs/2505.13740)
Keywords: generation
Abstract: We introduce a novel resampling criterion using lift scores, for improving compositional generation in diffusion models. By leveraging the lift scores, we evaluate whether generated samples align with each single condition and then compose the results to determine whether the composed prompt is satisfied. Our key insight is that lift scores can be efficiently approximated using only the original diffusion model, requiring no additional training or external modules. We develop an optimized variant that achieves relatively lower computational overhead during inference while maintaining effectiveness. Through extensive experiments, we demonstrate that lift scores significantly improved the condition alignment for compositional generation across 2D synthetic data, CLEVR position tasks, and text-to-image synthesis. Our code is available at this http URL.
摘要：我们介绍了一种使用升力分数的新型重采样标准，以改善扩散模型中的组成产生。通过利用升力分数，我们评估生成的样品是否与每个单个条件对齐，然后构成结果以确定是否满足组合提示。我们的关键见解是，仅使用原始扩散模型可以有效地估算升力分数，而不需要额外的训练或外部模块。我们开发了一种优化的变体，该变体在推断期间在推断过程中达到相对较低的计算开销，同时保持有效性。通过广泛的实验，我们证明了升力评分可显着改善跨2D合成数据，CLEVR位置任务和文本对图像合成的组成生成的条件比对。我们的代码可在此HTTP URL上找到。

Title: Synthetic Non-stationary Data Streams for Recognition of the Unknown

Authors: Joanna Komorniczak
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.13745
Pdf URL: https://arxiv.org/pdf/2505.13745
Copy Paste: [[2505.13745]] Synthetic Non-stationary Data Streams for Recognition of the Unknown(https://arxiv.org/abs/2505.13745)
Keywords: generation
Abstract: The problem of data non-stationarity is commonly addressed in data stream processing. In a dynamic environment, methods should continuously be ready to analyze time-varying data -- hence, they should enable incremental training and respond to concept drifts. An equally important variability typical for non-stationary data stream environments is the emergence of new, previously unknown classes. Often, methods focus on one of these two phenomena -- detection of concept drifts or detection of novel classes -- while both difficulties can be observed in data streams. Additionally, concerning previously unknown observations, the topic of open set of classes has become particularly important in recent years, where the goal of methods is to efficiently classify within known classes and recognize objects outside the model competence. This article presents a strategy for synthetic data stream generation in which both concept drifts and the emergence of new classes representing unknown objects occur. The presented research shows how unsupervised drift detectors address the task of detecting novelty and concept drifts and demonstrates how the generated data streams can be utilized in the open set recognition task.
摘要：数据流处理中通常解决了数据非平稳性问题。在动态环境中，方法应不断准备分析时变数据 - 因此，它们应进行增量训练并响应概念漂移。非平稳数据流环境的典型变异性同样重要的是新的，以前未知的类别的出现。通常，方法集中在这两种现象之一 - 概念漂移的检测或新颖类的检测 - 而在数据流中可以观察到这两个困难。此外，关于以前未知的观察，开放式班级的主题在近年来变得尤为重要，方法的目的是有效地在已知类中进行分类并识别模型能力之外的对象。本文提出了一种合成数据流的策略，其中概念漂移和代表未知对象的新类别的出现。提出的研究表明，无监督的漂移检测器如何解决检测新颖性和概念漂移的任务，并演示如何在开放集识别任务中使用生成的数据流。

Title: Scalable Autoregressive 3D Molecule Generation

Authors: Austin H. Cheng, Chong Sun, Alán Aspuru-Guzik
Subjects: cs.LG, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2505.13791
Pdf URL: https://arxiv.org/pdf/2505.13791
Copy Paste: [[2505.13791]] Scalable Autoregressive 3D Molecule Generation(https://arxiv.org/abs/2505.13791)
Keywords: generation, generative
Abstract: Generative models of 3D molecular structure play a rapidly growing role in the design and simulation of molecules. Diffusion models currently dominate the space of 3D molecule generation, while autoregressive models have trailed behind. In this work, we present Quetzal, a simple but scalable autoregressive model that builds molecules atom-by-atom in 3D. Treating each molecule as an ordered sequence of atoms, Quetzal combines a causal transformer that predicts the next atom's discrete type with a smaller Diffusion MLP that models the continuous next-position distribution. Compared to existing autoregressive baselines, Quetzal achieves substantial improvements in generation quality and is competitive with the performance of state-of-the-art diffusion models. In addition, by reducing the number of expensive forward passes through a dense transformer, Quetzal enables significantly faster generation speed, as well as exact divergence-based likelihood computation. Finally, without any architectural changes, Quetzal natively handles variable-size tasks like hydrogen decoration and scaffold completion. We hope that our work motivates a perspective on scalability and generality for generative modelling of 3D molecules.
摘要：3D分子结构的生成模型在分子的设计和模拟中起着快速增长的作用。当前的扩散模型主导了3D分子生成的空间，而自回旋模型则落后于落后。在这项工作中，我们提出了Quetzal，这是一种简单但可扩展的自回归模型，它在3D中构建了分子原子。 Quetzal将每个分子视为有序的原子序列，结合了一个因果变压器，该因果变压器将下一个原子的离散类型与较小的扩散MLP预测，该分散类型MLP模拟连续的下一个位置分布。与现有的自回旋基线相比，Quetzal可以实现发电质量的实质性改善，并且与最先进的扩散模型的性能具有竞争力。此外，通过减少昂贵的前向通过密集的变压器的数量，Quetzal可以使生成速度明显更快，并且基于差异的可能性计算。最后，在没有任何架构变化的情况下，Quetzal内在处理可变大小的任务，例如氢装饰和脚手架完成。我们希望我们的工作激发了对3D分子生成建模的可扩展性和通用性的看法。

Title: Context-Free Synthetic Data Mitigates Forgetting

Authors: Parikshit Bansal, Sujay Sanghavi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.13811
Pdf URL: https://arxiv.org/pdf/2505.13811
Copy Paste: [[2505.13811]] Context-Free Synthetic Data Mitigates Forgetting(https://arxiv.org/abs/2505.13811)
Keywords: generation
Abstract: Fine-tuning a language model often results in a degradation of its existing performance on other tasks, due to a shift in the model parameters; this phenomenon is often referred to as (catastrophic) forgetting. We are interested in mitigating this, in settings where we only have access to the model weights but no access to its training data/recipe. A natural approach is to penalize the KL divergence between the original model and the new one. Our main realization is that a simple process - which we term context-free generation - allows for an approximate unbiased estimation of this KL divergence. We show that augmenting a fine-tuning dataset with context-free generations mitigates forgetting, in two settings: (a) preserving the zero-shot performance of pretrained-only models, and (b) preserving the reasoning performance of thinking models. We show that contextual synthetic data, and even a portion of the pretraining data, are less effective. We also investigate the effect of choices like generation temperature, data ratios etc. We present our results for OLMo-1B for pretrained-only setting and R1-Distill-Llama-8B for the reasoning setting.
摘要：通过模型参数的变化，对语言模型进行微调通常会导致其在其他任务上的现有性能下降。这种现象通常被称为（灾难性的）遗忘。在我们只能访问模型权重但无法访问其培训数据/配方的设置中，我们有兴趣减轻这种情况。一种自然的方法是惩罚原始模型与新模型之间的KL差异。我们的主要意识是，一个简单的过程 - 我们将其称为无上下文的生成 - 允许对此KL差异进行大致无偏见的估计。我们表明，在两个设置中，通过无上下文的世代增强了一个微调数据集，可以减轻忘记：（a）保留仅预审计的模型的零拍摄性能，以及（b）保留思维模型的推理性能。我们表明，上下文合成数据，甚至是预读图数据的一部分，都效率较低。我们还研究了诸如生成温度，数据比等选择的效果。我们为OLMO-1B提供了仅预审计的设置和R1-Distill-Lalama-8B的结果，用于推理设置。

Title: SuperMapNet for Long-Range and High-Accuracy Vectorized HD Map Construction

Authors: Ruqin Zhou, San Jiang, Wanshou Jiang, Yongsheng Zhang, Chenguang Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.13856
Pdf URL: https://arxiv.org/pdf/2505.13856
Copy Paste: [[2505.13856]] SuperMapNet for Long-Range and High-Accuracy Vectorized HD Map Construction(https://arxiv.org/abs/2505.13856)
Keywords: generation
Abstract: Vectorized HD map is essential for autonomous driving. Remarkable work has been achieved in recent years, but there are still major issues: (1) in the generation of the BEV features, single modality-based methods are of limited perception capability, while direct concatenation-based multi-modal methods fail to capture synergies and disparities between different modalities, resulting in limited ranges with feature holes; (2) in the classification and localization of map elements, only point information is used without the consideration of element infor-mation and neglects the interaction between point information and element information, leading to erroneous shapes and element entanglement with low accuracy. To address above issues, we introduce SuperMapNet for long-range and high-accuracy vectorized HD map construction. It uses both camera images and LiDAR point clouds as input, and first tightly couple semantic information from camera images and geometric information from LiDAR point clouds by a cross-attention based synergy enhancement module and a flow-based disparity alignment module for long-range BEV feature generation. And then, local features from point queries and global features from element queries are tightly coupled by three-level interactions for high-accuracy classification and localization, where Point2Point interaction learns local geometric information between points of the same element and of each point, Element2Element interaction learns relation constraints between different elements and semantic information of each elements, and Point2Element interaction learns complement element information for its constituent points. Experiments on the nuScenes and Argoverse2 datasets demonstrate superior performances, surpassing SOTAs over 14.9/8.8 mAP and 18.5/3.1 mAP under hard/easy settings, respectively. The code is made publicly available1.
摘要：矢量化高清图对于自动驾驶至关重要。近年来已经实现了杰出的工作，但是仍然存在重大问题：（1）在BEV功能的产生中，基于单态的方法具有有限的感知能力，而直接基于基于串联的多模式方法的直接方法无法捕获不同模态之间的协同和差异，从而导致特征孔有限的范围；（2）在地图元素的分类和定位中，仅使用点信息，而无需考虑元素infor-mation并忽略了点信息和元素信息之间的相互作用，从而导致错误的形状和元素纠缠较低。为了解决上述问题，我们介绍了用于远程和高临界矢量化的高清图构建的SuperMapnet。它同时使用相机图像和激光镜头云作为输入，首先通过基于跨注意的协同效应增强模块和远程BEV特征生成的基于跨注意的协同差异模块和基于流动的差距对齐模块，从LiDar Point Clouds中进行了紧密的逐步拨打语义信息。然后，来自点查询的本地特征以及元素查询的全局特征与三级互动与高临界分类和本地化的三级相互作用紧密耦合，其中point2point的相互作用在每个点的点和每个点的点之间学习本地几何信息，元素2元素互动学习之间的关系互动限制了各个元素和元素之间的元素之间的元素之间的相互作用，以了解元素和点的元素相互作用。 Nuscenes和Argoverse2数据集的实验表现出了出色的性能，在硬/轻松设置下分别超过14.9/8.8地图和18.5/3.1地图超过SOTA。该代码可公开可用1。

Title: Exploring Causes of Representational Similarity in Machine Learning Models

Authors: Zeyu Michael Li, Hung Anh Vu, Damilola Awofisayo, Emily Wenger
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.13899
Pdf URL: https://arxiv.org/pdf/2505.13899
Copy Paste: [[2505.13899]] Exploring Causes of Representational Similarity in Machine Learning Models(https://arxiv.org/abs/2505.13899)
Keywords: generative
Abstract: Numerous works have noted significant similarities in how machine learning models represent the world, even across modalities. Although much effort has been devoted to uncovering properties and metrics on which these models align, surprisingly little work has explored causes of this similarity. To advance this line of inquiry, this work explores how two possible causal factors -- dataset overlap and task overlap -- influence downstream model similarity. The exploration of dataset overlap is motivated by the reality that large-scale generative AI models are often trained on overlapping datasets of scraped internet data, while the exploration of task overlap seeks to substantiate claims from a recent work, the Platonic Representation Hypothesis, that task similarity may drive model similarity. We evaluate the effects of both factors through a broad set of experiments. We find that both positively correlate with higher representational similarity and that combining them provides the strongest effect. Our code and dataset are published.
摘要：许多作品都注意到机器学习模型如何代表世界，即使在跨模式上也是如此。尽管已经大量精力探讨了这些模型对齐的属性和指标，但令人惊讶的是，很少有工作探讨了这种相似性的原因。为了推进这一询问线，这项工作探讨了两个可能的因果因素如何 - 数据集重叠和任务重叠 - 影响下游模型的相似性。数据集重叠的探索是由以下事实激发的：大规模生成的AI模型经常在刮擦的互联网数据的重叠数据集上进行培训，而任务重叠的探索则旨在证明最近的工作，即柏拉图表示假设的索赔，该任务相似性可能会引起模型相似性。我们通过广泛的实验评估了这两个因素的影响。我们发现，两者都与较高的代表性相似性正相关，并且结合它们提供了最强的效果。我们的代码和数据集已发布。

Title: Blind Restoration of High-Resolution Ultrasound Video

Authors: Chu Chen, Kangning Cui, Pasquale Cascarano, Wei Tang, Elena Loli Piccolomini, Raymond H. Chan
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.13915
Pdf URL: https://arxiv.org/pdf/2505.13915
Copy Paste: [[2505.13915]] Blind Restoration of High-Resolution Ultrasound Video(https://arxiv.org/abs/2505.13915)
Keywords: restoration, super-resolution
Abstract: Ultrasound imaging is widely applied in clinical practice, yet ultrasound videos often suffer from low signal-to-noise ratios (SNR) and limited resolutions, posing challenges for diagnosis and analysis. Variations in equipment and acquisition settings can further exacerbate differences in data distribution and noise levels, reducing the generalizability of pre-trained models. This work presents a self-supervised ultrasound video super-resolution algorithm called Deep Ultrasound Prior (DUP). DUP employs a video-adaptive optimization process of a neural network that enhances the resolution of given ultrasound videos without requiring paired training data while simultaneously removing noise. Quantitative and visual evaluations demonstrate that DUP outperforms existing super-resolution algorithms, leading to substantial improvements for downstream applications.
摘要：超声成像被广泛应用于临床实践中，但是超声视频通常遭受信噪比低（SNR）和有限的分辨率，对诊断和分析提出了挑战。设备和采集设置的变化会进一步加剧数据分布和噪声水平的差异，从而降低了预训练模型的普遍性。这项工作介绍了一种自我监督的超声视频超分辨率算法，称为Deep Ultrasound Prior（DUP）。 DUP采用了神经网络的视频自适应优化过程，该过程可以增强给定超声视频的分辨率，而无需配对训练数据，同时删除噪声。定量和视觉评估表明，DUP的表现优于现有的超分辨率算法，从而实现了下游应用程序的实质性改进。

Title: LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts

Authors: Qifeng Cai, Hao Liang, Hejun Dong, Meiyi Qiang, Ruichuan An, Zhaoyang Han, Zhengzhou Zhu, Bin Cui, Wentao Zhang
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2505.13928
Pdf URL: https://arxiv.org/pdf/2505.13928
Copy Paste: [[2505.13928]] LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts(https://arxiv.org/abs/2505.13928)
Keywords: generation
Abstract: Long videos contain a vast amount of information, making video-text retrieval an essential and challenging task in multimodal learning. However, existing benchmarks suffer from limited video duration, low-quality captions, and coarse annotation granularity, which hinder the evaluation of advanced video-text retrieval methods. To address these limitations, we introduce LoVR, a benchmark specifically designed for long video-text retrieval. LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions. To overcome the issue of poor machine-generated annotations, we propose an efficient caption generation framework that integrates VLM automatic generation, caption quality scoring, and dynamic refinement. This pipeline improves annotation accuracy while maintaining scalability. Furthermore, we introduce a semantic fusion method to generate coherent full-video captions without losing important contextual information. Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset, presenting new challenges for video understanding and retrieval. Extensive experiments on various advanced embedding models demonstrate that LoVR is a challenging benchmark, revealing the limitations of current approaches and providing valuable insights for future research. We release the code and dataset link at this https URL
摘要：长视频包含大量信息，使视频文本检索成为多模式学习的重要且具有挑战性的任务。但是，现有的基准测试持续时间有限，视频持续时间有限，低质量标题和粗注释粒度，这阻碍了高级视频文本检索方法的评估。为了解决这些限制，我们介绍了专为长时间视频检索而设计的基准LOVR。 Lovr包含467个长期视频和40,804多个具有高质量字幕的细粒剪辑。为了克服机器生成的注释不良的问题，我们提出了一个有效的字幕生成框架，该框架集成了VLM自动生成，标题质量评分和动态改进。该管道在保持可伸缩性的同时提高了注释精度。此外，我们引入了一种语义融合方法，以生成连贯的全视频字幕而不会失去重要的上下文信息。我们的基准测试介绍了更长的视频，更详细的标题和更大的数据集，为视频理解和检索带来了新的挑战。对各种高级嵌入模型的广泛实验表明，LOVR是一个具有挑战性的基准，揭示了当前方法的局限性，并为未来的研究提供了宝贵的见解。我们在此HTTPS URL上发布代码和数据集链接

Title: RLVR-World: Training World Models with Reinforcement Learning

Authors: Jialong Wu, Shaofeng Yin, Ningya Feng, Mingsheng Long
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.13934
Pdf URL: https://arxiv.org/pdf/2505.13934
Copy Paste: [[2505.13934]] RLVR-World: Training World Models with Reinforcement Learning(https://arxiv.org/abs/2505.13934)
Keywords: generative
Abstract: World models predict state transitions in response to actions and are increasingly developed across diverse modalities. However, standard training objectives such as maximum likelihood estimation (MLE) often misalign with task-specific goals of world models, i.e., transition prediction metrics like accuracy or perceptual quality. In this paper, we present RLVR-World, a unified framework that leverages reinforcement learning with verifiable rewards (RLVR) to directly optimize world models for such metrics. Despite formulating world modeling as autoregressive prediction of tokenized sequences, RLVR-World evaluates metrics of decoded predictions as verifiable rewards. We demonstrate substantial performance gains on both language- and video-based world models across domains, including text games, web navigation, and robot manipulation. Our work indicates that, beyond recent advances in reasoning language models, RLVR offers a promising post-training paradigm for enhancing the utility of generative models more broadly.
摘要：世界模型预测了对行动的响应状态转变，并越来越多地跨不同方式发展。但是，诸如最大似然估计（MLE）之类的标准培训目标通常与世界模型的特定任务目标（即过渡预测指标）等特定于任务目标不一致。在本文中，我们介绍了RLVR-World，这是一个统一的框架，利用可验证的奖励（RLVR）来直接优化此类指标的世界模型。尽管将世界建模作为令牌化序列的自回旋预测，但RLVR-World将解码预测的指标评估为可验证的奖励。我们展示了跨域的基于语言和视频的世界模型，包括文本游戏，Web导航和机器人操纵。我们的工作表明，除了推理语言模型的最新进展外，RLVR提供了一个有希望的训练后范式，以更广泛地增强生成模型的实用性。

Title: CLEVER: A Curated Benchmark for Formally Verified Code Generation

Authors: Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzche, Greg Durrett, Yisong Yue, Swarat Chaudhuri
Subjects: cs.LG, cs.AI, cs.LO, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2505.13938
Pdf URL: https://arxiv.org/pdf/2505.13938
Copy Paste: [[2505.13938]] CLEVER: A Curated Benchmark for Formally Verified Code Generation(https://arxiv.org/abs/2505.13938)
Keywords: generation
Abstract: We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(this https URL) as well as HuggingFace(this https URL). All our evaluation code is also available online(this https URL).
摘要：我们介绍了$ {\ rm c {\ small Lever}} $，这是一个高质量的，策划的基准，这些基准是161个问题，用于精益端到端验证的代码生成。每个问题都包含（1）生成与持有的地面规范相匹配的规范的任务，以及（2）生成可证明满足此规范的精益实现的任务。与先前的基准标准不同，$ {\ rm c {\ small Lever}} $避免了测试案例监督，LLM生成的注释以及泄漏实现逻辑或允许空置解决方案的规格。使用精益类型的检查器对所有输出进行了验证，以确保机器检查的正确性。我们使用$ {\ rm c {\ small Lever}} $来评估基于最先进的语言模型的几种几次和代理方法。这些方法都难以实现全面验证，将其确定为计划综合和正式推理的具有挑战性的前沿基准。我们的基准测试标准可以在GitHub（此HTTPS URL）以及HuggingFace（此HTTPS URL）上找到。我们所有的评估代码也可以在线获得（此HTTPS URL）。

Title: Every Pixel Tells a Story: End-to-End Urdu Newspaper OCR

Authors: Samee Arif, Sualeha Farid
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.13943
Pdf URL: https://arxiv.org/pdf/2505.13943
Copy Paste: [[2505.13943]] Every Pixel Tells a Story: End-to-End Urdu Newspaper OCR(https://arxiv.org/abs/2505.13943)
Keywords: super-resolution
Abstract: This paper introduces a comprehensive end-to-end pipeline for Optical Character Recognition (OCR) on Urdu newspapers. In our approach, we address the unique challenges of complex multi-column layouts, low-resolution archival scans, and diverse font styles. Our process decomposes the OCR task into four key modules: (1) article segmentation, (2) image super-resolution, (3) column segmentation, and (4) text recognition. For article segmentation, we fine-tune and evaluate YOLOv11x to identify and separate individual articles from cluttered layouts. Our model achieves a precision of 0.963 and mAP@50 of 0.975. For super-resolution, we fine-tune and benchmark the SwinIR model (reaching 32.71 dB PSNR) to enhance the quality of degraded newspaper scans. To do our column segmentation, we use YOLOv11x to separate columns in text to further enhance performance - this model reaches a precision of 0.970 and mAP@50 of 0.975. In the text recognition stage, we benchmark a range of LLMs from different families, including Gemini, GPT, Llama, and Claude. The lowest WER of 0.133 is achieved by Gemini-2.5-Pro.
摘要：本文介绍了乌尔都语报纸上光学特征识别（OCR）的全面端到端管道。在我们的方法中，我们应对复杂的多列布局，低分辨率档案扫描和多种字体样式的独特挑战。我们的过程将OCR任务分解为四个关键模块：（1）文章分割，（2）图像超分辨率，（3）列进行分割和（4）文本识别。对于文章细分，我们对Yolov11x进行微调和评估，以识别和将单个文章与混乱的布局分开。我们的模型达到了0.963的精度和0.975的50个地图。对于超分辨率，我们对Swinir模型（达到32.71 dB PSNR）进行微调和基准测试，以提高报纸扫描的质量。为了进行列进行分割，我们使用Yolov11x在文本中分开列以进一步提高性能 - 该模型的精度为0.970，MAP@50 of 0.975 of 0.975。在文本识别阶段，我们基准了来自不同家庭的一系列LLM，包括双子座，GPT，Llama和Claude。 Demini-2.5-Pro实现了0.133的最低WER。

Title: UHD Image Dehazing via anDehazeFormer with Atmospheric-aware KV Cache

Authors: Pu Wang, Pengwen Dai, Chen Wu, Yeying Jin, Dianjie Lu, Guijuan Zhang, Youshan Zhang, Zhuoran Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14010
Pdf URL: https://arxiv.org/pdf/2505.14010
Copy Paste: [[2505.14010]] UHD Image Dehazing via anDehazeFormer with Atmospheric-aware KV Cache(https://arxiv.org/abs/2505.14010)
Keywords: restoration
Abstract: In this paper, we propose an efficient visual transformer framework for ultra-high-definition (UHD) image dehazing that addresses the key challenges of slow training speed and high memory consumption for existing methods. Our approach introduces two key innovations: 1) an \textbf{a}daptive \textbf{n}ormalization mechanism inspired by the nGPT architecture that enables ultra-fast and stable training with a network with a restricted range of parameter expressions; and 2) we devise an atmospheric scattering-aware KV caching mechanism that dynamically optimizes feature preservation based on the physical haze formation model. The proposed architecture improves the training convergence speed by \textbf{5 $\times$} while reducing memory overhead, enabling real-time processing of 50 high-resolution images per second on an RTX4090 GPU. Experimental results show that our approach maintains state-of-the-art dehazing quality while significantly improving computational efficiency for 4K/8K image restoration tasks. Furthermore, we provide a new dehazing image interpretable method with the help of an integrated gradient attribution map. Our code can be found here: this https URL.
摘要：在本文中，我们提出了一个有效的视觉变压器框架，用于超高定义（UHD）图像除尘，以解决慢训练速度的关键挑战和现有方法的高内存消耗。我们的方法介绍了两个关键创新：1）\ textbf {a} daptive \ textbf {n}启发的启发的NGPT体系结构，该机制受到NGPT体系结构的启发，该架构可以通过具有限制性范围的参数表达式进行超快速和稳定的培训； 2）我们设计了一种大气散射感知的KV缓存机制，该机制基于物理雾化形成模型，动态优化特征保存。所提出的体系结构将训练速度提高了\ textbf {5 $ \ times $}，同时减少内存开销，从而在RTX4090 GPU上实现50个高分辨率图像的实时处理。实验结果表明，我们的方法保持最先进的质量，同时显着提高了4K/8K图像恢复任务的计算效率。此外，我们在集成梯度归因图的帮助下，提供了一种新的Dehaz图像可解释方法。我们的代码可以在此处找到：此HTTPS URL。

Title: OmniStyle: Filtering High Quality Style Transfer Data at Scale

Authors: Ye Wang, Ruiqi Liu, Jiang Lin, Fei Liu, Zili Yi, Yilin Wang, Rui Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14028
Pdf URL: https://arxiv.org/pdf/2505.14028
Copy Paste: [[2505.14028]] OmniStyle: Filtering High Quality Style Transfer Data at Scale(https://arxiv.org/abs/2505.14028)
Keywords: quality assessment
Abstract: In this paper, we introduce OmniStyle-1M, a large-scale paired style transfer dataset comprising over one million content-style-stylized image triplets across 1,000 diverse style categories, each enhanced with textual descriptions and instruction prompts. We show that OmniStyle-1M can not only enable efficient and scalable of style transfer models through supervised training but also facilitate precise control over target stylization. Especially, to ensure the quality of the dataset, we introduce OmniFilter, a comprehensive style transfer quality assessment framework, which filters high-quality triplets based on content preservation, style consistency, and aesthetic appeal. Building upon this foundation, we propose OmniStyle, a framework based on the Diffusion Transformer (DiT) architecture designed for high-quality and efficient style transfer. This framework supports both instruction-guided and image-guided style transfer, generating high resolution outputs with exceptional detail. Extensive qualitative and quantitative evaluations demonstrate OmniStyle's superior performance compared to existing approaches, highlighting its efficiency and versatility. OmniStyle-1M and its accompanying methodologies provide a significant contribution to advancing high-quality style transfer, offering a valuable resource for the research community.
摘要：在本文中，我们介绍了Omnistyle-1M，这是一个大规模的配对样式传输数据集，其中包含1000种不同样式类别的一百万个内容式式式图像三重态，每个图像类别都通过文本描述和说明提示进行了增强。我们表明，Omnistyle-1M不仅可以通过监督培训来实现高效且可扩展的样式转移模型，而且还可以促进对目标风格化的精确控制。尤其是，为了确保数据集的质量，我们引入了综合器，这是一个全面的风格转移质量评估框架，该框架根据内容保存，样式一致性和美学吸引力过滤了高质量的三胞胎。在这个基础的基础上，我们提出了Omnistyle，这是一个基于扩散变压器（DIT）体系结构的框架，该体系结构旨在高质量和高效的风格转移。该框架支持指导引导和图像引导样式转移，从而产生具有出色细节的高分辨率输出。广泛的定性和定量评估表明，与现有方法相比，Allistyle的表现出色，强调了其效率和多功能性。 Omnistyle-1M及其随附的方法论为推进高质量风格转移提供了重要贡献，为研究社区提供了宝贵的资源。

Title: Adaptive Cyclic Diffusion for Inference Scaling

Authors: Gyubin Lee, Truong Nhat Nguyen Bao, Jaesik Yoon, Dongwoo Lee, Minsu Kim, Yoshua Bengio, Sungjin Ahn
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14036
Pdf URL: https://arxiv.org/pdf/2505.14036
Copy Paste: [[2505.14036]] Adaptive Cyclic Diffusion for Inference Scaling(https://arxiv.org/abs/2505.14036)
Keywords: generative
Abstract: Diffusion models have demonstrated strong generative capabilities across domains ranging from image synthesis to complex reasoning tasks. However, most inference-time scaling methods rely on fixed denoising schedules, limiting their ability to allocate computation based on instance difficulty or task-specific demands adaptively. We introduce the challenge of adaptive inference-time scaling-dynamically adjusting computational effort during inference-and propose Adaptive Bi-directional Cyclic Diffusion (ABCD), a flexible, search-based inference framework. ABCD refines outputs through bi-directional diffusion cycles while adaptively controlling exploration depth and termination. It comprises three components: Cyclic Diffusion Search, Automatic Exploration-Exploitation Balancing, and Adaptive Thinking Time. Experiments show that ABCD improves performance across diverse tasks while maintaining computational efficiency.
摘要：扩散模型已经证明了从图像综合到复杂的推理任务的域之间强大的生成能力。但是，大多数推理时间缩放方法都依赖于固定的denoising时间表，从而限制了基于实例难度或特定于任务的需求的计算能力。我们介绍了自适应推理时间缩放量表的挑战，可以在推理过程中调整计算工作，并提出自适应双向循环扩散（ABCD），这是一个灵活的，基于搜索的推理框架。 ABCD通过双向扩散周期来完善输出，同时自适应控制勘探深度和终止。它包括三个组成部分：循环扩散搜索，自动探索 - 探索平衡和自适应思维时间。实验表明，ABCD在维持计算效率的同时提高了各种任务的性能。

Title: MAS-KCL: Knowledge component graph structure learning with large language model-based agentic workflow

Authors: Yuan-Hao Jiang, Kezong Tang, Zi-Wei Chen, Yuang Wei, Tian-Yi Liu, Jiayi Wu
Subjects: cs.LG, cs.CY, cs.HC, cs.MA
Abstract URL: https://arxiv.org/abs/2505.14126
Pdf URL: https://arxiv.org/pdf/2505.14126
Copy Paste: [[2505.14126]] MAS-KCL: Knowledge component graph structure learning with large language model-based agentic workflow(https://arxiv.org/abs/2505.14126)
Keywords: generation
Abstract: Knowledge components (KCs) are the fundamental units of knowledge in the field of education. A KC graph illustrates the relationships and dependencies between KCs. An accurate KC graph can assist educators in identifying the root causes of learners' poor performance on specific KCs, thereby enabling targeted instructional interventions. To achieve this, we have developed a KC graph structure learning algorithm, named MAS-KCL, which employs a multi-agent system driven by large language models for adaptive modification and optimization of the KC graph. Additionally, a bidirectional feedback mechanism is integrated into the algorithm, where AI agents leverage this mechanism to assess the value of edges within the KC graph and adjust the distribution of generation probabilities for different edges, thereby accelerating the efficiency of structure learning. We applied the proposed algorithm to 5 synthetic datasets and 4 real-world educational datasets, and experimental results validate its effectiveness in learning path recognition. By accurately identifying learners' learning paths, teachers are able to design more comprehensive learning plans, enabling learners to achieve their educational goals more effectively, thus promoting the sustainable development of education.
摘要：知识组成部分（KC）是教育领域知识的基本单位。 KC图说明了KC之间的关系和依赖关系。准确的KC图可以帮助教育工作者确定学习者在特定KC上表现不佳的根本原因，从而实现目标的教学干预措施。为了实现这一目标，我们开发了一种名为MAS-KCL的KC图结构学习算法，该算法采用了由大语言模型驱动的多机构系统，以自适应修改和优化KC图。此外，将双向反馈机制集成到该算法中，其中AI代理利用该机制评估KC图内边缘的值并调整不同边缘的发电概率的分布，从而加速了结构学习的效率。我们将所提出的算法应用于5个合成数据集和4个现实世界的教育数据集，实验结果验证了其在学习路径识别方面的有效性。通过准确地识别学习者的学习路径，教师能够设计更全面的学习计划，使学习者能够更有效地实现其教育目标，从而促进教育的可持续发展。

Title: Hunyuan-Game: Industrial-grade Intelligent Game Creation Model

Authors: Ruihuang Li, Caijin Zhou, Shoujian Zheng, Jianxiang Lu, Jiabin Huang, Comi Chen, Junshu Tang, Guangzheng Xu, Jiale Tao, Hongmei Wang, Donghao Li, Wenqing Yu, Senbo Wang, Zhimin Li, Yetshuan Shi, Haoyu Yang, Yukun Wang, Wenxun Dai, Jiaqi Li, Linqing Wang, Qixun Wang, Zhiyong Xu, Yingfang Zhang, Jiangfeng Xiong, Weijie Kong, Chao Zhang, Hongxin Zhang, Qiaoling Zheng, Weiting Guo, Xinchi Deng, Yixuan Li, Renjia Wei, Yulin Jian, Duojun Huang, Xuhua Ren, Sihuan Lin, Yifu Sun, Yuan Zhou, Joey Wang, Qin Lin, Jingmiao Yu, Jihong Zhang, Caesar Zhong, Di Wang, Yuhong Liu, Linus, Jie Jiang, Longhuang Wu, Shuai Shao, Qinglin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14135
Pdf URL: https://arxiv.org/pdf/2505.14135
Copy Paste: [[2505.14135]] Hunyuan-Game: Industrial-grade Intelligent Game Creation Model(https://arxiv.org/abs/2505.14135)
Keywords: super-resolution, generation, generative
Abstract: Intelligent game creation represents a transformative advancement in game development, utilizing generative artificial intelligence to dynamically generate and enhance game content. Despite notable progress in generative models, the comprehensive synthesis of high-quality game assets, including both images and videos, remains a challenging frontier. To create high-fidelity game content that simultaneously aligns with player preferences and significantly boosts designer efficiency, we present Hunyuan-Game, an innovative project designed to revolutionize intelligent game production. Hunyuan-Game encompasses two primary branches: image generation and video generation. The image generation component is built upon a vast dataset comprising billions of game images, leading to the development of a group of customized image generation models tailored for game scenarios: (1) General Text-to-Image Generation. (2) Game Visual Effects Generation, involving text-to-effect and reference image-based game visual effect generation. (3) Transparent Image Generation for characters, scenes, and game visual effects. (4) Game Character Generation based on sketches, black-and-white images, and white models. The video generation component is built upon a comprehensive dataset of millions of game and anime videos, leading to the development of five core algorithmic models, each targeting critical pain points in game development and having robust adaptation to diverse game video scenarios: (1) Image-to-Video Generation. (2) 360 A/T Pose Avatar Video Synthesis. (3) Dynamic Illustration Generation. (4) Generative Video Super-Resolution. (5) Interactive Game Video Generation. These image and video generation models not only exhibit high-level aesthetic expression but also deeply integrate domain-specific knowledge, establishing a systematic understanding of diverse game and anime art styles.
摘要：智能游戏创建代表了游戏开发中的变革性进步，利用生成人工智能来动态生成和增强游戏内容。尽管生成模型取得了显着进展，但包括图像和视频在内的高质量游戏资产的全面综合仍然是一个具有挑战性的领域。为了创建高保真的游戏内容，与玩家的喜好同时保持一致并显着提高了设计师的效率，我们展示了Hunyuan-Game，这是一个创新的项目，旨在彻底改变智能游戏的生产。 Hunyuan-Game包括两个主要分支：图像生成和视频生成。图像生成组件建立在一个庞大的数据集上，其中包括数十亿个游戏图像，从而开发了一组针对游戏场景的自定义图像生成模型：（1）一般文本到图像生成。（2）游戏视觉效果生成，涉及文本对效应和基于参考图像的游戏视觉效果的生成。（3）角色，场景和游戏视觉效果的透明图像生成。（4）基于草图，黑白图像和白色模型的游戏角色生成。视频生成组件建立在数百万游戏和动漫视频的综合数据集上，从而开发了五种核心算法模型，每个模型都针对游戏开发中的关键疼痛点，并且对多样化的游戏视频方案具有强烈的适应性：（1）图像到视频生成。（2）360 A/T构成头像视频综合。（3）动态插图生成。（4）生成视频超分辨率。（5）互动游戏视频生成。这些图像和视频生成模型不仅表现出高级美学的表达，而且还深入整合了特定于领域的知识，从而对各种游戏和动漫艺术风格建立了系统的理解。

Title: FlowQ: Energy-Guided Flow Policies for Offline Reinforcement Learning

Authors: Marvin Alles, Nutan Chen, Patrick van der Smagt, Botond Cseke
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2505.14139
Pdf URL: https://arxiv.org/pdf/2505.14139
Copy Paste: [[2505.14139]] FlowQ: Energy-Guided Flow Policies for Offline Reinforcement Learning(https://arxiv.org/abs/2505.14139)
Keywords: generation
Abstract: The use of guidance to steer sampling toward desired outcomes has been widely explored within diffusion models, especially in applications such as image and trajectory generation. However, incorporating guidance during training remains relatively underexplored. In this work, we introduce energy-guided flow matching, a novel approach that enhances the training of flow models and eliminates the need for guidance at inference time. We learn a conditional velocity field corresponding to the flow policy by approximating an energy-guided probability path as a Gaussian path. Learning guided trajectories is appealing for tasks where the target distribution is defined by a combination of data and an energy function, as in reinforcement learning. Diffusion-based policies have recently attracted attention for their expressive power and ability to capture multi-modal action distributions. Typically, these policies are optimized using weighted objectives or by back-propagating gradients through actions sampled by the policy. As an alternative, we propose FlowQ, an offline reinforcement learning algorithm based on energy-guided flow matching. Our method achieves competitive performance while the policy training time is constant in the number of flow sampling steps.
摘要：在扩散模型中广泛探讨了将指导转向预期结果的使用，尤其是在图像和轨迹生成等应用中。但是，在培训期间合并指导仍然相对不受欢迎。在这项工作中，我们引入了能源引导的流匹配，这是一种新型方法，可增强流动模型的训练并消除了推理时需要指导的需求。我们通过将能源引导的概率路径近似为高斯路径来学习与流策略相对应的条件速度场。学习引导的轨迹吸引了目标分布由数据和能量功能组合定义的任务，例如增强学习。基于扩散的政策最近引起了人们对其表达能力和捕获多模式作用分布的能力的关注。通常，这些策略是使用加权目标或通过策略所采样的动作来优化这些策略的。作为替代方案，我们提出了FlowQ，这是一种基于能源引导的流量匹配的离线增强学习算法。我们的方法可以实现竞争性能，而政策培训时间在流动步骤的数量中持续不变。

Title: ReactDiff: Latent Diffusion for Facial Reaction Generation

Authors: Jiaming Li, Sheng Wang, Xin Wang, Yitao Zhu, Honglin Xiong, Zixu Zhuang, Qian Wang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.14151
Pdf URL: https://arxiv.org/pdf/2505.14151
Copy Paste: [[2505.14151]] ReactDiff: Latent Diffusion for Facial Reaction Generation(https://arxiv.org/abs/2505.14151)
Keywords: generation
Abstract: Given the audio-visual clip of the speaker, facial reaction generation aims to predict the listener's facial reactions. The challenge lies in capturing the relevance between video and audio while balancing appropriateness, realism, and diversity. While prior works have mostly focused on uni-modal inputs or simplified reaction mappings, recent approaches such as PerFRDiff have explored multi-modal inputs and the one-to-many nature of appropriate reaction mappings. In this work, we propose the Facial Reaction Diffusion (ReactDiff) framework that uniquely integrates a Multi-Modality Transformer with conditional diffusion in the latent space for enhanced reaction generation. Unlike existing methods, ReactDiff leverages intra- and inter-class attention for fine-grained multi-modal interaction, while the latent diffusion process between the encoder and decoder enables diverse yet contextually appropriate outputs. Experimental results demonstrate that ReactDiff significantly outperforms existing approaches, achieving a facial reaction correlation of 0.26 and diversity score of 0.094 while maintaining competitive realism. The code is open-sourced at \href{this https URL}{github}.
摘要：鉴于说话者的视听剪辑，面部反应产生旨在预测听众的面部反应。挑战在于捕获视频和音频之间的相关性，同时平衡适当性，现实主义和多样性。虽然先前的工作主要集中在单模式输入或简化的反应映射上，但诸如Perfrdiff之类的最新方法探索了多模式输入和适当反应映射的一对一本质。在这项工作中，我们提出了面部反应扩散（ReactDiff）框架，该框架独特地整合了多模式变压器与潜在空间中有条件扩散的多模式变压器，以增强反应的产生。与现有方法不同，ReactDiff利用阶层内和类间的注意力来获得细粒度的多模式相互作用，而编码器和解码器之间的潜在扩散过程可实现多种而适当的输出。实验结果表明，ReactDiff显着胜过现有方法，在维持竞争现实主义的同时，达到0.26的面部反应相关性和多样性得分为0.094。该代码在\ href {this https url} {github}的开源。

Title: Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search

Authors: Songhao Wu, Quan Tu, Hong Liu, Jia Xu, Zhongyi Liu, Guannan Zhang, Ran Wang, Xiuying Chen, Rui Yan
Subjects: cs.CV, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.14156
Pdf URL: https://arxiv.org/pdf/2505.14156
Copy Paste: [[2505.14156]] Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search(https://arxiv.org/abs/2505.14156)
Keywords: generation, generative
Abstract: Session search involves a series of interactive queries and actions to fulfill user's complex information need. Current strategies typically prioritize sequential modeling for deep semantic understanding, overlooking the graph structure in interactions. While some approaches focus on capturing structural information, they use a generalized representation for documents, neglecting the word-level semantic modeling. In this paper, we propose Symbolic Graph Ranker (SGR), which aims to take advantage of both text-based and graph-based approaches by leveraging the power of recent Large Language Models (LLMs). Concretely, we first introduce a set of symbolic grammar rules to convert session graph into text. This allows integrating session history, interaction process, and task instruction seamlessly as inputs for the LLM. Moreover, given the natural discrepancy between LLMs pre-trained on textual corpora, and the symbolic language we produce using our graph-to-text grammar, our objective is to enhance LLMs' ability to capture graph structures within a textual format. To achieve this, we introduce a set of self-supervised symbolic learning tasks including link prediction, node content generation, and generative contrastive learning, to enable LLMs to capture the topological information from coarse-grained to fine-grained. Experiment results and comprehensive analysis on two benchmark datasets, AOL and Tiangong-ST, confirm the superiority of our approach. Our paradigm also offers a novel and effective methodology that bridges the gap between traditional search strategies and modern LLMs.
摘要：会话搜索涉及一系列交互式查询和操作，以满足用户的复杂信息需求。当前的策略通常优先考虑顺序建模，以进行深入的语义理解，从而忽略了交互中的图形结构。尽管某些方法着重于捕获结构信息，但他们使用广义表示文档，忽略了单词级别的语义建模。在本文中，我们提出了符号图排名（SGR），该符号图排名符（SGR）旨在通过利用最近的大语言模型（LLMS）的力量来利用基于文本的方法和基于图形的方法。具体而言，我们首先引入了一组符号语法规则，将会话图转换为文本。这允许将会话历史记录，交互过程和任务指令无缝作为LLM的输入。此外，鉴于在文本语料库中预先训练的LLM与使用图形至文本语法产生的符号语言之间的自然差异，我们的目标是增强LLMS在文本格式中捕获图形结构的能力。为了实现这一目标，我们介绍了一组自制的符号学习任务，包括链接预测，节点内容产生和生成性对比学习，以使LLMS能够捕获从粗粒到细粒度的拓扑信息。实验结果和对两个基准数据集AOL和Tiangong-St的全面分析证实了我们方法的优势。我们的范式还提供了一种新颖有效的方法，可以弥合传统搜索策略与现代LLM之间的差距。

Title: LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer

Authors: Changgu Chen, Xiaoyan Yang, Junwei Shu, Changbo Wang, Yang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14167
Pdf URL: https://arxiv.org/pdf/2505.14167
Copy Paste: [[2505.14167]] LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer(https://arxiv.org/abs/2505.14167)
Keywords: generation, generative
Abstract: In recent years, large-scale pre-trained diffusion transformer models have made significant progress in video generation. While current DiT models can produce high-definition, high-frame-rate, and highly diverse videos, there is a lack of fine-grained control over the video content. Controlling the motion of subjects in videos using only prompts is challenging, especially when it comes to describing complex movements. Further, existing methods fail to control the motion in image-to-video generation, as the subject in the reference image often differs from the subject in the reference video in terms of initial position, size, and shape. To address this, we propose the Leveraging Motion Prior (LMP) framework for zero-shot video generation. Our framework harnesses the powerful generative capabilities of pre-trained diffusion transformers to enable motion in the generated videos to reference user-provided motion videos in both text-to-video and image-to-video generation. To this end, we first introduce a foreground-background disentangle module to distinguish between moving subjects and backgrounds in the reference video, preventing interference in the target video generation. A reweighted motion transfer module is designed to allow the target video to reference the motion from the reference video. To avoid interference from the subject in the reference video, we propose an appearance separation module to suppress the appearance of the reference subject in the target video. We annotate the DAVIS dataset with detailed prompts for our experiments and design evaluation metrics to validate the effectiveness of our method. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in generation quality, prompt-video consistency, and control capability. Our homepage is available at this https URL
摘要：近年来，大规模训练的扩散变压器模型在视频生成方面取得了重大进展。尽管当前的DIT模型可以产生高清，高框架速率和高度多样化的视频，但对视频内容缺乏细粒度的控制。仅使用提示控制视频中的主题运动是具有挑战性的，尤其是在描述复杂运动时。此外，现有方法无法控制图像到视频生成中的运动，因为参考图像中的主题通常与参考视频中的主题在初始位置，大小和形状方面有所不同。为了解决这个问题，我们提出了为零拍摄视频生成的利用运动先验（LMP）框架。我们的框架利用了预训练的扩散变压器的强大生成能力，以使生成的视频中的运动能够在文本到视频和图像到视频生成中参考用户提供的运动视频。为此，我们首先引入了一个前景 - 背景的删除模块，以区分参考视频中的移动主题和背景，从而防止了目标视频生成的干扰。重新加权的运动传输模块旨在允许目标视频从参考视频引用运动。为了避免参考视频中的主题干扰，我们提出了一个外观分离模块，以抑制目标视频中参考主题的外观。我们用详细提示为我们的实验和设计评估指标提供详细提示，以验证我们方法的有效性。广泛的实验表明，我们的方法在发电质量，及时视频一致性和控制能力方面实现了最先进的性能。我们的首页可在此HTTPS URL上找到

Title: $α$-GAN by Rényi Cross Entropy

Authors: Ni Ding, Miao Qiao, Jiaxing Xu, Yiping Ke, Xiaoyu Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14190
Pdf URL: https://arxiv.org/pdf/2505.14190
Copy Paste: [[2505.14190]] $α$-GAN by Rényi Cross Entropy(https://arxiv.org/abs/2505.14190)
Keywords: generative
Abstract: This paper proposes $\alpha$-GAN, a generative adversarial network using Rényi measures. The value function is formulated, by Rényi cross entropy, as an expected certainty measure incurred by the discriminator's soft decision as to where the sample is from, true population or the generator. The discriminator tries to maximize the Rényi certainty about sample source, while the generator wants to reduce it by injecting fake samples. This forms a min-max problem with the solution parameterized by the Rényi order $\alpha$. This $\alpha$-GAN reduces to vanilla GAN at $\alpha = 1$, where the value function is exactly the binary cross entropy. The optimization of $\alpha$-GAN is over probability (vector) space. It is shown that the gradient is exponentially enlarged when Rényi order is in the range $\alpha \in (0,1)$. This makes convergence faster, which is verified by experimental results. A discussion shows that choosing $\alpha \in (0,1)$ may be able to solve some common problems, e.g., vanishing gradient. A following observation reveals that this range has not been fully explored in the existing Rényi version GANs.
摘要：本文提出了$ \ alpha $ -gan，这是一种使用Rényi措施的生成对抗网络。 Rényi交叉熵的价值函数是由歧视者对样本来自何处（真实种群或发电机的位置）所产生的预期确定性度量提出的。鉴别器试图最大程度地提高对样本源的重新确定性，而发电机则希望通过注入假样品来减少它。这形成了最小的最大问题，该解决方案由rényi订单$ \ alpha $参数化的解决方案。此$ \ alpha $ -gan以$ \ alpha = 1 $减少了Vanilla gan，其中值函数正是二进制交叉熵。 $ \ alpha $ gan的优化是超过概率（向量）空间。结果表明，当rényi订单在（0,1）$ in（0,1）$中的范围内时，梯度将呈指数扩大。这使收敛速度更快，这通过实验结果验证。讨论表明，在（0,1）$中选择$ \ alpha \可能能够解决一些常见问题，例如消失的梯度。以下观察结果表明，在现有的Rényi版本Gans中尚未完全探索此范围。

Title: MSDformer: Multi-scale Discrete Transformer For Time Series Generation

Authors: Zhicheng Chen, Shibo Feng, Xi Xiao, Zhong Zhang, Qing Li, Xingyu Gao, Peilin Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.14202
Pdf URL: https://arxiv.org/pdf/2505.14202
Copy Paste: [[2505.14202]] MSDformer: Multi-scale Discrete Transformer For Time Series Generation(https://arxiv.org/abs/2505.14202)
Keywords: generation
Abstract: Discrete Token Modeling (DTM), which employs vector quantization techniques, has demonstrated remarkable success in modeling non-natural language modalities, particularly in time series generation. While our prior work SDformer established the first DTM-based framework to achieve state-of-the-art performance in this domain, two critical limitations persist in existing DTM approaches: 1) their inability to capture multi-scale temporal patterns inherent to complex time series data, and 2) the absence of theoretical foundations to guide model optimization. To address these challenges, we proposes a novel multi-scale DTM-based time series generation method, called Multi-Scale Discrete Transformer (MSDformer). MSDformer employs a multi-scale time series tokenizer to learn discrete token representations at multiple scales, which jointly characterize the complex nature of time series data. Subsequently, MSDformer applies a multi-scale autoregressive token modeling technique to capture the multi-scale patterns of time series within the discrete latent space. Theoretically, we validate the effectiveness of the DTM method and the rationality of MSDformer through the rate-distortion theorem. Comprehensive experiments demonstrate that MSDformer significantly outperforms state-of-the-art methods. Both theoretical analysis and experimental results demonstrate that incorporating multi-scale information and modeling multi-scale patterns can substantially enhance the quality of generated time series in DTM-based approaches. The code will be released upon acceptance.
摘要：采用矢量量化技术的离散令牌建模（DTM）在建模非天然语言模式方面取得了显着的成功，尤其是在时间序列的生成中。尽管我们先前的工作SDFormer建立了第一个基于DTM的框架来实现该领域的最新性能，但现有的DTM方法中存在两个临界局限性：1）它们无法捕获复杂时间序列数据固有的多规模时间模式，而2）缺乏理论基础来指导模型优化。为了应对这些挑战，我们提出了一种新型的基于DTM的多尺度时间序列生成方法，称为多尺度离散变压器（MSDFormer）。 MSDFormer采用多尺度时间序列令牌来学习多个尺度的离散令牌表示，这共同表征了时间序列数据的复杂性质。随后，MSDFormer应用多尺度自回归令牌建模技术来捕获离散潜在空间内时间序列的多尺度模式。从理论上讲，我们验证了DTM方法的有效性以及MSDFormer通过速率延伸定理的合理性。全面的实验表明，MSDFormer的表现明显胜过最先进的方法。理论分析和实验结果都表明，合并多尺度信息和建模多尺度模式可以大大提高基于DTM的方法中生成的时间序列的质量。该代码将在接受后发布。

Title: Challenges and Limitations in the Synthetic Generation of mHealth Sensor Data

Authors: Flavio Di Martino, Franca Delmastro
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14206
Pdf URL: https://arxiv.org/pdf/2505.14206
Copy Paste: [[2505.14206]] Challenges and Limitations in the Synthetic Generation of mHealth Sensor Data(https://arxiv.org/abs/2505.14206)
Keywords: generation, generative
Abstract: The widespread adoption of mobile sensors has the potential to provide massive and heterogeneous time series data, driving Artificial Intelligence applications in mHealth. However, data collection remains limited due to stringent ethical regulations, privacy concerns, and other constraints, hindering progress in the field. Synthetic data generation, particularly through Generative Adversarial Networks and Diffusion Models, has emerged as a promising solution to address both data scarcity and privacy issues. Yet, these models are often limited to short-term, unimodal signal patterns. This paper presents a systematic evaluation of state-of-the-art generative models for time series synthesis, with a focus on their ability to jointly handle multi-modality, long-range dependencies, and conditional generation-key challenges in the mHealth domain. To ensure a fair comparison, we introduce a novel evaluation framework designed to measure both the intrinsic quality of synthetic data and its utility in downstream predictive tasks. Our findings reveal critical limitations in the existing approaches, particularly in maintaining cross-modal consistency, preserving temporal coherence, and ensuring robust performance in train-on-synthetic, test-on-real, and data augmentation scenarios. Finally, we present our future research directions to enhance synthetic time series generation and improve the applicability of generative models in mHealth.
摘要：移动传感器的广泛采用有可能提供大量和异质的时间序列数据，并在MHealth中推动人工智能应用。但是，由于严格的道德法规，隐私问题和其他限制，数据收集仍然限制，从而阻碍了该领域的进展。合成数据生成，特别是通过生成的对抗网络和扩散模型，已成为解决数据稀缺和隐私问题的有前途解决方案。但是，这些模型通常仅限于短期的单峰信号模式。本文对时间序列合成的最新生成模型进行了系统的评估，重点是他们共同处理多模式，远程依赖性以及MHealth领域中有条件的生成键挑战的能力。为了确保进行公平的比较，我们引入了一个新颖的评估框架，旨在衡量合成数据的内在质量及其在下游预测任务中的效用。我们的发现揭示了现有方法的临界局限性，尤其是在保持跨模式一致性，保持时间连贯性以及确保在训练机上，测试和数据增强方案中的稳健性能。最后，我们提出了未来的研究方向，以增强合成时间序列的生成并提高生成模型在MHealth中的适用性。

Title: Flexible-weighted Chamfer Distance: Enhanced Objective Function for Point Cloud Completion

Authors: Jie Li, Shengwei Tian, Long Yu, Xin Ning
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14218
Pdf URL: https://arxiv.org/pdf/2505.14218
Copy Paste: [[2505.14218]] Flexible-weighted Chamfer Distance: Enhanced Objective Function for Point Cloud Completion(https://arxiv.org/abs/2505.14218)
Keywords: generation
Abstract: Chamfer Distance (CD) comprises two components that can evaluate the global distribution and local performance of generated point clouds, making it widely utilized as a similarity measure between generated and target point clouds in point cloud completion tasks. Additionally, CD's computational efficiency has led to its frequent application as an objective function for guiding point cloud generation. However, using CD directly as an objective function with fixed equal weights for its two components can often result in seemingly high overall performance (i.e., low CD score), while failing to achieve a good global distribution. This is typically reflected in high Earth Mover's Distance (EMD) and Decomposed Chamfer Distance (DCD) scores, alongside poor human assessments. To address this issue, we propose a Flexible-Weighted Chamfer Distance (FCD) to guide point cloud generation. FCD assigns a higher weight to the global distribution component of CD and incorporates a flexible weighting strategy to adjust the balance between the two components, aiming to improve global distribution while maintaining robust overall performance. Experimental results on two state-of-the-art networks demonstrate that our method achieves superior results across multiple evaluation metrics, including CD, EMD, DCD, and F-Score, as well as in human evaluations.
摘要：倒角距离（CD）包括两个组件，可以评估生成点云的全局分布和本地性能，从而广泛用作点云完成任务中生成点云和目标点云之间的相似度度量。此外，CD的计算效率已导致其经常应用作为指导点云生成的目标函数。但是，将CD直接用作其两个组件固定相等权重的目标函数通常会导致总体性能（即CD得分低），同时未能实现良好的全球分布。这通常反映在高地球搬运工的距离（EMD）和分解的倒角距离（DCD）分数以及人类评估差。为了解决这个问题，我们提出了一个灵活的加权倒角距离（FCD），以指导点云的生成。 FCD将更高的权重分配给CD的全球分布部分，并结合了灵活的加权策略，以调整两个组件之间的平衡，旨在改善全球分布，同时保持强大的整体性能。两个最先进的网络的实验结果表明，我们的方法在包括CD，EMD，DCD和F-SCORE以及人类评估中的多个评估指标之间取得了卓越的结果。

Title: Instructing Text-to-Image Diffusion Models via Classifier-Guided Semantic Optimization

Authors: Yuanyuan Chang, Yinghua Yao, Tao Qin, Mengmeng Wang, Ivor Tsang, Guang Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14254
Pdf URL: https://arxiv.org/pdf/2505.14254
Copy Paste: [[2505.14254]] Instructing Text-to-Image Diffusion Models via Classifier-Guided Semantic Optimization(https://arxiv.org/abs/2505.14254)
Keywords: generation
Abstract: Text-to-image diffusion models have emerged as powerful tools for high-quality image generation and editing. Many existing approaches rely on text prompts as editing guidance. However, these methods are constrained by the need for manual prompt crafting, which can be time-consuming, introduce irrelevant details, and significantly limit editing performance. In this work, we propose optimizing semantic embeddings guided by attribute classifiers to steer text-to-image models toward desired edits, without relying on text prompts or requiring any training or fine-tuning of the diffusion model. We utilize classifiers to learn precise semantic embeddings at the dataset level. The learned embeddings are theoretically justified as the optimal representation of attribute semantics, enabling disentangled and accurate edits. Experiments further demonstrate that our method achieves high levels of disentanglement and strong generalization across different domains of data.
摘要：文本到图像扩散模型已成为高质量图像生成和编辑的强大工具。许多现有方法依赖文本提示作为编辑指导。但是，这些方法受到需要手动提示制作的需求，这可能是耗时的，引入无关紧要的细节并大大限制了编辑性能。在这项工作中，我们建议优化以属性分类器为指导的语义嵌入到将文本对图像模型转向所需编辑的情况下，而无需依赖文本提示或需要对扩散模型进行任何培训或微调。我们利用分类器在数据集级别学习精确的语义嵌入。从理论上讲，学到的嵌入是有道理的，是属性语义的最佳表示，实现了分解和准确的编辑。实验进一步表明，我们的方法实现了不同数据领域的高水平分离和强有力的概括。

Title: Towards Generating Realistic Underwater Images

Authors: Abdul-Kazeem Shamba
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.14296
Pdf URL: https://arxiv.org/pdf/2505.14296
Copy Paste: [[2505.14296]] Towards Generating Realistic Underwater Images(https://arxiv.org/abs/2505.14296)
Keywords: generative
Abstract: This paper explores the use of contrastive learning and generative adversarial networks for generating realistic underwater images from synthetic images with uniform lighting. We investigate the performance of image translation models for generating realistic underwater images using the VAROS dataset. Two key evaluation metrics, Fréchet Inception Distance (FID) and Structural Similarity Index Measure (SSIM), provide insights into the trade-offs between perceptual quality and structural preservation. For paired image translation, pix2pix achieves the best FID scores due to its paired supervision and PatchGAN discriminator, while the autoencoder model attains the highest SSIM, suggesting better structural fidelity despite producing blurrier outputs. Among unpaired methods, CycleGAN achieves a competitive FID score by leveraging cycle-consistency loss, whereas CUT, which replaces cycle-consistency with contrastive learning, attains higher SSIM, indicating improved spatial similarity retention. Notably, incorporating depth information into CUT results in the lowest overall FID score, demonstrating that depth cues enhance realism. However, the slight decrease in SSIM suggests that depth-aware learning may introduce structural variations.
摘要：本文探讨了使用对比度学习和生成的对抗网络来从具有均匀照明的合成图像中生成逼真的水下图像。我们研究了图像翻译模型的性能，用于使用VAROS数据集生成现实的水下图像。两个关键的评估指标，即FréchetInception距离（FID）和结构相似性指数量度（SSIM），为感知质量和结构保存之间的权衡提供了见解。对于配对的图像翻译，Pix2Pix由于其配对的监督和PatchGAN歧视器而获得最佳的FID得分，而自动编码器模型达到了最高的SSIM，表明尽管产生了模糊的输出，但表明尽管产生了更明显的结构保真度。在未配对的方法中，Cyclegan通过利用周期矛盾的损失来达到竞争性的FID得分，而切割用对比度学习取代周期矛盾的削减率达到了更高的SSIM，这表明空间相似性保留的提高。值得注意的是，将深度信息纳入最低的总体FID得分中的剪切结果，这表明深度提示可以增强现实主义。但是，SSIM的略有下降表明深度感知学习可能引入结构变化。

Title: RADAR: Enhancing Radiology Report Generation with Supplementary Knowledge Injection

Authors: Wenjun Hou, Yi Cheng, Kaishuai Xu, Heng Li, Yan Hu, Wenjie Li, Jiang Liu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.14318
Pdf URL: https://arxiv.org/pdf/2505.14318
Copy Paste: [[2505.14318]] RADAR: Enhancing Radiology Report Generation with Supplementary Knowledge Injection(https://arxiv.org/abs/2505.14318)
Keywords: generation
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in various domains, including radiology report generation. Previous approaches have attempted to utilize multimodal LLMs for this task, enhancing their performance through the integration of domain-specific knowledge retrieval. However, these approaches often overlook the knowledge already embedded within the LLMs, leading to redundant information integration and inefficient utilization of learned representations. To address this limitation, we propose RADAR, a framework for enhancing radiology report generation with supplementary knowledge injection. RADAR improves report generation by systematically leveraging both the internal knowledge of an LLM and externally retrieved information. Specifically, it first extracts the model's acquired knowledge that aligns with expert image-based classification outputs. It then retrieves relevant supplementary knowledge to further enrich this information. Finally, by aggregating both sources, RADAR generates more accurate and informative radiology reports. Extensive experiments on MIMIC-CXR, CheXpert-Plus, and IU X-ray demonstrate that our model outperforms state-of-the-art LLMs in both language quality and clinical accuracy
摘要：大型语言模型（LLMS）在包括放射学报告的各个领域都表现出了显着的功能。以前的方法试图将多模式LLM用于此任务，从而通过集成特定于领域的知识检索来增强其性能。但是，这些方法经常忽略已经嵌入在LLM中的知识，从而导致冗余信息集成和对学习表示的效率低下的利用。为了解决这一限制，我们提出了雷达，这是一种增强放射学报告并用补充知识注入的框架。雷达通过系统地利用LLM的内部知识和外部检索信息来改善报告的生成。具体而言，它首先提取模型的获取知识，该知识与基于图像的专家分类输出保持一致。然后，它检索相关的补充知识以进一步丰富此信息。最后，通过汇总这两个来源，雷达生成更准确和信息丰富的放射学报告。关于MIMIC-CXR，CHEXPERT-PLUS和IU X射线的广泛实验表明，我们的模型在语言质量和临床准确性方面都优于最先进的LLM

Title: Handloom Design Generation Using Generative Networks

Authors: Rajat Kanti Bhattacharjee, Meghali Nandi, Amrit Jha, Gunajit Kalita, Ferdous Ahmed Barbhuiya
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14330
Pdf URL: https://arxiv.org/pdf/2505.14330
Copy Paste: [[2505.14330]] Handloom Design Generation Using Generative Networks(https://arxiv.org/abs/2505.14330)
Keywords: generation, generative
Abstract: This paper proposes deep learning techniques of generating designs for clothing, focused on handloom fabric and discusses the associated challenges along with its application. The capability of generative neural network models in understanding artistic designs and synthesizing those is not yet explored well. In this work, multiple methods are employed incorporating the current state of the art generative models and style transfer algorithms to study and observe their performance for the task. The results are then evaluated through user score. This work also provides a new dataset NeuralLoom for the task of the design generation.
摘要：本文提出了为服装生成设计的深度学习技术，专注于手织机面料，并讨论了相关的挑战及其应用。生成神经网络模型在理解艺术设计和合成这些模型方面的能力尚未得到很好的探索。在这项工作中，采用了多种方法，结合了最新的生成模型和样式转移算法来研究和观察其任务的性能。然后通过用户得分评估结果。这项工作还为设计生成的任务提供了一个新的数据集神经ALALLOOM。

Title: Vid2World: Crafting Video Diffusion Models to Interactive World Models

Authors: Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, Mingsheng Long
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14357
Pdf URL: https://arxiv.org/pdf/2505.14357
Copy Paste: [[2505.14357]] Vid2World: Crafting Video Diffusion Models to Interactive World Models(https://arxiv.org/abs/2505.14357)
Keywords: generation
Abstract: World models, which predict transitions based on history observation and action sequences, have shown great promise in improving data efficiency for sequential decision making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their applicability in complex environments. In contrast, video diffusion models trained on large, internet-scale datasets have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World performs casualization of a pre-trained video diffusion model by crafting its architecture and training objective to enable autoregressive generation. Furthermore, it introduces a causal action guidance mechanism to enhance action controllability in the resulting interactive world model. Extensive experiments in robot manipulation and game simulation domains show that our method offers a scalable and effective approach for repurposing highly capable video diffusion models to interactive world models.
摘要：根据历史观察和动作序列预测过渡的世界模型在提高顺序决策的数据效率方面表现出了巨大的希望。但是，现有的世界模型通常需要广泛的特定领域训练，并且仍然产生低保真，粗略的预测，从而限制了它们在复杂环境中的适用性。相比之下，在大型的互联网规模数据集中训练的视频扩散模型在生成捕获各种现实世界动态的高质量视频方面表现出了令人印象深刻的功能。在这项工作中，我们提出了VID2World，这是一种将预训练的视频扩散模型转移到交互式世界模型中的一般方法。为了弥合差距，VID2World通过制定其体系结构和训练目标来实现自回归产生，对预训练的视频扩散模型进行随意化。此外，它引入了因果行动指导机制，以增强所得互动世界模型中的动作可控性。在机器人操纵和游戏模拟域中进行的广泛实验表明，我们的方法为将高功能强大的视频扩散模型重新利用为交互式世界模型提供了可扩展有效的方法。

Title: Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable

Authors: Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taiping Yao, Shouhong Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14359
Pdf URL: https://arxiv.org/pdf/2505.14359
Copy Paste: [[2505.14359]] Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable(https://arxiv.org/abs/2505.14359)
Keywords: generative
Abstract: Existing detectors are often trained on biased datasets, leading to the possibility of overfitting on non-causal image attributes that are spuriously correlated with real/synthetic labels. While these biased features enhance performance on the training data, they result in substantial performance degradation when applied to unbiased datasets. One common solution is to perform dataset alignment through generative reconstruction, matching the semantic content between real and synthetic images. However, we revisit this approach and show that pixel-level alignment alone is insufficient. The reconstructed images still suffer from frequency-level misalignment, which can perpetuate spurious correlations. To illustrate, we observe that reconstruction models tend to restore the high-frequency details lost in real images (possibly due to JPEG compression), inadvertently creating a frequency-level misalignment, where synthetic images appear to have richer high-frequency content than real ones. This misalignment leads to models associating high-frequency features with synthetic labels, further reinforcing biased cues. To resolve this, we propose Dual Data Alignment (DDA), which aligns both the pixel and frequency domains. Moreover, we introduce two new test sets: DDA-COCO, containing DDA-aligned synthetic images for testing detector performance on the most aligned dataset, and EvalGEN, featuring the latest generative models for assessing detectors under new generative architectures such as visual auto-regressive generators. Finally, our extensive evaluations demonstrate that a detector trained exclusively on DDA-aligned MSCOCO could improve across 8 diverse benchmarks by a non-trivial margin, showing a +7.2% on in-the-wild benchmarks, highlighting the improved generalizability of unbiased detectors.
摘要：现有的检测器通常在有偏见的数据集上进行训练，从而导致有可能与真实/合成标签相关的非毒物图像属性过度拟合。尽管这些有偏见的功能增强了培训数据的性能，但当应用于无偏见的数据集时，它们会导致大量性能降解。一种常见的解决方案是通过生成重构执行数据集对齐，与真实图像和合成图像之间的语义内容匹配。但是，我们重新审视了这种方法，并表明仅像素级对齐是不够的。重建的图像仍然遭受频率级别的未对准，这可能会使虚假相关性永存。为了说明，我们观察到，重建模型倾向于恢复在真实图像中丢失的高频细节（可能是由于JPEG压缩引起的），无意间造成了频率级别的未对准，其中合成图像似乎比真实的图像更丰富的高频含量。这种未对准导致将高频特征与合成标签相关联的模型，从而进一步增强了有偏见的提示。为了解决此问题，我们提出了双数据对齐（DDA），该对齐均与像素和频域保持一致。此外，我们介绍了两个新的测试集：DDA-COCO，其中包含与DDA一致的合成图像，用于在最校准的数据集中测试检测器性能，以及Evalgen，其中包含最新的生成模型，用于评估探测器在新的生成架构中，例如视觉自动防潮生成器。最后，我们广泛的评估表明，仅在DDA一致的Mscoco上训练的检测器可以通过非平凡的边缘在8种不同的基准测试中改善，显示出野生基准测试的7.2％ +7.2％，突出了无偏见的检测器的提高概括性。

Title: Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives

Authors: Xingxing Weng, Chao Pang, Gui-Song Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14361
Pdf URL: https://arxiv.org/pdf/2505.14361
Copy Paste: [[2505.14361]] Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives(https://arxiv.org/abs/2505.14361)
Keywords: generation, generative
Abstract: Vision-language modeling (VLM) aims to bridge the information gap between images and natural language. Under the new paradigm of first pre-training on massive image-text pairs and then fine-tuning on task-specific data, VLM in the remote sensing domain has made significant progress. The resulting models benefit from the absorption of extensive general knowledge and demonstrate strong performance across a variety of remote sensing data analysis tasks. Moreover, they are capable of interacting with users in a conversational manner. In this paper, we aim to provide the remote sensing community with a timely and comprehensive review of the developments in VLM using the two-stage paradigm. Specifically, we first cover a taxonomy of VLM in remote sensing: contrastive learning, visual instruction tuning, and text-conditioned image generation. For each category, we detail the commonly used network architecture and pre-training objectives. Second, we conduct a thorough review of existing works, examining foundation models and task-specific adaptation methods in contrastive-based VLM, architectural upgrades, training strategies and model capabilities in instruction-based VLM, as well as generative foundation models with their representative downstream applications. Third, we summarize datasets used for VLM pre-training, fine-tuning, and evaluation, with an analysis of their construction methodologies (including image sources and caption generation) and key properties, such as scale and task adaptability. Finally, we conclude this survey with insights and discussions on future research directions: cross-modal representation alignment, vague requirement comprehension, explanation-driven model reliability, continually scalable model capabilities, and large-scale datasets featuring richer modalities and greater challenges.
摘要：视觉建模（VLM）旨在弥合图像和自然语言之间的信息差距。在大量图像文本对的首次预训练的新范式下，然后对特定于任务的数据进行微调，遥感域中的VLM取得了重大进展。最终的模型受益于吸收广泛的常识，并在各种遥感数据分析任务中表现出强大的性能。此外，他们能够以对话方式与用户互动。在本文中，我们旨在为遥感社区提供使用两阶段范式对VLM中的发展的及时，全面审查。具体来说，我们首先介绍了遥感中VLM的分类学：对比度学习，视觉教学调整和文本条件图像生成。对于每个类别，我们详细介绍了常用的网络体系结构和预训练目标。其次，我们对现有作品进行了详尽的审查，研究基础模型和基于对比的VLM，建筑升级，培训策略和基于教学的VLM中的模型功能以及具有代表性下游应用的生成基础模型中的特定于任务适应方法。第三，我们总结了用于VLM预训练，微调和评估的数据集，并分析了它们的构造方法（包括图像源和字幕生成）和关键属性，例如规模和任务适应性。最后，我们通过有关未来研究方向的见解和讨论来结束这项调查：跨模式表示，模糊需求理解，解释驱动的模型可靠性，不断可扩展的模型能力以及具有更丰富方式和更大挑战的大规模数据集。

Title: ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations

Authors: Xuecheng Wu, Jiaxing Liu, Danlei Huang, Xiaoyu Li, Yifan Wang, Chen Chen, Liya Ma, Xuezhi Cao, Junxiao Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14404
Pdf URL: https://arxiv.org/pdf/2505.14404
Copy Paste: [[2505.14404]] ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations(https://arxiv.org/abs/2505.14404)
Keywords: generation
Abstract: Visual-Interleaved Chain-of-Thought (VI-CoT) enables MLLMs to continually update their understanding and decisions based on step-wise intermediate visual states (IVS), much like a human would, which demonstrates impressive success in various tasks, thereby leading to emerged advancements in related benchmarks. Despite promising progress, current benchmarks provide models with relatively fixed IVS, rather than free-style IVS, whch might forcibly distort the original thinking trajectories, failing to evaluate their intrinsic reasoning capabilities. More importantly, existing benchmarks neglect to systematically explore the impact factors that IVS would impart to untamed reasoning performance. To tackle above gaps, we introduce a specialized benchmark termed ViC-Bench, consisting of four representive tasks: maze navigation, jigsaw puzzle, embodied long-horizon planning, and complex counting, where each task has dedicated free-style IVS generation pipeline supporting function calls. To systematically examine VI-CoT capability, we propose a thorough evaluation suite incorporating a progressive three-stage strategy with targeted new metrics. Besides, we establish Incremental Prompting Information Injection (IPII) strategy to ablatively explore the prompting factors for VI-CoT. We extensively conduct evaluations for 18 advanced MLLMs, revealing key insights into their VI-CoT capability. Our proposed benchmark is publicly open at Huggingface.
摘要：视觉交流链链（VI-COT）使MLLM能够基于逐步的中间视觉状态（IVS）不断更新其理解和决策，就像人类一样，这在各种任务中都表现出了令人印象深刻的成功，从而导致相关基准标准的进步。尽管有前途的进展，但当前的基准测试为具有相对固定的IV而不是自由风格的IV提供了模型，但可能会强行扭曲原始思维轨迹，无法评估其内在的推理能力。更重要的是，现有的基准测试忽略了系统地探索IV将赋予未驯服推理性能的影响因素。为了解决差距上方，我们引入了一个专门的基准，称为VIC板凳，由四个有代表性的任务组成：迷宫导航，拼图拼图，体现的长途径计划和复杂的计数，每个任务都具有专门的自由式IVS IVS IVS Generation Generation Exeneration Peneration Postrine Piperine Pipeline Pipeline Point Supportine supporting process support supporting wall。为了系统地检查VI-COT能力，我们提出了一个彻底的评估套件，该套件将渐进的三阶段策略与有针对性的新指标结合在一起。此外，我们建立了增量提示信息注入（IPII）策略，以散热地探索VI-COT的提示因素。我们对18个高级MLLM进行了广泛的评估，揭示了对其VI-COT能力的关键见解。我们提议的基准在Huggingface公开开放。

Title: Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

Authors: Xuyang Liu, Yiyu Wang, Junpeng Ma, Linfeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14454
Pdf URL: https://arxiv.org/pdf/2505.14454
Copy Paste: [[2505.14454]] Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models(https://arxiv.org/abs/2505.14454)
Keywords: generation
Abstract: Video large language models (VideoLLM) excel at video understanding, but face efficiency challenges due to the quadratic complexity of abundant visual tokens. Our systematic analysis of token compression methods for VideoLLMs reveals two critical issues: (i) overlooking distinctive visual signals across frames, leading to information loss; (ii) suffering from implementation constraints, causing incompatibility with modern architectures or efficient operators. To address these challenges, we distill three design principles for VideoLLM token compression and propose a plug-and-play inference acceleration framework "Video Compression Commander" (VidCom2). By quantifying each frame's uniqueness, VidCom2 adaptively adjusts compression intensity across frames, effectively preserving essential information while reducing redundancy in video sequences. Extensive experiments across various VideoLLMs and benchmarks demonstrate the superior performance and efficiency of our VidCom2. With only 25% visual tokens, VidCom2 achieves 99.6% of the original performance on LLaVA-OV while reducing 70.8% of the LLM generation latency. Notably, our Frame Compression Adjustment strategy is compatible with other token compression methods to further improve their performance. Our code is available at this https URL.
摘要：视频大型语言模型（Videollm）在视频理解方面表现出色，但是由于丰富的视觉令牌的二次复杂性，面临效率挑战。我们对视频的令牌压缩方法的系统分析揭示了两个关键问题：（i）忽略跨帧的独特视觉信号，从而导致信息丢失；（ii）遭受实施限制的困扰，导致与现代建筑或有效运营商的不兼容。为了应对这些挑战，我们将三个设计原理提炼为Videollm令牌压缩，并提出一个插件推理加速框架“视频压缩指挥官”（VIDCOM2）。通过量化每个帧的唯一性，VIDCOM2可以自适应调整跨帧的压缩强度，从而有效地保留基本信息，同时减少视频序列中的冗余。各种视频和基准的广泛实验证明了我们的VIDCOM2的卓越性能和效率。 VIDCOM2只有25％的视觉令牌，可在Llava-ov上达到99.6％的原始性能，同时减少了LLM生成潜伏期的70.8％。值得注意的是，我们的框架压缩调整策略与其他令牌压缩方法兼容，以进一步提高其性能。我们的代码可在此HTTPS URL上找到。

Title: VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

Authors: Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, Kede Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14460
Pdf URL: https://arxiv.org/pdf/2505.14460
Copy Paste: [[2505.14460]] VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank(https://arxiv.org/abs/2505.14460)
Keywords: super-resolution, generation, quality assessment
Abstract: DeepSeek-R1 has demonstrated remarkable effectiveness in incentivizing reasoning and generalization capabilities of large language models (LLMs) through reinforcement learning. Nevertheless, the potential of reasoning-induced computational modeling has not been thoroughly explored in the context of image quality assessment (IQA), a task critically dependent on visual reasoning. In this paper, we introduce VisualQuality-R1, a reasoning-induced no-reference IQA (NR-IQA) model, and we train it with reinforcement learning to rank, a learning algorithm tailored to the intrinsically relative nature of visual quality. Specifically, for a pair of images, we employ group relative policy optimization to generate multiple quality scores for each image. These estimates are then used to compute comparative probabilities of one image having higher quality than the other under the Thurstone model. Rewards for each quality estimate are defined using continuous fidelity measures rather than discretized binary labels. Extensive experiments show that the proposed VisualQuality-R1 consistently outperforms discriminative deep learning-based NR-IQA models as well as a recent reasoning-induced quality regression method. Moreover, VisualQuality-R1 is capable of generating contextually rich, human-aligned quality descriptions, and supports multi-dataset training without requiring perceptual scale realignment. These features make VisualQuality-R1 especially well-suited for reliably measuring progress in a wide range of image processing tasks like super-resolution and image generation.
摘要：DeepSeek-R1通过强化学习在激励大语模型（LLM）的推理和概括能力方面表现出了出色的有效性。然而，在图像质量评估（IQA）的背景下，尚未对推理引起的计算建模的潜力进行彻底探讨，这是一项至关重要的任务。在本文中，我们介绍了VisualQuality-R1，这是一种推理引起的无参考IQA（NR-IQA）模型，并通过强化学习来训练它，这是一种针对视觉质量本质上相对性质的学习算法。具体来说，对于一对图像，我们采用组相对策略优化来为每个图像生成多个质量分数。然后，这些估计值用于计算一个图像的比较概率，其质量比Thurstone模型下的图像更高。每个质量估计值的奖励是使用连续的保真度度量而不是离散的二进制标签来定义的。广泛的实验表明，所提出的VisualQuality-R1始终优于歧视性深度学习的NR-IQA模型以及最近推理诱导的质量回归方法。此外，VisualQuality-R1能够生成上下文丰富，与人类一致的质量描述，并支持多数据集训练而无需进行感知规模的重新调整。这些功能使VisualQuality-R1特别适合可靠地衡量在超级分辨率和图像生成等各种图像处理任务中的进度。

Title: RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

Authors: Jiaang Li, Yifei Yuan, Wenyan Li, Mohammad Aliannejadi, Daniel Hershcovich, Anders Søgaard, Ivan Vulić, Wenxuan Zhang, Paul Pu Liang, Yang Deng, Serge Belongie
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.14462
Pdf URL: https://arxiv.org/pdf/2505.14462
Copy Paste: [[2505.14462]] RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding(https://arxiv.org/abs/2505.14462)
Keywords: generation
Abstract: As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 Wikipedia documents curated and ranked by human annotators. With RAVENEA, we train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the value of retrieval-augmented methods and culturally inclusive benchmarks for multimodal understanding.
摘要：随着视觉模型（VLM）越来越多地整合到日常生活中，对准确的视觉文化理解的需求变得至关重要。然而，这些模型经常有效地解释文化细微差别。先前的工作证明了检索功能增强的一代（RAG）在增强仅文本设置中的文化理解方面的有效性，而其在多模式场景中的应用仍未得到充实。为了弥合这一差距，我们介绍了Ravenea（检索型视觉文化理解），这是一种新的基准测试，旨在通过检索来推进视觉文化理解，重点关注两项任务：以文化为中心的视觉问题回答（CVQA）和文化形成的图像字幕（CIC）。 Ravenea通过整合了由人类注释者策划和排名的10,000多个Wikipedia文件来扩展现有数据集。使用Ravenea，我们为每个图像查询训练并评估七个多模式检索器，并测量14个最先进的VLMS中检索提示输入的下游影响。我们的结果表明，轻巧的VLMS在使用文化吸引的检索中增强时，优于其非表现出色的VLM（在CVQA上绝对至少为3.2％，而CIC的绝对绝对是6.2％）。这突出了以多模式理解的检索仪方法和文化包容的基准的价值。

Title: Enhancing Interpretability of Sparse Latent Representations with Class Information

Authors: Farshad Sangari Abiz, Reshad Hosseini, Babak N. Araabi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14476
Pdf URL: https://arxiv.org/pdf/2505.14476
Copy Paste: [[2505.14476]] Enhancing Interpretability of Sparse Latent Representations with Class Information(https://arxiv.org/abs/2505.14476)
Keywords: generative
Abstract: Variational Autoencoders (VAEs) are powerful generative models for learning latent representations. Standard VAEs generate dispersed and unstructured latent spaces by utilizing all dimensions, which limits their interpretability, especially in high-dimensional spaces. To address this challenge, Variational Sparse Coding (VSC) introduces a spike-and-slab prior distribution, resulting in sparse latent representations for each input. These sparse representations, characterized by a limited number of active dimensions, are inherently more interpretable. Despite this advantage, VSC falls short in providing structured interpretations across samples within the same class. Intuitively, samples from the same class are expected to share similar attributes while allowing for variations in those attributes. This expectation should manifest as consistent patterns of active dimensions in their latent representations, but VSC does not enforce such consistency. In this paper, we propose a novel approach to enhance the latent space interpretability by ensuring that the active dimensions in the latent space are consistent across samples within the same class. To achieve this, we introduce a new loss function that encourages samples from the same class to share similar active dimensions. This alignment creates a more structured and interpretable latent space, where each shared dimension corresponds to a high-level concept, or "factor." Unlike existing disentanglement-based methods that primarily focus on global factors shared across all classes, our method captures both global and class-specific factors, thereby enhancing the utility and interpretability of latent representations.
摘要：变分自动编码器（VAE）是学习潜在表示的强大生成模型。标准VAE通过利用所有维度来生成分散和非结构化的潜在空间，这限制了它们的可解释性，尤其是在高维空间中。为了应对这一挑战，变异稀疏编码（VSC）引入了尖峰和slab先验分布，从而导致每个输入的稀疏潜在表示。这些稀疏表示，其特征在于有限数量的主动维度，它本质上是更容易解释的。尽管有这一优势，VSC在同一类内的样本中提供结构化解释方面却差不多。直观地，同一类的样本有望共享相似的属性，同时允许这些属性变化。这种期望应表现为其潜在表示中主动维度的一致模式，但VSC并未强制执行这种一致性。在本文中，我们提出了一种新颖的方法来增强潜在空间可解释性，以确保在同一类中的样本中，潜在空间中的主动尺寸是一致的。为了实现这一目标，我们引入了一种新的损失功能，该功能鼓励同一类的样本共享相似的主动维度。这种对齐形成了一个更具结构化和可解释的潜在空间，每个共享维度都对应于高级概念或“因素”。与现有基于分解的方法不同，主要关注所有类别共享的全球因素，我们的方法捕获了全球和班级特定的因素，从而增强了潜在表示的效用和解释性。

Title: Latent Flow Transformer

Authors: Yen-Chen Wu, Feng-Ting Liao, Meng-Hsi Chen, Pei-Chen Ho, Farhang Nabiei, Da-shan Shiu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14513
Pdf URL: https://arxiv.org/pdf/2505.14513
Copy Paste: [[2505.14513]] Latent Flow Transformer(https://arxiv.org/abs/2505.14513)
Keywords: generation
Abstract: Transformers, the standard implementation for large language models (LLMs), typically consist of tens to hundreds of discrete layers. While more layers can lead to better performance, this approach has been challenged as far from efficient, especially given the superiority of continuous layers demonstrated by diffusion and flow-based models for image generation. We propose the Latent Flow Transformer (LFT), which replaces a block of layers with a single learned transport operator trained via flow matching, offering significant compression while maintaining compatibility with the original architecture. Additionally, we address the limitations of existing flow-based methods in \textit{preserving coupling} by introducing the Flow Walking (FW) algorithm. On the Pythia-410M model, LFT trained with flow matching compresses 6 of 24 layers and outperforms directly skipping 2 layers (KL Divergence of LM logits at 0.407 vs. 0.529), demonstrating the feasibility of this design. When trained with FW, LFT further distills 12 layers into one while reducing the KL to 0.736 surpassing that from skipping 3 layers (0.932), significantly narrowing the gap between autoregressive and flow-based generation paradigms.
摘要：变形金刚是大型语言模型（LLMS）的标准实现，通常由数十个离散层组成。尽管更多的层可以提高性能更好，但这种方法受到挑战，尤其是考虑到连续层的优越性通过扩散和基于流程的模型来生成图像。我们提出了潜在流动变压器（LFT），该流动变压器（LFT）用通过流量匹配训练的单个学习运算符代替一层层，并提供了显着的压缩，同时保持与原始体系结构的兼容性。此外，我们通过引入流步行（FW）算法来解决\ textIt {保存耦合}中现有基于流的方法的局限性。在Pythia-410m模型上，经过流量匹配的LFT压缩了24层中的6个，并且表现直接跳过2层（LM LOGITS的KL差异为0.407 vs. 0.529），证明了该设计的可行性。当使用FW训练时，LFT进一步将12层提炼成一层，同时将KL降低到0.736，从而超过了3层（0.932），从而大大缩小了自回归和基于流动的生成范式之间的差距。

Title: SparC: Sparse Representation and Construction for High-Resolution 3D Shapes Modeling

Authors: Zhihao Li, Yufei Wang, Heliang Zheng, Yihao Luo, Bihan Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14521
Pdf URL: https://arxiv.org/pdf/2505.14521
Copy Paste: [[2505.14521]] SparC: Sparse Representation and Construction for High-Resolution 3D Shapes Modeling(https://arxiv.org/abs/2505.14521)
Keywords: generation, generative
Abstract: High-fidelity 3D object synthesis remains significantly more challenging than 2D image generation due to the unstructured nature of mesh data and the cubic complexity of dense volumetric grids. Existing two-stage pipelines-compressing meshes with a VAE (using either 2D or 3D supervision), followed by latent diffusion sampling-often suffer from severe detail loss caused by inefficient representations and modality mismatches introduced in VAE. We introduce SparC, a unified framework that combines a sparse deformable marching cubes representation SparseCubes with a novel encoder SparConv-VAE. SparseCubes converts raw meshes into high-resolution ($1024^3$) surfaces with arbitrary topology by scattering signed distance and deformation fields onto a sparse cube, allowing differentiable optimization. SparConv-VAE is the first modality-consistent variational autoencoder built entirely upon sparse convolutional networks, enabling efficient and near-lossless 3D reconstruction suitable for high-resolution generative modeling through latent diffusion. SparC achieves state-of-the-art reconstruction fidelity on challenging inputs, including open surfaces, disconnected components, and intricate geometry. It preserves fine-grained shape details, reduces training and inference cost, and integrates naturally with latent diffusion models for scalable, high-resolution 3D generation.
摘要：高保真3D对象的合成比2D图像生成更具挑战性，这是由于网格数据的非结构化性质和密集体积网格的立方复杂性。现有的两阶段管道压缩的网格与VAE（使用2D或3D监督），然后是潜在扩散采样，通常受到VAE中引入的效率低下的表述和模态失配造成的严重细节损失。我们介绍了SPARC，这是一个统一的框架，结合了稀疏的可变形立方体表示Sparsecubes和新颖的编码器Sparconv-Vae。 Sparsecubes通过将签名距离和变形字段散射到稀疏的立方体上，将原始网格转换为具有任意拓扑表面的高分辨率（$ 1024^3 $）表面，从而允许可区分的优化。 SPARCONV-VAE是完全基于稀疏卷积网络构建的第一个模态偶然的变异自动编码器，实现了适用于通过潜扩散的高分辨率生成建模的高效且近乎无情的3D重建。 SPARC在具有挑战性的输入上实现了最新的重建保真度，包括开放表面，断开的组件和复杂的几何形状。它保留了细粒的细节，降低了训练和推理成本，并与可扩展的高分辨率3D生成的潜在扩散模型自然整合。

Title: Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image

Authors: Yuxuan Wang, Xuanyu Yi, Qingshan Xu, Yuan Zhou, Long Chen, Hanwang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14537
Pdf URL: https://arxiv.org/pdf/2505.14537
Copy Paste: [[2505.14537]] Personalize Your Gaussian: Consistent 3D Scene Personalization from a Single Image(https://arxiv.org/abs/2505.14537)
Keywords: generation
Abstract: Personalizing 3D scenes from a single reference image enables intuitive user-guided editing, which requires achieving both multi-view consistency across perspectives and referential consistency with the input image. However, these goals are particularly challenging due to the viewpoint bias caused by the limited perspective provided in a single image. Lacking the mechanisms to effectively expand reference information beyond the original view, existing methods of image-conditioned 3DGS personalization often suffer from this viewpoint bias and struggle to produce consistent results. Therefore, in this paper, we present Consistent Personalization for 3D Gaussian Splatting (CP-GS), a framework that progressively propagates the single-view reference appearance to novel perspectives. In particular, CP-GS integrates pre-trained image-to-3D generation and iterative LoRA fine-tuning to extract and extend the reference appearance, and finally produces faithful multi-view guidance images and the personalized 3DGS outputs through a view-consistent generation process guided by geometric cues. Extensive experiments on real-world scenes show that our CP-GS effectively mitigates the viewpoint bias, achieving high-quality personalization that significantly outperforms existing methods. The code will be released at this https URL.
摘要：从单个参考图像中个性化的3D场景启用了直观的用户指导编辑，这需要在跨视角上实现多视图的一致性，并且与输入图像具有参考性一致性。但是，由于单个图像中提供的有限观点引起的观点偏见，这些目标尤其具有挑战性。缺乏有效扩展参考信息超出原始视图的机制，现有的图像条件3DGS个性化方法通常会遭受这种观点偏见的困扰，并难以产生一致的结果。因此，在本文中，我们介绍了3D高斯脱落（CP-GS）的一致个性化，该框架逐渐传播了单视图的参考外观。特别是，CP-GS集成了预训练的图像到3D生成和迭代lora微调以提取和扩展参考外观，并最终通过以几何学线索为指导的视图一致的生成过程产生忠实的多视图指导图像以及个性化的3DGS输出。在现实世界中进行的广泛实验表明，我们的CP-GS有效地减轻了观点偏见，实现了高质量的个性化，从而极大地表现了现有方法。该代码将在此HTTPS URL上发布。

Title: Dynadiff: Single-stage Decoding of Images from Continuously Evolving fMRI

Authors: Marlène Careil, Yohann Benchetrit, Jean-Rémi King
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14556
Pdf URL: https://arxiv.org/pdf/2505.14556
Copy Paste: [[2505.14556]] Dynadiff: Single-stage Decoding of Images from Continuously Evolving fMRI(https://arxiv.org/abs/2505.14556)
Keywords: generative
Abstract: Brain-to-image decoding has been recently propelled by the progress in generative AI models and the availability of large ultra-high field functional Magnetic Resonance Imaging (fMRI). However, current approaches depend on complicated multi-stage pipelines and preprocessing steps that typically collapse the temporal dimension of brain recordings, thereby limiting time-resolved brain decoders. Here, we introduce Dynadiff (Dynamic Neural Activity Diffusion for Image Reconstruction), a new single-stage diffusion model designed for reconstructing images from dynamically evolving fMRI recordings. Our approach offers three main contributions. First, Dynadiff simplifies training as compared to existing approaches. Second, our model outperforms state-of-the-art models on time-resolved fMRI signals, especially on high-level semantic image reconstruction metrics, while remaining competitive on preprocessed fMRI data that collapse time. Third, this approach allows a precise characterization of the evolution of image representations in brain activity. Overall, this work lays the foundation for time-resolved brain-to-image decoding.
摘要：脑对图像解码最近已被生成AI模型的进展以及大型超高场功能性磁共振成像（fMRI）的可用性推动。然而，当前的方法取决于复杂的多阶段管道和通常崩溃的大脑记录时间维度的预处理步骤，从而限制了时间分辨的脑解码器。在这里，我们介绍了Dynadiff（图像重建的动态神经活动扩散），这是一种新的单阶段扩散模型，旨在从动态发展的fMRI记录中重建图像。我们的方法提供了三个主要贡献。首先，与现有方法相比，Dynadiff简化了训练。其次，我们的模型在时间分辨的fMRI信号上的表现优于最先进的模型，尤其是在高级语义图像重建指标上，同时在崩溃时间的预处理fMRI数据上保持竞争力。第三，这种方法允许精确表征大脑活动中图像表示的演变。总体而言，这项工作为时间分辨的大脑对图像解码奠定了基础。

Title: CSTS: A Benchmark for the Discovery of Correlation Structures in Time Series Clustering

Authors: Isabella Degen, Zahraa S Abdallah, Henry W J Reeve, Kate Robson Brown
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.14596
Pdf URL: https://arxiv.org/pdf/2505.14596
Copy Paste: [[2505.14596]] CSTS: A Benchmark for the Discovery of Correlation Structures in Time Series Clustering(https://arxiv.org/abs/2505.14596)
Keywords: generation
Abstract: Time series clustering promises to uncover hidden structural patterns in data with applications across healthcare, finance, industrial systems, and other critical domains. However, without validated ground truth information, researchers cannot objectively assess clustering quality or determine whether poor results stem from absent structures in the data, algorithmic limitations, or inappropriate validation methods, raising the question whether clustering is "more art than science" (Guyon et al., 2009). To address these challenges, we introduce CSTS (Correlation Structures in Time Series), a synthetic benchmark for evaluating the discovery of correlation structures in multivariate time series data. CSTS provides a clean benchmark that enables researchers to isolate and identify specific causes of clustering failures by differentiating between correlation structure deterioration and limitations of clustering algorithms and validation methods. Our contributions are: (1) a comprehensive benchmark for correlation structure discovery with distinct correlation structures, systematically varied data conditions, established performance thresholds, and recommended evaluation protocols; (2) empirical validation of correlation structure preservation showing moderate distortion from downsampling and minimal effects from distribution shifts and sparsification; and (3) an extensible data generation framework enabling structure-first clustering evaluation. A case study demonstrates CSTS's practical utility by identifying an algorithm's previously undocumented sensitivity to non-normal distributions, illustrating how the benchmark enables precise diagnosis of methodological limitations. CSTS advances rigorous evaluation standards for correlation-based time series clustering.
摘要：时间序列聚类有望在数据中揭示数据中隐藏的结构模式，并在医疗保健，金融，工业系统和其他关键领域的应用中进行了应用。但是，如果没有经过验证的地面真相信息，研究人员将无法客观地评估聚类质量，或者确定差的结果是否源于数据中缺乏结构，算法限制或不适当的验证方法，从而提出了一个问题，即聚类是“比科学更多的艺术”（Guyon等，2009）。为了应对这些挑战，我们引入了CST（时间序列中的相关结构），这是评估多元时间序列数据中相关结构的合成基准。 CST提供了一个干净的基准测试，使研究人员能够通过区分相关结构恶化和聚类算法和验证方法的局限性来隔离和识别聚类失败的特定原因。我们的贡献是：（1）具有不同相关结构，系统变化的数据条件，既定的性能阈值和建议的评估协议的相关结构发现的综合基准；（2）相关结构保存的经验验证表明，分布变化和稀疏性的下采样和最小效应中等变形；（3）可扩展的数据生成框架实现结构优先群集评估。一项案例研究通过识别算法先前对非正常分布的敏感性来证明CST的实用性，这说明了基准如何实现方法论限制的精确诊断。 CST提高了基于相关时间序列聚类的严格评估标准。

Title: KERL: Knowledge-Enhanced Personalized Recipe Recommendation using Large Language Models

Authors: Fnu Mohbat, Mohammed J Zaki
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.14629
Pdf URL: https://arxiv.org/pdf/2505.14629
Copy Paste: [[2505.14629]] KERL: Knowledge-Enhanced Personalized Recipe Recommendation using Large Language Models(https://arxiv.org/abs/2505.14629)
Keywords: generation
Abstract: Recent advances in large language models (LLMs) and the abundance of food data have resulted in studies to improve food understanding using LLMs. Despite several recommendation systems utilizing LLMs and Knowledge Graphs (KGs), there has been limited research on integrating food related KGs with LLMs. We introduce KERL, a unified system that leverages food KGs and LLMs to provide personalized food recommendations and generates recipes with associated micro-nutritional information. Given a natural language question, KERL extracts entities, retrieves subgraphs from the KG, which are then fed into the LLM as context to select the recipes that satisfy the constraints. Next, our system generates the cooking steps and nutritional information for each recipe. To evaluate our approach, we also develop a benchmark dataset by curating recipe related questions, combined with constraints and personal preferences. Through extensive experiments, we show that our proposed KG-augmented LLM significantly outperforms existing approaches, offering a complete and coherent solution for food recommendation, recipe generation, and nutritional analysis. Our code and benchmark datasets are publicly available at this https URL.
摘要：大型语言模型（LLM）和丰富食物数据的最新进展已导致研究以使用LLMS提高食物理解。尽管使用LLM和知识图（KGS）的几个建议系统，但在将相关的KG与LLM相关的研究方面的研究有限。我们介绍了Kerl，这是一个统一的系统，利用食品KGS和LLMS提供个性化的食物建议，并通过相关的微核信息生成食谱。考虑到一个自然的语言问题，Kerl提取实体，从KG中检索子图，然后将其送入LLM作为上下文，以选择满足约束的配方。接下来，我们的系统为每个食谱生成烹饪步骤和营养信息。为了评估我们的方法，我们还通过策划与食谱相关的问题以及约束和个人偏好来开发基准数据集。通过广泛的实验，我们表明我们提出的KG扬名LLM显着胜过现有方法，为食物推荐，食谱生成和营养分析提供了完整而连贯的解决方案。我们的代码和基准数据集可在此HTTPS URL上公开可用。

Title: CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation

Authors: Anna C. Doris, Md Ferdous Alam, Amin Heyrani Nobari, Faez Ahmed
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14646
Pdf URL: https://arxiv.org/pdf/2505.14646
Copy Paste: [[2505.14646]] CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation(https://arxiv.org/abs/2505.14646)
Keywords: generation
Abstract: Efficient creation of accurate and editable 3D CAD models is critical in engineering design, significantly impacting cost and time-to-market in product innovation. Current manual workflows remain highly time-consuming and demand extensive user expertise. While recent developments in AI-driven CAD generation show promise, existing models are limited by incomplete representations of CAD operations, inability to generalize to real-world images, and low output accuracy. This paper introduces CAD-Coder, an open-source Vision-Language Model (VLM) explicitly fine-tuned to generate editable CAD code (CadQuery Python) directly from visual input. Leveraging a novel dataset that we created--GenCAD-Code, consisting of over 163k CAD-model image and code pairs--CAD-Coder outperforms state-of-the-art VLM baselines such as GPT-4.5 and Qwen2.5-VL-72B, achieving a 100% valid syntax rate and the highest accuracy in 3D solid similarity. Notably, our VLM demonstrates some signs of generalizability, successfully generating CAD code from real-world images and executing CAD operations unseen during fine-tuning. The performance and adaptability of CAD-Coder highlights the potential of VLMs fine-tuned on code to streamline CAD workflows for engineers and designers. CAD-Coder is publicly available at: this https URL.
摘要：有效地创建准确且可编辑的3D CAD模型对于工程设计至关重要，从而显着影响产品创新的成本和上市时间。当前的手动工作流程仍然很耗时，并需要广泛的用户专业知识。尽管AI驱动的CAD生成的最新发展表现出了希望，但现有模型受到CAD操作的不完整表示的限制，无法推广到现实世界图像以及低输出精度。本文介绍了CAD-Coder，这是一种明确微调的开源视觉语言模型（VLM），以直接从视觉输入中生成可编辑的CAD代码（Cadquery Python）。利用我们创建的新型数据集 - gencad代码，由163K CAD模型图像和代码对组成 - cad-coder优于最先进的VLM基准，例如GPT-4.5和QWEN2.5-VL-72B，可实现100％有效的索引速率和3D固体相似性的最高准确度。值得注意的是，我们的VLM展示了一些普遍性的迹象，成功地从现实世界图像中生成了CAD代码，并在微调过程中执行CAD操作。 CAD-Coder的性能和适应性突出了VLMS在代码上进行微调的潜力，以简化工程师和设计师的CAD工作流程。 CAD-CODER可公开可用：此HTTPS URL。

Title: UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens

Authors: Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, Bocheng Zou, Chaoqun Yang, Wentao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14671
Pdf URL: https://arxiv.org/pdf/2505.14671
Copy Paste: [[2505.14671]] UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens(https://arxiv.org/abs/2505.14671)
Keywords: generation
Abstract: Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept $\langle bo\rangle$, generating "$\langle bo\rangle$ wearing its hat" without additional textual descriptions of its hat. We call this kind of generation personalized knowledge-driven generation. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and knowledge-driven generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized knowledge-driven generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding. Our code and dataset will be released at: \href{this https URL}{this https URL}.
摘要：个性化模型在理解和生成用户提供的概念方面取得了显着成功。但是，现有方法使用单独的概念代币进行理解和生成，并孤立地处理这些任务。这可能会导致生成具有复杂提示的图像的局限性。例如，给定概念$ \ langle bo \ rangle $，生成了“ $ \ langle bo \ rangle $戴着帽子”，而没有其他帽子的文字描述。我们称这一一代个性化知识驱动的一代。为了解决限制，我们提出了Unictokens，这是一个新颖的框架，可有效地将个性化信息整合到统一的视觉语言模型（VLM）中，以供理解和产生。 Unictokens训练一组统一的概念令牌来利用互补的语义，从而增强了两个个性化的任务。此外，我们提出了一种具有三个阶段的渐进培训策略：了解热身，从理解中引导产生，并加深一代人的理解，从而增强两项任务之间的相互利益。为了定量评估统一的VLM个性化，我们提出了UnifyBench，这是评估概念理解，概念生成和知识驱动的生成的第一个基准。与unifyBench的实验结果表明，与概念理解，概念生成和实现最先进的领先方法相比，Unictokens表现出竞争性能，从而导致了个性化知识驱动的一代。我们的研究表明，增强的理解可以改善产生，而发电过程可以产生对理解的宝贵见解。我们的代码和数据集将在以下位置发布：\ href {this https url} {this https url}。

Title: Training-Free Watermarking for Autoregressive Image Generation

Authors: Yu Tong, Zihao Pan, Shuai Yang, Kaiyang Zhou
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2505.14673
Pdf URL: https://arxiv.org/pdf/2505.14673
Copy Paste: [[2505.14673]] Training-Free Watermarking for Autoregressive Image Generation(https://arxiv.org/abs/2505.14673)
Keywords: generation, generative
Abstract: Invisible image watermarking can protect image ownership and prevent malicious misuse of visual generative models. However, existing generative watermarking methods are mainly designed for diffusion models while watermarking for autoregressive image generation models remains largely underexplored. We propose IndexMark, a training-free watermarking framework for autoregressive image generation models. IndexMark is inspired by the redundancy property of the codebook: replacing autoregressively generated indices with similar indices produces negligible visual differences. The core component in IndexMark is a simple yet effective match-then-replace method, which carefully selects watermark tokens from the codebook based on token similarity, and promotes the use of watermark tokens through token replacement, thereby embedding the watermark without affecting the image quality. Watermark verification is achieved by calculating the proportion of watermark tokens in generated images, with precision further improved by an Index Encoder. Furthermore, we introduce an auxiliary validation scheme to enhance robustness against cropping attacks. Experiments demonstrate that IndexMark achieves state-of-the-art performance in terms of image quality and verification accuracy, and exhibits robustness against various perturbations, including cropping, noises, Gaussian blur, random erasing, color jittering, and JPEG compression.
摘要：无形的图像水印可以保护图像所有权并防止对视觉生成模型的恶意滥用。但是，现有的生成水印方法主要是为扩散模型设计的，而自回归图像生成模型的水印仍然在很大程度上没有被逐渐倍增。我们提出了索引标记，这是一种无训练的水印框架，用于自回归图像生成模型。索引标记的灵感来自代码书的冗余属性：用相似指数替换自动加工产生的索引会产生可忽略的视觉差异。索引标记中的核心组件是一种简单而有效的匹配 - 然后是重复的方法，该方法根据令牌相似性仔细地从代码书中选择了水印令牌，并促进通过更换令牌替换的水印代币的使用，从而在不影响图像质量的情况下嵌入水印。水印验证是通过计算生成图像中水印的比例来实现的，并通过索引编码进一步提高了精度。此外，我们引入了辅助验证计划，以增强抗种植攻击的鲁棒性。实验表明，索引标志在图像质量和验证精度方面实现了最先进的性能，并具有针对各种扰动的鲁棒性，包括种植，噪音，高斯模糊，随机擦除，颜色抖动和JPEG压缩。

Title: UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

Authors: Rui Tian, Mingfei Gao, Mingze Xu, Jiaming Hu, Jiasen Lu, Zuxuan Wu, Yinfei Yang, Afshin Dehghan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14682
Pdf URL: https://arxiv.org/pdf/2505.14682
Copy Paste: [[2505.14682]] UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation(https://arxiv.org/abs/2505.14682)
Keywords: generation
Abstract: We introduce UniGen, a unified multimodal large language model (MLLM) capable of image understanding and generation. We study the full training pipeline of UniGen from a data-centric perspective, including multi-stage pre-training, supervised fine-tuning, and direct preference optimization. More importantly, we propose a new Chain-of-Thought Verification (CoT-V) strategy for test-time scaling, which significantly boosts UniGen's image generation quality using a simple Best-of-N test-time strategy. Specifically, CoT-V enables UniGen to act as both image generator and verifier at test time, assessing the semantic alignment between a text prompt and its generated image in a step-by-step CoT manner. Trained entirely on open-source datasets across all stages, UniGen achieves state-of-the-art performance on a range of image understanding and generation benchmarks, with a final score of 0.78 on GenEval and 85.19 on DPG-Bench. Through extensive ablation studies, our work provides actionable insights and addresses key challenges in the full life cycle of building unified MLLMs, contributing meaningful directions to the future research.
摘要：我们介绍了能够理解和产生的统一的多模式大语模型（MLLM）Unigen。我们从以数据为中心的角度研究了Unigen的完整培训管道，包括多阶段预训练，受监督的微调和直接偏好优化。更重要的是，我们为测试时间扩展提出了一种新的经过思考验证（COT-V）策略，它使用简单的N测试时间策略可以显着提高Unigen的图像生成质量。具体而言，COT-V使Unigen可以在测试时间充当图像发生器和验证器，以分步COT的方式评估文本提示符与其生成的图像之间的语义对齐方式。 Unigen在各个阶段都完全在开源数据集上进行了培训，在一系列图像理解和生成基准方面都实现了最先进的性能，在Geneval上的最终得分为0.78，在DPG Bench上获得了85.19。通过广泛的消融研究，我们的工作提供了可行的见解，并解决了建立统一MLLM的整个生命周期中的关键挑战，从而为未来的研究贡献了有意义的方向。

Title: Emerging Properties in Unified Multimodal Pretraining

Authors: Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14683
Pdf URL: https://arxiv.org/pdf/2505.14683
Copy Paste: [[2505.14683]] Emerging Properties in Unified Multimodal Pretraining(https://arxiv.org/abs/2505.14683)
Keywords: generation
Abstract: Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at this https URL
摘要：统一多模式的理解和产生在尖端专有系统中表现出了令人印象深刻的能力。在这项工作中，我们介绍了Bagel，Bagel是一种Open0Source的基础模型，该模型本地支持多模式的理解和产生。 Bagel是一种统一的，解码的模型，该模型是在大型0尺度交织文本，图像，视频和Web数据中策划的数万亿个代币的预测模型。当用如此多样化的多模式交织数据缩放时，百吉饼在复杂的多模式推理中表现出新兴的能力。结果，它在多模式生成和跨标准基准测试中的开源统一模型都显着优于开源统一模型，同时表现出高级的多模式推理能力，例如自由形式图像操纵，未来的框架预测，3D操纵和世界导航。为了促进多模式研究的进一步机会，我们分享了关键的发现，预处理细节，数据创建原始的，并将我们的代码和检查站发布给社区。项目页面在此HTTPS URL上

Title: Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers

Authors: Sucheng Ren, Qihang Yu, Ju He, Alan Yuille, Liang-Chieh Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.14687
Pdf URL: https://arxiv.org/pdf/2505.14687
Copy Paste: [[2505.14687]] Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers(https://arxiv.org/abs/2505.14687)
Keywords: generation, generative
Abstract: Diffusion-based Transformers have demonstrated impressive generative capabilities, but their high computational costs hinder practical deployment, for example, generating an $8192\times 8192$ image can take over an hour on an A100 GPU. In this work, we propose GRAT (\textbf{GR}ouping first, \textbf{AT}tending smartly), a training-free attention acceleration strategy for fast image and video generation without compromising output quality. The key insight is to exploit the inherent sparsity in learned attention maps (which tend to be locally focused) in pretrained Diffusion Transformers and leverage better GPU parallelism. Specifically, GRAT first partitions contiguous tokens into non-overlapping groups, aligning both with GPU execution patterns and the local attention structures learned in pretrained generative Transformers. It then accelerates attention by having all query tokens within the same group share a common set of attendable key and value tokens. These key and value tokens are further restricted to structured regions, such as surrounding blocks or criss-cross regions, significantly reducing computational overhead (e.g., attaining a \textbf{35.8$\times$} speedup over full attention when generating $8192\times 8192$ images) while preserving essential attention patterns and long-range context. We validate GRAT on pretrained Flux and HunyuanVideo for image and video generation, respectively. In both cases, GRAT achieves substantially faster inference without any fine-tuning, while maintaining the performance of full attention. We hope GRAT will inspire future research on accelerating Diffusion Transformers for scalable visual generation.
摘要：基于扩散的变压器已经表现出令人印象深刻的生成能力，但是它们的高计算成本阻碍了实际部署，例如，生成$ 8192 \ times 8192 $图像可能需要超过一个小时的A100 GPU。在这项工作中，我们提出了grat（\ textbf {gr}首先，\ textbf {at}巧妙地调整），这是一种无训练的注意加速策略，用于快速图像和视频生成而不会损害输出质量。关键的见解是利用验证的扩散变压器中学习的注意图（倾向于局部集中）的固有稀疏性，并利用更好的GPU并行性。具体而言，GRAT的第一分区连续的代币与非重叠的组相结合，与GPU执行模式和在预审预测的生成变压器中学到的局部注意力结构保持一致。然后，它通过在同一组中拥有所有查询令牌来加速注意力，共享一组通用的出现密钥和价值令牌。这些关键和价值代币进一步限于结构化区域，例如周围的区块或纵横交错区域，大大降低了计算间接费用（例如，在产生$ 8192 \ times 8192 $ 8192 $图像时，在产生$ 8192 $ 8192 $ 8192 $图像的情况下，请注意和长期远程和长期差异，并在产生$ 8192 \ textbf {35.8 $ \ times $}速度上加快了全部关注。我们分别验证了验证的通量和HunyuanVideo，分别以图像和视频生成。在这两种情况下，GRAT都在没有任何微调的同时，都可以达到更快的推断，同时保持了全部关注。我们希望GRAT能够激发未来关于加速扩散变压器以进行可扩展视觉生成的研究。