2025-04-24

Title: Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends

Authors: Mohammad Abu Tami, Mohammed Elhenawy, Huthaifa I. Ashqar
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2504.16134
Pdf URL: https://arxiv.org/pdf/2504.16134
Copy Paste: [[2504.16134]] Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends(https://arxiv.org/abs/2504.16134)
Keywords: generation
Abstract: Traffic safety remains a critical global challenge, with traditional Advanced Driver-Assistance Systems (ADAS) often struggling in dynamic real-world scenarios due to fragmented sensor processing and susceptibility to adversarial conditions. This paper reviews the transformative potential of Multimodal Large Language Models (MLLMs) in addressing these limitations by integrating cross-modal data such as visual, spatial, and environmental inputs to enable holistic scene understanding. Through a comprehensive analysis of MLLM-based approaches, we highlight their capabilities in enhancing perception, decision-making, and adversarial robustness, while also examining the role of key datasets (e.g., KITTI, DRAMA, ML4RoadSafety) in advancing research. Furthermore, we outline future directions, including real-time edge deployment, causality-driven reasoning, and human-AI collaboration. By positioning MLLMs as a cornerstone for next-generation traffic safety systems, this review underscores their potential to revolutionize the field, offering scalable, context-aware solutions that proactively mitigate risks and improve overall road safety.
摘要：交通安全仍然是一个关键的全球挑战，因为传统的高级驾驶员辅助系统（ADA）经常在动态现实世界中挣扎，这是由于传感器处理碎片和对对抗条件的敏感性。本文回顾了多模式大语言模型（MLLM）在解决这些局限性方面的变革潜力，通过整合诸如视觉，空间和环境输入之类的跨模式数据以实现整体场景的理解。通过对基于MLLM的方法的全面分析，我们强调了它们在增强感知，决策和对抗性鲁棒性方面的能力，同时还研究了关键数据集（例如Kitti，Kitti，Drama，Ml4RoadSafety）在进行研究中的作用。此外，我们概述了未来的方向，包括实时边缘部署，因果关系驱动的推理和人为合作。通过将MLLMS定位为下一代交通安全系统的基石，这项审查强调了它们具有彻底改变该领域的潜力，提供了可扩展的，上下文感知的解决方案，可主动降低风险并提高整体道路安全。

Title: Learning Energy-Based Generative Models via Potential Flow: A Variational Principle Approach to Probability Density Homotopy Matching

Authors: Junn Yong Loo, Michelle Adeline, Julia Kaiwen Lau, Fang Yu Leong, Hwa Hui Tew, Arghya Pal, Vishnu Monn Baskaran, Chee-Ming Ting, Raphaël C.-W. Phan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.16262
Pdf URL: https://arxiv.org/pdf/2504.16262
Copy Paste: [[2504.16262]] Learning Energy-Based Generative Models via Potential Flow: A Variational Principle Approach to Probability Density Homotopy Matching(https://arxiv.org/abs/2504.16262)
Keywords: generation, generative
Abstract: Energy-based models (EBMs) are a powerful class of probabilistic generative models due to their flexibility and interpretability. However, relationships between potential flows and explicit EBMs remain underexplored, while contrastive divergence training via implicit Markov chain Monte Carlo (MCMC) sampling is often unstable and expensive in high-dimensional settings. In this paper, we propose Variational Potential Flow Bayes (VPFB), a new energy-based generative framework that eliminates the need for implicit MCMC sampling and does not rely on auxiliary networks or cooperative training. VPFB learns an energy-parameterized potential flow by constructing a flow-driven density homotopy that is matched to the data distribution through a variational loss minimizing the Kullback-Leibler divergence between the flow-driven and marginal homotopies. This principled formulation enables robust and efficient generative modeling while preserving the interpretability of EBMs. Experimental results on image generation, interpolation, out-of-distribution detection, and compositional generation confirm the effectiveness of VPFB, showing that our method performs competitively with existing approaches in terms of sample quality and versatility across diverse generative modeling tasks.
摘要：基于能量的模型（EBM）是一类强大的概率生成模型，因为它们的灵活性和解释性。然而，潜在流与显式EBM之间的关系仍然没有被忽略，而通过隐式马尔可夫链蒙特卡洛（MCMC）采样的对比差异训练通常在高维环境中不稳定且昂贵。在本文中，我们提出了一个基于能量的生成框架变分潜能贝叶斯（VPFB），它消除了对隐式MCMC采样的需求，并且不依赖辅助网络或合作培训。 VPFB通过构建流动驱动的密度同喻来学习能量参数化的电势流，该密度通过各种损失最小化kullback-leibler差异的差异，该流动驱动的密度均与数据分布相匹配。这种原则的配方可以在保留EBM的解释性的同时，可实现强大而有效的生成建模。对图像产生，插值，分布外检测和组成产生的实验结果证实了VPFB的有效性，表明我们的方法在不同生成建模任务的样本质量和多功能方面竞争现有方法的竞争性。

Title: PixelWeb: The First Web GUI Dataset with Pixel-Wise Labels

Authors: Qi Yang, Weichen Bi, Haiyang Shen, Yaoqi Guo, Yun Ma
Subjects: cs.CV, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2504.16419
Pdf URL: https://arxiv.org/pdf/2504.16419
Copy Paste: [[2504.16419]] PixelWeb: The First Web GUI Dataset with Pixel-Wise Labels(https://arxiv.org/abs/2504.16419)
Keywords: generation
Abstract: Graphical User Interface (GUI) datasets are crucial for various downstream tasks. However, GUI datasets often generate annotation information through automatic labeling, which commonly results in inaccurate GUI element BBox annotations, including missing, duplicate, or meaningless BBoxes. These issues can degrade the performance of models trained on these datasets, limiting their effectiveness in real-world applications. Additionally, existing GUI datasets only provide BBox annotations visually, which restricts the development of visually related GUI downstream tasks. To address these issues, we introduce PixelWeb, a large-scale GUI dataset containing over 100,000 annotated web pages. PixelWeb is constructed using a novel automatic annotation approach that integrates visual feature extraction and Document Object Model (DOM) structure analysis through two core modules: channel derivation and layer analysis. Channel derivation ensures accurate localization of GUI elements in cases of occlusion and overlapping elements by extracting BGRA four-channel bitmap annotations. Layer analysis uses the DOM to determine the visibility and stacking order of elements, providing precise BBox annotations. Additionally, PixelWeb includes comprehensive metadata such as element images, contours, and mask annotations. Manual verification by three independent annotators confirms the high quality and accuracy of PixelWeb annotations. Experimental results on GUI element detection tasks show that PixelWeb achieves performance on the mAP95 metric that is 3-7 times better than existing datasets. We believe that PixelWeb has great potential for performance improvement in downstream tasks such as GUI generation and automated user interaction.
摘要：图形用户界面（GUI）数据集对于各种下游任务至关重要。但是，GUI数据集通常通过自动标记生成注释信息，这通常会导致GUI元素Bbox注释不准确，包括丢失，重复或毫无意义的Bbox。这些问题会降低在这些数据集上训练的模型的性能，从而限制了它们在现实世界中的有效性。此外，现有的GUI数据集仅在视觉上提供Bbox注释，这限制了视觉上相关的GUI下游任务的开发。为了解决这些问题，我们介绍了PixelWeb，这是一个包含超过100,000个带注释的网页的大型GUI数据集。 PixelWeb是使用一种新型的自动注释方法构建的，该方法通过两个核心模块集成了视觉特征提取和文档对象模型（DOM）结构分析：通道推导和层分析。通道推导通过提取BGRA四通道位图注释来确保在遮挡和重叠元件的情况下准确定位GUI元素。图层分析使用DOM来确定元素的可见性和堆叠顺序，提供精确的Bbox注释。此外，PixelWeb还包括综合元数据，例如元素图像，轮廓和掩码注释。通过三个独立注释者的手动验证证实了PixelWeb注释的高质量和准确性。 GUI元素检测任务的实验结果表明，PixelWeb在MAP95度量上的性能是比现有数据集好3-7倍。我们认为，PixelWeb在GUI生成和自动化用户互动等下游任务中具有巨大的性能改进潜力。

Title: Cross Paradigm Representation and Alignment Transformer for Image Deraining

Authors: Shun Zou, Yi Zou, Juncheng Li, Guangwei Gao, Guojun Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.16455
Pdf URL: https://arxiv.org/pdf/2504.16455
Copy Paste: [[2504.16455]] Cross Paradigm Representation and Alignment Transformer for Image Deraining(https://arxiv.org/abs/2504.16455)
Keywords: restoration
Abstract: Transformer-based networks have achieved strong performance in low-level vision tasks like image deraining by utilizing spatial or channel-wise self-attention. However, irregular rain patterns and complex geometric overlaps challenge single-paradigm architectures, necessitating a unified framework to integrate complementary global-local and spatial-channel representations. To address this, we propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer). Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms (spatial-channel and global-local) to aid image reconstruction. It bridges the gap within and between paradigms, aligning and coordinating them to enable deep interaction and fusion of features. Specifically, we use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA). SPC-SA enhances global channel dependencies through dynamic sparsity, while SPR-SA focuses on spatial rain distribution and fine-grained texture recovery. To address the feature misalignment and knowledge differences between them, we introduce the Adaptive Alignment Frequency Module (AAFM), which aligns and interacts with features in a two-stage progressive manner, enabling adaptive guidance and complementarity. This reduces the information gap within and between paradigms. Through this unified cross-paradigm dynamic interaction framework, we achieve the extraction of the most valuable interactive fusion information from the two paradigms. Extensive experiments demonstrate that our model achieves state-of-the-art performance on eight benchmark datasets and further validates CPRAformer's robustness in other image restoration tasks and downstream applications.
摘要：基于变压器的网络在低级视觉任务中实现了强劲的性能，例如通过使用空间或频道自我注意力来驱动图像。然而，不规则的降雨模式和复杂的几何与单群体系结构挑战，需要一个统一的框架来整合互补的全球 - 本地和空间通道表示。为了解决这个问题，我们提出了一种新颖的交叉范式表示和对齐变压器（CPRAFORMER）。它的核心思想是分层表示和对齐，利用范式（空间通道和全球本地）的优势来帮助图像重建。它弥合了范式内部和之间的缝隙，对准和协调它们以使特征的深层相互作用和融合。具体而言，我们在变压器块中使用两种类型的自我注意力：稀疏提示自我注意力（SPC-SA）和空间像素细化自我注意（SPR-SA）。 SPC-SA通过动态稀疏性增强了全局信道依赖性，而SPR-SA则侧重于空间降雨分布和细粒质地恢复。为了解决它们之间的特征错位和知识差异，我们引入了自适应对准频率模块（AAFM），该模块（AAFM）以两阶段的渐进式方式对齐并与特征相互作用，从而实现自适应指导和互补性。这减少了范式内和之间的信息差距。通过这个统一的跨范式动态交互框架，我们从两个范式中提取了最有价值的交互式融合信息。广泛的实验表明，我们的模型在八个基准数据集上实现了最新性能，并进一步验证了CPRAFormer在其他图像恢复任务和下游应用程序中的鲁棒性。

Title: A Comprehensive Survey of Synthetic Tabular Data Generation

Authors: Ruxue Shi, Yili Wang, Mengnan Du, Xu Shen, Xin Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.16506
Pdf URL: https://arxiv.org/pdf/2504.16506
Copy Paste: [[2504.16506]] A Comprehensive Survey of Synthetic Tabular Data Generation(https://arxiv.org/abs/2504.16506)
Keywords: generation, generative
Abstract: Tabular data remains one of the most prevalent and critical data formats across diverse real-world applications. However, its effective use in machine learning (ML) is often constrained by challenges such as data scarcity, privacy concerns, and class imbalance. Synthetic data generation has emerged as a promising solution, leveraging generative models to learn the distribution of real datasets and produce high-fidelity, privacy-preserving samples. Various generative paradigms have been explored, including energy-based models (EBMs), variational autoencoders (VAEs), generative adversarial networks (GANs), large language models (LLMs), and diffusion models. While several surveys have investigated synthetic tabular data generation, most focus on narrow subdomains or specific generative methods, such as GANs, diffusion models, or privacy-preserving techniques. This limited scope often results in fragmented insights, lacking a comprehensive synthesis that bridges diverse approaches. In particular, recent advances driven by LLMs and diffusion-based models remain underexplored. This gap hinders a holistic understanding of the field`s evolution, methodological interplay, and open challenges. To address this, our survey provides a unified and systematic review of synthetic tabular data generation. Our contributions are threefold: (1) we propose a comprehensive taxonomy that organizes existing methods into traditional approaches, diffusion-based methods, and LLM-based models, and provide an in-depth comparative analysis; (2) we detail the complete pipeline for synthetic tabular data generation, including data synthesis, post-processing, and evaluation; (3) we identify major challenges, explore real-world applications, and outline open research questions and future directions to guide future work in this rapidly evolving area.
摘要：表格数据仍然是各种现实世界应用程序中最普遍，最关键的数据格式之一。但是，它在机器学习中的有效使用（ML）通常受到诸如数据稀缺，隐私问题和阶级失衡等挑战的限制。合成数据生成已成为一种有前途的解决方案，利用生成模型来了解真实数据集的分布并产生高保真性，保留隐私的样本。已经探索了各种生成范式，包括基于能量的模型（EBM），变异自动编码器（VAE），生成对抗网络（GAN），大语言模型（LLMS）和扩散模型。尽管几项调查已经研究了合成表格数据的生成，但大多数侧重于狭窄的子域或特定生成方法，例如gan，扩散模型或保护隐私技术。这种有限的范围通常会导致分散的见解，缺乏弥合多种方法的全面综合。特别是，由LLM和基于扩散的模型驱动的最新进展仍未得到充满反感。这一差距阻碍了对该领域的进化，方法论相互作用和公开挑战的整体理解。为了解决这个问题，我们的调查提供了对合成表格数据生成的统一和系统的回顾。我们的贡献是三倍：（1）我们提出了一种全面的分类法，将现有方法组织为传统方法，基于扩散的方法和基于LLM的模型，并提供深入的比较分析；（2）我们详细介绍了合成表格数据生成的完整管道，包括数据综合，后处理和评估；（3）我们确定了重大挑战，探索现实世界的应用程序，并概述开放研究问题和未来的方向，以指导这个迅速发展的领域的未来工作。

Title: Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes

Authors: Joan Perez (1), Giovanni Fusco (2) ((1) Urban Geo Analytics, France (2) Universite Cote-Azur-CNRS-AMU-Avignon Universite, ESPACE, France)
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.16538
Pdf URL: https://arxiv.org/pdf/2504.16538
Copy Paste: [[2504.16538]] Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes(https://arxiv.org/abs/2504.16538)
Keywords: generative
Abstract: Streetscapes are an essential component of urban space. Their assessment is presently either limited to morphometric properties of their mass skeleton or requires labor-intensive qualitative evaluations of visually perceived qualities. This paper introduces SAGAI: Streetscape Analysis with Generative Artificial Intelligence, a modular workflow for scoring street-level urban scenes using open-access data and vision-language models. SAGAI integrates OpenStreetMap geometries, Google Street View imagery, and a lightweight version of the LLaVA model to generate structured spatial indicators from images via customizable natural language prompts. The pipeline includes an automated mapping module that aggregates visual scores at both the point and street levels, enabling direct cartographic interpretation. It operates without task-specific training or proprietary software dependencies, supporting scalable and interpretable analysis of urban environments. Two exploratory case studies in Nice and Vienna illustrate SAGAI's capacity to produce geospatial outputs from vision-language inference. The initial results show strong performance for binary urban-rural scene classification, moderate precision in commercial feature detection, and lower estimates, but still informative, of sidewalk width. Fully deployable by any user, SAGAI can be easily adapted to a wide range of urban research themes, such as walkability, safety, or urban design, through prompt modification alone.
摘要：街景是城市空间的重要组成部分。目前，他们的评估要么限于其质量骨骼的形态特性，要么需要对视觉上感知的质量进行劳动密集型定性评估。本文介绍了Sagai：带有生成人工智能的街景分析，这是一种模块化的工作流程，用于使用开放式数据和视觉语言模型为街道级别的城市场景进行评分。 Sagai集成了OpenStreetMap几何形状，Google Street View Imagery和Llava模型的轻量级版本，以通过可自定义的自然语言提示从图像中生成结构化的空间指示器。该管道包括一个自动映射模块，该模块在街道和街道级别汇总了视觉分数，从而可以直接进行制图解释。它在没有特定任务的培训或专有软件依赖性的情况下运行，支持对城市环境的可扩展和可解释的分析。在Nice和Vienna中的两个探索性案例研究说明了Sagai从视觉推断中产生地理空间产量的能力。最初的结果表明，二元城市农村场景分类，商业特征检测中的中等精度以及人行道宽度的估计值较低但仍然有益。单独通过及时修改，可以轻松地将任何用户部署的Sagai可容易地适应广泛的城市研究主题，例如步行性，安全性或城市设计。

Title: Beyond Anonymization: Object Scrubbing for Privacy-Preserving 2D and 3D Vision Tasks

Authors: Murat Bilgehan Ertan, Ronak Sahu, Phuong Ha Nguyen, Kaleel Mahmood, Marten van Dijk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.16557
Pdf URL: https://arxiv.org/pdf/2504.16557
Copy Paste: [[2504.16557]] Beyond Anonymization: Object Scrubbing for Privacy-Preserving 2D and 3D Vision Tasks(https://arxiv.org/abs/2504.16557)
Keywords: generative
Abstract: We introduce ROAR (Robust Object Removal and Re-annotation), a scalable framework for privacy-preserving dataset obfuscation that eliminates sensitive objects instead of modifying them. Our method integrates instance segmentation with generative inpainting to remove identifiable entities while preserving scene integrity. Extensive evaluations on 2D COCO-based object detection show that ROAR achieves 87.5% of the baseline detection average precision (AP), whereas image dropping achieves only 74.2% of the baseline AP, highlighting the advantage of scrubbing in preserving dataset utility. The degradation is even more severe for small objects due to occlusion and loss of fine-grained details. Furthermore, in NeRF-based 3D reconstruction, our method incurs a PSNR loss of at most 1.66 dB while maintaining SSIM and improving LPIPS, demonstrating superior perceptual quality. Our findings establish object removal as an effective privacy framework, achieving strong privacy guarantees with minimal performance trade-offs. The results highlight key challenges in generative inpainting, occlusion-robust segmentation, and task-specific scrubbing, setting the foundation for future advancements in privacy-preserving vision systems.
摘要：我们引入了咆哮（可靠的对象删除和重新通道），这是一个可扩展的隐私性数据集混淆框架，它消除了敏感对象而不是修改它们。我们的方法将实例分割与生成涂层集成在一起，以删除可识别的实体，同时保持场景完整性。对基于2D可可的对象检测进行了广泛的评估表明，咆哮可实现87.5％的基线检测平均精度（AP），而图像降低仅获得了基线AP的74.2％，突显了在保留数据集效用方面擦洗的优势。由于阻塞和细粒细节的丢失，对于小物体而言，降解更加严重。此外，在基于NERF的3D重建中，我们的方法在维持SSIM和改善LPIP的同时，最多会损失1.66 dB，表现出优异的感知质量。我们的发现将对象删除作为有效的隐私框架，从而通过最低的性能权衡获得了强大的隐私保证。结果突出了生成镶嵌，闭塞刺激性细分和特定于任务的擦洗方面的关键挑战，为隐私保护视觉系统的未来进步奠定了基础。

Title: Unified Molecule Generation and Property Prediction

Authors: Adam Izdebski, Jan Olszewski, Pankhil Gawade, Krzysztof Koras, Serra Korkmaz, Valentin Rauscher, Jakub M. Tomczak, Ewa Szczurek
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2504.16559
Pdf URL: https://arxiv.org/pdf/2504.16559
Copy Paste: [[2504.16559]] Unified Molecule Generation and Property Prediction(https://arxiv.org/abs/2504.16559)
Keywords: generation, generative
Abstract: Modeling the joint distribution of the data samples and their properties allows to construct a single model for both data generation and property prediction, with synergistic capabilities reaching beyond purely generative or predictive models. However, training joint models presents daunting architectural and optimization challenges. Here, we propose Hyformer, a transformer-based joint model that successfully blends the generative and predictive functionalities, using an alternating attention mask together with a unified pre-training scheme. We show that Hyformer rivals other joint models, as well as state-of-the-art molecule generation and property prediction models. Additionally, we show the benefits of joint modeling in downstream tasks of molecular representation learning, hit identification and antimicrobial peptide design.
摘要：对数据样本的联合分布进行建模及其属性，可以构建一个单个模型，以同时为数据生成和属性预测构建一个模型，其协同功能超出了纯粹的生成或预测模型。但是，培训联合模型表现出艰巨的建筑和优化挑战。在这里，我们提出了一种基于变压器的联合模型，该模型成功地将生成和预测功能融合在一起，并使用交替的注意力掩码与统一的预训练方案一起。我们表明，Hyformer与其他联合模型以及最先进的分子生成和财产预测模型竞争。此外，我们还显示了联合建模在分子表示学习，命中识别和抗菌肽设计的下游任务中的好处。

Title: Hyper-Transforming Latent Diffusion Models

Authors: Ignacio Peis, Batuhan Koyuncu, Isabel Valera, Jes Frellsen
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.16580
Pdf URL: https://arxiv.org/pdf/2504.16580
Copy Paste: [[2504.16580]] Hyper-Transforming Latent Diffusion Models(https://arxiv.org/abs/2504.16580)
Keywords: generation, generative
Abstract: We introduce a novel generative framework for functions by integrating Implicit Neural Representations (INRs) and Transformer-based hypernetworks into latent variable models. Unlike prior approaches that rely on MLP-based hypernetworks with scalability limitations, our method employs a Transformer-based decoder to generate INR parameters from latent variables, addressing both representation capacity and computational efficiency. Our framework extends latent diffusion models (LDMs) to INR generation by replacing standard decoders with a Transformer-based hypernetwork, which can be trained either from scratch or via hyper-transforming-a strategy that fine-tunes only the decoder while freezing the pre-trained latent space. This enables efficient adaptation of existing generative models to INR-based representations without requiring full retraining.
摘要：我们通过将隐式神经表示（INR）和基于变压器的超网络集成到潜在变量模型中，从而引入了一个新颖的生成框架，以实现功能。与依赖于具有可扩展性限制的基于MLP的超网络的先验方法不同，我们的方法采用基于变压器的解码器来生成来自潜在变量的INR参数，从而解决了表示能力和计算效率。我们的框架将潜在扩散模型（LDMS）扩展到INR生成，通过用基于变压器的超级网络替换标准解码器，可以通过scratch或Hyper-Transforming-a策略对其进行训练，该策略仅对解码器进行微调，同时冻结预培养的潜在空间。这使现有生成模型可以有效地适应基于INR的表示，而无需完全重新培训。

Title: EHGCN: Hierarchical Euclidean-Hyperbolic Fusion via Motion-Aware GCN for Hybrid Event Stream Perception

Authors: Haosheng Chen, Lian Luo, Mengjingcheng Mo, Zhanjie Wu, Guobao Xiao, Ji Gan, Jiaxu Leng, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.16616
Pdf URL: https://arxiv.org/pdf/2504.16616
Copy Paste: [[2504.16616]] EHGCN: Hierarchical Euclidean-Hyperbolic Fusion via Motion-Aware GCN for Hybrid Event Stream Perception(https://arxiv.org/abs/2504.16616)
Keywords: generation
Abstract: Event cameras, with microsecond temporal resolution and high dynamic range (HDR) characteristics, emit high-speed event stream for perception tasks. Despite the recent advancement in GNN-based perception methods, they are prone to use straightforward pairwise connectivity mechanisms in the pure Euclidean space where they struggle to capture long-range dependencies and fail to effectively characterize the inherent hierarchical structures of non-uniformly distributed event stream. To this end, in this paper we propose a novel approach named EHGCN, which is a pioneer to perceive event stream in both Euclidean and hyperbolic spaces for event vision. In EHGCN, we introduce an adaptive sampling strategy to dynamically regulate sampling rates, retaining discriminative events while attenuating chaotic noise. Then we present a Markov Vector Field (MVF)-driven motion-aware hyperedge generation method based on motion state transition probabilities, thereby eliminating cross-target spurious associations and providing critically topological priors while capturing long-range dependencies between events. Finally, we propose a Euclidean-Hyperbolic GCN to fuse the information locally aggregated and globally hierarchically modeled in Euclidean and hyperbolic spaces, respectively, to achieve hybrid event perception. Experimental results on event perception tasks such as object detection and recognition validate the effectiveness of our approach.
摘要：事件摄像机，具有微秒的时间分辨率和高动态范围（HDR）特征，会发出高速事件流以进行感知任务。尽管最近在基于GNN的感知方法方面取得了进步，但它们很容易在纯欧几里得空间中使用直接的成对连接机制，在那里他们努力捕获长期依赖性，并且无法有效地表征非均匀分布的事件流的固有层次结构。为此，在本文中，我们提出了一种名为EHGCN的新颖方法，该方法是在欧几里得和双曲线空间中感知事件流的先驱。在EHGCN中，我们引入了一种自适应采样策略，以动态调节采样率，保留歧视性事件，同时减弱混乱的噪声。然后，我们提出了基于运动状态过渡概率的马尔可夫矢量场（MVF）驱动的运动吸引超边缘生成方法，从而消除了跨目标的伪造关联并提供了重要的拓扑拓扑先验，同时捕获事件之间的长距离依赖性。最后，我们提出了一个欧几里得 - 毛细血管GCN，以分别在欧几里得和双曲线空间中融合本地汇总和全球分层的信息，以实现混合事件感知。事件感知任务（例如对象检测和识别）的实验结果验证了我们方法的有效性。

Title: Dual-Camera All-in-Focus Neural Radiance Fields

Authors: Xianrui Luo, Zijin Wu, Juewen Peng, Huiqiang Sun, Zhiguo Cao, Guosheng Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.16636
Pdf URL: https://arxiv.org/pdf/2504.16636
Copy Paste: [[2504.16636]] Dual-Camera All-in-Focus Neural Radiance Fields(https://arxiv.org/abs/2504.16636)
Keywords: restoration
Abstract: We present the first framework capable of synthesizing the all-in-focus neural radiance field (NeRF) from inputs without manual refocusing. Without refocusing, the camera will automatically focus on the fixed object for all views, and current NeRF methods typically using one camera fail due to the consistent defocus blur and a lack of sharp reference. To restore the all-in-focus NeRF, we introduce the dual-camera from smartphones, where the ultra-wide camera has a wider depth-of-field (DoF) and the main camera possesses a higher resolution. The dual camera pair saves the high-fidelity details from the main camera and uses the ultra-wide camera's deep DoF as reference for all-in-focus restoration. To this end, we first implement spatial warping and color matching to align the dual camera, followed by a defocus-aware fusion module with learnable defocus parameters to predict a defocus map and fuse the aligned camera pair. We also build a multi-view dataset that includes image pairs of the main and ultra-wide cameras in a smartphone. Extensive experiments on this dataset verify that our solution, termed DC-NeRF, can produce high-quality all-in-focus novel views and compares favorably against strong baselines quantitatively and qualitatively. We further show DoF applications of DC-NeRF with adjustable blur intensity and focal plane, including refocusing and split diopter.
摘要：我们提出了第一个能够从没有手动重新聚焦的输入中综合全中的神经辐射场（NERF）的框架。如果不重新聚焦，摄像机将自动关注所有视图的固定对象，并且当前的NERF方法通常是由于一致的Defocus模糊和缺乏尖锐的参考而使用一个相机故障。为了恢复全焦点NERF，我们介绍了智能手机的双摄像头，在该智能手机中，超宽相机具有更大的景点（DOF），并且主相机具有更高的分辨率。双摄像头对从主摄像头节省了高保真的细节，并使用超宽摄像头的深度DOF作为全焦点恢复的参考。为此，我们首先实现空间翘曲和颜色匹配以使双摄像头对齐，然后使用具有可学习的Defocus参数的DeFocus-Inave Fusion Fusion模块来预测defocus映射并融合对齐的相机对。我们还构建了一个多视图数据集，该数据集在智能手机中包含主要和超宽相机的图像对。该数据集的广泛实验验证了我们的解决方案（称为DC-NERF）可以产生高质量的全焦点小说，并在定量和质量上对强基础有利地比较。我们进一步显示了DC-NERF的DOF应用，具有可调节的模糊强度和焦平面，包括重新聚焦和分裂屈光度。

Title: RouteWinFormer: A Route-Window Transformer for Middle-range Attention in Image Restoration

Authors: Qifan Li, Tianyi Liang, Xingtao Wang, Xiaopeng Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.16637
Pdf URL: https://arxiv.org/pdf/2504.16637
Copy Paste: [[2504.16637]] RouteWinFormer: A Route-Window Transformer for Middle-range Attention in Image Restoration(https://arxiv.org/abs/2504.16637)
Keywords: restoration
Abstract: Transformer models have recently garnered significant attention in image restoration due to their ability to capture long-range pixel dependencies. However, long-range attention often results in computational overhead without practical necessity, as degradation and context are typically localized. Normalized average attention distance across various degradation datasets shows that middle-range attention is enough for image restoration. Building on this insight, we propose RouteWinFormer, a novel window-based Transformer that models middle-range context for image restoration. RouteWinFormer incorporates Route-Windows Attnetion Module, which dynamically selects relevant nearby windows based on regional similarity for attention aggregation, extending the receptive field to a mid-range size efficiently. In addition, we introduce Multi-Scale Structure Regularization during training, enabling the sub-scale of the U-shaped network to focus on structural information, while the original-scale learns degradation patterns based on generalized image structure priors. Extensive experiments demonstrate that RouteWinFormer outperforms state-of-the-art methods across 9 datasets in various image restoration tasks.
摘要：由于捕获远程像素依赖性的能力，变形金刚模型最近引起了图像恢复的重要关注。但是，由于降级和上下文通常是本地化的，因此远程注意力通常会导致计算开销而无需实际。各种退化数据集的归一化平均注意力距离表明，中间范围的注意力足以用于恢复图像。在此洞察力的基础上，我们提出了RouteWinformer，这是一种基于窗口的新型变压器，为图像恢复的中环上下文建模。 RouteWinformer合并了Route-Windows Attnetion模块，该模块基于区域相似性，将附近的窗户动态选择附近的窗口，从而有效地扩展了接收场的中档大小。此外，我们在训练过程中介绍了多尺度结构正则化，使U形网络的子尺度专注于结构信息，而原始尺度则学习基于广义图像结构先验的降级模式。广泛的实验表明，在各种图像恢复任务中，RouteWinformer在9个数据集上的最先进方法优于最先进的方法。

Title: Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

Authors: Chris, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, Yahui Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.16656
Pdf URL: https://arxiv.org/pdf/2504.16656
Copy Paste: [[2504.16656]] Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning(https://arxiv.org/abs/2504.16656)
Keywords: generation
Abstract: We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that harmonizes reward-model guidance with rule-based strategies, thereby addressing the long-standing challenge of balancing sophisticated reasoning capabilities with broad generalization. To further enhance training efficiency, we propose the Selective Sample Buffer (SSB) mechanism, which effectively counters the ``Vanishing Advantages'' dilemma inherent in Group Relative Policy Optimization (GRPO) by prioritizing high-value samples throughout the optimization process. Notably, we observe that excessive reinforcement signals can induce visual hallucinations--a phenomenon we systematically monitor and mitigate through calibrated reward thresholds throughout the training process. Empirical results affirm the exceptional capability of R1V2, with benchmark-leading performances such as 62.6 on OlympiadBench, 79.0 on AIME2024, 63.6 on LiveCodeBench, and 74.0 on MMMU. These results underscore R1V2's superiority over existing open-source models and demonstrate significant progress in closing the performance gap with premier proprietary systems, including Gemini 2.5 and OpenAI o4-mini. The Skywork R1V2 model weights have been publicly released to promote openness and reproducibility this https URL.
摘要：我们提出了Skywork R1V2，这是一种下一代多模式推理模型，并从其前身Skywork R1V出发。 R1V2以此为核心引入了混合增强学习范式，该学习范式将奖励模型指导与基于规则的策略进行了协调，从而解决了平衡复杂的推理能力与广泛概括的长期挑战。为了进一步提高训练效率，我们提出了选择性样品缓冲液（SSB）机制，该机制有效地反驳了``消失的优势''''组相对策略优化（GRPO）固有的困境，通过在整个优化过程中优先考虑高价值样本。值得注意的是，我们观察到过多的增强信号会引起视觉幻觉 - 这是我们系统地监测和减轻整个训练过程中校准的奖励阈值的现象。经验结果肯定了R1V2的非凡能力，基准领先的性能，例如奥林匹克班基因上的62.6，AIME2024上的79.0，在livecodebench上为63.6，MMMU上的74.0。这些结果强调了R1V2比现有开源模型的优势，并在缩小Premier专有系统（包括Gemini 2.5和OpenAI O4-Mini）的性能差距方面表现出了重大进展。 Skywork R1V2型号的权重已公开发布，以促进此HTTPS URL的开放性和可重复性。

Title: PMG: Progressive Motion Generation via Sparse Anchor Postures Curriculum Learning

Authors: Yingjie Xi, Jian Jun Zhang, Xiaosong Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.16722
Pdf URL: https://arxiv.org/pdf/2504.16722
Copy Paste: [[2504.16722]] PMG: Progressive Motion Generation via Sparse Anchor Postures Curriculum Learning(https://arxiv.org/abs/2504.16722)
Keywords: generation
Abstract: In computer animation, game design, and human-computer interaction, synthesizing human motion that aligns with user intent remains a significant challenge. Existing methods have notable limitations: textual approaches offer high-level semantic guidance but struggle to describe complex actions accurately; trajectory-based techniques provide intuitive global motion direction yet often fall short in generating precise or customized character movements; and anchor poses-guided methods are typically confined to synthesize only simple motion patterns. To generate more controllable and precise human motions, we propose \textbf{ProMoGen (Progressive Motion Generation)}, a novel framework that integrates trajectory guidance with sparse anchor motion control. Global trajectories ensure consistency in spatial direction and displacement, while sparse anchor motions only deliver precise action guidance without displacement. This decoupling enables independent refinement of both aspects, resulting in a more controllable, high-fidelity, and sophisticated motion synthesis. ProMoGen supports both dual and single control paradigms within a unified training process. Moreover, we recognize that direct learning from sparse motions is inherently unstable, we introduce \textbf{SAP-CL (Sparse Anchor Posture Curriculum Learning)}, a curriculum learning strategy that progressively adjusts the number of anchors used for guidance, thereby enabling more precise and stable convergence. Extensive experiments demonstrate that ProMoGen excels in synthesizing vivid and diverse motions guided by predefined trajectory and arbitrary anchor frames. Our approach seamlessly integrates personalized motion with structured guidance, significantly outperforming state-of-the-art methods across multiple control scenarios.
摘要：在计算机动画，游戏设计和人类计算机互动中，与用户意图保持一致的人类运动是一个重大挑战。现有方法具有明显的局限性：文本方法提供了高级语义指导，但努力准确地描述复杂的动作；基于轨迹的技术提供了直观的全球运动方向，但在产生精确或自定义的角色运动方面常常不足；锚点引导的方法通常仅限于合成简单的运动模式。为了产生更可控制和精确的人类动作，我们提出\ textbf {promogen（渐进运动生成）}，这是一个新颖的框架，将轨迹引导与稀疏锚固运动控制整合在一起。全球轨迹确保在空间方向和位移方面的一致性，而稀疏锚点仅提供精确的动作指导而无需移位。这种脱钩可以独立地完善这两个方面，从而导致更可控制，高保真和复杂的运动合成。 Promogen在统一的训练过程中支持双重控制范式和单一控制范例。此外，我们认识到，从稀疏动作中直接学习本质上是不稳定的，我们介绍了\ textbf {sap-cl（稀疏锚姿势课程学习）}，这是一种课程学习策略，逐步调整了用于指导的锚的数量，从而实现了更精确和稳定的收敛。广泛的实验表明，Promogen在由预定义的轨迹和任意锚固框架引导的生动和多样化的运动中表现出色。我们的方法无缝地将个性化运动与结构化的指导整合在一起，在多种控制方案中的最先进方法都大大优于最先进的方法。

Title: V$^2$R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations

Authors: Zhiyuan Fan, Yumeng Wang, Sandeep Polisetty, Yi R. (May)Fung
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.16727
Pdf URL: https://arxiv.org/pdf/2504.16727
Copy Paste: [[2504.16727]] V$^2$R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations(https://arxiv.org/abs/2504.16727)
Keywords: generation
Abstract: Large Vision Language Models (LVLMs) excel in various vision-language tasks. Yet, their robustness to visual variations in position, scale, orientation, and context that objects in natural scenes inevitably exhibit due to changes in viewpoint and environment remains largely underexplored. To bridge this gap, we introduce V$^2$R-Bench, a comprehensive benchmark framework for evaluating Visual Variation Robustness of LVLMs, which encompasses automated evaluation dataset generation and principled metrics for thorough robustness assessment. Through extensive evaluation on 21 LVLMs, we reveal a surprising vulnerability to visual variations, in which even advanced models that excel at complex vision-language tasks significantly underperform on simple tasks such as object recognition. Interestingly, these models exhibit a distinct visual position bias that contradicts theories of effective receptive fields, and demonstrate a human-like visual acuity threshold. To identify the source of these vulnerabilities, we present a systematic framework for component-level analysis, featuring a novel visualization approach for aligned visual features. Results show that these vulnerabilities stem from error accumulation in the pipeline architecture and inadequate multimodal alignment. Complementary experiments with synthetic data further demonstrate that these limitations are fundamentally architectural deficiencies, scoring the need for architectural innovations in future LVLM designs.
摘要：大型视觉语言模型（LVLM）在各种视觉语言任务中表现出色。然而，由于观点和环境的变化，它们对自然场景中对象不可避免地表现出的位置，规模，方向和上下文的视觉变化的稳健性仍然很大程度上尚未得到震惊。为了弥合这一差距，我们介绍了V $^2 $ r BENCH，这是一个综合基准框架，用于评估LVLMS的视觉变化鲁棒性，该框架涵盖了自动评估数据集生成和用于彻底鲁棒性评估的原则性指标。通过对21 lvlms的广泛评估，我们揭示了令人惊讶的视觉变化脆弱性，即使在复杂的视觉任务上脱颖而出的高级模型在诸如对象识别之类的简单任务上都显着表现出色。有趣的是，这些模型表现出与有效接受场的理论相矛盾的独特视觉位置偏差，并表现出类似人类的视觉敏锐度阈值。为了确定这些漏洞的来源，我们提出了一个用于组件级分析的系统框架，该框架采用了一种新颖的可视化方法，可用于对齐视觉特征。结果表明，这些漏洞源于管道架构中的错误积累和多模式比对不足。综合数据的互补实验进一步表明，这些局限性从根本上是建筑缺陷，在未来的LVLM设计中得分了对建筑创新的需求。

Title: Feature Mixing Approach for Detecting Intraoperative Adverse Events in Laparoscopic Roux-en-Y Gastric Bypass Surgery

Authors: Rupak Bose, Chinedu Innocent Nwoye, Jorge Lazo, Joël Lukas Lavanchy, Nicolas Padoy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.16749
Pdf URL: https://arxiv.org/pdf/2504.16749
Copy Paste: [[2504.16749]] Feature Mixing Approach for Detecting Intraoperative Adverse Events in Laparoscopic Roux-en-Y Gastric Bypass Surgery(https://arxiv.org/abs/2504.16749)
Keywords: generative
Abstract: Intraoperative adverse events (IAEs), such as bleeding or thermal injury, can lead to severe postoperative complications if undetected. However, their rarity results in highly imbalanced datasets, posing challenges for AI-based detection and severity quantification. We propose BetaMixer, a novel deep learning model that addresses these challenges through a Beta distribution-based mixing approach, converting discrete IAE severity scores into continuous values for precise severity regression (0-5 scale). BetaMixer employs Beta distribution-based sampling to enhance underrepresented classes and regularizes intermediate embeddings to maintain a structured feature space. A generative approach aligns the feature space with sampled IAE severity, enabling robust classification and severity regression via a transformer. Evaluated on the MultiBypass140 dataset, which we extended with IAE labels, BetaMixer achieves a weighted F1 score of 0.76, recall of 0.81, PPV of 0.73, and NPV of 0.84, demonstrating strong performance on imbalanced data. By integrating Beta distribution-based sampling, feature mixing, and generative modeling, BetaMixer offers a robust solution for IAE detection and quantification in clinical settings.
摘要：术中不良事件（IAE），例如出血或热损伤，如果未被发现，可能会导致严重的术后并发症。但是，它们的稀有性导致高度不平衡的数据集，对基于AI的检测和严重性量化提出了挑战。我们提出了Betamixer，这是一个新颖的深度学习模型，通过基于Beta分布的混合方法来解决这些挑战，将离散的IAE严重程度得分转化为连续值以获得精确的严重性回归（0-5比例）。 Betamixer采用基于Beta分布的采样来增强代表性不足的类别，并将中间嵌入式定于维持结构化特征空间。生成方法将特征空间与采样的IAE严重程度一致，从而通过变压器使稳健的分类和严重性回归。在我们使用IAE标签扩展的Multibypass140数据集上进行了评估，Betamixer的加权F1得分为0.76，召回0.81，PPV为0.73，NPV为0.84，表明数据不平衡的数据表现强。通过整合基于β分布的采样，功能混合和生成建模，Betamixer为临床环境中的IAE检测和定量提供了强大的解决方案。

Title: Tri-FusionNet: Enhancing Image Description Generation with Transformer-based Fusion Network and Dual Attention Mechanism

Authors: Lakshita Agarwal, Bindu Verma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.16761
Pdf URL: https://arxiv.org/pdf/2504.16761
Copy Paste: [[2504.16761]] Tri-FusionNet: Enhancing Image Description Generation with Transformer-based Fusion Network and Dual Attention Mechanism(https://arxiv.org/abs/2504.16761)
Keywords: generation
Abstract: Image description generation is essential for accessibility and AI understanding of visual content. Recent advancements in deep learning have significantly improved natural language processing and computer vision. In this work, we propose Tri-FusionNet, a novel image description generation model that integrates transformer modules: a Vision Transformer (ViT) encoder module with dual-attention mechanism, a Robustly Optimized BERT Approach (RoBERTa) decoder module, and a Contrastive Language-Image Pre-Training (CLIP) integrating module. The ViT encoder, enhanced with dual attention, focuses on relevant spatial regions and linguistic context, improving image feature extraction. The RoBERTa decoder is employed to generate precise textual descriptions. CLIP's integrating module aligns visual and textual data through contrastive learning, ensuring effective combination of both modalities. This fusion of ViT, RoBERTa, and CLIP, along with dual attention, enables the model to produce more accurate, contextually rich, and flexible descriptions. The proposed framework demonstrated competitive performance on the Flickr30k and Flickr8k datasets, with BLEU scores ranging from 0.767 to 0.456 and 0.784 to 0.479, CIDEr scores of 1.679 and 1.483, METEOR scores of 0.478 and 0.358, and ROUGE-L scores of 0.567 and 0.789, respectively. On MS-COCO, the framework obtained BLEU scores of 0.893 (B-1), 0.821 (B-2), 0.794 (B-3), and 0.725 (B-4). The results demonstrate the effectiveness of Tri-FusionNet in generating high-quality image descriptions.
摘要：图像描述生成对于可访问性和对视觉内容的理解至关重要。深度学习的最新进展显着改善了自然语言处理和计算机视觉。在这项工作中，我们提出了一个新型图像描述的生成模型，该模型集成了变压器模块：视觉变压器（VIT）编码器模块，具有双注意机制，一种可靠的优化BERT方法（Roberta）解码器模块，以及一个对比的语言图像图像图像 - 图像训练（剪辑）集成模块。 VIT编码器以双重注意力增强，重点是相关的空间区域和语言环境，从而改善了图像特征提取。罗伯塔解码器被用来生成精确的文本描述。剪辑的集成模块通过对比度学习使视觉和文本数据对齐，从而确保两种方式的有效组合。 VIT，Roberta和Clip的这种融合以及双重注意，使该模型能够产生更准确，上下文丰富和灵活的描述。 The proposed framework demonstrated competitive performance on the Flickr30k and Flickr8k datasets, with BLEU scores ranging from 0.767 to 0.456 and 0.784 to 0.479, CIDEr scores of 1.679 and 1.483, METEOR scores of 0.478 and 0.358, and ROUGE-L scores of 0.567 and 0.789, respectively.在MS-Coco上，该框架获得了0.893（B-1），0.821（B-2），0.794（B-3）和0.725（B-4）的BLEU得分。结果证明了三融合网络在产生高质量图像描述中的有效性。

Title: Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation

Authors: Lakshita Agarwal, Bindu Verma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.16788
Pdf URL: https://arxiv.org/pdf/2504.16788
Copy Paste: [[2504.16788]] Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation(https://arxiv.org/abs/2504.16788)
Keywords: generation, generative
Abstract: Understanding and analyzing video actions are essential for producing insightful and contextualized descriptions, especially for video-based applications like intelligent monitoring and autonomous systems. The proposed work introduces a novel framework for generating natural language descriptions from video datasets by combining textual and visual modalities. The suggested architecture makes use of ResNet50 to extract visual features from video frames that are taken from the Microsoft Research Video Description Corpus (MSVD), and Berkeley DeepDrive eXplanation (BDD-X) datasets. The extracted visual characteristics are converted into patch embeddings and then run through an encoder-decoder model based on Generative Pre-trained Transformer-2 (GPT-2). In order to align textual and visual representations and guarantee high-quality description production, the system uses multi-head self-attention and cross-attention techniques. The model's efficacy is demonstrated by performance evaluation using BLEU (1-4), CIDEr, METEOR, and ROUGE-L. The suggested framework outperforms traditional methods with BLEU-4 scores of 0.755 (BDD-X) and 0.778 (MSVD), CIDEr scores of 1.235 (BDD-X) and 1.315 (MSVD), METEOR scores of 0.312 (BDD-X) and 0.329 (MSVD), and ROUGE-L scores of 0.782 (BDD-X) and 0.795 (MSVD). By producing human-like, contextually relevant descriptions, strengthening interpretability, and improving real-world applications, this research advances explainable AI.
摘要：理解和分析视频动作对于产生有见地和上下文化的描述至关重要，尤其是对于基于视频的应用程序，例如智能监控和自主系统。拟议的工作介绍了一个新颖的框架，用于通过结合文本和视觉方式从视频数据集中生成自然语言描述。建议的体系结构利用Resnet50从Microsoft研究视频描述语料库（MSVD）和Berkeley DeepDrive解释（BDD-X）数据集中提取视频帧中提取视觉功能。提取的视觉特性转换为斑块嵌入，然后基于生成训练的预训练的变压器-2（GPT-2），通过编码器模型运行。为了使文本和视觉表示并保证高质量的描述生产，系统使用多头的自我注意力和跨注意技术。使用BLEU（1-4），苹果酒，流星和Rouge-L的性能评估证明了该模型的功效。 The suggested framework outperforms traditional methods with BLEU-4 scores of 0.755 (BDD-X) and 0.778 (MSVD), CIDEr scores of 1.235 (BDD-X) and 1.315 (MSVD), METEOR scores of 0.312 (BDD-X) and 0.329 (MSVD), and ROUGE-L scores of 0.782 (BDD-X) and 0.795（MSVD）。通过产生类似人类的，上下文相关的描述，增强可解释性并改善现实世界的应用，这项研究可以解释AI。

Title: Process Reward Models That Think

Authors: Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2504.16828
Pdf URL: https://arxiv.org/pdf/2504.16828
Copy Paste: [[2504.16828]] Process Reward Models That Think(https://arxiv.org/abs/2504.16828)
Keywords: generative
Abstract: Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models will be released at this https URL.
摘要：逐步验证器（也称为过程奖励模型（PRM））是测试时间缩放的关键要素。 PRM需要阶梯级监督，使其训练昂贵。这项工作旨在将数据效率的PRM构建为口头上的逐步奖励模型，这些模型通过生成验证链（COT）来验证解决方案中的每个步骤。我们提出了ThinkPrm，这是一个长期的COT验证器，对工艺标签的顺序比判别性PRMS所要求的较少。我们的方法利用了长COT模型的固有推理能力，并且优于LLM-AS-A-A-A-A-A-A-A-A-A-As-A-Audguse和歧视性验证者（在PRM800K中仅使用1％的流程标签）在几个挑战性的基准中使用。具体而言，在最佳N选择和奖励引导搜索下，ThinkPrm在ProcessBench，Math-500和Aime '24上击败了基线。在对GPQA-Diamond和LiveCodeBench的子集的室外评估中，我们的PRM超过了在完整PRM800K上训练的歧视性验证者，分别为8％和4.5％。最后，在同一代币预算下，与LLM-AS-A-Gudge相比，ThinkPrm更有效地计算验证，在一部分ProcessBench的一部分中，其表现优于7.2％。我们的工作强调了生成长的COT PRM的价值，这些价值可以扩展测试时间计算以进行验证，同时需要最少的培训监督。我们的代码，数据和模型将在此HTTPS URL上发布。

Title: Evaluating Autoencoders for Parametric and Invertible Multidimensional Projections

Authors: Frederik L. Dennig, Nina Geyer, Daniela Blumberg, Yannick Metz, Daniel A. Keim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.16831
Pdf URL: https://arxiv.org/pdf/2504.16831
Copy Paste: [[2504.16831]] Evaluating Autoencoders for Parametric and Invertible Multidimensional Projections(https://arxiv.org/abs/2504.16831)
Keywords: generation
Abstract: Recently, neural networks have gained attention for creating parametric and invertible multidimensional data projections. Parametric projections allow for embedding previously unseen data without recomputing the projection as a whole, while invertible projections enable the generation of new data points. However, these properties have never been explored simultaneously for arbitrary projection methods. We evaluate three autoencoder (AE) architectures for creating parametric and invertible projections. Based on a given projection, we train AEs to learn a mapping into 2D space and an inverse mapping into the original space. We perform a quantitative and qualitative comparison on four datasets of varying dimensionality and pattern complexity using t-SNE. Our results indicate that AEs with a customized loss function can create smoother parametric and inverse projections than feed-forward neural networks while giving users control over the strength of the smoothing effect.
摘要：最近，神经网络因创建参数和可逆的多维数据投影而引起了人们的关注。参数投影允许嵌入以前看不见的数据而无需重新计算投影，而可逆投影可以生成新的数据点。但是，这些属性从未通过任意投影方法同时探索。我们评估了三个自动编码器（AE）体系结构来创建参数和可逆预测。基于给定的投影，我们训练AES学习映射到2D空间，并将反向映射到原始空间中。我们对使用T-SNE的四个不同维度和模式复杂性的数据集进行了定量和定性比较。我们的结果表明，具有自定义损耗功能的AE可以比馈送前向神经网络创造更平滑的参数和反向预测，同时使用户控制平滑效果的强度。

Title: Exploring How LLMs Capture and Represent Domain-Specific Knowledge

Authors: Mirian Hipolito Garcia, Camille Couturier, Daniel Madrigal Diaz, Ankur Mallick, Anastasios Kyrillidis, Robert Sim, Victor Ruhle, Saravan Rajmohan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.16871
Pdf URL: https://arxiv.org/pdf/2504.16871
Copy Paste: [[2504.16871]] Exploring How LLMs Capture and Represent Domain-Specific Knowledge(https://arxiv.org/abs/2504.16871)
Keywords: generative
Abstract: We study whether Large Language Models (LLMs) inherently capture domain-specific nuances in natural language. Our experiments probe the domain sensitivity of LLMs by examining their ability to distinguish queries from different domains using hidden states generated during the prefill phase. We reveal latent domain-related trajectories that indicate the model's internal recognition of query domains. We also study the robustness of these domain representations to variations in prompt styles and sources. Our approach leverages these representations for model selection, mapping the LLM that best matches the domain trace of the input query (i.e., the model with the highest performance on similar traces). Our findings show that LLMs can differentiate queries for related domains, and that the fine-tuned model is not always the most accurate. Unlike previous work, our interpretations apply to both closed and open-ended generative tasks
摘要：我们研究大型语言模型（LLMS）是否固有地捕获了自然语言的特定领域细微差别。我们的实验通过使用预填充阶段中生成的隐藏状态来研究其与不同域的区分与不同域的能力来探测LLM的域灵敏度。我们揭示了潜在域相关的轨迹，这些轨迹表明该模型对查询域的内部识别。我们还研究了这些域表示的鲁棒性，以迅速和来源的变化。我们的方法利用这些表示形式进行模型选择，映射最匹配输入查询的域轨迹的LLM（即，在相似轨迹上具有最高性能的模型）。我们的发现表明，LLM可以区分相关域的查询，并且微调模型并不总是最准确的。与以前的工作不同，我们的解释适用于封闭式和开放式生成任务

Title: BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation

Authors: Ruotong Wang, Mingli Zhu, Jiarong Ou, Rui Chen, Xin Tao, Pengfei Wan, Baoyuan Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.16907
Pdf URL: https://arxiv.org/pdf/2504.16907
Copy Paste: [[2504.16907]] BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation(https://arxiv.org/abs/2504.16907)
Keywords: generation, generative
Abstract: Text-to-video (T2V) generative models have rapidly advanced and found widespread applications across fields like entertainment, education, and marketing. However, the adversarial vulnerabilities of these models remain rarely explored. We observe that in T2V generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts, such as environmental elements, secondary objects, and additional details, providing opportunities for malicious attackers to embed hidden harmful content. Exploiting this inherent redundancy, we introduce BadVideo, the first backdoor attack framework tailored for T2V generation. Our attack focuses on designing target adversarial outputs through two key strategies: (1) Spatio-Temporal Composition, which combines different spatiotemporal features to encode malicious information; (2) Dynamic Element Transformation, which introduces transformations in redundant elements over time to convey malicious information. Based on these strategies, the attacker's malicious target seamlessly integrates with the user's textual instructions, providing high stealthiness. Moreover, by exploiting the temporal dimension of videos, our attack successfully evades traditional content moderation systems that primarily analyze spatial information within individual frames. Extensive experiments demonstrate that BadVideo achieves high attack success rates while preserving original semantics and maintaining excellent performance on clean inputs. Overall, our work reveals the adversarial vulnerability of T2V models, calling attention to potential risks and misuse. Our project page is at this https URL.
摘要：文本到视频（T2V）生成模型已经快速先进，并在娱乐，教育和营销等领域找到了广泛的应用程序。但是，这些模型的对抗性漏洞仍然很少探索。我们观察到，在T2V生成任务中，生成的视频通常包含文本提示中未明确指定的大量冗余信息，例如环境元素，次要对象和其他细节，为恶意攻击者提供了嵌入隐藏有害内容的机会。利用这种固有的冗余，我们介绍了Badvideo，这是针对T2V生成的第一个后门攻击框架。我们的攻击重点是通过两种关键策略设计目标对抗输出：（1）时空组成，该构图结合了不同的时空特征以编码恶意信息；（2）动态元素转换，随着时间的流逝，它引入了冗余元素的转换以传达恶意信息。基于这些策略，攻击者的恶意目标与用户的文字说明无缝集成，提供了高隐形。此外，通过利用视频的时间维度，我们的攻击成功逃避了传统的内容审核系统，这些系统主要分析了各个框架内的空间信息。广泛的实验表明，BadVideo在保留原始语义并保持出色的清洁输入方面取得了高发作的成功率。总体而言，我们的工作揭示了T2V模型的对抗脆弱性，引起人们对潜在风险和滥用的关注。我们的项目页面在此HTTPS URL上。

Title: DreamO: A Unified Framework for Image Customization

Authors: Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, Mengtian Li, Songtao Zhao, Jian Zhang, Qian He, Xinglong Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.16915
Pdf URL: https://arxiv.org/pdf/2504.16915
Copy Paste: [[2504.16915]] DreamO: A Unified Framework for Image Customization(https://arxiv.org/abs/2504.16915)
Keywords: generative
Abstract: Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.
摘要：最近，关于图像定制的广泛研究（例如身份，主题，样式，背景等）在大规模生成模型中表现出强大的自定义功能。但是，大多数方法都是为特定任务而设计的，从而限制了它们结合不同类型状况的概括性。为图像自定义开发统一的框架仍然是一个开放的挑战。在本文中，我们提出了Dreamo，这是一个图像自定义框架，旨在支持各种任务，同时促进多种条件的无缝集成。具体而言，Dreamo利用扩散变压器（DIT）框架来统一处理不同类型的输入。在培训期间，我们构建了一个包括各种自定义任务的大规模培训数据集，并介绍了功能路由约束，以促进参考图像中相关信息的精确查询。此外，我们设计了一种占位符策略，该战略将特定占位符与特定位置的条件相关联，从而可以控制生成结果中的条件。此外，我们采用了一个由三个阶段组成的渐进培训策略：初始阶段，专注于具有有限数据的简单任务，以建立基线一致性，一个完整的培训阶段，以全面增强自定义能力，以及最终的质量一致性阶段，以纠正由低品质数据引入的质量偏见。广泛的实验表明，拟议的Dreamo可以有效地执行具有高质量的各种图像自定义任务，并灵活地整合不同类型的控制条件。

Title: Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

Authors: Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen-mei Hwu, Ming-Yu Liu, Humphrey Shi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.16922
Pdf URL: https://arxiv.org/pdf/2504.16922
Copy Paste: [[2504.16922]] Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light(https://arxiv.org/abs/2504.16922)
Keywords: generative
Abstract: Many sparse attention mechanisms such as Neighborhood Attention have typically failed to consistently deliver speedup over the self attention baseline. This is largely due to the level of complexity in attention infrastructure, and the rapid evolution of AI hardware architecture. At the same time, many state-of-the-art foundational models, particularly in computer vision, are heavily bound by attention, and need reliable sparsity to escape the O(n^2) complexity. In this paper, we study a class of promising sparse attention mechanisms that focus on locality, and aim to develop a better analytical model of their performance improvements. We first introduce Generalized Neighborhood Attention (GNA), which can describe sliding window, strided sliding window, and blocked attention. We then consider possible design choices in implementing these approaches, and create a simulator that can provide much more realistic speedup upper bounds for any given setting. Finally, we implement GNA on top of a state-of-the-art fused multi-headed attention (FMHA) kernel designed for the NVIDIA Blackwell architecture in CUTLASS. Our implementation can fully realize the maximum speedup theoretically possible in many perfectly block-sparse cases, and achieves an effective utilization of 1.3 petaFLOPs/second in FP16. In addition, we plug various GNA configurations into off-the-shelf generative models, such as Cosmos-7B, HunyuanVideo, and FLUX, and show that it can deliver 28% to 46% end-to-end speedup on B200 without any fine-tuning. We will open source our simulator and Blackwell kernels directly through the NATTEN project.
摘要：许多稀疏的注意机制（例如邻里注意力）通常未能始终如一地在自我注意力基线上提供加速。这主要是由于注意力基础架构的复杂程度以及AI硬件体系结构的快速发展。同时，许多最先进的基础模型，尤其是在计算机视觉中，受到关注的严重束缚，需要可靠的稀疏性来逃避O（n^2）的复杂性。在本文中，我们研究了一类有希望的稀疏注意机制，这些机制侧重于当地，并旨在开发出更好的性能改善分析模型。我们首先引入广泛的邻里注意力（GNA），可以描述滑动窗口，滑动窗口并阻止了注意力。然后，我们考虑实现这些方法的可能设计选择，并创建一个模拟器，该模拟器可以为任何给定设置提供更现实的加速度上限。最后，我们以最先进的融合多头注意（FMHA）内核为基础，为Cutlass的Nvidia Blackwell Architecture设计。在许多完美的障碍案例中，我们的实施可以完全实现理论上最大的加速，并在FP16中实现了1.3 PETAFLOPS/秒的有效利用。此外，我们将各种GNA配置插入现成的生成模型，例如Cosmos-7b，Hunyuanvideo和Flux，并表明它可以在B200上提供28％至46％的端到端速度加速，而无需任何微调。我们将直接通过Natten Project直接开放我们的模拟器和Blackwell内核。

Title: Procedural Dataset Generation for Zero-Shot Stereo Matching

Authors: David Yan, Alexander Raistrick, Jia Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.16930
Pdf URL: https://arxiv.org/pdf/2504.16930
Copy Paste: [[2504.16930]] Procedural Dataset Generation for Zero-Shot Stereo Matching(https://arxiv.org/abs/2504.16930)
Keywords: generation
Abstract: Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains largely unexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stereo matching performance using standard benchmarks. We collect the best settings to produce Infinigen-Stereo, a procedural generator specifically optimized for zero-shot stereo datasets. Models trained only on data from our system outperform robust baselines trained on a combination of existing synthetic datasets and have stronger zero-shot stereo matching performance than public checkpoints from prior works. We open source our system at this https URL to enable further research on procedural stereo datasets.
摘要：合成数据集是训练立体声匹配网络的关键要素，但是什么使立体声数据集有效的问题仍然在很大程度上没有探索。我们通过改变过程数据集发电机的参数来研究合成数据集的设计空间，并使用标准基准报告对零摄像的立体声匹配性能的影响。我们收集最佳设置来生产Infinigen-STEREO，这是一种专门针对零击立体声数据集优化的程序发电机。仅根据我们系统的数据训练的模型比现有合成数据集的组合训练的强大基线比较强的零摄像立体声匹配性能要比先前工作的公共检查点更强。我们在此HTTPS URL上开源系统，以实现对程序立体声数据集的进一步研究。