2025-05-22

Title: DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance

Authors: Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jiuxiang Gu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14708
Pdf URL: https://arxiv.org/pdf/2505.14708
Copy Paste: [[2505.14708]] DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance(https://arxiv.org/abs/2505.14708)
Keywords: generation
Abstract: Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency, and generating just 8 seconds of 720p video takes tens of minutes-posing serious challenges to practical application and scalability. To address this, we propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs. We apply down-sampling to each feature map across frames in the compressed latent space, enabling a higher-level receptive field over the latent composed of hundreds of thousands of tokens. The low-resolution draft attention map, derived from draft query and key, exposes redundancy both spatially within each feature map and temporally across frames. We reorder the query, key, and value based on the draft attention map to guide the sparse attention computation in full resolution, and subsequently restore their original order after the attention computation. This reordering enables structured sparsity that aligns with hardware-optimized execution. Our theoretical analysis demonstrates that the low-resolution draft attention closely approximates the full attention, providing reliable guidance for constructing accurate sparse attention. Experimental results show that our method outperforms existing sparse attention approaches in video generation quality and achieves up to 1.75x end-to-end speedup on GPUs. Code: this https URL
摘要：基于扩散变压器的扩散视频生成模型（DIT）最近因其出色的发电质量而引起了广泛的关注。但是，仅凭其计算成本仍然是一项主要的瓶颈注意力，占总延迟的80％以上，并且仅产生720p视频的8秒钟，需要数十分钟的时间来解决实用应用和可扩展性的严重挑战。为了解决这个问题，我们提出了草稿，这是一个无训练的框架，用于加速视频扩散变压器，并在GPU上稀疏。我们对压缩潜在空间中的框架上的每个特征映射进行了向下采样，从而在由数十万个代币组成的潜在磁场上实现了更高级别的接受场。低分辨率的注意图源自草稿查询和钥匙，在每个特征图内和跨帧的时间上都在空间上露出冗余。我们基于注意力图草图重新排序查询，键和值，以完全分辨率指导稀疏注意计算，然后在注意力计算后恢复其原始顺序。这种重新排序可以使结构化的稀疏性与硬件优化的执行保持一致。我们的理论分析表明，低分辨率的注意草案密切接近全部注意力，为构建准确的稀疏注意力提供了可靠的指导。实验结果表明，我们的方法在视频发电质量中的现有稀疏注意方法优于现有的稀疏注意方法，并且在GPU上达到了1.75倍的端到端速度。代码：此HTTPS URL

Title: FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge

Authors: Xuan Shen, Weize Ma, Yufa Zhou, Enhao Tang, Yanyue Xie, Zhengang Li, Yifan Gong, Quanyi Wang, Henghui Ding, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jun Lin, Jiuxiang Gu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14709
Pdf URL: https://arxiv.org/pdf/2505.14709
Copy Paste: [[2505.14709]] FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge(https://arxiv.org/abs/2505.14709)
Keywords: generation
Abstract: Auto-regressive (AR) models, initially successful in language generation, have recently shown promise in visual generation tasks due to their superior sampling efficiency. Unlike image generation, video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase. Our key observations are: (i) MLP modules in the decode phase dominate the inference latency, and (ii) there exists high temporal redundancy in MLP outputs of adjacent frames. In this paper, we propose the \textbf{FastCar} framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy. The Temporal Attention Score (TAS) is proposed to determine whether to apply the replay strategy (\textit{i.e.}, reusing cached MLP outputs from the previous frame to reduce redundant computations) with detailed theoretical analysis and justification. Also, we develop a hardware accelerator on FPGA with Dynamic Resource Scheduling (DRS) based on TAS to enable better resource utilization and faster inference. Experimental results demonstrate the effectiveness of our method, which outperforms traditional sparse attention approaches with more than 2.1x decoding speedup and higher energy efficiency on the edge. Furthermore, by combining FastCar and sparse attention, FastCar can boost the performance of sparse attention with alleviated drifting, demonstrating our unique advantages for high-resolution and long-duration video generation. Code: this https URL
摘要：最初在语言生成方面取得成功的自动回归（AR）模型由于其出色的采样效率，最近在视觉生成任务中表现出了希望。与图像生成不同，视频生成需要大量的令牌来产生相干的时间框架，从而在解码阶段产生了明显的开销。我们的主要观察结果是：（i）解码相中的MLP模块主导了推理潜伏期，（ii）相邻帧的MLP输出中存在很高的时间冗余。在本文中，我们提出了\ textbf {fastcar}框架，以通过探索时间冗余来加速AR视频生成的解码阶段。提出了时间注意力评分（TAS），以确定是否应用重播策略（\ textit {i.e。}，通过详细的理论分析和合理性，从上一个帧中重复使用的缓存的MLP输出以减少冗余计算）。此外，我们在FPGA上开发了一个基于TAS的动态资源调度（DRS）的硬件加速器，以实现更好的资源利用率和更快的推断。实验结果证明了我们方法的有效性，该方法的效率超过了传统的稀疏注意方法，其边缘上超过2.1倍解码的速度和更高的能源效率。此外，通过结合快速车和稀疏的注意力，快速车可以通过减轻的漂移来提高注意力的表现，从而证明了我们在高分辨率和长期视频生成方面的独特优势。代码：此HTTPS URL

Title: The Evolution of Alpha in Finance Harnessing Human Insight and LLM Agents

Authors: Mohammad Rubyet Islam
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.14727
Pdf URL: https://arxiv.org/pdf/2505.14727
Copy Paste: [[2505.14727]] The Evolution of Alpha in Finance Harnessing Human Insight and LLM Agents(https://arxiv.org/abs/2505.14727)
Keywords: generation
Abstract: The pursuit of alpha returns that exceed market benchmarks has undergone a profound transformation, evolving from intuition-driven investing to autonomous, AI powered systems. This paper introduces a comprehensive five stage taxonomy that traces this progression across manual strategies, statistical models, classical machine learning, deep learning, and agentic architectures powered by large language models (LLMs). Unlike prior surveys focused narrowly on modeling techniques, this review adopts a system level lens, integrating advances in representation learning, multimodal data fusion, and tool augmented LLM agents. The strategic shift from static predictors to contextaware financial agents capable of real time reasoning, scenario simulation, and cross modal decision making is emphasized. Key challenges in interpretability, data fragility, governance, and regulatory compliance areas critical to production deployment are examined. The proposed taxonomy offers a unified framework for evaluating maturity, aligning infrastructure, and guiding the responsible development of next generation alpha systems.
摘要：超过市场基准的Alpha回报的追求经历了深刻的转变，从直觉驱动的投资演变为自动驱动的AI动力系统。本文介绍了一项全面的五阶段分类法，该分类法可以追溯到手动策略，统计模型，经典的机器学习，深度学习和代理体系结构（由大语言模型（LLMS）提供支持）。与以前的调查不同地集中在建模技术上，本综述采用了系统级别的镜头，整合了表示学习的进步，多模式数据融合和工具增强的LLM代理。从静态预测因素到能够实时推理，场景模拟和交叉模态决策做出的战略转变。研究了对生产部署至关重要的解释性，数据脆弱性，治理和法规合规性领域的关键挑战。拟议的分类法提供了一个统一的框架，用于评估成熟度，对齐基础设施并指导下一代Alpha系统的负责任发展。

Title: Time Series Similarity Score Functions to Monitor and Interact with the Training and Denoising Process of a Time Series Diffusion Model applied to a Human Activity Recognition Dataset based on IMUs

Authors: Heiko Oppel, Andreas Spilz, Michael Munz
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14739
Pdf URL: https://arxiv.org/pdf/2505.14739
Copy Paste: [[2505.14739]] Time Series Similarity Score Functions to Monitor and Interact with the Training and Denoising Process of a Time Series Diffusion Model applied to a Human Activity Recognition Dataset based on IMUs(https://arxiv.org/abs/2505.14739)
Keywords: generation, generative
Abstract: Denoising diffusion probabilistic models are able to generate synthetic sensor signals. The training process of such a model is controlled by a loss function which measures the difference between the noise that was added in the forward process and the noise that was predicted by the diffusion model. This enables the generation of realistic data. However, the randomness within the process and the loss function itself makes it difficult to estimate the quality of the data. Therefore, we examine multiple similarity metrics and adapt an existing metric to overcome this issue by monitoring the training and synthetisation process using those metrics. The adapted metric can even be fine-tuned on the input data to comply with the requirements of an underlying classification task. We were able to significantly reduce the amount of training epochs without a performance reduction in the classification task. An optimized training process not only saves resources, but also reduces the time for training generative models.
摘要：去核扩散概率模型能够生成合成传感器信号。这种模型的训练过程由损耗函数控制，该损失函数衡量了向前过程中添加的噪声与扩散模型预测的噪声之间的差异。这使生成现实的数据。但是，过程中的随机性和损失函数本身使得很难估计数据的质量。因此，我们检查了多个相似性指标，并通过使用这些指标来监视培训和合成过程来调整现有指标以克服此问题。甚至可以在输入数据上微调适应的指标，以符合基础分类任务的要求。我们能够显着减少训练时期的数量，而不会降低分类任务的性能。优化的培训过程不仅可以节省资源，而且还减少了培训生成模型的时间。

Title: Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism

Authors: Kunyun Wang, Bohan Li, Kai Yu, Minyi Guo, Jieru Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14741
Pdf URL: https://arxiv.org/pdf/2505.14741
Copy Paste: [[2505.14741]] Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism(https://arxiv.org/abs/2505.14741)
Keywords: generation, generative
Abstract: Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis. However, their deployment is often limited by significant inference latency, primarily due to the inherently sequential nature of the denoising process. While existing parallelization strategies attempt to accelerate inference by distributing computation across multiple devices, they typically incur high communication overhead, hindering deployment on commercial hardware. To address this challenge, we propose \textbf{ParaStep}, a novel parallelization method based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting similarity between adjacent denoising steps. Unlike prior approaches that rely on layer-wise or stage-wise communication, ParaStep employs lightweight, step-wise communication, substantially reducing overhead. ParaStep achieves end-to-end speedups of up to \textbf{3.88}$\times$ on SVD, \textbf{2.43}$\times$ on CogVideoX-2b, and \textbf{6.56}$\times$ on AudioLDM2-large, while maintaining generation quality. These results highlight ParaStep as a scalable and communication-efficient solution for accelerating diffusion inference, particularly in bandwidth-constrained environments.
摘要：扩散模型已成为各种模式的强大生成模型，包括图像，视频和音频合成。但是，它们的部署通常受到明显的推理潜伏期的限制，这主要是由于固定过程的固有顺序性质。尽管现有的并行化策略试图通过在多个设备上分发计算来加速推理，但它们通常会产生高通信开销，从而阻碍商业硬件的部署。为了应对这一挑战，我们提出了\ textbf {parastep}，这是一种基于重复使用的新型并行化方法，然后预测的机制通过利用相邻的DeNoising步骤之间的相似性来平行扩散推断。与依赖于层次或阶段交流的先前方法不同，Parastep采用轻巧的逐步通信，大大降低了开销。 Parastep在SVD，\ textbf {2.43} $ \ times $上$ textbf {3.88} $ \ times $的端到端加速实现cogvideox-2b和\ textbf {6.56} $ \ textbf {6.56} $ \ times $ \ textbf {2.43} $ \ times $的端速度。这些结果突出显示了Parastep作为加速扩散推断的一种可扩展和沟通效率的解决方案，尤其是在带宽受限的环境中。

Title: Large Language Models for Data Synthesis

Authors: Yihong Tang, Menglin Kong, Lijun Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.14752
Pdf URL: https://arxiv.org/pdf/2505.14752
Copy Paste: [[2505.14752]] Large Language Models for Data Synthesis(https://arxiv.org/abs/2505.14752)
Keywords: generative
Abstract: Generating synthetic data that faithfully captures the statistical structure of real-world distributions is a fundamental challenge in data modeling. Classical approaches often depend on strong parametric assumptions or manual structural design and struggle in high-dimensional or heterogeneous domains. Recent progress in Large Language Models (LLMs) reveals their potential as flexible, high-dimensional priors over real-world distributions. However, when applied to data synthesis, standard LLM-based sampling is inefficient, constrained by fixed context limits, and fails to ensure statistical alignment. Given this, we introduce LLMSynthor, a general framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback. LLMSynthor treats the LLM as a nonparametric copula simulator for modeling high-order dependencies and introduces LLM Proposal Sampling to generate grounded proposal distributions that improve sampling efficiency without requiring rejection. By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data while gradually uncovering and refining the latent generative structure. We evaluate LLMSynthor in both controlled and real-world settings using heterogeneous datasets in privacy-sensitive domains (e.g., e-commerce, population, and mobility) that encompass both structured and unstructured formats. The synthetic data produced by LLMSynthor shows high statistical fidelity, practical utility, and cross-data adaptability, positioning it as a valuable tool across economics, social science, urban studies, and beyond.
摘要：生成忠实捕获现实世界分布的统计结构的合成数据是数据建模的基本挑战。经典方法通常取决于强有力的参数假设或手动结构设计以及高维或异质域中的斗争。大型语言模型（LLMS）的最新进展揭示了其对现实世界分布的灵活，高维的先验的潜力。但是，当应用于数据合成时，基于标准LLM的采样效率低下，受固定上下文限制的约束，无法确保统计对齐。鉴于此，我们介绍了llmsynthor，这是数据合成的一般框架，该框架将LLMS转化为以分布反馈为指导的结构感知的模拟器。 LLMSYNTHOR将LLM视为非参数模拟器，用于建模高阶依赖性，并引入LLM提案采样，以生成扎根的建议分布，以提高采样效率而不需要拒绝。通过最大程度地减少汇总统计空间中的差异，迭代合成环将对齐真实和合成数据，同时逐渐发现和完善潜在的生成结构。我们使用涵盖结构化和非结构化格式的非均质数据集（例如，电子商务，人口和移动性）中使用异构数据集（例如，电子商务，人口和移动性）中的llmsynthor评估了llmsynthor。 LLMSYNTHOR生产的合成数据显示出很高的统计保真度，实用性和跨数据适应性，将其定位为经济学，社会科学，城市研究以及其他地区的宝贵工具。

Title: Leveraging Generative AI Models to Explore Human Identity

Authors: Yunha Yeo, Daeho Um
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14843
Pdf URL: https://arxiv.org/pdf/2505.14843
Copy Paste: [[2505.14843]] Leveraging Generative AI Models to Explore Human Identity(https://arxiv.org/abs/2505.14843)
Keywords: generation, generative
Abstract: This paper attempts to explore human identity by utilizing neural networks in an indirect manner. For this exploration, we adopt diffusion models, state-of-the-art AI generative models trained to create human face images. By relating the generated human face to human identity, we establish a correspondence between the face image generation process of the diffusion model and the process of human identity formation. Through experiments with the diffusion model, we observe that changes in its external input result in significant changes in the generated face image. Based on the correspondence, we indirectly confirm the dependence of human identity on external factors in the process of human identity formation. Furthermore, we introduce \textit{Fluidity of Human Identity}, a video artwork that expresses the fluid nature of human identity affected by varying external factors. The video is available at this https URL.
摘要：本文试图通过间接使用神经网络来探索人类的身份。对于此探索，我们采用了扩散模型，即最先进的AI生成模型，该模型训练有素，可以创建人脸图像。通过将生成的人脸与人类身份联系起来，我们在扩散模型的面部图像生成过程与人类认同形成过程之间建立了对应关系。通过实验扩散模型，我们观察到其外部输入的变化会导致生成的面部图像发生重大变化。根据对应关系，我们间接确认人类身份对人类认同形成过程中外部因素的依赖性。此外，我们引入了\ textit {人类身份的流动性}，这是一种视频艺术品，表达了受不同外部因素影响的人类身份的流动性。该视频可在此HTTPS URL上找到。

Title: A self-regulated convolutional neural network for classifying variable stars

Authors: Francisco Pérez-Galarce, Jorge Martínez-Palomera, Karim Pichara, Pablo Huijse, Márcio Catelan
Subjects: cs.LG, astro-ph.SR
Abstract URL: https://arxiv.org/abs/2505.14877
Pdf URL: https://arxiv.org/pdf/2505.14877
Copy Paste: [[2505.14877]] A self-regulated convolutional neural network for classifying variable stars(https://arxiv.org/abs/2505.14877)
Keywords: generative
Abstract: Over the last two decades, machine learning models have been widely applied and have proven effective in classifying variable stars, particularly with the adoption of deep learning architectures such as convolutional neural networks, recurrent neural networks, and transformer models. While these models have achieved high accuracy, they require high-quality, representative data and a large number of labelled samples for each star type to generalise well, which can be challenging in time-domain surveys. This challenge often leads to models learning and reinforcing biases inherent in the training data, an issue that is not easily detectable when validation is performed on subsamples from the same catalogue. The problem of biases in variable star data has been largely overlooked, and a definitive solution has yet to be established. In this paper, we propose a new approach to improve the reliability of classifiers in variable star classification by introducing a self-regulated training process. This process utilises synthetic samples generated by a physics-enhanced latent space variational autoencoder, incorporating six physical parameters from Gaia Data Release 3. Our method features a dynamic interaction between a classifier and a generative model, where the generative model produces ad-hoc synthetic light curves to reduce confusion during classifier training and populate underrepresented regions in the physical parameter space. Experiments conducted under various scenarios demonstrate that our self-regulated training approach outperforms traditional training methods for classifying variable stars on biased datasets, showing statistically significant improvements.
摘要：在过去的二十年中，机器学习模型已被广泛应用，并证明可以有效地对变量恒星进行分类，尤其是在采用深度学习体系结构（例如卷积神经网络，复发性神经网络和变压器模型）的情况下。尽管这些模型的精度已经很高，但它们需要高质量的代表性数据和每种星类型的大量标记样品来概括，这可能在时间域调查中具有挑战性。这一挑战通常会导致模型学习和加强培训数据中固有的偏见，当对同一目录的子样本进行验证时，这一问题不容易被检测到。可变恒星数据中偏差的问题在很大程度上被忽略了，并且尚未确定确定的解决方案。在本文中，我们提出了一种新的方法，通过引入自调节的训练过程，以提高可变星分类中的分类器的可靠性。 This process utilises synthetic samples generated by a physics-enhanced latent space variational autoencoder, incorporating six physical parameters from Gaia Data Release 3. Our method features a dynamic interaction between a classifier and a generative model, where the generative model produces ad-hoc synthetic light curves to reduce confusion during classifier training and populate underrepresented regions in the physical parameter space.在各种情况下进行的实验表明，我们的自我调节培训方法优于传统培训方法，用于对偏置数据集上的可变星进行分类，显示出统计学上显着的改进。

Title: Programmatic Video Prediction Using Large Language Models

Authors: Hao Tang, Kevin Ellis, Suhas Lohit, Michael J. Jones, Moitreya Chatterjee
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14948
Pdf URL: https://arxiv.org/pdf/2505.14948
Copy Paste: [[2505.14948]] Programmatic Video Prediction Using Large Language Models(https://arxiv.org/abs/2505.14948)
Keywords: generation
Abstract: The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we propose ProgGen, which undertakes the task of video frame prediction by representing the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize programs: (i) to estimate the states of the video, given the visual context (i.e. the frames); (ii) to predict the states corresponding to future time steps by estimating the transition dynamics; (iii) to render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and interpretable video generation attesting to its effectiveness and generalizability for video generation tasks.
摘要：估计描述现实世界过程动态的世界模型的任务对预期和准备未来的结果具有巨大的重要性。对于视频监视，机器人应用程序，自动驾驶等应用程序。此目标需要合成合理的视觉期货，并给定一些视频框架来设置视觉上下文。为此，我们提出了Proggen，它通过使用一组神经符号，人际交往状态集（每帧）来代表视频的动态来执行视频框架预测的任务（每帧），通过利用大型（视觉）语言模型（LLM/VLM）的归纳偏见。特别是，Proggen利用LLM/VLM来综合程序：（i）给定视觉上下文（即帧）来估计视频状态；（ii）通过估计过渡动力学来预测与未来时间步骤相对应的状态；（iii）将预测状态作为视觉RGB框架渲染。经验评估表明，在两个具有挑战性的环境中，我们提出的方法在视频框架预测任务上优于竞争技术：（i）phyworld（ii）推车杆。此外，Proggen允许反事实推理和可解释的视频生成证明其对视频生成任务的有效性和概括性。

Title: STree: Speculative Tree Decoding for Hybrid State-Space Models

Authors: Yangchao Wu, Zongyue Qin, Alex Wong, Stefano Soatto
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14969
Pdf URL: https://arxiv.org/pdf/2505.14969
Copy Paste: [[2505.14969]] STree: Speculative Tree Decoding for Hybrid State-Space Models(https://arxiv.org/abs/2505.14969)
Keywords: generation
Abstract: Speculative decoding is a technique to leverage hardware concurrency to improve the efficiency of large-scale autoregressive (AR) Transformer models by enabling multiple steps of token generation in a single forward pass. State-space models (SSMs) are already more efficient than AR Transformers, since their state summarizes all past data with no need to cache or re-process tokens in the sliding window context. However, their state can also comprise thousands of tokens; so, speculative decoding has recently been extended to SSMs. Existing approaches, however, do not leverage the tree-based verification methods, since current SSMs lack the means to compute a token tree efficiently. We propose the first scalable algorithm to perform tree-based speculative decoding in state-space models (SSMs) and hybrid architectures of SSMs and Transformer layers. We exploit the structure of accumulated state transition matrices to facilitate tree-based speculative decoding with minimal overhead to current SSM state update implementations. With the algorithm, we describe a hardware-aware implementation that improves naive application of AR Transformer tree-based speculative decoding methods to SSMs. Furthermore, we outperform vanilla speculative decoding with SSMs even with a baseline drafting model and tree structure on three different benchmarks, opening up opportunities for further speed up with SSM and hybrid model inference. Code will be released upon paper acceptance.
摘要：投机解码是一种技术，可以利用硬件并发性来提高大规模自回旋（AR）变压器模型的效率，通过在单个正向传球中启用令牌生成的多个步骤。州空间模型（SSM）已经比AR变形金刚更有效，因为它们的状态总结了过去的所有数据，而无需在滑动窗口上下文中缓存或重新处理令牌。但是，他们的状态也可以包括成千上万的令牌。因此，投机解码最近已扩展到SSM。但是，现有方法不会利用基于树的验证方法，因为当前的SSM缺乏有效计算令牌树的方法。我们提出了第一种可扩展算法，以在状态空间模型（SSM）（SSM）和SSM和变压器层的混合体系结构中执行基于树的投机解码。我们利用累积状态过渡矩阵的结构，以促进基于树的投机解码，而最小的开销对当前的SSM状态更新实现。使用算法，我们描述了一种硬件感知的实现，该实现改善了基于AR Transformer树的投机解码方法对SSM的幼稚应用。此外，即使在三个不同的基准上使用基线制图模型和树结构，我们的表现都超过了SSM的香草投机解码，从而为SSM和Hybrid模型推断提供了进一步加速的机会。代码将在纸上接受时发布。

Title: Flattening Hierarchies with Policy Bootstrapping

Authors: John L. Zhou, Jonathan C. Kao
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2505.14975
Pdf URL: https://arxiv.org/pdf/2505.14975
Copy Paste: [[2505.14975]] Flattening Hierarchies with Policy Bootstrapping(https://arxiv.org/abs/2505.14975)
Keywords: generation, generative
Abstract: Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail.
摘要：离线目标条件增强学习（GCRL）是在大型无奖励轨迹数据集上预处理通才政策的有前途的方法，类似于用于培训计算机视觉和自然语言处理基础模型的自我监督目标。但是，由于稀疏的奖励和打折的结合，将GCRL缩放到更长的视野仍然具有挑战性，这掩盖了原始行动在遥远目标方面的比较优势。层次RL方法在长距离目标的任务上获得了强大的经验结果，但是它们依赖模块化，时间表特定的政策和亚目标引入了显着的额外复杂性，并阻碍了对高维目标空间的扩展。在这项工作中，我们介绍了一种算法，通过在具有优势加权重要性抽样的亚目标政策上进行引导，以训练平坦（非等级的目标条件条件政策）。我们的方法消除了对（子）目标空间上生成模型的需求，我们发现这是在大状态空间中扩展到高维控制的关键。我们进一步表明，现有的基于层次结构和基于引导的方法对应于我们的派生中的特定设计选择。在全面的基于州和像素的运动和操纵基准的全面套件中，我们的方法匹配或超过了最先进的脱机gcrl算法和量表，并在先前方法失败的情况下完成了复杂的，长途任务。

Title: RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

Authors: Kaiwen Zha, Zhengqi Gao, Maohao Shen, Zhang-Wei Hong, Duane S. Boning, Dina Katabi
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.15034
Pdf URL: https://arxiv.org/pdf/2505.15034
Copy Paste: [[2505.15034]] RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning(https://arxiv.org/abs/2505.15034)
Keywords: generative
Abstract: Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: this https URL.
摘要：强化学习（RL）最近成为增强大语言模型（LLMS）推理能力的令人信服的方法，在该方法中，LLM发电机是由验证者（奖励模型）指导的政策。但是，当前的LLM的RL训练方法通常使用固定的验证者（基于规则或冻结的）或通过监督的微调（SFT）进行区分训练。此类设计容易奖励黑客攻击，超出其培训分布的概括。为了克服这些局限性，我们提出了探戈，这是一个新型框架，它使用RL以交错的方式同时训练LLM发生器和验证者。探戈的中央创新是其生成的过程级LLM验证者，该验证者通过RL训练并与发电机共同发展。重要的是，验证者仅基于结果级验证的正确性奖励进行训练，而无需明确的过程级注释。与确定性或SFT训练的验证者相比，这种生成的RL训练的验证者表现出改善的鲁棒性和卓越的概括，从而促进了与发电机的有效相互加固。广泛的实验表明，探戈的两个组成部分都在7b/8b尺度模型之间达到最先进的结果：发电机在五个竞争级数学基准和四项挑战性的跨域推理任务中达到了一流的性能，而验证者在ProcessBench数据集中的验证者领导者。值得注意的是，这两个组件在最困难的数学推理问题上都表现出特别的重大改进。代码为：此HTTPS URL。

Title: Agentic Feature Augmentation: Unifying Selection and Generation with Teaming, Planning, and Memories

Authors: Nanxu Gong, Sixun Dong, Haoyue Bai, Xinyuan Wang, Wangyang Ying, Yanjie Fu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15076
Pdf URL: https://arxiv.org/pdf/2505.15076
Copy Paste: [[2505.15076]] Agentic Feature Augmentation: Unifying Selection and Generation with Teaming, Planning, and Memories(https://arxiv.org/abs/2505.15076)
Keywords: generation
Abstract: As a widely-used and practical tool, feature engineering transforms raw data into discriminative features to advance AI model performance. However, existing methods usually apply feature selection and generation separately, failing to strive a balance between reducing redundancy and adding meaningful dimensions. To fill this gap, we propose an agentic feature augmentation concept, where the unification of feature generation and selection is modeled as agentic teaming and planning. Specifically, we develop a Multi-Agent System with Long and Short-Term Memory (MAGS), comprising a selector agent to eliminate redundant features, a generator agent to produce informative new dimensions, and a router agent that strategically coordinates their actions. We leverage in-context learning with short-term memory for immediate feedback refinement and long-term memory for globally optimal guidance. Additionally, we employ offline Proximal Policy Optimization (PPO) reinforcement fine-tuning to train the router agent for effective decision-making to navigate a vast discrete feature space. Extensive experiments demonstrate that this unified agentic framework consistently achieves superior task performance by intelligently orchestrating feature selection and generation.
摘要：作为一种广泛使用和实用的工具，功能工程将原始数据转换为判别特征，以提高AI模型性能。但是，现有方法通常分别应用特征选择和生成，无法在减少冗余和添加有意义的维度之间取得平衡。为了填补这一空白，我们提出了一个代理功能增强概念，其中将功能生成和选择的统一建模为代理组合和计划。具体而言，我们开发了一个具有长期和短期内存（MAG）的多代理系统，包括选择器代理以消除冗余功能，生成新的新维度的生成器代理以及从策略上协调其操作的路由器代理。我们利用短期记忆来利用文字学习，以立即进行反馈改进和长期记忆，以获得全球最佳指导。此外，我们采用离线近端策略优化（PPO）加固微调来训练路由器代理以有效的决策来浏览庞大的离散功能空间。广泛的实验表明，这个统一的代理框架始终通过智能编排功能选择和生成来始终达到卓越的任务性能。

Title: Khan-GCL: Kolmogorov-Arnold Network Based Graph Contrastive Learning with Hard Negatives

Authors: Zihu Wang, Boxun Xu, Hejia Geng, Peng Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15103
Pdf URL: https://arxiv.org/pdf/2505.15103
Copy Paste: [[2505.15103]] Khan-GCL: Kolmogorov-Arnold Network Based Graph Contrastive Learning with Hard Negatives(https://arxiv.org/abs/2505.15103)
Keywords: generation
Abstract: Graph contrastive learning (GCL) has demonstrated great promise for learning generalizable graph representations from unlabeled data. However, conventional GCL approaches face two critical limitations: (1) the restricted expressive capacity of multilayer perceptron (MLP) based encoders, and (2) suboptimal negative samples that either from random augmentations-failing to provide effective 'hard negatives'-or generated hard negatives without addressing the semantic distinctions crucial for discriminating graph data. To this end, we propose Khan-GCL, a novel framework that integrates the Kolmogorov-Arnold Network (KAN) into the GCL encoder architecture, substantially enhancing its representational capacity. Furthermore, we exploit the rich information embedded within KAN coefficient parameters to develop two novel critical feature identification techniques that enable the generation of semantically meaningful hard negative samples for each graph representation. These strategically constructed hard negatives guide the encoder to learn more discriminative features by emphasizing critical semantic differences between graphs. Extensive experiments demonstrate that our approach achieves state-of-the-art performance compared to existing GCL methods across a variety of datasets and tasks.
摘要：图对比度学习（GCL）表现出了从未标记的数据学习可概括的图表表示的巨大希望。但是，传统的GCL方法面临两个关键局限性：（1）基于多层感知器（MLP）的编码器的表达能力受限的能力，以及（2）次优的负面样本，这些样本可以随机增强而无法随机增强以提供有效的“硬触角”，而无需解决具有辨别图的语义差异的有效的“硬否定”或“较难的”型硬质量。为此，我们提出了Khan-GCL，这是一个新颖的框架，将Kolmogorov-Arnold网络（KAN）集成到GCL编码器体系结构中，从而大大提高了其代表性。此外，我们利用KAN系数参数中嵌入的丰富信息开发了两种新颖的关键特征识别技术，从而使每个图表示的语义上有意义的硬性否定样本能够生成。这些战略性地构建的艰苦负面因素指导编码器通过强调图表之间的批判性语义差异来学习更多的判别特征。广泛的实验表明，与各种数据集和任务中现有的GCL方法相比，我们的方法可实现最先进的性能。

Title: BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

Authors: Yunlong Hou, Fengzhuo Zhang, Cunxiao Du, Xuan Zhang, Jiachun Pan, Tianyu Pang, Chao Du, Vincent Y. F. Tan, Zhuoran Yang
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.15141
Pdf URL: https://arxiv.org/pdf/2505.15141
Copy Paste: [[2505.15141]] BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms(https://arxiv.org/abs/2505.15141)
Keywords: generation
Abstract: Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.
摘要：投机解码已成为一种流行的方法，可以加速大型语言模型（LLM）的推断，同时保持其出色的文本生成性能。以前的方法要么采用固定的投机解码配置，无论前缀令牌如何，要么以离线或在线方式训练草稿模型，以使其与上下文保持一致。本文提出了一个无培训的在线学习框架，以自适应选择超参数的配置，以便在生成文本时进行投机解码。我们首先将此超参数选择问题作为多臂匪徒问题，并提供一般的投机解码框架BANDITSPEC。此外，设计和分析了两种基于强盗的高参数选择算法，UCBSPEC和EXP3Spec，以新颖的数量（停止时间的遗憾）进行了设计和分析。在随机和对抗奖励设置下，我们将这种遗憾束缚在上面。通过得出信息理论的不可能结果，可以表明UCBSPEC的遗憾表现是最佳的，可以实现通用常数。最后，使用Llama3和Qwen2进行的广泛的经验实验表明，与现有方法相比，我们的算法是有效的，并且在模拟现实生活中的LLM服务场景中，吞吐量接近Oracle的最佳超参数，并具有多样的输入提示。

Title: CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation

Authors: Xinran Wang, Songyu Xu, Xiangxuan Shan, Yuxuan Zhang, Muxi Diao, Xueyan Duan, Yanhua Huang, Kongming Liang, Zhanyu Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15145
Pdf URL: https://arxiv.org/pdf/2505.15145
Copy Paste: [[2505.15145]] CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation(https://arxiv.org/abs/2505.15145)
Keywords: generation
Abstract: Cinematography is a cornerstone of film production and appreciation, shaping mood, emotion, and narrative through visual elements such as camera movement, shot composition, and lighting. Despite recent progress in multimodal large language models (MLLMs) and video generation models, the capacity of current models to grasp and reproduce cinematographic techniques remains largely uncharted, hindered by the scarcity of expert-annotated data. To bridge this gap, we present CineTechBench, a pioneering benchmark founded on precise, manual annotation by seasoned cinematography experts across key cinematography dimensions. Our benchmark covers seven essential aspects-shot scale, shot angle, composition, camera movement, lighting, color, and focal length-and includes over 600 annotated movie images and 120 movie clips with clear cinematographic techniques. For the understanding task, we design question answer pairs and annotated descriptions to assess MLLMs' ability to interpret and explain cinematographic techniques. For the generation task, we assess advanced video generation models on their capacity to reconstruct cinema-quality camera movements given conditions such as textual prompts or keyframes. We conduct a large-scale evaluation on 15+ MLLMs and 5+ video generation models. Our results offer insights into the limitations of current models and future directions for cinematography understanding and generation in automatically film production and appreciation. The code and benchmark can be accessed at this https URL.
摘要：摄影是电影制作和欣赏，情绪，情感和叙事的基石，它通过摄像机运动，射击作品和照明等视觉元素。尽管多模式大语模型（MLLM）和视频生成模型最近取得了进展，但当前模型掌握和繁殖摄影技术的能力仍然很大程度上尚未在很大程度上尚未大脑，这受到专家通知数据的稀缺性的阻碍。为了弥合这一差距，我们提出了Cinetechbench，这是一种基于经验丰富的摄影专家在关键摄影维度的精确，手动注释的开创性基准。我们的基准涵盖了七个基本方面量表，射击角度，构图，摄像头运动，照明，颜色和焦距，并包括600多个带注释的电影图像和120个带有清晰摄影技术的电影剪辑。为了理解任务，我们设计了问题答案对和注释的描述，以评估MLLM的解释和解释摄影技术的能力。对于生成任务，我们评估了高级视频生成模型，它们可以在给定条件（例如文本提示或密钥帧）的情况下重建电影质量的相机运动。我们对15多个MLLM和5+视频生成模型进行了大规模评估。我们的结果提供了有关当前模型的局限性以及在自动拍摄电影制作和欣赏的摄影理解和产生的未来方向的局限性。可以在此HTTPS URL上访问代码和基准。

Title: Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation

Authors: Nanxu Gong, Zijun Li, Sixun Dong, Haoyue Bai, Wangyang Ying, Xinyuan Wang, Yanjie Fu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15152
Pdf URL: https://arxiv.org/pdf/2505.15152
Copy Paste: [[2505.15152]] Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation(https://arxiv.org/abs/2505.15152)
Keywords: generation, generative
Abstract: Feature Transformation (FT) crafts new features from original ones via mathematical operations to enhance dataset expressiveness for downstream models. However, existing FT methods exhibit critical limitations: discrete search struggles with enormous combinatorial spaces, impeding practical use; and continuous search, being highly sensitive to initialization and step sizes, often becomes trapped in local optima, restricting global exploration. To overcome these limitations, DIFFT redefines FT as a reward-guided generative task. It first learns a compact and expressive latent space for feature sets using a Variational Auto-Encoder (VAE). A Latent Diffusion Model (LDM) then navigates this space to generate high-quality feature embeddings, its trajectory guided by a performance evaluator towards task-specific optima. This synthesis of global distribution learning (from LDM) and targeted optimization (reward guidance) produces potent embeddings, which a novel semi-autoregressive decoder efficiently converts into structured, discrete features, preserving intra-feature dependencies while allowing parallel inter-feature generation. Extensive experiments on 14 benchmark datasets show DIFFT consistently outperforms state-of-the-art baselines in predictive accuracy and robustness, with significantly lower training and inference times.
摘要：功能转换（FT）通过数学操作制作了从原始功能的新功能，以增强下游模型的数据集表现力。但是，现有的FT方法表现出关键局限性：与巨大组合空间的离散搜索斗争，阻碍了实际使用；连续的搜索对初始化和阶梯尺寸高度敏感，通常会陷入本地Optima，从而限制了全球探索。为了克服这些局限性，Difft将FT定义为奖励引导的生成任务。它首先使用各种自动编码器（VAE）学习一个紧凑而表达的潜在空间，用于特征集。然后，潜在扩散模型（LDM）导航该空间以生成高质量的特征嵌入，其轨迹由绩效评估器引导到特定于任务的Optima。全球分布学习（来自LDM）和靶向优化（奖励指导）的合成产生有效的嵌入，一种新型的半自动回归解码器有效地转化为结构化的，离散的特征，并保留feature的依赖性，同时允许并行互动互动产生。在14个基准数据集上进行的广泛实验表明，在预测精度和鲁棒性方面，差异始终优于最先进的基准，训练和推理时间明显较低。

Title: Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation

Authors: Xinran Wang, Muxi Diao, Yuanzhi Liu, Chunyu Wang, Kongming Liang, Zhanyu Ma, Jun Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15172
Pdf URL: https://arxiv.org/pdf/2505.15172
Copy Paste: [[2505.15172]] Harnessing Caption Detailness for Data-Efficient Text-to-Image Generation(https://arxiv.org/abs/2505.15172)
Keywords: generation
Abstract: Training text-to-image (T2I) models with detailed captions can significantly improve their generation quality. Existing methods often rely on simplistic metrics like caption length to represent the detailness of the caption in the T2I training set. In this paper, we propose a new metric to estimate caption detailness based on two aspects: image coverage rate (ICR), which evaluates whether the caption covers all regions/objects in the image, and average object detailness (AOD), which quantifies the detailness of each object's description. Through experiments on the COCO dataset using ShareGPT4V captions, we demonstrate that T2I models trained on high-ICR and -AOD captions achieve superior performance on DPG and other benchmarks. Notably, our metric enables more effective data selection-training on only 20% of full data surpasses both full-dataset training and length-based selection method, improving alignment and reconstruction ability. These findings highlight the critical role of detail-aware metrics over length-based heuristics in caption selection for T2I tasks.
摘要：带有详细标题的培训文本对图像（T2I）模型可以显着提高其发电质量。现有的方法通常依赖于简单的指标，例如字幕长度来表示T2I训练集中标题的细节。在本文中，我们提出了一个新的指标，以根据两个方面的两个方面估算字幕细节：图像覆盖率（ICR），该方面评估标题是否涵盖图像中的所有区域/对象，以及平均对象详细信息（AOD），量化了每个对象描述的细节。通过使用ShareGPT4V字幕在可可数据集上进行的实验，我们证明了在高ICR和-AOD字幕上训练的T2I模型在DPG和其他基准测试方面具有出色的性能。值得注意的是，我们的指标可以在20％的完整数据上实现更有效的数据选择训练，超过了全数据库训练和基于长度的选择方法，从而提高了对齐和重建能力。这些发现突出了细节感知指标在基于长度的启发式方面的关键作用在T2I任务的标题选择中。

Title: AvatarShield: Visual Reinforcement Learning for Human-Centric Video Forgery Detection

Authors: Zhipei Xu, Xuanyu Zhang, Xing Zhou, Jian Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15173
Pdf URL: https://arxiv.org/pdf/2505.15173
Copy Paste: [[2505.15173]] AvatarShield: Visual Reinforcement Learning for Human-Centric Video Forgery Detection(https://arxiv.org/abs/2505.15173)
Keywords: generation
Abstract: The rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, particularly in video generation, has led to unprecedented creative capabilities but also increased threats to information integrity, identity security, and public trust. Existing detection methods, while effective in general scenarios, lack robust solutions for human-centric videos, which pose greater risks due to their realism and potential for legal and ethical misuse. Moreover, current detection approaches often suffer from poor generalization, limited scalability, and reliance on labor-intensive supervised fine-tuning. To address these challenges, we propose AvatarShield, the first interpretable MLLM-based framework for detecting human-centric fake videos, enhanced via Group Relative Policy Optimization (GRPO). Through our carefully designed accuracy detection reward and temporal compensation reward, it effectively avoids the use of high-cost text annotation data, enabling precise temporal modeling and forgery detection. Meanwhile, we design a dual-encoder architecture, combining high-level semantic reasoning and low-level artifact amplification to guide MLLMs in effective forgery detection. We further collect FakeHumanVid, a large-scale human-centric video benchmark that includes synthesis methods guided by pose, audio, and text inputs, enabling rigorous evaluation of detection methods in real-world scenes. Extensive experiments show that AvatarShield significantly outperforms existing approaches in both in-domain and cross-domain detection, setting a new standard for human-centric video forensics.
摘要：人工智能生成内容（AIGC）技术的快速发展，尤其是在视频生成中，导致了前所未有的创造力，但也增加了对信息完整性，身份安全和公共信任的威胁。现有的检测方法虽然在一般情况下有效，但缺乏以人为中心的视频的强大解决方案，由于其现实主义以及法律和道德滥用的潜力，这会带来更大的风险。此外，当前的检测方法通常遭受概括，可伸缩性有限以及对劳动密集型监督微调的依赖。为了应对这些挑战，我们提出了Avatarshield，这是第一个可解释的基于MLLM的框架，用于检测以人为中心的假视频，通过小组相对政策优化（GRPO）增强。通过我们精心设计的准确性检测奖励和时间补偿奖励，它有效地避免了使用高成本文本注释数据，从而实现了精确的时间建模和伪造检测。同时，我们设计了一个双重编码器结构，结合了高级语义推理和低级伪影放大，以指导MLLM有效检测。我们进一步收集了伪造的humanvid，这是一种大规模以人为中心的视频基准，其中包括以姿势，音频和文本输入为指导的综合方法，可在现实世界中对检测方法进行严格的评估。广泛的实验表明，avatarshield在内域和跨域检测中的现有方法明显胜过现有的方法，为以人为中心的视频取证树立了新标准。

Title: MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models

Authors: Yifan Liu, Keyu Fan, Weihao Yu, Chenxin Li, Hao Lu, Yixuan Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15185
Pdf URL: https://arxiv.org/pdf/2505.15185
Copy Paste: [[2505.15185]] MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models(https://arxiv.org/abs/2505.15185)
Keywords: generation
Abstract: Recent advances in generalizable 3D Gaussian Splatting have demonstrated promising results in real-time high-fidelity rendering without per-scene optimization, yet existing approaches still struggle to handle unfamiliar visual content during inference on novel scenes due to limited generalizability. To address this challenge, we introduce MonoSplat, a novel framework that leverages rich visual priors from pre-trained monocular depth foundation models for robust Gaussian reconstruction. Our approach consists of two key components: a Mono-Multi Feature Adapter that transforms monocular features into multi-view representations, coupled with an Integrated Gaussian Prediction module that effectively fuses both feature types for precise Gaussian generation. Through the Adapter's lightweight attention mechanism, features are seamlessly aligned and aggregated across views while preserving valuable monocular priors, enabling the Prediction module to generate Gaussian primitives with accurate geometry and appearance. Through extensive experiments on diverse real-world datasets, we convincingly demonstrate that MonoSplat achieves superior reconstruction quality and generalization capability compared to existing methods while maintaining computational efficiency with minimal trainable parameters. Codes are available at this https URL.
摘要：最新的可概括3D高斯脱落的进展表明，在没有现场优化的情况下实时高保真渲染中，有希望的结果，但由于有限的可推广性，现有的方法仍在努力处理陌生的视觉内容。为了应对这一挑战，我们介绍了MonosPlat，这是一个新颖的框架，该框架利用预先训练的单眼深度基础模型来利用丰富的视觉先验来进行健壮的高斯重建。我们的方法由两个关键组成部分组成：一个单元特征适配器，将单眼特征转换为多视图表示形式，并与一个集成的高斯预测模块相结合，可有效融合两种特征类型的精确高斯生成。通过适配器的轻巧注意机制，特征在视图上无缝对齐和聚合，同时保留了有价值的单眼先验，从而使预测模块能够以准确的几何形状和外观生成高斯原始人。通过对各种现实世界数据集的广泛实验，我们令人信服地证明，与现有方法相比，单声道具有优越的重建质量和概括能力，同时使用最小的可训练参数来维持计算效率。代码可在此HTTPS URL上找到。

Title: Intentional Gesture: Deliver Your Intentions with Gestures for Speech

Authors: Pinxin Liu, Haiyang Liu, Luchuan Song, Chenliang Xu
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2505.15197
Pdf URL: https://arxiv.org/pdf/2505.15197
Copy Paste: [[2505.15197]] Intentional Gesture: Deliver Your Intentions with Gestures for Speech(https://arxiv.org/abs/2505.15197)
Keywords: generation
Abstract: When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues (\textit{e.g.} speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce \textbf{Intentional-Gesture}, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. % First, we curate the \textbf{InG} dataset by augmenting BEAT-2 with gesture-intention annotations (\textit{i.e.}, text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the \textbf{Intentional Gesture Motion Tokenizer} to leverage these intention annotations. It injects high-level communicative functions (\textit{e.g.}, intentions) into tokenized motion representations to enable intention-aware gesture synthesis that are both temporally aligned and semantically meaningful, achieving new state-of-the-art performance on the BEAT-2 benchmark. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI. Project Page: this https URL
摘要：当人类讲话时，手势有助于传达沟通意图，例如增加重点或描述概念。但是，当前的共同语音手势生成方法仅依赖于表面语言提示（\ textit {e.g。}语音音频或文本成绩单），忽略了理解和利用基于人类手势的交流意图。这导致产出与语音有节奏同步但在语义上是浅的输出。为了解决这一差距，我们介绍了\ textbf {有意示意}，这是一个新颖的框架，将手势生成铸造为以高级交流功能为基础的意向性修复任务。％首先，我们通过使用手势意图注释来增强Beat-2（\ textit {i.e。}，汇总意图的文本句子）来策划\ textbf {ing}数据集，这些句子是使用大型视觉模型自动注释的。接下来，我们介绍\ textbf {有意的手势运动令牌}来利用这些意图注释。它将高级交流功能（\ textit {e.g。}，意图）注入令牌化运动表示中，以使意向意识到的手势合成在时间上既适合又具有意义，并且在Beat-2 Benchmark上实现了新的最新性能。我们的框架为数字人类的表现力产生和体现的AI提供了模块化的基础。项目页面：此HTTPS URL

Title: KernelOracle: Predicting the Linux Scheduler's Next Move with Deep Learning

Authors: Sampanna Yashwant Kahu
Subjects: cs.LG, cs.OS
Abstract URL: https://arxiv.org/abs/2505.15213
Pdf URL: https://arxiv.org/pdf/2505.15213
Copy Paste: [[2505.15213]] KernelOracle: Predicting the Linux Scheduler's Next Move with Deep Learning(https://arxiv.org/abs/2505.15213)
Keywords: generation
Abstract: Efficient task scheduling is paramount in the Linux kernel, where the Completely Fair Scheduler (CFS) meticulously manages CPU resources to balance high utilization with interactive responsiveness. This research pioneers the use of deep learning techniques to predict the sequence of tasks selected by CFS, aiming to evaluate the feasibility of a more generalized and potentially more adaptive task scheduler for diverse workloads. Our core contributions are twofold: first, the systematic generation and curation of a novel scheduling dataset from a running Linux kernel, capturing real-world CFS behavior; and second, the development, training, and evaluation of a Long Short-Term Memory (LSTM) network designed to accurately forecast the next task to be scheduled. This paper further discusses the practical pathways and implications of integrating such a predictive model into the kernel's scheduling framework. The findings and methodologies presented herein open avenues for data-driven advancements in kernel scheduling, with the full source code provided for reproducibility and further exploration.
摘要：有效的任务计划在Linux内核中至关重要，在Linux内核中，完全公平的调度程序（CFS）精心管理CPU资源，以平衡高利用率和交互式响应能力。这项研究开创了使用深度学习技术来预测CFS选择的任务的顺序，旨在评估对各种工作量的更具概括性和潜在的自适应任务调度程序的可行性。我们的核心贡献是双重的：首先，是从运行的Linux内核中新型调度数据集的系统生成和策划，从而捕获了现实世界中的CFS行为；其次，旨在准确预测要安排的下一个任务的长期短期内存（LSTM）网络的开发，培训和评估。本文进一步讨论了将这种预测模型集成到内核计划框架中的实际途径和含义。本文提出的发现和方法论为数据驱动的内核计划进步开放途径，并提供了完整的源代码，以供可重复性和进一步探索。

Title: Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection

Authors: Haotian Qin, Dongliang Chang, Yueying Gao, Bingyao Yu, Lei Chen, Zhanyu Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15217
Pdf URL: https://arxiv.org/pdf/2505.15217
Copy Paste: [[2505.15217]] Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection(https://arxiv.org/abs/2505.15217)
Keywords: generative
Abstract: Although existing CLIP-based methods for detecting AI-generated images have achieved promising results, they are still limited by severe feature redundancy, which hinders their generalization ability. To address this issue, incorporating an information bottleneck network into the task presents a straightforward solution. However, relying solely on image-corresponding prompts results in suboptimal performance due to the inherent diversity of prompts. In this paper, we propose a multimodal conditional bottleneck network to reduce feature redundancy while enhancing the discriminative power of features extracted by CLIP, thereby improving the model's generalization ability. We begin with a semantic analysis experiment, where we observe that arbitrary text features exhibit lower cosine similarity with real image features than with fake image features in the CLIP feature space, a phenomenon we refer to as "bias". Therefore, we introduce InfoFD, a text-guided AI-generated image detection framework. InfoFD consists of two key components: the Text-Guided Conditional Information Bottleneck (TGCIB) and Dynamic Text Orthogonalization (DTO). TGCIB improves the generalizability of learned representations by conditioning on both text and class modalities. DTO dynamically updates weighted text features, preserving semantic information while leveraging the global "bias". Our model achieves exceptional generalization performance on the GenImage dataset and latest generative models. Our code is available at this https URL.
摘要：尽管现有的基于夹子的方法用于检测AI生成的图像的结果已取得了令人鼓舞的结果，但它们仍然受到严重特征冗余的限制，这阻碍了其概括能力。为了解决此问题，将信息瓶颈网络纳入任务提供了简单的解决方案。但是，仅依靠图像对应提示会导致提示的固有多样性，从而导致次优性能。在本文中，我们提出了一个多模式的条件瓶颈网络，以减少特征冗余，同时增强剪辑提取的特征的判别能力，从而提高模型的泛化能力。我们从一个语义分析实验开始，在该实验中，我们观察到任意文本特征与剪辑特征空间中的假图像特征表现出较低的余弦相似性，这是我们称为“偏见”的现象。因此，我们引入了InfoFD，这是一种文本引导的AI生成的图像检测框架。 InfoFD由两个关键组成部分组成：文本指导的条件信息瓶颈（TGCIB）和动态文本正交化（DTO）。 TGCIB通过对文本和班级方式进行调节来提高学习表示的普遍性。 DTO动态更新加权文本功能，在利用全局“偏见”时保留语义信息。我们的模型在Genimage数据集和最新生成模型上实现了出色的概括性能。我们的代码可在此HTTPS URL上找到。

Title: Continuous Representation Methods, Theories, and Applications: An Overview and Perspectives

Authors: Yisi Luo, Xile Zhao, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15222
Pdf URL: https://arxiv.org/pdf/2505.15222
Copy Paste: [[2505.15222]] Continuous Representation Methods, Theories, and Applications: An Overview and Perspectives(https://arxiv.org/abs/2505.15222)
Keywords: restoration
Abstract: Recently, continuous representation methods emerge as novel paradigms that characterize the intrinsic structures of real-world data through function representations that map positional coordinates to their corresponding values in the continuous space. As compared with the traditional discrete framework, the continuous framework demonstrates inherent superiority for data representation and reconstruction (e.g., image restoration, novel view synthesis, and waveform inversion) by offering inherent advantages including resolution flexibility, cross-modal adaptability, inherent smoothness, and parameter efficiency. In this review, we systematically examine recent advancements in continuous representation frameworks, focusing on three aspects: (i) Continuous representation method designs such as basis function representation, statistical modeling, tensor function decomposition, and implicit neural representation; (ii) Theoretical foundations of continuous representations such as approximation error analysis, convergence property, and implicit regularization; (iii) Real-world applications of continuous representations derived from computer vision, graphics, bioinformatics, and remote sensing. Furthermore, we outline future directions and perspectives to inspire exploration and deepen insights to facilitate continuous representation methods, theories, and applications. All referenced works are summarized in our open-source repository: this https URL.
摘要：最近，连续表示方法作为新型范式出现，这些范式通过函数表示表征了现实世界数据的内在结构，这些功能表示将位置坐标映射到其连续空间中的相应值。与传统的离散框架相比，连续框架通过提供固有的优势，包括分辨率灵活性，交叉模式适应性，固有的平滑度和参数效率，证明了数据表示和重建的固有优势（例如，图像恢复，新型视图合成和波形反演）。在这篇综述中，我们系统地检查了连续表示框架中的最新进展，重点介绍了三个方面：（i）连续表示方法设计，例如基函数表示，统计建模，张量功能分解和隐式神经表示；（ii）连续表示的理论基础，例如近似误差分析，收敛属性和隐式正则化；（iii）来自计算机视觉，图形，生物信息学和遥感的连续表示的现实应用程序。此外，我们概述了未来的方向和观点，以激发探索并加深洞察力，以促进持续的代表方法，理论和应用。所有引用的作品均在我们的开源存储库中总结：此HTTPS URL。

Title: Loss-Guided Auxiliary Agents for Overcoming Mode Collapse in GFlowNets

Authors: Idriss Malek, Abhijit Sharma, Salem Lahlou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15251
Pdf URL: https://arxiv.org/pdf/2505.15251
Copy Paste: [[2505.15251]] Loss-Guided Auxiliary Agents for Overcoming Mode Collapse in GFlowNets(https://arxiv.org/abs/2505.15251)
Keywords: generation, generative
Abstract: Although Generative Flow Networks (GFlowNets) are designed to capture multiple modes of a reward function, they often suffer from mode collapse in practice, getting trapped in early discovered modes and requiring prolonged training to find diverse solutions. Existing exploration techniques may rely on heuristic novelty signals. We propose Loss-Guided GFlowNets (LGGFN), a novel approach where an auxiliary GFlowNet's exploration is directly driven by the main GFlowNet's training loss. By prioritizing trajectories where the main model exhibits high loss, LGGFN focuses sampling on poorly understood regions of the state space. This targeted exploration significantly accelerates the discovery of diverse, high-reward samples. Empirically, across various benchmarks including grid environments, structured sequence generation, and Bayesian structure learning, LGGFN consistently enhances exploration efficiency and sample diversity compared to baselines. For instance, on a challenging sequence generation task, it discovered over 40 times more unique valid modes while simultaneously reducing the exploration error metric by approximately 99\%.
摘要：尽管生成流动网络（Gflownets）旨在捕获奖励功能的多种模式，但它们通常会遭受实践模式崩溃，被困在早期发现的模式中，并需要长时间的培训才能找到各种解决方案。现有的探索技术可能依靠启发式新颖性信号。我们提出了损失引导的Gflownets（LGGFN），这是一种新颖的方法，在该方法中，辅助Gflownet的探索直接由主要的Gflownet训练损失驱动。通过优先考虑主要模型表现出很高损失的轨迹，LGGFN将采样集中在状态空间的不理区域上。这种目标探索显着加速了发现多种高级样本的发现。从经验上讲，与基线相比，LGGFN在包括网格环境，结构化序列产生和贝叶斯结构学习在内的各种基准，结构化序列的产生和贝叶斯结构学习，一贯提高勘探效率和样品多样性。例如，在具有挑战性的序列生成任务上，它发现了超过40倍独特的有效模式，同时将勘探误差指标降低了约99 \％。

Title: gen2seg: Generative Models Enable Generalizable Instance Segmentation

Authors: Om Khangaonkar, Hamed Pirsiavash
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15263
Pdf URL: https://arxiv.org/pdf/2505.15263
Copy Paste: [[2505.15263]] gen2seg: Generative Models Enable Generalizable Instance Segmentation(https://arxiv.org/abs/2505.15263)
Keywords: generative
Abstract: By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE's ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.
摘要：通过从扰动输入中综合相干图像进行预处理，生成模型固有地学会理解对象边界和场景组成。我们如何将这些生成性表示为通用感知组织？我们使用我们的实例着色损失仅在狭窄的对象类型（室内家具和汽车）上使用我们的实例着色损失，为类别 - 敏捷的实例分割而列出稳定的扩散和MAE（编码器+解码器）。令人惊讶的是，我们的模型表现出强烈的零弹性概括，准确地分割了固定中看不见的类型和样式的对象（在许多情况下，MAE的Imagenet-1k也预处理）。在对看不见的对象类型和样式进行评估时，我们表现最佳的模型在对物体类型和样式进行评估时，紧密接近了严重监督的SAM，并在细分精细的结构和模棱两可的边界时胜过它。相比之下，现有的可迅速分割体系结构或判别预处理的模型无法概括。这表明生成模型学习了一种固有的分组机制，即使没有互联网规模的预处理，该机制即使没有互联网规模进行预处理。代码，预估计的模型和演示可在我们的网站上找到。

Title: Scaling Diffusion Transformers Efficiently via $μ$P

Authors: Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.15270
Pdf URL: https://arxiv.org/pdf/2505.15270
Copy Paste: [[2505.15270]] Scaling Diffusion Transformers Efficiently via $μ$P(https://arxiv.org/abs/2505.15270)
Keywords: generation, generative
Abstract: Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.
摘要：扩散变压器已成为视觉生成模型的基础，但是它们的可伸缩性受到大规模高参数（HP）调整的高成本的限制。最近，为香草变形金刚提出了最大更新参数化（$ \ mu $ p），这使HP从小到大语言模型稳定转移，并大大降低了调音成本。但是，尚不清楚香草变压器的$ \ mu $ p延伸到扩散变压器，这在建筑和客观上差异。在这项工作中，我们将标准$ \ MU $ P概括为扩散变压器并通过大规模实验来验证其有效性。首先，我们严格地证明了主流扩散变压器的$ \ mu $ p，包括DIT，U-Vit，Pixart-$ \ alpha $和MMDIT，与Vanilla Transformer的一致，使现有的$ \ MU $ P方法可以直接应用。利用此结果，我们系统地证明了DIT-$ \ MU $ P具有强大的HP可转移性。值得注意的是，DIT-XL-2- $ \ MU $ P具有转移学习率的速度比原始DIT-XL-2快2.9倍。最后，我们通过将PixArt缩放-Alpha $从0.04B到0.61B，并从0.18B到18B来验证$ \ mu $ P在文本到图像生成上的有效性。在这两种情况下，$ \ mu $ p以下的型号在需要小型调整成本的同时均优于各自的基线，只有5.5％的pixart培训 - $ \ alpha $，而人类专家的MMDIT-18B的消费量的3％。这些结果将$ \ mu $ p建立为缩放扩散变压器的原则性高效框架。

Title: GS2E: Gaussian Splatting is an Effective Data Generator for Event Stream Generation

Authors: Yuchen Li, Chaoran Feng, Zhenyu Tang, Kaiyuan Deng, Wangbo Yu, Yonghong Tian, Li Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15287
Pdf URL: https://arxiv.org/pdf/2505.15287
Copy Paste: [[2505.15287]] GS2E: Gaussian Splatting is an Effective Data Generator for Event Stream Generation(https://arxiv.org/abs/2505.15287)
Keywords: generation
Abstract: We introduce GS2E (Gaussian Splatting to Event), a large-scale synthetic event dataset for high-fidelity event vision tasks, captured from real-world sparse multi-view RGB images. Existing event datasets are often synthesized from dense RGB videos, which typically lack viewpoint diversity and geometric consistency, or depend on expensive, difficult-to-scale hardware setups. GS2E overcomes these limitations by first reconstructing photorealistic static scenes using 3D Gaussian Splatting, and subsequently employing a novel, physically-informed event simulation pipeline. This pipeline generally integrates adaptive trajectory interpolation with physically-consistent event contrast threshold modeling. Such an approach yields temporally dense and geometrically consistent event streams under diverse motion and lighting conditions, while ensuring strong alignment with underlying scene structures. Experimental results on event-based 3D reconstruction demonstrate GS2E's superior generalization capabilities and its practical value as a benchmark for advancing event vision research.
摘要：我们介绍了GS2E（高斯分裂到事件），这是一个大规模合成事件数据集，用于高保真事件视觉任务任务，从现实世界稀疏的多视频RGB图像捕获。现有的事件数据集通常是由密集的RGB视频合成的，这些视频通常缺乏观点多样性和几何一致性，或者取决于昂贵的，难以规模的硬件设置。 GS2E通过首先使用3D高斯分裂来重建影照相静态场景，从而克服了这些局限性，然后使用新颖的，有理由的事件模拟管道。该管道通常将自适应轨迹插值与物理上一致的事件对比度阈值建模集成在一起。这种方法在各种运动和照明条件下产生了时间密集和几何一致的事件流，同时确保与基本场景结构的紧密对齐。基于事件的3D重建的实验结果证明了GS2E的出色概括能力及其实际价值作为推进事件视觉研究的基准。

Title: BadSR: Stealthy Label Backdoor Attacks on Image Super-Resolution

Authors: Ji Guo, Xiaolei Wen, Wenbo Jiang, Cheng Huang, Jinjin Li, Hongwei Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15308
Pdf URL: https://arxiv.org/pdf/2505.15308
Copy Paste: [[2505.15308]] BadSR: Stealthy Label Backdoor Attacks on Image Super-Resolution(https://arxiv.org/abs/2505.15308)
Keywords: super-resolution
Abstract: With the widespread application of super-resolution (SR) in various fields, researchers have begun to investigate its security. Previous studies have demonstrated that SR models can also be subjected to backdoor attacks through data poisoning, affecting downstream tasks. A backdoor SR model generates an attacker-predefined target image when given a triggered image while producing a normal high-resolution (HR) output for clean images. However, prior backdoor attacks on SR models have primarily focused on the stealthiness of poisoned low-resolution (LR) images while ignoring the stealthiness of poisoned HR images, making it easy for users to detect anomalous data. To address this problem, we propose BadSR, which improves the stealthiness of poisoned HR images. The key idea of BadSR is to approximate the clean HR image and the pre-defined target image in the feature space while ensuring that modifications to the clean HR image remain within a constrained range. The poisoned HR images generated by BadSR can be integrated with existing triggers. To further improve the effectiveness of BadSR, we design an adversarially optimized trigger and a backdoor gradient-driven poisoned sample selection method based on a genetic algorithm. The experimental results show that BadSR achieves a high attack success rate in various models and data sets, significantly affecting downstream tasks.
摘要：随着超分辨率（SR）在各个领域的广泛应用，研究人员已经开始调查其安全性。先前的研究表明，SR模型也可以通过数据中毒遭受后门攻击，从而影响下游任务。当给出触发图像时，后门SR模型在为干净的图像产生正常的高分辨率（HR）输出时会生成攻击者预先定义的目标图像。但是，先前对SR模型的后门攻击主要集中在中毒的低分辨率（LR）图像的隐秘性上，同时忽略了中毒的HR图像的隐身性，从而使用户易于检测异常数据。为了解决这个问题，我们提出了BADSR，从而改善了中毒的人力资源图像的隐身性。 BADSR的关键思想是在特征空间中近似清洁的HR图像和预定义的目标图像，同时确保对干净的HR图像进行修改保持在约束范围内。 BADSR产生的中毒的HR图像可以与现有触发器集成在一起。为了进一步提高BADSR的有效性，我们设计了一种基于遗传算法的对手优化的触发器和后门梯度驱动的中毒样品选择方法。实验结果表明，BADSR在各种模型和数据集中取得了很高的攻击成功率，从而显着影响下游任务。

Title: FaceCrafter: Identity-Conditional Diffusion with Disentangled Control over Facial Pose, Expression, and Emotion

Authors: Kazuaki Mishima, Antoni Bigata Casademunt, Stavros Petridis, Maja Pantic, Kenji Suzuki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15313
Pdf URL: https://arxiv.org/pdf/2505.15313
Copy Paste: [[2505.15313]] FaceCrafter: Identity-Conditional Diffusion with Disentangled Control over Facial Pose, Expression, and Emotion(https://arxiv.org/abs/2505.15313)
Keywords: generation, generative
Abstract: Human facial images encode a rich spectrum of information, encompassing both stable identity-related traits and mutable attributes such as pose, expression, and emo- tion. While recent advances in image generation have enabled high-quality identity- conditional face synthesis, precise control over non-identity attributes remains challeng- ing, and disentangling identity from these mutable factors is particularly difficult. To address these limitations, we propose a novel identity-conditional diffusion model that introduces two lightweight control modules designed to independently manipulate facial pose, expression, and emotion without compromising identity preservation. These mod- ules are embedded within the cross-attention layers of the base diffusion model, enabling precise attribute control with minimal parameter overhead. Furthermore, our tailored training strategy, which leverages cross-attention between the identity feature and each non-identity control feature, encourages identity features to remain orthogonal to control signals, enhancing controllability and diversity. Quantitative and qualitative evaluations, along with perceptual user studies, demonstrate that our method surpasses existing ap- proaches in terms of control accuracy over pose, expression, and emotion, while also improving generative diversity under identity-only conditioning.
摘要：人的面部图像编码了丰富的信息范围，包括稳定的身份相关性状和可变属性，例如姿势，表达和情感。尽管图像产生的最新进展使高质量的身份 - 条件面部合成，但对非身份属性的精确控制仍然存在挑战，并且从这些可变的因素中解散身份尤其困难。为了解决这些局限性，我们提出了一个新颖的身份条件扩散模型，该模型介绍了两个轻巧的控制模块，旨在独立操纵面部姿势，表达和情绪而不损害身份保存。这些模块嵌入基础扩散模型的交叉注意层中，从而可以使用最小的参数开销，从而实现精确的属性控制。此外，我们量身定制的培训策略（利用身份功能和每个非身份控制功能）之间的交叉注意力鼓励身份功能保持正交以控制信号，增强可控性和多样性。定量和定性评估以及感知用户研究表明，我们的方法在控制姿势，表达和情感方面超过了现有的处理，同时也改善了仅身份条件下的生成多样性。

Title: My Face Is Mine, Not Yours: Facial Protection Against Diffusion Model Face Swapping

Authors: Hon Ming Yam, Zhongliang Guo, Chun Pong Lau
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15336
Pdf URL: https://arxiv.org/pdf/2505.15336
Copy Paste: [[2505.15336]] My Face Is Mine, Not Yours: Facial Protection Against Diffusion Model Face Swapping(https://arxiv.org/abs/2505.15336)
Keywords: generative
Abstract: The proliferation of diffusion-based deepfake technologies poses significant risks for unauthorized and unethical facial image manipulation. While traditional countermeasures have primarily focused on passive detection methods, this paper introduces a novel proactive defense strategy through adversarial attacks that preemptively protect facial images from being exploited by diffusion-based deepfake systems. Existing adversarial protection methods predominantly target conventional generative architectures (GANs, AEs, VAEs) and fail to address the unique challenges presented by diffusion models, which have become the predominant framework for high-quality facial deepfakes. Current diffusion-specific adversarial approaches are limited by their reliance on specific model architectures and weights, rendering them ineffective against the diverse landscape of diffusion-based deepfake implementations. Additionally, they typically employ global perturbation strategies that inadequately address the region-specific nature of facial manipulation in deepfakes.
摘要：基于扩散的深层技术的扩散为未经授权和不道德的面部图像操纵带来了重大风险。尽管传统的对策主要集中在被动检测方法上，但本文通过对抗性攻击介绍了一种新颖的主动防御策略，该攻击先发制地保护面部图像免于受到基于扩散的深层捕获系统的利用。现有的对抗保护方法主要是针对传统的生成架构（GAN，AES，VAE），并且无法解决扩散模型带来的独特挑战，这些挑战已成为高质量面部深层攻击的主要框架。当前的扩散特异性对抗方法受到其对特定模型架构和权重的依赖的限制，使它们对基于扩散的深层实现的各种景观无效。此外，他们通常采用全球扰动策略，这些策略无法充分解决深层面部操作的特定区域特定性质。

Title: Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation

Authors: Jianyuan Guo, Peike Li, Trevor Cohn
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15438
Pdf URL: https://arxiv.org/pdf/2505.15438
Copy Paste: [[2505.15438]] Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation(https://arxiv.org/abs/2505.15438)
Keywords: generation
Abstract: Sign Language Translation (SLT) aims to map sign language videos to spoken language text. A common approach relies on gloss annotations as an intermediate representation, decomposing SLT into two sub-tasks: video-to-gloss recognition and gloss-to-text translation. While effective, this paradigm depends on expert-annotated gloss labels, which are costly and rarely available in existing datasets, limiting its scalability. To address this challenge, we propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses while preserving the structured intermediate representation. Specifically, we prompt a Large Language Model (LLM) with a few example text-gloss pairs using in-context learning to produce draft sign glosses from spoken language text. To enhance the correspondence between LLM-generated pseudo glosses and the sign sequences in video, we correct the ordering in the pseudo glosses for better alignment via a weakly supervised learning process. This reordering facilitates the incorporation of auxiliary alignment objectives, and allows for the use of efficient supervision via a Connectionist Temporal Classification (CTC) loss. We train our SLT mode, which consists of a vision encoder and a translator, through a three-stage pipeline, which progressively narrows the modality gap between sign language and spoken language. Despite its simplicity, our approach outperforms previous state-of-the-art gloss-free frameworks on two SLT benchmarks and achieves competitive results compared to gloss-based methods.
摘要：手语翻译（SLT）旨在将手语视频映射到口语文字。一种常见的方法依赖于光泽注释作为中间表示，将SLT分解为两个子任务：视频到斜体识别和光泽到文本翻译。虽然有效，但该范式取决于专家注销的光泽标签，在现有数据集中昂贵且很少可用，从而限制了其可扩展性。为了应对这一挑战，我们提出了一个无光泽的伪光泽生成框架，该框架消除了对人类注销的光泽的需求，同时保留结构化的中间表示。具体来说，我们促使使用文字学习的一些示例文本 - 拼写对促进了大型语言模型（LLM），从而产生了口语文字的草稿彩色光泽。为了增强LLM生成的伪光泽与视频中的符号序列之间的对应关系，我们校正了伪光泽中的顺序，以通过弱监督的学习过程更好地对齐。这种重新排序促进了辅助对准目标的结合，并允许通过连接派时间分类（CTC）损失使用有效的监督。我们通过三阶段的管道训练SLT模式，该模式由视觉编码器和翻译器组成，该管道逐渐缩小了手语和口语之间的模态差距。尽管它很简单，但我们的方法在两个SLT基准测试基准上的先前最先进的光泽框架优于先前的最先进的光泽框架，并且与基于光泽的方法相比，我们的方法取得了竞争性的结果。

Title: FRN: Fractal-Based Recursive Spectral Reconstruction Network

Authors: Ge Meng, Zhongnan Cai, Ruizhe Chen, Jingyan Tu, Yingying Wang, Yue Huang, Xinghao Ding
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.15439
Pdf URL: https://arxiv.org/pdf/2505.15439
Copy Paste: [[2505.15439]] FRN: Fractal-Based Recursive Spectral Reconstruction Network(https://arxiv.org/abs/2505.15439)
Keywords: generation
Abstract: Generating hyperspectral images (HSIs) from RGB images through spectral reconstruction can significantly reduce the cost of HSI acquisition. In this paper, we propose a Fractal-Based Recursive Spectral Reconstruction Network (FRN), which differs from existing paradigms that attempt to directly integrate the full-spectrum information from the R, G, and B channels in a one-shot manner. Instead, it treats spectral reconstruction as a progressive process, predicting from broad to narrow bands or employing a coarse-to-fine approach for predicting the next wavelength. Inspired by fractals in mathematics, FRN establishes a novel spectral reconstruction paradigm by recursively invoking an atomic reconstruction module. In each invocation, only the spectral information from neighboring bands is used to provide clues for the generation of the image at the next wavelength, which follows the low-rank property of spectral data. Moreover, we design a band-aware state space model that employs a pixel-differentiated scanning strategy at different stages of the generation process, further suppressing interference from low-correlation regions caused by reflectance differences. Through extensive experimentation across different datasets, FRN achieves superior reconstruction performance compared to state-of-the-art methods in both quantitative and qualitative evaluations.
摘要：通过光谱重建从RGB图像产生高光谱图像（HSI）可以显着降低HSI采集的成本。在本文中，我们提出了一个基于分形的递归光谱重建网络（FRN），该网络与现有的范式不同，该范式试图以单发方式直接从R，G和B通道中直接整合全谱信息。取而代之的是，它将光谱重建视为一个渐进过程，从宽到狭窄的频段预测或采用粗到精细的方法来预测下一个波长。受数学分形的启发，FRN通过递归调用原子重建模块来建立一种新型的光谱重建范式。在每个调用中，仅使用来自相邻频段的光谱信息来为下一个波长生成图像的线索，该波长遵循光谱数据的低级别属性。此外，我们设计了一种带有频段的状态空间模型，该模型在生成过程的不同阶段采用像素分化的扫描策略，进一步抑制了由于反射差异引起的低相关区域的干扰。通过对不同数据集进行广泛的实验，与定量和定性评估中的最新方法相比，FRN实现了优越的重建性能。

Title: Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models

Authors: Die Chen, Zhiwen Li, Cen Chen, Yuexiang Xie, Xiaodan Li, Jinyan Ye, Yingda Chen, Yaliang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15450
Pdf URL: https://arxiv.org/pdf/2505.15450
Copy Paste: [[2505.15450]] Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models(https://arxiv.org/abs/2505.15450)
Keywords: generation
Abstract: Text-to-image diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content, posing significant risks to their safe deployment. While several concept erasure methods have been proposed to mitigate the issue associated with NSFW content, a comprehensive evaluation of their effectiveness across various scenarios remains absent. To bridge this gap, we introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods. By examining the interplay between the underlying mechanisms and empirical observations, we provide in-depth insights and practical guidance for the effective application of concept erasure methods in various real-world scenarios, with the aim of advancing the understanding of content safety in diffusion models and establishing a solid foundation for future research and development in this critical area.
摘要：文本到图像扩散模型已在各个领域获得了广泛的应用，表现出巨大的创造力。但是，扩散模型的强大概括能力会无意中导致产生不安全的工作（NSFW）内容，从而对其安全部署构成了重大风险。虽然已经提出了几种概念擦除方法来减轻与NSFW内容相关的问题，但在各种情况下对它们的有效性进行了全面评估。为了弥合这一差距，我们引入了专门为概念擦除设计的全层线工具包，并对NSFW概念擦除方法进行了首次系统研究。通过研究基本机制和经验观察之间的相互作用，我们为在各种现实世界中的有效应用有效地应用概念擦除方法提供了深入的见解和实践指导，以促进对扩散模型中内容安全的理解，并为这个关键领域的未来研究和发展建立稳固的基础。

Title: NOMAD Projection

Authors: Brandon Duderstadt, Zach Nussbaum, Laurens van der Maaten
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15511
Pdf URL: https://arxiv.org/pdf/2505.15511
Copy Paste: [[2505.15511]] NOMAD Projection(https://arxiv.org/abs/2505.15511)
Keywords: generative
Abstract: The rapid adoption of generative AI has driven an explosion in the size of datasets consumed and produced by AI models. Traditional methods for unstructured data visualization, such as t-SNE and UMAP, have not kept up with the pace of dataset scaling. This presents a significant challenge for AI explainability, which relies on methods such as t-SNE and UMAP for exploratory data analysis. In this paper, we introduce Negative Or Mean Affinity Discrimination (NOMAD) Projection, the first method for unstructured data visualization via nonlinear dimensionality reduction that can run on multiple GPUs at train time. We provide theory that situates NOMAD Projection as an approximate upper bound on the InfoNC-t-SNE loss, and empirical results that demonstrate NOMAD Projection's superior performance and speed profile compared to existing state-of-the-art methods. We demonstrate the scalability of NOMAD Projection by computing the first complete data map of Multilingual Wikipedia.
摘要：生成AI的迅速采用使AI模型消耗和生产的数据集的大小爆炸。非结构化数据可视化的传统方法，例如T-SNE和UMAP，并没有跟上数据集缩放的速度。这对AI解释性提出了重大挑战，该挑战依赖于T-SNE和UMAP等方法进行探索性数据分析。在本文中，我们引入了负面或平均亲和力歧视（Nomad）投影，这是通过降低非线性尺寸可视化的第一种非结构化数据可视化方法，该方法可以在火车时在多个GPU上运行。我们提供的理论将游牧投影定位为Infonc-T-SNE损失的近似上限，并且与现有的最新方法相比，证明了Nomad投影的出色性能和速度概况。我们通过计算多语言Wikipedia的第一个完整数据图来证明Nomad投影的可扩展性。

Title: PlantDreamer: Achieving Realistic 3D Plant Models with Diffusion-Guided Gaussian Splatting

Authors: Zane K J Hartley, Lewis A G Stuart, Andrew P French, Michael P Pound
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2505.15528
Pdf URL: https://arxiv.org/pdf/2505.15528
Copy Paste: [[2505.15528]] PlantDreamer: Achieving Realistic 3D Plant Models with Diffusion-Guided Gaussian Splatting(https://arxiv.org/abs/2505.15528)
Keywords: generation, generative
Abstract: Recent years have seen substantial improvements in the ability to generate synthetic 3D objects using AI. However, generating complex 3D objects, such as plants, remains a considerable challenge. Current generative 3D models struggle with plant generation compared to general objects, limiting their usability in plant analysis tools, which require fine detail and accurate geometry. We introduce PlantDreamer, a novel approach to 3D synthetic plant generation, which can achieve greater levels of realism for complex plant geometry and textures than available text-to-3D models. To achieve this, our new generation pipeline leverages a depth ControlNet, fine-tuned Low-Rank Adaptation and an adaptable Gaussian culling algorithm, which directly improve textural realism and geometric integrity of generated 3D plant models. Additionally, PlantDreamer enables both purely synthetic plant generation, by leveraging L-System-generated meshes, and the enhancement of real-world plant point clouds by converting them into 3D Gaussian Splats. We evaluate our approach by comparing its outputs with state-of-the-art text-to-3D models, demonstrating that PlantDreamer outperforms existing methods in producing high-fidelity synthetic plants. Our results indicate that our approach not only advances synthetic plant generation, but also facilitates the upgrading of legacy point cloud datasets, making it a valuable tool for 3D phenotyping applications.
摘要：近年来，使用AI生成合成3D对象的能力有了很大的提高。但是，产生复杂的3D物体（例如植物）仍然是一个巨大的挑战。与一般物体相比，当前的生成3D模型与植物生成相比，限制了其在植物分析工具中的可用性，这需要细节和准确的几何形状。我们介绍了PlantDreamer，这是一种新型的3D合成植物产生的方法，它可以比可用的文本到3D模型获得更高的现实主义和复杂的植物几何形状和纹理。为了实现这一目标，我们的新一代管道利用深度控制网络，微调的低级适应性和适应性的高斯culling算法，该算法直接改善了生成的3D植物模型的纹理现实主义和几何完整性。此外，PlantDreamer通过利用L系统生成的网格以及通过将它们转换为3D高斯夹层来增强现实世界的植物点云，从而使纯粹的合成植物产生。我们通过将其输出与最先进的文本到3D模型进行比较来评估我们的方法，表明Plantreamer在生产高保真合成植物方面的现有方法优于现有方法。我们的结果表明，我们的方法不仅可以促进合成植物的产生，而且还促进了传统点云数据集的升级，从而使其成为3D表型应用程序的宝贵工具。

Title: seg_3D_by_PC2D: Multi-View Projection for Domain Generalization and Adaptation in 3D Semantic Segmentation

Authors: Andrew Caunes, Thierry Chateau, Vincent Fremont
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15545
Pdf URL: https://arxiv.org/pdf/2505.15545
Copy Paste: [[2505.15545]] seg_3D_by_PC2D: Multi-View Projection for Domain Generalization and Adaptation in 3D Semantic Segmentation(https://arxiv.org/abs/2505.15545)
Keywords: generation
Abstract: 3D semantic segmentation plays a pivotal role in autonomous driving and road infrastructure analysis, yet state-of-the-art 3D models are prone to severe domain shift when deployed across different datasets. We propose a novel multi-view projection framework that excels in both domain generalization (DG) and unsupervised domain adaptation (UDA). Our approach first aligns Lidar scans into coherent 3D scenes and renders them from multiple virtual camera poses to create a large-scale synthetic 2D dataset (PC2D). We then use it to train a 2D segmentation model in-domain. During inference, the model processes hundreds of views per scene; the resulting logits are back-projected to 3D with an occlusion-aware voting scheme to generate final point-wise labels. Our framework is modular and enables extensive exploration of key design parameters, such as view generation optimization (VGO), visualization modality optimization (MODO), and 2D model choice. We evaluate on the nuScenes and SemanticKITTI datasets under both the DG and UDA settings. We achieve state-of-the-art results in UDA and close to state-of-the-art in DG, with particularly large gains on large, static classes. Our code and dataset generation tools will be publicly available at this https URL
摘要：3D语义分割在自动驾驶和道路基础设施分析中起着关键作用，但是当在不同的数据集中部署时，最新的3D模型易于发生严重的域转移。我们提出了一个新型的多视图投影框架，在域概括（DG）和无监督域的适应性（UDA）中都表现出色。我们的方法首先将激光雷达扫描到连贯的3D场景中，并将它们从多个虚拟摄像头姿势呈现以创建一个大型合成2D数据集（PC2D）。然后，我们使用它来训练2D分割模型内域。在推断期间，该模型每个场景都会处理数百个视图；通过遮挡感知的投票方案将最终的徽标重新投影到3D，以生成最终的点标签。我们的框架是模块化的，可以广泛探索关键设计参数，例如视图生成优化（VGO），可视化模态优化（MODO）和2D模型选择。我们在DG和UDA设置下对Nuscenes和Semantickitti数据集进行了评估。我们在UDA中获得最先进的结果，并接近DG的最先进的结果，在大型静态类中的收益特别大。我们的代码和数据集生成工具将在此HTTPS URL上公开可用

Title: Impact of Data Sparsity on Machine Learning for Fault Detection in Power System Protection

Authors: Julian Oelhaf, Georg Kordowich, Changhun Kim, Paula Andrea Perez-Toro, Andreas Maier, Johann Jager, Siming Bayer
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2505.15560
Pdf URL: https://arxiv.org/pdf/2505.15560
Copy Paste: [[2505.15560]] Impact of Data Sparsity on Machine Learning for Fault Detection in Power System Protection(https://arxiv.org/abs/2505.15560)
Keywords: generation
Abstract: Germany's transition to a renewable energy-based power system is reshaping grid operations, requiring advanced monitoring and control to manage decentralized generation. Machine learning (ML) has emerged as a powerful tool for power system protection, particularly for fault detection (FD) and fault line identification (FLI) in transmission grids. However, ML model reliability depends on data quality and availability. Data sparsity resulting from sensor failures, communication disruptions, or reduced sampling rates poses a challenge to ML-based FD and FLI. Yet, its impact has not been systematically validated prior to this work. In response, we propose a framework to assess the impact of data sparsity on ML-based FD and FLI performance. We simulate realistic data sparsity scenarios, evaluate their impact, derive quantitative insights, and demonstrate the effectiveness of this evaluation strategy by applying it to an existing ML-based framework. Results show the ML model remains robust for FD, maintaining an F1-score of 0.999 $\pm$ 0.000 even after a 50x data reduction. In contrast, FLI is more sensitive, with performance decreasing by 55.61% for missing voltage measurements and 9.73% due to communication failures at critical network points. These findings offer actionable insights for optimizing ML models for real-world grid protection. This enables more efficient FD and supports targeted improvements in FLI.
摘要：德国向基于可再生能源的电力系统的过渡正在重塑电网操作，需要高级监控和控制才能管理分散的生成。机器学习（ML）已成为电力系统保护的强大工具，特别是用于变速箱网格中的故障检测（FD）和故障线识别（FLI）。但是，ML模型可靠性取决于数据质量和可用性。传感器故障，通信中断或采样率降低引起的数据稀疏性对基于ML的FD和FLI构成了挑战。但是，在此工作之前，其影响尚未系统地验证。作为回应，我们提出了一个框架来评估数据稀疏对基于ML的FD和FLI性能的影响。我们通过将其应用于现有的基于ML的框架来模拟现实的数据稀疏方案，评估它们的影响，得出定量见解并证明该评估策略的有效性。结果表明，ML模型对于FD仍然很强，即使在减少50倍数据后，F1得分也为0.999 $ \ pm $ 0.000。相比之下，FLI更敏感，由于缺少电压测量值，由于关键网络点的通信故障，由于缺少电压测量值的性能下降了55.61％。这些发现提供了可行的见解，以优化ML模型，以保护现实世界的电网保护。这使得更有效的FD可以支持FLI的有针对性改进。

Title: Bridging the Domain Gap in Equation Distillation with Reinforcement Feedback

Authors: Wangyang Ying, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Sixun Dong, Haifeng Chen, Yanjie Fu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15572
Pdf URL: https://arxiv.org/pdf/2505.15572
Copy Paste: [[2505.15572]] Bridging the Domain Gap in Equation Distillation with Reinforcement Feedback(https://arxiv.org/abs/2505.15572)
Keywords: generation
Abstract: The data-to-equation (Data2Eqn) task aims to discover interpretable mathematical equations that map observed values to labels, offering physical insights and broad applicability across academic and industrial domains. Genetic programming and traditional deep learning-based approaches suffer from search inefficiency and poor generalization on small task-specific datasets. Foundation models showed promise in this area, but existing approaches suffer from: 1) They are pretrained on general-purpose data distributions, making them less effective for domain-specific tasks; and 2) their training objectives focus on token-level alignment, overlooking mathematical semantics, which can lead to inaccurate equations. To address these issues, we aim to enhance the domain adaptability of foundation models for Data2Eqn tasks. In this work, we propose a reinforcement learning-based finetuning framework that directly optimizes the generation policy of a pretrained model through reward signals derived from downstream numerical fitness. Our method allows the model to adapt to specific and complex data distributions and generate mathematically meaningful equations. Extensive experiments demonstrate that our approach improves both the accuracy and robustness of equation generation under complex distributions.
摘要：数据对方程式（Data2EQN）任务旨在发现可解释的数学方程式，这些方程将值映射到标签上，并在学术和工业领域提供物理见解和广泛的适用性。基因编程和传统的基于深度学习的方法遭受搜索效率低下和对特定于特定任务的数据集的概括不良。基础模型在该领域表现出了希望，但是现有的方法遭受了：1）它们是在通用数据分布上鉴定的，从而使其对特定领域的任务效果降低； 2）他们的训练目标专注于令牌级别的对齐，俯瞰数学语义，这可能导致不准确的方程式。为了解决这些问题，我们旨在增强数据2EQN任务基础模型的域适应性。在这项工作中，我们提出了一个基于增强学习的填充框架，该框架通过从下游数值适应性获得的奖励信号直接优化了预告片模型的发电政策。我们的方法允许模型适应特定且复杂的数据分布并生成数学上有意义的方程。广泛的实验表明，我们的方法在复杂分布下提高了方程生成的准确性和鲁棒性。

Title: Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks

Authors: Nick Kocher, Christian Wassermann, Leona Hennig, Jonas Seng, Holger Hoos, Kristian Kersting, Marius Lindauer, Matthias Müller
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15631
Pdf URL: https://arxiv.org/pdf/2505.15631
Copy Paste: [[2505.15631]] Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks(https://arxiv.org/abs/2505.15631)
Keywords: quality assessment
Abstract: Neural Architecture Search (NAS) accelerates progress in deep learning through systematic refinement of model architectures. The downside is increasingly large en- ergy consumption during the search process. Surrogate-based benchmarking mitigates the cost of full training by querying a pre-trained surrogate to obtain an estimate for the quality of the model. Specifically, energy-aware benchmarking aims to make it possible for NAS to favourably trade off model energy consumption against accuracy. Towards this end, we propose three design principles for such energy-aware benchmarks: (i) reliable power measurements, (ii) a wide range of GPU usage, and (iii) holistic cost reporting. We analyse EA-HAS-Bench based on these principles and find that the choice of GPU measurement API has a large impact on the quality of results. Using the Nvidia System Management Interface (SMI) on top of its underlying library influences the sampling rate during the initial data collection, returning faulty low-power estimations. This results in poor correlation with accurate measurements obtained from an external power meter. With this study, we bring to attention several key considerations when performing energy- aware surrogate-based benchmarking and derive first guidelines that can help design novel benchmarks. We show a narrow usage range of the four GPUs attached to our device, ranging from 146 W to 305 W in a single-GPU setting, and narrowing down even further when using all four GPUs. To improve holistic energy reporting, we propose calibration experiments over assumptions made in popular tools, such as Code Carbon, thus achieving reductions in the maximum inaccuracy from 10.3 % to 8.9 % without and to 6.6 % with prior estimation of the expected load on the device.
摘要：神经体系结构搜索（NAS）通过系统地完善模型体系结构加速了深度学习的进步。在搜索过程中，缺点越来越大。基于替代物的基准测试通过查询预先培训的替代物来获得模型质量的估算来降低全面培训的成本。具体而言，能源感知的基准测试旨在使NAS有可能对模型消耗进行有利的能源消耗，以抵制准确性。为此，我们提出了此类能源感知基准的三个设计原则：（i）可靠的功率测量值，（ii）广泛的GPU使用范围以及（iii）整体成本报告。我们基于这些原理分析了EA-HAS基础，发现GPU测量API的选择对结果质量有很大影响。使用NVIDIA系统管理接口（SMI）在其基础库的顶部会影响初始数据收集过程中的采样率，从而返回有缺陷的低功率估计。这导致与从外部功率计获得的准确测量值的相关性差。在这项研究中，我们将注意到基于能量的基于替代的基准测试并得出可以帮助设计新颖基准的第一指南时，我们引起了一些关键考虑。我们显示了连接到设备的四个GPU的狭窄使用范围，在单个GPU设置中从146 W到305 W不等，并且在使用所有四个GPU时都会进一步缩小。为了改善整体能源报告，我们提出了对流行工具（例如代码碳）假设的校准实验，从而使最大不准确性从10.3％降低到8.9％，而没有预期的设备的预期负载，并提高了6.6％。

Title: FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models

Authors: Zhen Sun, Ziyi Zhang, Zeren Luo, Zeyang Sha, Tianshuo Cong, Zheng Li, Shiwen Cui, Weiqiang Wang, Jiaheng Wei, Xinlei He, Qi Li, Qian Wang
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2505.15644
Pdf URL: https://arxiv.org/pdf/2505.15644
Copy Paste: [[2505.15644]] FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models(https://arxiv.org/abs/2505.15644)
Keywords: generation
Abstract: Fine-grained edited image detection of localized edits in images is crucial for assessing content authenticity, especially given that modern diffusion models and image editing methods can produce highly realistic manipulations. However, this domain faces three challenges: (1) Binary classifiers yield only a global real-or-fake label without providing localization; (2) Traditional computer vision methods often rely on costly pixel-level annotations; and (3) No large-scale, high-quality dataset exists for modern image-editing detection techniques. To address these gaps, we develop an automated data-generation pipeline to create FragFake, the first dedicated benchmark dataset for edited image detection, which includes high-quality images from diverse editing models and a wide variety of edited objects. Based on FragFake, we utilize Vision Language Models (VLMs) for the first time in the task of edited image classification and edited region localization. Experimental results show that fine-tuned VLMs achieve higher average Object Precision across all datasets, significantly outperforming pretrained models. We further conduct ablation and transferability analyses to evaluate the detectors across various configurations and editing scenarios. To the best of our knowledge, this work is the first to reformulate localized image edit detection as a vision-language understanding task, establishing a new paradigm for the field. We anticipate that this work will establish a solid foundation to facilitate and inspire subsequent research endeavors in the domain of multimodal content authenticity.
摘要：图像中局部编辑的细粒度编辑图像检测对于评估内容真实性至关重要，尤其是考虑到现代扩散模型和图像编辑方法可以产生高度逼真的操作。但是，该领域面临三个挑战：（1）二进制分类器仅产生全球现实或捕获标签而无需提供本地化；（2）传统的计算机视觉方法通常依赖于昂贵的像素级注释；（3）对于现代图像编辑技术，不存在大规模的高质量数据集。为了解决这些差距，我们开发了一个自动数据生成管道来创建FragFake，这是第一个用于编辑图像检测的专用基准数据集，其中包括来自不同编辑模型的高质量图像和各种各样的编辑对象。基于FragFake，我们首次在编辑的图像分类和编辑区域本地化的任务中使用视觉语言模型（VLM）。实验结果表明，微调的VLM在所有数据集中实现了较高的平均物体精度，从而超过了预定的模型。我们进一步进行消融和可转移性分析，以评估各种配置和编辑方案的检测器。据我们所知，这项工作是第一个重新将本地化图像编辑检测重新制定为视觉理解任务，为该领域建立新的范式。我们预计，这项工作将建立坚实的基础，以促进和激发多模式内容真实性领域的后续研究努力。

Title: Graph Conditional Flow Matching for Relational Data Generation

Authors: Davide Scassola, Sebastiano Saccani, Luca Bortolussi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15668
Pdf URL: https://arxiv.org/pdf/2505.15668
Copy Paste: [[2505.15668]] Graph Conditional Flow Matching for Relational Data Generation(https://arxiv.org/abs/2505.15668)
Keywords: generation, generative
Abstract: Data synthesis is gaining momentum as a privacy-enhancing technology. While single-table tabular data generation has seen considerable progress, current methods for multi-table data often lack the flexibility and expressiveness needed to capture complex relational structures. In particular, they struggle with long-range dependencies and complex foreign-key relationships, such as tables with multiple parent tables or multiple types of links between the same pair of tables. We propose a generative model for relational data that generates the content of a relational dataset given the graph formed by the foreign-key relationships. We do this by learning a deep generative model of the content of the whole relational database by flow matching, where the neural network trained to denoise records leverages a graph neural network to obtain information from connected records. Our method is flexible, as it can support relational datasets with complex structures, and expressive, as the generation of each record can be influenced by any other record within the same connected component. We evaluate our method on several benchmark datasets and show that it achieves state-of-the-art performance in terms of synthetic data fidelity.
摘要：数据综合正在成为一种增强隐私技术的动力。虽然单表表数据生成已经取得了很大的进步，但多桌数据的当前方法通常缺乏捕获复杂关系结构所需的灵活性和表现力。特别是，他们在长期依赖性和复杂的外交关系中挣扎，例如具有多个父表的表或同对表之间的多种链接。我们为关系数据提出了一个生成模型，该模型生成了由外交密钥关系形成的图表的关系数据集的内容。我们通过通过流匹配来学习整个关系数据库内容的深层生成模型来做到这一点，在该模型中，神经网络训练了Denoise记录的训练，利用图形神经网络从连接的记录中获取信息。我们的方法是灵活的，因为它可以支持具有复杂结构和表现力的关系数据集，因为每个记录的生成都会受到同一连接组件中其他任何记录的影响。我们在几个基准数据集上评估了我们的方法，并表明它在合成数据保真度方面实现了最新的性能。

Title: RUSplatting: Robust 3D Gaussian Splatting for Sparse-View Underwater Scene Reconstruction

Authors: Zhuodong Jiang, Haoran Wang, Guoxi Huang, Brett Seymour, Nantheera Anantrasirichai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15737
Pdf URL: https://arxiv.org/pdf/2505.15737
Copy Paste: [[2505.15737]] RUSplatting: Robust 3D Gaussian Splatting for Sparse-View Underwater Scene Reconstruction(https://arxiv.org/abs/2505.15737)
Keywords: restoration
Abstract: Reconstructing high-fidelity underwater scenes remains a challenging task due to light absorption, scattering, and limited visibility inherent in aquatic environments. This paper presents an enhanced Gaussian Splatting-based framework that improves both the visual quality and geometric accuracy of deep underwater rendering. We propose decoupled learning for RGB channels, guided by the physics of underwater attenuation, to enable more accurate colour restoration. To address sparse-view limitations and improve view consistency, we introduce a frame interpolation strategy with a novel adaptive weighting scheme. Additionally, we introduce a new loss function aimed at reducing noise while preserving edges, which is essential for deep-sea content. We also release a newly collected dataset, Submerged3D, captured specifically in deep-sea environments. Experimental results demonstrate that our framework consistently outperforms state-of-the-art methods with PSNR gains up to 1.90dB, delivering superior perceptual quality and robustness, and offering promising directions for marine robotics and underwater visual analytics.
摘要：重建高保真的水下场景仍然是一项艰巨的任务，这是由于吸收，散射和水生环境中固有的可见性有限。本文介绍了增强的基于高斯脱落的框架，可提高深水渲染的视觉质量和几何精度。在水下衰减的物理学的指导下，我们建议对RGB通道的解耦学习，以实现更准确的颜色恢复。为了解决稀疏视图的限制并提高视图一致性，我们通过新颖的自适应加权方案引入了框架插值策略。此外，我们引入了一种新的损失功能，旨在减少噪声，同时保持边缘，这对于深海含量至关重要。我们还发布了一个新收集的数据集，Summerged3D，该数据集在深海环境中专门捕获。实验结果表明，我们的框架始终优于PSNR增长到1.90dB的最先进方法，提供了卓越的感知质量和鲁棒性，并为海洋机器人和水下视觉分析提供了有希望的方向。

Title: Constructing a 3D Town from a Single Image

Authors: Kaizhi Zheng, Ruijian Zhang, Jing Gu, Jie Yang, Xin Eric Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15765
Pdf URL: https://arxiv.org/pdf/2505.15765
Copy Paste: [[2505.15765]] Constructing a 3D Town from a Single Image(https://arxiv.org/abs/2505.15765)
Keywords: generation, generative
Abstract: Acquiring detailed 3D scenes typically demands costly equipment, multi-view data, or labor-intensive modeling. Therefore, a lightweight alternative, generating complex 3D scenes from a single top-down image, plays an essential role in real-world applications. While recent 3D generative models have achieved remarkable results at the object level, their extension to full-scene generation often leads to inconsistent geometry, layout hallucinations, and low-quality meshes. In this work, we introduce 3DTown, a training-free framework designed to synthesize realistic and coherent 3D scenes from a single top-down view. Our method is grounded in two principles: region-based generation to improve image-to-3D alignment and resolution, and spatial-aware 3D inpainting to ensure global scene coherence and high-quality geometry generation. Specifically, we decompose the input image into overlapping regions and generate each using a pretrained 3D object generator, followed by a masked rectified flow inpainting process that fills in missing geometry while maintaining structural continuity. This modular design allows us to overcome resolution bottlenecks and preserve spatial structure without requiring 3D supervision or fine-tuning. Extensive experiments across diverse scenes show that 3DTown outperforms state-of-the-art baselines, including Trellis, Hunyuan3D-2, and TripoSG, in terms of geometry quality, spatial coherence, and texture fidelity. Our results demonstrate that high-quality 3D town generation is achievable from a single image using a principled, training-free approach.
摘要：获取详细的3D场景通常需要昂贵的设备，多视图数据或劳动密集型建模。因此，一种轻巧的替代方案，从单个自上而下的图像中产生复杂的3D场景，在现实世界应用中起着至关重要的作用。尽管最近的3D生成模型在对象级别上取得了显着的结果，但它们向全景生成的扩展通常会导致不一致的几何形状，布局幻觉和低质量的网格。在这项工作中，我们介绍了3Dtown，这是一个无训练的框架，旨在从单个自上而下的视图中综合现实且连贯的3D场景。我们的方法以两种原则为基础：基于区域的生成，以改善图像到3D的对齐和分辨率，以及空间感知的3D介入，以确保全局场景相干性和高质量的几何产生。具体而言，我们将输入图像分解为重叠区域，并使用预验证的3D对象发生器生成每个区域，然后进行掩盖的校正流入式授课过程，该过程填充缺失的几何形状，同时保持结构连续性。这种模块化设计使我们能够克服分辨率瓶颈并保留空间结构，而无需3D监督或微调。各种场景的广泛实验表明，在几何质量，空间相干性和纹理忠诚方面，3Dtown优于包括Trellis，Hunyuan3D-2和Triposg在内的最先进的基线。我们的结果表明，使用原则上的，无训练的方法可以从单个图像中获得高质量的3D城镇一代。

Title: IA-T2I: Internet-Augmented Text-to-Image Generation

Authors: Chuanhao Li, Jianwen Sun, Yukang Feng, Mingliang Zhai, Yifan Chang, Kaipeng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15779
Pdf URL: https://arxiv.org/pdf/2505.15779
Copy Paste: [[2505.15779]] IA-T2I: Internet-Augmented Text-to-Image Generation(https://arxiv.org/abs/2505.15779)
Keywords: generation
Abstract: Current text-to-image (T2I) generation models achieve promising results, but they fail on the scenarios where the knowledge implied in the text prompt is uncertain. For example, a T2I model released in February would struggle to generate a suitable poster for a movie premiering in April, because the character designs and styles are uncertain to the model. To solve this problem, we propose an Internet-Augmented text-to-image generation (IA-T2I) framework to compel T2I models clear about such uncertain knowledge by providing them with reference images. Specifically, an active retrieval module is designed to determine whether a reference image is needed based on the given text prompt; a hierarchical image selection module is introduced to find the most suitable image returned by an image search engine to enhance the T2I model; a self-reflection mechanism is presented to continuously evaluate and refine the generated image to ensure faithful alignment with the text prompt. To evaluate the proposed framework's performance, we collect a dataset named Img-Ref-T2I, where text prompts include three types of uncertain knowledge: (1) known but rare. (2) unknown. (3) ambiguous. Moreover, we carefully craft a complex prompt to guide GPT-4o in making preference evaluation, which has been shown to have an evaluation accuracy similar to that of human preference evaluation. Experimental results demonstrate the effectiveness of our framework, outperforming GPT-4o by about 30% in human evaluation.
摘要：当前的文本对图像（T2I）生成模型取得了令人鼓舞的结果，但是它们在文本提示中隐含的知识尚不确定的情况下失败。例如，2月发布的T2I模型将很难在4月份为电影首映的合适海报生成合适的海报，因为该模型不确定角色的设计和样式。为了解决这个问题，我们提出了一个互联网增强的文本到图像生成（IA-T2I）框架，以通过向他们提供参考图像来强迫T2I模型清楚地了解此类不确定知识。具体而言，主动检索模块旨在根据给定文本提示确定是否需要参考图像；引入了分层图像选择模块，以找到图像搜索引擎返回的最合适的图像，以增强T2I模型。提出了一种自我反思机制，以连续评估和完善生成的图像，以确保与文本提示的忠实对齐。为了评估所提出的框架的性能，我们收集了一个名为img-ref-t2i的数据集，其中文本提示包括三种不确定的知识：（1）已知但很少见。（2）未知。（3）模棱两可。此外，我们仔细制作了一个复杂的提示，以指导GPT-4O进行偏好评估，该评估的评估精度与人类偏好评估的准确性相似。实验结果证明了我们的框架的有效性，在人类评估中的表现优于GPT-4O约30％。

Title: VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL

Authors: Fengyuan Dai, Zifeng Zhuang, Yufei Huang, Siteng Huang, Bangyan Liao, Donglin Wang, Fajie Yuan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15791
Pdf URL: https://arxiv.org/pdf/2505.15791
Copy Paste: [[2505.15791]] VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL(https://arxiv.org/abs/2505.15791)
Keywords: generation, generative
Abstract: Diffusion models have emerged as powerful generative tools across various domains, yet tailoring pre-trained models to exhibit specific desirable properties remains challenging. While reinforcement learning (RL) offers a promising solution,current methods struggle to simultaneously achieve stable, efficient fine-tuning and support non-differentiable rewards. Furthermore, their reliance on sparse rewards provides inadequate supervision during intermediate steps, often resulting in suboptimal generation quality. To address these limitations, dense and differentiable signals are required throughout the diffusion process. Hence, we propose VAlue-based Reinforced Diffusion (VARD): a novel approach that first learns a value function predicting expection of rewards from intermediate states, and subsequently uses this value function with KL regularization to provide dense supervision throughout the generation process. Our method maintains proximity to the pretrained model while enabling effective and stable training via backpropagation. Experimental results demonstrate that our approach facilitates better trajectory guidance, improves training efficiency and extends the applicability of RL to diffusion models optimized for complex, non-differentiable reward functions.
摘要：扩散模型已成为各个领域的强大生成工具，但量身定制预培训的模型以表现出特定的理想特性仍然具有挑战性。尽管增强学习（RL）提供了有希望的解决方案，但当前的方法很难同时获得稳定，有效的微调并支持非差异性奖励。此外，他们对稀疏奖励的依赖提供了中间步骤中的监督不足，通常会导致次优的产生质量。为了解决这些局限性，在整个扩散过程中需要密集和可区分的信号。因此，我们提出了基于价值的加强扩散（VARD）：一种新的方法，该方法首先学习了一个价值函数，可以预测中间状态的奖励，并随后将此值函数与KL正则化一起使用，以在整个生成过程中提供密集的监督。我们的方法在通过反向传播实现有效和稳定的训练的同时，保持了与预处理模型的接近度。实验结果表明，我们的方法促进了更好的轨迹指导，提高了训练效率，并扩展了RL对针对复杂，非差异奖励功能进行优化的扩散模型的适用性。

Title: Interspatial Attention for Efficient 4D Human Video Generation

Authors: Ruizhi Shao, Yinghao Xu, Yujun Shen, Ceyuan Yang, Yang Zheng, Changan Chen, Yebin Liu, Gordon Wetzstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15800
Pdf URL: https://arxiv.org/pdf/2505.15800
Copy Paste: [[2505.15800]] Interspatial Attention for Efficient 4D Human Video Generation(https://arxiv.org/abs/2505.15800)
Keywords: generation
Abstract: Generating photorealistic videos of digital humans in a controllable manner is crucial for a plethora of applications. Existing approaches either build on methods that employ template-based 3D representations or emerging video generation models but suffer from poor quality or limited consistency and identity preservation when generating individual or multiple digital humans. In this paper, we introduce a new interspatial attention (ISA) mechanism as a scalable building block for modern diffusion transformer (DiT)--based video generation models. ISA is a new type of cross attention that uses relative positional encodings tailored for the generation of human videos. Leveraging a custom-developed video variation autoencoder, we train a latent ISA-based diffusion model on a large corpus of video data. Our model achieves state-of-the-art performance for 4D human video synthesis, demonstrating remarkable motion consistency and identity preservation while providing precise control of the camera and body poses. Our code and model are publicly released at this https URL.
摘要：以可控的方式生成数字人类的影像片视频对于众多应用程序至关重要。现有的方法是建立在采用基于模板的3D表示或新兴视频生成模型的方法上，但质量差，或者在产生个人或多个数字人类时的质量或有限的一致性和身份保存。在本文中，我们引入了一种新的空间注意力（ISA）机制，作为现代扩散变压器（DIT）的可扩展构建块 - 基于视频生成模型。 ISA是一种新型的交叉注意，它使用了针对人类视频产生的相对位置编码。利用自定义开发的视频变体自动编码器，我们在大量视频数据范围内训练一个基于ISA的潜在扩散模型。我们的模型可实现4D人类视频综合的最先进性能，表明运动一致性和身份保存非常出色，同时提供了对相机和身体姿势的精确控制。我们的代码和模型在此HTTPS URL上公开发布。

Title: Neural Conditional Transport Maps

Authors: Carlos Rodriguez-Pardo, Leonardo Chiani, Emanuele Borgonovo, Massimo Tavoni
Subjects: cs.LG, cs.AI, math.PR, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2505.15808
Pdf URL: https://arxiv.org/pdf/2505.15808
Copy Paste: [[2505.15808]] Neural Conditional Transport Maps(https://arxiv.org/abs/2505.15808)
Keywords: generative
Abstract: We present a neural framework for learning conditional optimal transport (OT) maps between probability distributions. Our approach introduces a conditioning mechanism capable of processing both categorical and continuous conditioning variables simultaneously. At the core of our method lies a hypernetwork that generates transport layer parameters based on these inputs, creating adaptive mappings that outperform simpler conditioning methods. Comprehensive ablation studies demonstrate the superior performance of our method over baseline configurations. Furthermore, we showcase an application to global sensitivity analysis, offering high performance in computing OT-based sensitivity indices. This work advances the state-of-the-art in conditional optimal transport, enabling broader application of optimal transport principles to complex, high-dimensional domains such as generative modeling and black-box model explainability.
摘要：我们提出了一个学习有条件最佳运输（OT）图的神经框架，概率分布之间。我们的方法引入了一种能够同时处理分类和连续调节变量的调节机制。我们方法的核心是一个超网络，它基于这些输入生成传输层参数，从而创建自适应映射，以超过更简单的调理方法。全面的消融研究表明，我们方法的表现优于基线配置。此外，我们展示了全球灵敏度分析的应用，在计算基于OT的灵敏度指数方面提供了高性能。这项工作在有条件的最佳运输方面推进了最先进的作用，从而使最佳运输原理更广泛地应用于复杂的高维域，例如生成建模和黑盒模型的解释性。

Title: MMaDA: Multimodal Large Diffusion Language Models

Authors: Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15809
Pdf URL: https://arxiv.org/pdf/2505.15809
Copy Paste: [[2505.15809]] MMaDA: Multimodal Large Diffusion Language Models(https://arxiv.org/abs/2505.15809)
Keywords: generation
Abstract: We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: this https URL
摘要：我们介绍了MMADA，这是一种新颖的多模式扩散基础模型，旨在在文本推理，多模式理解和文本到图像生成等各个领域中实现卓越的性能。该方法通过三个关键创新来区分：（i）MMADA采用具有共同概率表述和模态性设计设计的统一扩散体系结构，从而消除了对模态特异性组件的需求。该体系结构可确保跨不同数据类型的无缝集成和处理。（ii）我们实施了混合的长期思考（COT）微调策略，该策略策划了跨模式的统一COT格式。通过使文本和视觉域之间的推理过程保持一致，该策略促进了最终强化学习（RL）阶段的冷启动训练，从而增强了模型从一开始就处理复杂任务的能力。（iii）我们提出了Unigrpo，这是一种专门针对扩散基础模型量身定制的统一基于策略梯度的RL算法。利用多元化的奖励建模，Unigrpo统一了推理和发电任务的训练后培训，从而确保了一致的绩效提高。实验结果表明，MMADA-8B作为统一的多模式基础模型具有强大的概括能力。它超过了文本推理中的Llama-3-7b和Qwen2-7b等强大的模型，在多模式理解中超过了Show-O和Seed-X，并且在文本到图像生成中擅长SDXL和Janus。这些成就凸显了MMADA在弥合统一扩散体系结构内训练和训练后之间的差距方面的有效性，从而为未来的研究和开发提供了全面的框架。我们在以下位置开放代码和训练的模型：此HTTPS URL

Title: InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition

Authors: Yijie Zheng, Weijie Wu, Qingyun Li, Xuehui Wang, Xu Zhou, Aiai Ren, Jun Shen, Long Zhao, Guoqing Li, Xue Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15818
Pdf URL: https://arxiv.org/pdf/2505.15818
Copy Paste: [[2505.15818]] InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition(https://arxiv.org/abs/2505.15818)
Keywords: generation
Abstract: Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-language models to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89% and overall runtime by over 32% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems.
摘要：遥感图像中语言指导的对象识别对于大规模映射和自动数据注释至关重要。但是，现有的开放式摄影和视觉接地方法取决于明确的类别提示，从而限制了它们处理需要高级推理的复杂或隐式查询的能力。为了解决此问题，我们介绍了一套新的任务套件，包括面向指导的对象计数，检测和细分（指令），涵盖开放式唱片，开放式，开放式和开放式风情。我们进一步介绍了EarthInstruct，这是第一个指导地球观测基准的基准。它是由两个不同的遥感数据集构建的，这些数据集具有不同的空间分辨率和跨20个类别的注释规则，因此需要模型来解释特定于数据集的指令。鉴于遥感中语义丰富的标记数据的稀缺性，我们提出了ConstressSAM，这是一个无培训的框架，用于指导驱动的对象识别。指示SAM利用大型视觉模型来解释用户说明并估算对象数量，使用SAM2进行掩码建议，并将蒙版标签分配作为二进制整数编程问题。通过将语义相似性与计数约束集成在一起，指令有效地将类别分配给预测的面具而不依赖置信阈值。实验表明，教学am匹配或超过多个任务的专业基准，而与直接生成方法相比，无论对象数量如何，无论对象数量如何，都可以保持接近恒定的推理时间，将输出令牌降低89％，整体运行时降低了32％以上。我们认为，拟议的任务，基准和有效方法的贡献将推进未来的研究，以开发多功能对象识别系统。