2025-02-11

Title: Survey on AI-Generated Media Detection: From Non-MLLM to MLLM

Authors: Yueying Zou, Peipei Li, Zekun Li, Huaibo Huang, Xing Cui, Xuannan Liu, Chenghanyu Zhang, Ran He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.05240
Pdf URL: https://arxiv.org/pdf/2502.05240
Copy Paste: [[2502.05240]] Survey on AI-Generated Media Detection: From Non-MLLM to MLLM(https://arxiv.org/abs/2502.05240)
Keywords: generative
Abstract: The proliferation of AI-generated media poses significant challenges to information authenticity and social trust, making reliable detection methods highly demanded. Methods for detecting AI-generated media have evolved rapidly, paralleling the advancement of Multimodal Large Language Models (MLLMs). Current detection approaches can be categorized into two main groups: Non-MLLM-based and MLLM-based methods. The former employs high-precision, domain-specific detectors powered by deep learning techniques, while the latter utilizes general-purpose detectors based on MLLMs that integrate authenticity verification, explainability, and localization capabilities. Despite significant progress in this field, there remains a gap in literature regarding a comprehensive survey that examines the transition from domain-specific to general-purpose detection methods. This paper addresses this gap by providing a systematic review of both approaches, analyzing them from single-modal and multi-modal perspectives. We present a detailed comparative analysis of these categories, examining their methodological similarities and differences. Through this analysis, we explore potential hybrid approaches and identify key challenges in forgery detection, providing direction for future research. Additionally, as MLLMs become increasingly prevalent in detection tasks, ethical and security considerations have emerged as critical global concerns. We examine the regulatory landscape surrounding Generative AI (GenAI) across various jurisdictions, offering valuable insights for researchers and practitioners in this field.
摘要：AI生成的媒体的扩散对信息真实性和社会信任提出了重大挑战，这使得可靠的检测方法高度要求。检测AI生成的培养基的方法已迅速发展，与多模式大型语言模型（MLLM）的进步相似。当前的检测方法可以分为两个主要组：基于非MLLM和基于MLLM的方法。前者采用了由深度学习技术提供动力的高精度，特定于领域的探测器，而后者则利用基于MLLM的通用检测器，这些探测器基于整合真实性验证，解释性和本地化功能的MLLM。尽管在该领域取得了重大进展，但文献中仍然存在有关一项综合调查的差距，该调查研究了从域特异性到通用检测方法的过渡。本文通过对两种方法进行系统的综述，从单模式和多模式的角度分析它们来解决这一差距。我们对这些类别进行了详细的比较分析，研究了它们的方法论上的相似性和差异。通过此分析，我们探讨了潜在的混合方法，并确定了伪造检测中的关键挑战，为将来的研究提供了方向。此外，随着MLLM在检测任务中越来越普遍，道德和安全考虑因素已成为关键的全球关注点。我们研究了各个司法管辖区围绕生成AI（Genai）的监管景观，为该领域的研究人员和从业人员提供了宝贵的见解。

Title: Parameter Symmetry Breaking and Restoration Determines the Hierarchical Learning in AI Systems

Authors: Liu Ziyin, Yizhou Xu, Tomaso Poggio, Isaac Chuang
Subjects: cs.LG, cond-mat.dis-nn, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2502.05300
Pdf URL: https://arxiv.org/pdf/2502.05300
Copy Paste: [[2502.05300]] Parameter Symmetry Breaking and Restoration Determines the Hierarchical Learning in AI Systems(https://arxiv.org/abs/2502.05300)
Keywords: restoration
Abstract: The dynamics of learning in modern large AI systems is hierarchical, often characterized by abrupt, qualitative shifts akin to phase transitions observed in physical systems. While these phenomena hold promise for uncovering the mechanisms behind neural networks and language models, existing theories remain fragmented, addressing specific cases. In this paper, we posit that parameter symmetry breaking and restoration serve as a unifying mechanism underlying these behaviors. We synthesize prior observations and show how this mechanism explains three distinct hierarchies in neural networks: learning dynamics, model complexity, and representation formation. By connecting these hierarchies, we highlight symmetry -- a cornerstone of theoretical physics -- as a potential fundamental principle in modern AI.
摘要：现代大型AI系统中的学习动力是分层的，通常以突然的定性转变，类似于在物理系统中观察到的相变。尽管这些现象有望揭示神经网络和语言模型背后的机制，但现有理论仍然分散，解决了具体情况。在本文中，我们认为参数对称性破坏和恢复是这些行为的统一机制。我们综合了先前的观察结果，并展示了这种机制如何解释神经网络中的三个不同的层次结构：学习动力学，模型复杂性和表示形成。通过连接这些层次结构，我们重点介绍了对称性 - 理论物理的基石 - 是现代AI中的潜在基本原理。

Title: fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving

Authors: Hanfei Yu, Xingqi Cui, Hong Zhang, Hao Wang, Hao Wang
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2502.05370
Pdf URL: https://arxiv.org/pdf/2502.05370
Copy Paste: [[2502.05370]] fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving(https://arxiv.org/abs/2502.05370)
Keywords: generation
Abstract: Large Language Models (LLMs) have gained immense success in revolutionizing various applications, including content generation, search and recommendation, and AI-assisted operation. To reduce high training costs, Mixture-of-Experts (MoE) architecture has become a popular backbone for modern LLMs. However, despite the benefits, serving MoE-based LLMs experience severe memory inefficiency due to sparsely activated experts. Recent studies propose to offload inactive experts from GPU memory to CPU memory to improve the serving efficiency of MoE models. However, they either incur high inference latency or high model memory footprints due to coarse-grained designs. To tame the latency-memory trade-off in MoE serving, we present fMoE, a fine-grained expert offloading system for MoE serving that achieves low inference latency with memory efficiency. We design fMoE to extract fine-grained expert selection patterns from MoE models and semantic hints from input prompts to efficiently guide expert prefetching, caching, and offloading decisions. fMoE is prototyped on top of HuggingFace Transformers and deployed on a six-GPU testbed. Experiments with open-source MoE models and real-world workloads show that fMoE reduces inference latency by 47% and improves expert hit rate by 36% over state-of-the-art solutions.
摘要：大型语言模型（LLMS）在革新各种应用程序（包括内容生成，搜索和建议以及AI辅助操作）方面取得了巨大的成功。为了降低高训练成本，Experts（MOE）结构的混合体已成为现代LLM的流行骨干。然而，尽管有好处，但基于MOE的LLM的服务因稀疏激活的专家而导致严重的记忆力降低。最近的研究提出，从GPU存储器到CPU内存，以提高MOE模型的服务效率。但是，它们要么由于粗粒设计而产生高推断潜伏期或高模型记忆足迹。为了驯服MOE服务中的延迟内存权衡，我们提出了FMOE，这是一种用于MOE服务的精细颗粒专家卸载系统，可通过记忆效率达到低推理潜伏期。我们设计了FMOE，以从MOE模型中提取细粒度的专家选择模式，并从输入提示中提出的语义提示有效指导专家预取，缓存和卸载决策。 FMOE是在拥抱面变压器之上进行原型的，并部署在六型GPU测试台上。开源MOE模型和现实世界工作负载的实验表明，FMOE将推理潜伏期降低了47％，并且将专家的命中率提高了36％。

Title: Coarse-to-Fine Structure-Aware Artistic Style Transfer

Authors: Kunxiao Liu, Guowu Yuan, Hao Wu, Wenhua Qian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.05387
Pdf URL: https://arxiv.org/pdf/2502.05387
Copy Paste: [[2502.05387]] Coarse-to-Fine Structure-Aware Artistic Style Transfer(https://arxiv.org/abs/2502.05387)
Keywords: generation
Abstract: Artistic style transfer aims to use a style image and a content image to synthesize a target image that retains the same artistic expression as the style image while preserving the basic content of the content image. Many recently proposed style transfer methods have a common problem; that is, they simply transfer the texture and color of the style image to the global structure of the content image. As a result, the content image has a local structure that is not similar to the local structure of the style image. In this paper, we present an effective method that can be used to transfer style patterns while fusing the local style structure into the local content structure. In our method, dif-ferent levels of coarse stylized features are first reconstructed at low resolution using a Coarse Network, in which style color distribution is roughly transferred, and the content structure is combined with the style structure. Then, the reconstructed features and the content features are adopted to synthesize high-quality structure-aware stylized images with high resolution using a Fine Network with three structural selective fusion (SSF) modules. The effectiveness of our method is demonstrated through the generation of appealing high-quality stylization results and a com-parison with some state-of-the-art style transfer methods.
摘要：艺术风格转移旨在使用样式图像和内容图像来合成目标图像，该目标图像在保留内容图像的基本内容的同时保留了与样式图像相同的艺术表达式。许多最近提出的样式转移方法有一个常见的问题。也就是说，他们只是将样式图像的纹理和颜色传递到内容图像的全局结构。结果，内容图像具有与样式图像的本地结构不同的本地结构。在本文中，我们提出了一种有效的方法，该方法可用于传输样式模式，同时将本地样式结构融合到本地内容结构中。在我们的方法中，首先使用粗网络以低分辨率重建粗糙的风格化功能，其中样式的色彩分布大致传递，内容结构与样式结构结合在一起。然后，采用了重建的功能和内容功能，以使用具有三个结构性选择性融合（SSF）模块的优质网络合成具有高分辨率的高质量结构感知的风格化图像。通过产生具有吸引力的高质量风格化结果以及与某些最先进的风格转移方法的共同参与，我们方法的有效性得到了证明。

Title: Beyond and Free from Diffusion: Invertible Guided Consistency Training

Authors: Chia-Hong Hsu, Shiu-hong Kao, Randall Balestriero
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.05391
Pdf URL: https://arxiv.org/pdf/2502.05391
Copy Paste: [[2502.05391]] Beyond and Free from Diffusion: Invertible Guided Consistency Training(https://arxiv.org/abs/2502.05391)
Keywords: generation
Abstract: Guidance in image generation steers models towards higher-quality or more targeted outputs, typically achieved in Diffusion Models (DMs) via Classifier-free Guidance (CFG). However, recent Consistency Models (CMs), which offer fewer function evaluations, rely on distilling CFG knowledge from pretrained DMs to achieve guidance, making them costly and inflexible. In this work, we propose invertible Guided Consistency Training (iGCT), a novel training framework for guided CMs that is entirely data-driven. iGCT, as a pioneering work, contributes to fast and guided image generation and editing without requiring the training and distillation of DMs, greatly reducing the overall compute requirements. iGCT addresses the saturation artifacts seen in CFG under high guidance scales. Our extensive experiments on CIFAR-10 and ImageNet64 show that iGCT significantly improves FID and precision compared to CFG. At a guidance of 13, iGCT improves precision to 0.8, while DM's drops to 0.47. Our work takes the first step toward enabling guidance and inversion for CMs without relying on DMs.
摘要：图像生成中的指导转向模型迈向更高质量或更具针对性的输出，通常在扩散模型（DMS）中通过无分类器指导（CFG）实现。但是，最近提供较少功能评估的最新一致性模型（CMS）依赖于从验证的DMS中提取CFG知识来实现指导，从而使其成本高昂且僵化。在这项工作中，我们提出了可逆的指导一致性培训（IGCT），这是一个完全由数据驱动的引导CMS的新型培训框架。作为一项开创性的工作，IGCT有助于快速和有指导的图像生成和编辑，而无需培训和蒸馏DM，从而大大降低了整体计算要求。 IGCT解决了在高引导量表下CFG中看到的饱和伪像。我们在CIFAR-10和Imagenet64上进行的广泛实验表明，与CFG相比，IGCT显着提高了FID和精度。在13的指导下，IGCT将精度提高到0.8，而DM的精度下降到0.47。我们的工作朝着在不依赖DMS的情况下为CM的指导和反转迈出了第一步。

Title: Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation

Authors: Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.05415
Pdf URL: https://arxiv.org/pdf/2502.05415
Copy Paste: [[2502.05415]] Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation(https://arxiv.org/abs/2502.05415)
Keywords: generation
Abstract: There has been increasing research interest in building unified multimodal understanding and generation models, among which Show-o stands as a notable representative, demonstrating great promise for both text-to-image and image-to-text generation. The inference of Show-o involves progressively denoising image tokens and autoregressively decoding text tokens, and hence, unfortunately, suffers from inefficiency issues from both sides. This paper introduces Show-o Turbo to bridge the gap. We first identify a unified denoising perspective for the generation of images and text in Show-o based on the parallel decoding of text tokens. We then propose to extend consistency distillation (CD), a qualified approach for shortening the denoising process of diffusion models, to the multimodal denoising trajectories of Show-o. We introduce a trajectory segmentation strategy and a curriculum learning procedure to improve the training convergence. Empirically, in text-to-image generation, Show-o Turbo displays a GenEval score of 0.625 at 4 sampling steps without using classifier-free guidance (CFG), outperforming that of the original Show-o with 8 steps and CFG; in image-to-text generation, Show-o Turbo exhibits a 1.5x speedup without significantly sacrificing performance. The code is available at this https URL.
摘要：在建立统一的多模式理解和生成模型的研究兴趣中，越来越多的研究兴趣，其中show-o是一个著名的代表，这对文本对图像和图像到文本的生成都展现了巨大的希望。 Show-O的推论涉及逐步确定图像令牌和自动汇总解码文本令牌，因此不幸的是，双方都遭受了效率低下的问题。本文介绍了Show-o Turbo来弥合差距。我们首先根据文本令牌的平行解码来确定统一的denoising观点，用于在show-o中生成图像和文本。然后，我们建议将一致性蒸馏（CD）扩展，这是一种缩短扩散模型的去核过程的合格方法，到show-o的多模式denoising轨迹。我们引入了轨迹分割策略和课程学习程序，以改善培训融合。从经验上讲，在文本到图像生成中，show-o涡轮在不使用无分类器指导（CFG）的情况下以4个采样步骤显示了遗传得分为0.625，以优于原始Show-O的8个步骤和CFG的表现；在图像到文本生成中，Show-o Turbo表现出1.5倍的速度，而不会显着牺牲性能。该代码可在此HTTPS URL上找到。

Title: Deep Generative Models with Hard Linear Equality Constraints

Authors: Ruoyan Li, Dipti Ranjan Sahu, Guy Van den Broeck, Zhe Zeng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.05416
Pdf URL: https://arxiv.org/pdf/2502.05416
Copy Paste: [[2502.05416]] Deep Generative Models with Hard Linear Equality Constraints(https://arxiv.org/abs/2502.05416)
Keywords: generation, generative
Abstract: While deep generative models~(DGMs) have demonstrated remarkable success in capturing complex data distributions, they consistently fail to learn constraints that encode domain knowledge and thus require constraint integration. Existing solutions to this challenge have primarily relied on heuristic methods and often ignore the underlying data distribution, harming the generative performance. In this work, we propose a probabilistically sound approach for enforcing the hard constraints into DGMs to generate constraint-compliant and realistic data. This is achieved by our proposed gradient estimators that allow the constrained distribution, the data distribution conditioned on constraints, to be differentiably learned. We carry out extensive experiments with various DGM model architectures over five image datasets and three scientific applications in which domain knowledge is governed by linear equality constraints. We validate that the standard DGMs almost surely generate data violating the constraints. Among all the constraint integration strategies, ours not only guarantees the satisfaction of constraints in generation but also archives superior generative performance than the other methods across every benchmark.
摘要：尽管深层生成模型〜（DGM）在捕获复杂的数据分布方面取得了显着成功，但它们始终无法学习编码域知识的约束，因此需要约束集成。现有的这项挑战解决方案主要依赖于启发式方法，并且通常忽略了潜在的数据分布，从而损害了生成性能。在这项工作中，我们提出了一种概率合理的方法，用于将硬性约束在DGM中执行，以生成符合约束和现实的数据。这是通过我们提出的梯度估计器来实现的，该梯度估计器允许有限的分布（以约束条件为条件的数据分布）被不同地学习。我们对五个图像数据集的各种DGM模型架构进行了广泛的实验，以及三个科学应用程序，其中域知识受线性平等约束的控制。我们验证标准DGM几乎肯定会生成违反约束的数据。在所有约束整合策略中，我们的不仅保证了一代限制的满意度，而且还保证了比每个基准中的其他方法都优越的生成性能。

Title: APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding

Authors: Xinyu Yang, Tianqi Chen, Beidi Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.05431
Pdf URL: https://arxiv.org/pdf/2502.05431
Copy Paste: [[2502.05431]] APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding(https://arxiv.org/abs/2502.05431)
Keywords: generation
Abstract: Context-augmented generation (CAG) techniques, including RAG and ICL, require the efficient combination of multiple contexts to generate responses to user queries. Directly inputting these contexts as a sequence introduces a considerable computational burden by re-encoding the combined selection of contexts for every request. To address this, we explore the promising potential of parallel encoding to independently pre-compute and cache each context's KV states. This approach enables the direct loading of cached states during inference while accommodating more contexts through position reuse across contexts. However, due to misalignments in attention distribution, directly applying parallel encoding results in a significant performance drop. To enable effective and efficient CAG, we propose Adaptive Parallel Encoding ($\textbf{APE}$), which brings shared prefix, attention temperature, and scaling factor to align the distribution of parallel encoding with sequential encoding. Results on RAG and ICL tasks demonstrate that APE can preserve 98% and 93% sequential encoding performance using the same inputs while outperforming parallel encoding by 3.6% and 7.9%, respectively. It also scales to many-shot CAG, effectively encoding hundreds of contexts in parallel. Efficiency evaluation shows that APE can achieve an end-to-end 4.5$\times$ speedup by reducing 28$\times$ prefilling time for a 128K-length context.
摘要：上下文增强的生成（CAG）技术（包括抹布和ICL）需要有效组合多个上下文，以生成对用户查询的响应。直接输入这些上下文作为序列，通过重新编码每个请求的上下文选择，引入了相当大的计算负担。为了解决这个问题，我们探讨了并行编码以独立预发和缓存每个上下文的KV状态的有希望的潜力。这种方法可以在推理期间直接加载缓存状态，同时通过跨环境重复使用位置来适应更多的上下文。但是，由于注意力分布的未对准，直接应用并行编码会导致性能下降。为了启用有效和有效的CAG，我们提出了自适应平行编码（$ \ textbf {ape} $），它带来了共享前缀，注意温度和缩放系数，以使并同步编码的分布与顺序编码对齐。 RAG和ICL任务的结果表明，APE可以使用相同的输入来保留98％和93％的顺序编码性能，而分别超过平行编码的分别为3.6％和7.9％。它还缩放到许多射击CAG，有效地并行编码了数百个上下文。效率评估表明，通过减少28 $ \ times $的预填充时间，APE可以实现端到端4.5 $ \ times $加速。

Title: AdaFlow: Efficient Long Video Editing via Adaptive Attention Slimming And Keyframe Selection

Authors: Shuheng Zhang, Yuqi Liu, Hongbo Zhou, Jun Peng, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.05433
Pdf URL: https://arxiv.org/pdf/2502.05433
Copy Paste: [[2502.05433]] AdaFlow: Efficient Long Video Editing via Adaptive Attention Slimming And Keyframe Selection(https://arxiv.org/abs/2502.05433)
Keywords: generation
Abstract: Despite great progress, text-driven long video editing is still notoriously challenging mainly due to excessive memory overhead. Although recent efforts have simplified this task into a two-step process of keyframe translation and interpolation generation, the token-wise keyframe translation still plagues the upper limit of video length. In this paper, we propose a novel and training-free approach towards efficient and effective long video editing, termed AdaFlow. We first reveal that not all tokens of video frames hold equal importance for keyframe translation, based on which we propose an Adaptive Attention Slimming scheme for AdaFlow to squeeze the $KV$ sequence, thus increasing the number of keyframes for translations by an order of magnitude. In addition, an Adaptive Keyframe Selection scheme is also equipped to select the representative frames for joint editing, further improving generation quality. With these innovative designs, AdaFlow achieves high-quality long video editing of minutes in one inference, i.e., more than 1$k$ frames on one A800 GPU, which is about ten times longer than the compared methods, e.g., TokenFlow. To validate AdaFlow, we also build a new benchmark for long video editing with high-quality annotations, termed LongV-EVAL. Our code is released at: this https URL.
摘要：尽管取得了长足的进步，但文本驱动的长期视频编辑仍然众所周知，主要是由于过度的内存开销而挑战。尽管最近的努力将这项任务简化为密钥帧翻译和插值生成的两步过程，但令牌的关键帧翻译仍然困扰着视频长度的上限。在本文中，我们提出了一种新颖且无训练的方法，用于高效有效的长期视频编辑，称为Adaflow。我们首先透露，并非所有视频框架的令牌对于密钥帧翻译都相同，我们提出了一种自适应注意力的缩写方案，以挤压$ kV $序列，从而增加了翻译的密钥框架数量。此外，自适应钥匙帧选择方案还可以选择代表性框架进行联合编辑，从而进一步提高发电质量。借助这些创新的设计，Adaflow在一个推理中实现了高质量的长时间视频编辑，即在一个A800 GPU上超过1 $ k $框架，该视频比比较的方法（例如TokenFlow）长10倍。为了验证ADAFLOF，我们还为长时间的长时间编辑构建了一个新的基准，称为Longv-eval。我们的代码在以下位置发布：此HTTPS URL。

Title: Stochastic Forward-Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets

Authors: Haoye Lu, Qifan Wu, Yaoliang Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.05446
Pdf URL: https://arxiv.org/pdf/2502.05446
Copy Paste: [[2502.05446]] Stochastic Forward-Backward Deconvolution: Training Diffusion Models with Finite Noisy Datasets(https://arxiv.org/abs/2502.05446)
Keywords: generative
Abstract: Recent diffusion-based generative models achieve remarkable results by training on massive datasets, yet this practice raises concerns about memorization and copyright infringement. A proposed remedy is to train exclusively on noisy data with potential copyright issues, ensuring the model never observes original content. However, through the lens of deconvolution theory, we show that although it is theoretically feasible to learn the data distribution from noisy samples, the practical challenge of collecting sufficient samples makes successful learning nearly unattainable. To overcome this limitation, we propose to pretrain the model with a small fraction of clean data to guide the deconvolution process. Combined with our Stochastic Forward--Backward Deconvolution (SFBD) method, we attain an FID of $6.31$ on CIFAR-10 with just $4\%$ clean images (and $3.58$ with $10\%$). Theoretically, we prove that SFBD guides the model to learn the true data distribution. The result also highlights the importance of pretraining on limited but clean data or the alternative from similar datasets. Empirical studies further support these findings and offer additional insights.
摘要：最近的基于扩散的生成模型通过在大规模数据集中进行培训取得了显着的结果，但是这种做法引起了人们对记忆和版权侵权的担忧。提出的补救措施是专门培训具有潜在版权问题的嘈杂数据，以确保该模型永远不会观察到原始内容。但是，通过反卷积理论的角度，我们表明，尽管从理论上可以从嘈杂的样本中学习数据分布是可行的，但收集足够样本的实际挑战使得成功的学习几乎无法实现。为了克服这一限制，我们建议用一小部分干净数据为模型预处理，以指导反卷积过程。再加上我们随机前向回卷卷积（SFBD）方法，我们在CIFAR-10上获得了$ 6.31 $的FID，只需$ 4 \％$ $ $ $清洁的图像（$ 3.58 $，$ 10 \％\％$）。从理论上讲，我们证明SFBD指导该模型学习真实的数据分布。结果还强调了在有限但干净的数据或类似数据集的替代方案上进行预读的重要性。实证研究进一步支持这些发现，并提供其他见解。

Title: Block Graph Neural Networks for tumor heterogeneity prediction

Authors: Marianne Abémgnigni Njifon, Tobias Weber, Viktor Bezborodov, Tyll Krueger, Dominic Schuhmacher
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.05458
Pdf URL: https://arxiv.org/pdf/2502.05458
Copy Paste: [[2502.05458]] Block Graph Neural Networks for tumor heterogeneity prediction(https://arxiv.org/abs/2502.05458)
Keywords: generation
Abstract: Accurate tumor classification is essential for selecting effective treatments, but current methods have limitations. Standard tumor grading, which categorizes tumors based on cell differentiation, is not recommended as a stand-alone procedure, as some well-differentiated tumors can be malignant. Tumor heterogeneity assessment via single-cell sequencing offers profound insights but can be costly and may still require significant manual intervention. Many existing statistical machine learning methods for tumor data still require complex pre-processing of MRI and histopathological data. In this paper, we propose to build on a mathematical model that simulates tumor evolution (Ożański (2017)) and generate artificial datasets for tumor classification. Tumor heterogeneity is estimated using normalized entropy, with a threshold to classify tumors as having high or low heterogeneity. Our contributions are threefold: (1) the cut and graph generation processes from the artificial data, (2) the design of tumor features, and (3) the construction of Block Graph Neural Networks (BGNN), a Graph Neural Network-based approach to predict tumor heterogeneity. The experimental results reveal that the combination of the proposed features and models yields excellent results on artificially generated data ($89.67\%$ accuracy on the test data). In particular, in alignment with the emerging trends in AI-assisted grading and spatial transcriptomics, our results suggest that enriching traditional grading methods with birth (e.g., Ki-67 proliferation index) and death markers can improve heterogeneity prediction and enhance tumor classification.
摘要：准确的肿瘤分类对于选择有效治疗至关重要，但是当前的方法有局限性。不建议根据细胞分化对肿瘤进行分类的标准肿瘤分级，不建议作为独立手术，因为某些差异化的肿瘤可能是恶性的。通过单细胞测序评估肿瘤异质性评估提供了深刻的见解，但可能会昂贵，并且可能需要大量的手动干预。许多现有的用于肿瘤数据的统计机器学习方法仍然需要对MRI和组织病理学数据进行复杂的预处理。在本文中，我们建议以模拟肿瘤进化的数学模型为基础，并生成用于肿瘤分类的人工数据集。使用归一化熵估算肿瘤异质性，其阈值将肿瘤分类为高或低异质性。我们的贡献是三重的：（1）人造数据的剪切和图生成过程，（2）肿瘤特征的设计，以及（3）构建块图神经网络（BGNN），这是一种基于图神经网络的方法预测肿瘤异质性。实验结果表明，所提出的特征和模型的组合在人为生成的数据（$ 89.67 \％$ $准确性上的精度）上产生了出色的结果。特别是，在与AI辅助分级和空间转录组学的新兴趋势保持一致之中，我们的结果表明，通过出生（例如KI-67增殖指数）丰富传统的分级方法和死亡标记可以改善异质性预测和增强肿瘤分类。

Title: Gen-DFL: Decision-Focused Generative Learning for Robust Decision Making

Authors: Prince Zizhuang Wang, Jinhao Liang, Shuyi Chen, Ferdinando Fioretto, Shixiang Zhu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.05468
Pdf URL: https://arxiv.org/pdf/2502.05468
Copy Paste: [[2502.05468]] Gen-DFL: Decision-Focused Generative Learning for Robust Decision Making(https://arxiv.org/abs/2502.05468)
Keywords: generative
Abstract: Decision-focused learning (DFL) integrates predictive models with downstream optimization, directly training machine learning models to minimize decision errors. While DFL has been shown to provide substantial advantages when compared to a counterpart that treats the predictive and prescriptive models separately, it has also been shown to struggle in high-dimensional and risk-sensitive settings, limiting its applicability in real-world settings. To address this limitation, this paper introduces decision-focused generative learning (Gen-DFL), a novel framework that leverages generative models to adaptively model uncertainty and improve decision quality. Instead of relying on fixed uncertainty sets, Gen-DFL learns a structured representation of the optimization parameters and samples from the tail regions of the learned distribution to enhance robustness against worst-case scenarios. This approach mitigates over-conservatism while capturing complex dependencies in the parameter space. The paper shows, theoretically, that Gen-DFL achieves improved worst-case performance bounds compared to traditional DFL. Empirically, it evaluates Gen-DFL on various scheduling and logistics problems, demonstrating its strong performance against existing DFL methods.
摘要：以决策为中心的学习（DFL）将预测模型与下游优化整合，直接训练机器学习模型，以最大程度地减少决策错误。与分别处理预测性和规定模型的对应物相比，DFL已被证明具有很大的优势，但它也已显示在高维和风险敏感的环境中挣扎，从而限制了其在现实世界中的适用性。为了解决这一局限性，本文介绍了以决策为中心的生成学习（GEN-DFL），这是一个新型框架，利用生成模型来适应不确定性并提高决策质量。 Gen-DFL不依赖固定的不确定性集，而是学习了从学习分布的尾部区域的优化参数和样品的结构化表示，以增强对最坏情况的鲁棒性。这种方法在捕获参数空间中的复杂依赖性时减轻过度保守性。从理论上讲，该论文表明，与传统DFL相比，Gen-DFL的最差案例性能范围提高了。从经验上讲，它在各种调度和物流问题上评估了Gen-DFL，证明了其针对现有DFL方法的强劲绩效。

Title: A Physical Coherence Benchmark for Evaluating Video Generation Models via Optical Flow-guided Frame Prediction

Authors: Yongfan Chen, Xiuwen Zhu, Tianyu Li, Hao Chen, Chunhua Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.05503
Pdf URL: https://arxiv.org/pdf/2502.05503
Copy Paste: [[2502.05503]] A Physical Coherence Benchmark for Evaluating Video Generation Models via Optical Flow-guided Frame Prediction(https://arxiv.org/abs/2502.05503)
Keywords: generation
Abstract: Recent advances in video generation models demonstrate their potential as world simulators, but they often struggle with videos deviating from physical laws, a key concern overlooked by most text-to-video benchmarks. We introduce a benchmark designed specifically to assess the Physical Coherence of generated videos, PhyCoBench. Our benchmark includes 120 prompts covering 7 categories of physical principles, capturing key physical laws observable in video content. We evaluated four state-of-the-art (SoTA) T2V models on PhyCoBench and conducted manual assessments. Additionally, we propose an automated evaluation model: PhyCoPredictor, a diffusion model that generates optical flow and video frames in a cascade manner. Through a consistency evaluation comparing automated and manual sorting, the experimental results show that PhyCoPredictor currently aligns most closely with human evaluation. Therefore, it can effectively evaluate the physical coherence of videos, providing insights for future model optimization. Our benchmark, which includes physical coherence prompts, automatic evaluation tool PhyCoPredictor, and generated video dataset, will all be released on GitHub shortly.
摘要：视频生成模型的最新进展证明了它们作为世界模拟器的潜力，但他们经常在偏离物理定律的视频中挣扎，这是大多数文本对视频基准的关键问题。我们引入了专门评估生成视频Phycobench的物理连贯性的基准测试。我们的基准包括120个提示，涵盖了7类物理原则，从而捕获视频内容中可观察到的关键物理定律。我们评估了植物管上的四个最先进的T2V模型，并进行了手动评估。此外，我们提出了一个自动评估模型：Phycopredictor，这是一种扩散模型，以级联的方式生成光流和视频帧。通过比较自动化和手动分类的一致性评估，实验结果表明，植物植物目前与人类评估最紧密地保持一致。因此，它可以有效地评估视频的物理连贯性，从而为未来的模型优化提供见解。我们的基准测试包括物理连贯提示，自动评估工具Phycopredictor和生成的视频数据集将在Github上发布。

Title: Do Spikes Protect Privacy? Investigating Black-Box Model Inversion Attacks in Spiking Neural Networks

Authors: Hamed Poursiami, Ayana Moshruba, Maryam Parsa
Subjects: cs.LG, cs.CR, cs.NE
Abstract URL: https://arxiv.org/abs/2502.05509
Pdf URL: https://arxiv.org/pdf/2502.05509
Copy Paste: [[2502.05509]] Do Spikes Protect Privacy? Investigating Black-Box Model Inversion Attacks in Spiking Neural Networks(https://arxiv.org/abs/2502.05509)
Keywords: generative
Abstract: As machine learning models become integral to security-sensitive applications, concerns over data leakage from adversarial attacks continue to rise. Model Inversion (MI) attacks pose a significant privacy threat by enabling adversaries to reconstruct training data from model outputs. While MI attacks on Artificial Neural Networks (ANNs) have been widely studied, Spiking Neural Networks (SNNs) remain largely unexplored in this context. Due to their event-driven and discrete computations, SNNs introduce fundamental differences in information processing that may offer inherent resistance to such attacks. A critical yet underexplored aspect of this threat lies in black-box settings, where attackers operate through queries without direct access to model parameters or gradients-representing a more realistic adversarial scenario in deployed systems. This work presents the first study of black-box MI attacks on SNNs. We adapt a generative adversarial MI framework to the spiking domain by incorporating rate-based encoding for input transformation and decoding mechanisms for output interpretation. Our results show that SNNs exhibit significantly greater resistance to MI attacks than ANNs, as demonstrated by degraded reconstructions, increased instability in attack convergence, and overall reduced attack effectiveness across multiple evaluation metrics. Further analysis suggests that the discrete and temporally distributed nature of SNN decision boundaries disrupts surrogate modeling, limiting the attacker's ability to approximate the target model.
摘要：随着机器学习模型成为对安全敏感应用程序不可或缺的一部分，对对抗性攻击的数据泄漏的担忧继续增加。模型反转（MI）攻击通过使对手能够从模型输出重建培训数据，从而构成了重大的隐私威胁。尽管对人工神经网络（ANN）的MI攻击进行了广泛的研究，但在这种情况下，尖峰神经网络（SNNS）在很大程度上尚未探索。由于其事件驱动和离散的计算，SNN引入了信息处理中的根本差异，这些差异可能对此类攻击具有固有的抵抗力。这种威胁的关键但毫无疑问的方面在于黑框设置，在该设置中，攻击者通过查询操作，而无需直接访问模型参数或梯度代表，从而在已部署的系统中使用了更现实的对抗性场景。这项工作提出了对黑盒MI攻击SNN的首次研究。我们通过合并基于速率的输入转换和解码机制来将生成的对抗MI框架调整为尖峰域。我们的结果表明，与ANN相比，SNN对MI攻击具有明显更大的抵抗力，如退化的重建，攻击融合的不稳定性增加以及总体上降低了多个评估指标的攻击效果。进一步的分析表明，SNN决策边界的离散和时间分布的性质破坏了替代建模，从而限制了攻击者近似目标模型的能力。

Title: Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation

Authors: Yin Wang, Mu Li, Jiapeng Liu, Zhiying Leng, Frederick W. B. Li, Ziyao Zhang, Xiaohui Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.05534
Pdf URL: https://arxiv.org/pdf/2502.05534
Copy Paste: [[2502.05534]] Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation(https://arxiv.org/abs/2502.05534)
Keywords: generation
Abstract: We address the challenging problem of fine-grained text-driven human motion generation. Existing works generate imprecise motions that fail to accurately capture relationships specified in text due to: (1) lack of effective text parsing for detailed semantic cues regarding body parts, (2) not fully modeling linguistic structures between words to comprehend text comprehensively. To tackle these limitations, we propose a novel fine-grained framework Fg-T2M++ that consists of: (1) an LLMs semantic parsing module to extract body part descriptions and semantics from text, (2) a hyperbolic text representation module to encode relational information between text units by embedding the syntactic dependency graph into hyperbolic space, and (3) a multi-modal fusion module to hierarchically fuse text and motion features. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that Fg-T2M++ outperforms SOTA methods, validating its ability to accurately generate motions adhering to comprehensive text semantics.
摘要：我们解决了细粒度驱动的人类运动产生的挑战性问题。现有作品产生的不精确动作无法准确捕获文本中指定的关系，因为：（1）缺乏有效的文本解析，用于有关身体部位的详细语义提示，（2）在单词之间无法完全建模语言结构以全面理解文本。为了应对这些限制，我们提出了一个新颖的细粒框架FG-T2M ++，其中包括：（1）LLMS语义解析模块，以从文本中提取身体部位描述和语义，（2）双曲线文本表示模块以编码相关信息在文本单元之间，通过将句法依赖图嵌入双曲线空间中，以及（3）多模式融合模块与层次融合的文本和运动特征。关于HumanML3D和Kit-ML数据集的广泛实验表明，FG-T2M ++的表现优于SOTA方法，从而验证了其准确生成粘附于综合文本语义的动作的能力。

Title: SSH: Sparse Spectrum Adaptation via Discrete Hartley Transformation

Authors: Yixian Shen, Qi Bi, Jia-Hong Huang, Hongyi Zhu, Andy D. Pimentel, Anuj Pathania
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2502.05539
Pdf URL: https://arxiv.org/pdf/2502.05539
Copy Paste: [[2502.05539]] SSH: Sparse Spectrum Adaptation via Discrete Hartley Transformation(https://arxiv.org/abs/2502.05539)
Keywords: generation
Abstract: Low-rank adaptation (LoRA) has been demonstrated effective in reducing the trainable parameter number when fine-tuning a large foundation model (LLM). However, it still encounters computational and memory challenges when scaling to larger models or addressing more complex task adaptation. In this work, we introduce Sparse Spectrum Adaptation via Discrete Hartley Transformation (SSH), a novel approach that significantly reduces the number of trainable parameters while enhancing model performance. It selects the most informative spectral components across all layers, under the guidance of the initial weights after a discrete Hartley transformation (DHT). The lightweight inverse DHT then projects the spectrum back into the spatial domain for updates. Extensive experiments across both single-modality tasks such as language understanding and generation and multi-modality tasks such as video-text understanding demonstrate that SSH outperforms existing parameter-efficient fine-tuning (PEFT) methods while achieving substantial reductions in computational cost and memory requirements.
摘要：在微调大型基础模型（LLM）时，低级适应性（LORA）已有效地减少可训练的参数数量。但是，当扩展到较大的模型或解决更复杂的任务适应时，它仍然遇到计算和内存挑战。在这项工作中，我们通过离散的Hartley Transformation（SSH）引入了稀疏频谱适应，这是一种新颖的方法，可显着减少可训练参数的数量，同时增强模型性能。在离散的哈特利转换（DHT）之后，在初始权重的指导下，它选择了所有层中最有用的光谱成分。然后，轻巧的倒数DHT然后将频谱投射回空间域以进行更新。在两个单模式任务中进行的广泛实验，例如语言理解和生成以及多模式的任务，例如视频文本理解，证明SSH的表现优于现有参数有效的微调方法（PEFT）方法，同时实现了计算成本和内存需求的大量降低。

Title: 4DR P2T: 4D Radar Tensor Synthesis with Point Clouds

Authors: Woo-Jin Jung, Dong-Hee Paek, Seung-Hyun Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.05550
Pdf URL: https://arxiv.org/pdf/2502.05550
Copy Paste: [[2502.05550]] 4DR P2T: 4D Radar Tensor Synthesis with Point Clouds(https://arxiv.org/abs/2502.05550)
Keywords: generation, generative
Abstract: In four-dimensional (4D) Radar-based point cloud generation, clutter removal is commonly performed using the constant false alarm rate (CFAR) algorithm. However, CFAR may not fully capture the spatial characteristics of objects. To address limitation, this paper proposes the 4D Radar Point-to-Tensor (4DR P2T) model, which generates tensor data suitable for deep learning applications while minimizing measurement loss. Our method employs a conditional generative adversarial network (cGAN), modified to effectively process 4D Radar point cloud data and generate tensor data. Experimental results on the K-Radar dataset validate the effectiveness of the 4DR P2T model, achieving an average PSNR of 30.39dB and SSIM of 0.96. Additionally, our analysis of different point cloud generation methods highlights that the 5% percentile method provides the best overall performance, while the 1% percentile method optimally balances data volume reduction and performance, making it well-suited for deep learning applications.
摘要：在基于四维（4D）基于雷达的点云的产生中，通常使用恒定的错误警报率（CFAR）算法进行杂波去除。但是，CFAR可能无法完全捕获对象的空间特征。为了解决局限性，本文提出了4D雷达点对调整器（4DR P2T）模型，该模型生成适合深度学习应用的张量数据，同时最大程度地减少测量损失。我们的方法采用条件生成对抗网络（CGAN），经过修改以有效处理4D雷达点云数据并生成张量数据。 K-Radar数据集的实验结果验证了4DR P2T模型的有效性，平均PSNR为30.39dB，SSIM为0.96。此外，我们对不同点云生成方法的分析强调了5％百分位方法提供了最佳的整体性能，而1％的方法可以最佳地平衡数据量量和性能，从而非常适合深度学习应用程序。

Title: FreeBlend: Advancing Concept Blending with Staged Feedback-Driven Interpolation Diffusion

Authors: Yufan Zhou, Haoyu Shen, Huan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.05606
Pdf URL: https://arxiv.org/pdf/2502.05606
Copy Paste: [[2502.05606]] FreeBlend: Advancing Concept Blending with Staged Feedback-Driven Interpolation Diffusion(https://arxiv.org/abs/2502.05606)
Keywords: generative
Abstract: Concept blending is a promising yet underexplored area in generative models. While recent approaches, such as embedding mixing and latent modification based on structural sketches, have been proposed, they often suffer from incompatible semantic information and discrepancies in shape and appearance. In this work, we introduce FreeBlend, an effective, training-free framework designed to address these challenges. To mitigate cross-modal loss and enhance feature detail, we leverage transferred image embeddings as conditional inputs. The framework employs a stepwise increasing interpolation strategy between latents, progressively adjusting the blending ratio to seamlessly integrate auxiliary features. Additionally, we introduce a feedback-driven mechanism that updates the auxiliary latents in reverse order, facilitating global blending and preventing rigid or unnatural outputs. Extensive experiments demonstrate that our method significantly improves both the semantic coherence and visual quality of blended images, yielding compelling and coherent results.
摘要：概念融合是生成模型中一个有前途但毫无疑问的领域。尽管已经提出了最近的方法，例如基于结构草图的嵌入混合和潜在的修改，但它们通常会遭受形状和外观上不兼容的语义信息和差异。在这项工作中，我们介绍了FreeBlend，这是一个有效的，无训练的框架，旨在应对这些挑战。为了减轻跨模式损失并增强特征细节，我们利用传输的图像嵌入为条件输入。该框架采用潜在潜在的逐步增加的插值策略，从而逐步调整了混合比以无缝整合辅助特征。此外，我们引入了一种反馈驱动的机制，该机制以相反的顺序更新辅助潜伏期，从而促进了全球混合并防止刚性或不自然的产出。广泛的实验表明，我们的方法显着提高了混合图像的语义连贯性和视觉质量，从而产生了引人注目和相干的结果。

Title: Training-Free Constrained Generation With Stable Diffusion Models

Authors: Stefano Zampini, Jacob Christopher, Luca Oneto, Davide Anguita, Ferdinando Fioretto
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.05625
Pdf URL: https://arxiv.org/pdf/2502.05625
Copy Paste: [[2502.05625]] Training-Free Constrained Generation With Stable Diffusion Models(https://arxiv.org/abs/2502.05625)
Keywords: generation
Abstract: Stable diffusion models represent the state-of-the-art in data synthesis across diverse domains and hold transformative potential for applications in science and engineering, e.g., by facilitating the discovery of novel solutions and simulating systems that are computationally intractable to model explicitly. However, their current utility in these fields is severely limited by an inability to enforce strict adherence to physical laws and domain-specific constraints. Without this grounding, the deployment of such models in critical applications, ranging from material science to safety-critical systems, remains impractical. This paper addresses this fundamental limitation by proposing a novel approach to integrate stable diffusion models with constrained optimization frameworks, enabling them to generate outputs that satisfy stringent physical and functional requirements. We demonstrate the effectiveness of this approach through material science experiments requiring adherence to precise morphometric properties, inverse design problems involving the generation of stress-strain responses using video generation with a simulator in the loop, and safety settings where outputs must avoid copyright infringement.
摘要：稳定的扩散模型代表了跨不同领域的数据合成的最新模型，并具有在科学和工程中应用的变革潜力，例如，通过促进发现新颖的解决方案和模拟在计算机上可棘手以显式建模的系统。但是，由于无法严格遵守物理定律和特定于领域的约束，它们目前在这些领域中的效用受到了严格的限制。没有这种基础，从材料科学到安全至关重要系统的关键应用中的这种模型的部署仍然不切实际。本文通过提出一种新的方法来解决这一基本限制，以将稳定的扩散模型与受限的优化框架整合在一起，从而使它们能够生成满足严格的物理和功能要求的输出。我们通过材料科学实验来证明这种方法的有效性，这些实验需要依从性才能精确形态计量学特性，涉及使用视频产生与循环中的模拟器一起产生应力 - 应变响应的逆设计问题，并且在循环中的安全设置必须避免侵犯版权。

Title: TrackDiffuser: Nearly Model-Free Bayesian Filtering with Diffusion Model

Authors: Yangguang He, Wenhao Li, Minzhe Li, Juan Zhang, Xiangfeng Wang, Bo Jin
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2502.05629
Pdf URL: https://arxiv.org/pdf/2502.05629
Copy Paste: [[2502.05629]] TrackDiffuser: Nearly Model-Free Bayesian Filtering with Diffusion Model(https://arxiv.org/abs/2502.05629)
Keywords: generative
Abstract: State estimation remains a fundamental challenge across numerous domains, from autonomous driving, aircraft tracking to quantum system control. Although Bayesian filtering has been the cornerstone solution, its classical model-based paradigm faces two major limitations: it struggles with inaccurate state space model (SSM) and requires extensive prior knowledge of noise characteristics. We present TrackDiffuser, a generative framework addressing both challenges by reformulating Bayesian filtering as a conditional diffusion model. Our approach implicitly learns system dynamics from data to mitigate the effects of inaccurate SSM, while simultaneously circumventing the need for explicit measurement models and noise priors by establishing a direct relationship between measurements and states. Through an implicit predict-and-update mechanism, TrackDiffuser preserves the interpretability advantage of traditional model-based filtering methods. Extensive experiments demonstrate that our framework substantially outperforms both classical and contemporary hybrid methods, especially in challenging non-linear scenarios involving non-Gaussian noises. Notably, TrackDiffuser exhibits remarkable robustness to SSM inaccuracies, offering a practical solution for real-world state estimation problems where perfect models and prior knowledge are unavailable.
摘要：从自动驾驶，飞机跟踪到量子系统控制，国家估计仍然是众多领域的基本挑战。尽管贝叶斯过滤一直是基石解决方案，但其基于经典模型的范式面临两个主要局限性：它与状态空间模型（SSM）不准确，需要广泛的噪声特性知识。我们提出了TrackDiffuser，这是一个生成框架，通过将贝叶斯过滤作为条件扩散模型来解决这两个挑战。我们的方法隐含地学习了从数据学习系统动力学，以减轻不准确的SSM的影响，同时通过建立测量和状态之间的直接关系来规避对显式测量模型和噪声先验的需求。通过隐式预测和更新机制，TrackDiffuser保留了传统基于模型的过滤方法的可解释性优势。广泛的实验表明，我们的框架基本上优于古典和当代混合方法，尤其是在挑战涉及非高斯噪音的非线性场景时。值得注意的是，TrackDiffuser对SSM不准确表现出了显着的鲁棒性，为现实世界中估计问题提供了实用的解决方案，在这些问题中，完美的模型和先验知识无法获得。

Title: Mol-MoE: Training Preference-Guided Routers for Molecule Generation

Authors: Diego Calanzone, Pierluca D'Oro, Pierre-Luc Bacon
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.05633
Pdf URL: https://arxiv.org/pdf/2502.05633
Copy Paste: [[2502.05633]] Mol-MoE: Training Preference-Guided Routers for Molecule Generation(https://arxiv.org/abs/2502.05633)
Keywords: generation
Abstract: Recent advances in language models have enabled framing molecule generation as sequence modeling. However, existing approaches often rely on single-objective reinforcement learning, limiting their applicability to real-world drug design, where multiple competing properties must be optimized. Traditional multi-objective reinforcement learning (MORL) methods require costly retraining for each new objective combination, making rapid exploration of trade-offs impractical. To overcome these limitations, we introduce Mol-MoE, a mixture-of-experts (MoE) architecture that enables efficient test-time steering of molecule generation without retraining. Central to our approach is a preference-based router training objective that incentivizes the router to combine experts in a way that aligns with user-specified trade-offs. This provides improved flexibility in exploring the chemical property space at test time, facilitating rapid trade-off exploration. Benchmarking against state-of-the-art methods, we show that Mol-MoE achieves superior sample quality and steerability.
摘要：语言模型的最新进展已使构架分子的产生作为序列建模。但是，现有的方法通常依赖于单瞄准剂的增强学习，从而将其适用性限制在现实世界中的药物设计中，必须优化多个竞争性能。传统的多目标增强学习（MORL）方法需要为每个新的客观组合进行昂贵的重新训练，从而快速探索权衡不切实际。为了克服这些局限性，我们引入了Mol-Moe，这是一种混合物（MOE）结构，可实现有效的测试时间转向分子产生而无需重新培训。我们方法的核心是一个基于偏好的路由器训练目标，它激励路由器以与用户指定的权衡相符的方式将专家组合起来。这在测试时探索化学性能空间方面提供了提高的灵活性，从而促进了快速权衡探索。针对最先进的方法进行基准测试，我们表明Mol-Moe可实现出色的样品质量和可固定性。

Title: The Evolution of Dataset Distillation: Toward Scalable and Generalizable Solutions

Authors: Ping Liu, Jiawei Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.05673
Pdf URL: https://arxiv.org/pdf/2502.05673
Copy Paste: [[2502.05673]] The Evolution of Dataset Distillation: Toward Scalable and Generalizable Solutions(https://arxiv.org/abs/2502.05673)
Keywords: generative
Abstract: Dataset distillation, which condenses large-scale datasets into compact synthetic representations, has emerged as a critical solution for training modern deep learning models efficiently. While prior surveys focus on developments before 2023, this work comprehensively reviews recent advances, emphasizing scalability to large-scale datasets such as ImageNet-1K and ImageNet-21K. We categorize progress into a few key methodologies: trajectory matching, gradient matching, distribution matching, scalable generative approaches, and decoupling optimization mechanisms. As a comprehensive examination of recent dataset distillation advances, this survey highlights breakthrough innovations: the SRe2L framework for efficient and effective condensation, soft label strategies that significantly enhance model accuracy, and lossless distillation techniques that maximize compression while maintaining performance. Beyond these methodological advancements, we address critical challenges, including robustness against adversarial and backdoor attacks, effective handling of non-IID data distributions. Additionally, we explore emerging applications in video and audio processing, multi-modal learning, medical imaging, and scientific computing, highlighting its domain versatility. By offering extensive performance comparisons and actionable research directions, this survey equips researchers and practitioners with practical insights to advance efficient and generalizable dataset distillation, paving the way for future innovations.
摘要：数据集蒸馏将大规模数据集凝结成紧凑的合成表示，已成为有效训练现代深度学习模型的关键解决方案。虽然先前的调查专注于2023年之前的发展，但这项工作对最近的进步进行了全面审查，强调了对大规模数据集（例如Imagenet-1K和Imagenet-21K）的可扩展性。我们将进度分为几种关键方法：轨迹匹配，梯度匹配，分布匹配，可扩展的生成方法和解耦优化机制。作为对最近数据集蒸馏的全面检查，这项调查突出了突破性的创新：SRE2L框架高效有效冷凝的框架，可显着提高模型准确性的软标签策略以及无损蒸馏技术，可在保持性能的同时最大程度地提高压缩性。除了这些方法论的进步之外，我们还应对关键挑战，包括针对对抗和后门攻击的鲁棒性，有效处理非IID数据分布。此外，我们探讨了在视频和音频处理，多模式学习，医学成像和科学计算中的新兴应用程序，突出了其域的多功能性。通过提供广泛的绩效比较和可行的研究方向，这项调查使研究人员和从业人员提供了实用的见解，以提高有效且可推广的数据集蒸馏，为未来的创新铺平了道路。

Title: SSDD-GAN: Single-Step Denoising Diffusion GAN for Cochlear Implant Surgical Scene Completion

Authors: Yike Zhang, Eduardo Davalos, Jack Noble
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.05710
Pdf URL: https://arxiv.org/pdf/2502.05710
Copy Paste: [[2502.05710]] SSDD-GAN: Single-Step Denoising Diffusion GAN for Cochlear Implant Surgical Scene Completion(https://arxiv.org/abs/2502.05710)
Keywords: restoration, generation, generative
Abstract: Recent deep learning-based image completion methods, including both inpainting and outpainting, have demonstrated promising results in restoring corrupted images by effectively filling various missing regions. Among these, Generative Adversarial Networks (GANs) and Denoising Diffusion Probabilistic Models (DDPMs) have been employed as key generative image completion approaches, excelling in the field of generating high-quality restorations with reduced artifacts and improved fine details. In previous work, we developed a method aimed at synthesizing views from novel microscope positions for mastoidectomy surgeries; however, that approach did not have the ability to restore the surrounding surgical scene environment. In this paper, we propose an efficient method to complete the surgical scene of the synthetic postmastoidectomy dataset. Our approach leverages self-supervised learning on real surgical datasets to train a Single-Step Denoising Diffusion-GAN (SSDD-GAN), combining the advantages of diffusion models with the adversarial optimization of GANs for improved Structural Similarity results of 6%. The trained model is then directly applied to the synthetic postmastoidectomy dataset using a zero-shot approach, enabling the generation of realistic and complete surgical scenes without the need for explicit ground-truth labels from the synthetic postmastoidectomy dataset. This method addresses key limitations in previous work, offering a novel pathway for full surgical microscopy scene completion and enhancing the usability of the synthetic postmastoidectomy dataset in surgical preoperative planning and intraoperative navigation.
摘要：最近的基于深度学习的图像完成方法，包括覆盖和覆盖，通过有效地填充各种缺失的区域来恢复损坏的图像。其中，生成的对抗网络（GAN）和DENORISIDED扩散概率模型（DDPM）已被用作关键的生成图像完成方法，在生成减少人工制品并改进的细节的高质量修复体领域中表现出色。在先前的工作中，我们开发了一种旨在从新的显微镜位置合成乳突切除术手术的观点的方法。但是，这种方法无法恢复周围的手术环境。在本文中，我们提出了一种有效的方法来完成合成后乳突后数据集的手术场景。我们的方法利用对真实手术数据集的自我监督学习来训练单步denoising扩散gan（SSDD-GAN），将扩散模型的优势与gan的对抗性优化相结合，以改善6％的结构相似性结果。然后，使用零拍方法将受过训练的模型直接应用于合成后乳房切除术数据集中，从而无需从合成的乳突后切除术数据集中显式地面真实标签，从而能够产生逼真和完整的手术场景。该方法解决了先前工作中的关键局限性，为完整的手术显微镜场景完成提供了新的途径，并增强了术前计划和术中导航中合成后乳突后切除术数据集的可用性。

Title: Understanding Representation Dynamics of Diffusion Models via Low-Dimensional Modeling

Authors: Xiao Li, Zekai Zhang, Xiang Li, Siyi Chen, Zhihui Zhu, Peng Wang, Qing Qu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2502.05743
Pdf URL: https://arxiv.org/pdf/2502.05743
Copy Paste: [[2502.05743]] Understanding Representation Dynamics of Diffusion Models via Low-Dimensional Modeling(https://arxiv.org/abs/2502.05743)
Keywords: generation, generative
Abstract: This work addresses the critical question of why and when diffusion models, despite being designed for generative tasks, can excel at learning high-quality representations in a self-supervised manner. To address this, we develop a mathematical framework based on a low-dimensional data model and posterior estimation, revealing a fundamental trade-off between generation and representation quality near the final stage of image generation. Our analysis explains the unimodal representation dynamics across noise scales, mainly driven by the interplay between data denoising and class specification. Building on these insights, we propose an ensemble method that aggregates features across noise levels, significantly improving both clean performance and robustness under label noise. Extensive experiments on both synthetic and real-world datasets validate our findings.
摘要：这项工作解决了一个关键问题，即尽管是为生成任务而设计的为什么和何时散布模型，但可以以一种自我监督的方式擅长学习高质量的表示。为了解决这个问题，我们基于低维数据模型和后验估计而开发了数学框架，揭示了在图像生成的最后阶段附近的发电和表示质量之间的基本权衡。我们的分析解释了跨噪声量表的单峰表示动力学，这主要是由于数据降解和类规范之间的相互作用驱动。在这些见解的基础上，我们提出了一种集合方法，该方法可以汇总噪声水平的特征，从而在标签噪声下显着提高了清洁性能和稳健性。关于合成和现实数据集的广泛实验验证了我们的发现。

Title: UniDB: A Unified Diffusion Bridge Framework via Stochastic Optimal Control

Authors: Kaizhen Zhu, Mokai Pan, Yuexin Ma, Yanwei Fu, Jingyi Yu, Jingya Wang, Ye Shi
Subjects: cs.CV, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2502.05749
Pdf URL: https://arxiv.org/pdf/2502.05749
Copy Paste: [[2502.05749]] UniDB: A Unified Diffusion Bridge Framework via Stochastic Optimal Control(https://arxiv.org/abs/2502.05749)
Keywords: restoration
Abstract: Recent advances in diffusion bridge models leverage Doob's $h$-transform to establish fixed endpoints between distributions, demonstrating promising results in image translation and restoration tasks. However, these approaches frequently produce blurred or excessively smoothed image details and lack a comprehensive theoretical foundation to explain these shortcomings. To address these limitations, we propose UniDB, a unified framework for diffusion bridges based on Stochastic Optimal Control (SOC). UniDB formulates the problem through an SOC-based optimization and derives a closed-form solution for the optimal controller, thereby unifying and generalizing existing diffusion bridge models. We demonstrate that existing diffusion bridges employing Doob's $h$-transform constitute a special case of our framework, emerging when the terminal penalty coefficient in the SOC cost function tends to infinity. By incorporating a tunable terminal penalty coefficient, UniDB achieves an optimal balance between control costs and terminal penalties, substantially improving detail preservation and output quality. Notably, UniDB seamlessly integrates with existing diffusion bridge models, requiring only minimal code modifications. Extensive experiments across diverse image restoration tasks validate the superiority and adaptability of the proposed framework. Our code is available at this https URL.
摘要：扩散桥模型的最新进展利用了Doob的$ h $转换来建立分布之间的固定端点，从而证明了图像翻译和恢复任务的有希望的结果。但是，这些方法经常产生模糊或过度平滑的图像细节，并且缺乏解释这些缺点的全面理论基础。为了解决这些局限性，我们提出了UNIDB，这是基于随机最佳控制（SOC）的扩散桥的统一框架。 UNIDB通过基于SOC的优化来提出问题，并为最佳控制器提供了封闭形式的解决方案，从而统一和推广了现有的扩散桥模型。我们证明，采用DOOB的$ H $转换的现有扩散桥构成了我们框架的特殊情况，当SOC成本功能的终端罚款系数趋于无限时，就会出现。通过合并可调的终端罚款系数，UNIDB在控制成本和终端罚款之间达到了最佳平衡，从而大大改善了细节保存和产出质量。值得注意的是，UNIDB与现有的扩散桥模型无缝集成，仅需要最小的代码修改。跨不同图像恢复任务的广泛实验验证了所提出的框架的优越性和适应性。我们的代码可在此HTTPS URL上找到。

Title: Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails

Authors: Yijun Yang, Lichao Wang, Xiao Yang, Lanqing Hong, Jun Zhu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.05772
Pdf URL: https://arxiv.org/pdf/2502.05772
Copy Paste: [[2502.05772]] Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails(https://arxiv.org/abs/2502.05772)
Keywords: generation
Abstract: Vision Large Language Models (VLLMs) integrate visual data processing, expanding their real-world applications, but also increasing the risk of generating unsafe responses. In response, leading companies have implemented Multi-Layered safety defenses, including alignment training, safety system prompts, and content moderation. However, their effectiveness against sophisticated adversarial attacks remains largely unexplored. In this paper, we propose MultiFaceted Attack, a novel attack framework designed to systematically bypass Multi-Layered Defenses in VLLMs. It comprises three complementary attack facets: Visual Attack that exploits the multimodal nature of VLLMs to inject toxic system prompts through images; Alignment Breaking Attack that manipulates the model's alignment mechanism to prioritize the generation of contrasting responses; and Adversarial Signature that deceives content moderators by strategically placing misleading information at the end of the response. Extensive evaluations on eight commercial VLLMs in a black-box setting demonstrate that MultiFaceted Attack achieves a 61.56% attack success rate, surpassing state-of-the-art methods by at least 42.18%.
摘要：视觉大语模型（VLLM）整合了视觉数据处理，扩大了其现实世界的应用程序，同时也增加了产生不安全响应的风险。作为回应，领先的公司已经实施了多层安全防御，包括对齐培训，安全系统提示和内容审核。但是，它们针对复杂的对抗性攻击的有效性在很大程度上尚未探索。在本文中，我们提出了多方面的攻击，这是一个新颖的攻击框架，旨在系统地绕过VLLM中的多层防御。它包括三个互补的攻击方：视觉攻击，利用VLLM的多模式性质以通过图像提示注入有毒系统；对齐打破攻击，该攻击操纵模型的对准机制，以优先考虑对比反应的产生；和对抗性签名，通过在响应结束时将误导性信息策略性地放置，从而欺骗了内容主持人。对八个商业VLLM在黑盒环境中的广泛评估表明，多方面的攻击达到了61.56％的攻击成功率，超过了最新方法，至少超过42.18％。

Title: Predictive Crash Analytics for Traffic Safety using Deep Learning

Authors: Karthik Sivakoti
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.05777
Pdf URL: https://arxiv.org/pdf/2502.05777
Copy Paste: [[2502.05777]] Predictive Crash Analytics for Traffic Safety using Deep Learning(https://arxiv.org/abs/2502.05777)
Keywords: generation
Abstract: Traditional automated crash analysis systems heavily rely on static statistical models and historical data, requiring significant manual interpretation and lacking real-time predictive capabilities. This research presents an innovative approach to traffic safety analysis through the integration of ensemble learning methods and multi-modal data fusion for real-time crash risk assessment and prediction. Our primary contribution lies in developing a hierarchical severity classification system that combines spatial-temporal crash patterns with environmental conditions, achieving significant improvements over traditional statistical approaches. The system demonstrates a Mean Average Precision (mAP) of 0.893, representing a 15% improvement over current state-of-the-art methods (baseline mAP: 0.776). We introduce a novel feature engineering technique that integrates crash location data with incident reports and weather conditions, achieving 92.4% accuracy in risk prediction and 89.7% precision in hotspot identification. Through extensive validation using 500,000 initial crash records filtered to 59,496 high-quality samples, our solution shows marked improvements in both prediction accuracy and computational efficiency. Key innovations include a robust data cleaning pipeline, adaptive feature generation, and a scalable real-time prediction system capable of handling peak loads of 1,000 concurrent requests while maintaining sub-100ms response times.
摘要：传统的自动化崩溃分析系统在很大程度上依赖静态统计模型和历史数据，需要大量的手动解释和缺乏实时预测能力。这项研究通过整合集合学习方法和多模式数据融合来实现实时崩溃风险评估和预测，从而提出了一种创新的交通安全性分析方法。我们的主要贡献在于开发一个层次严重性分类系统，该系统将空间 - 周期性崩溃模式与环境条件相结合，从而比传统统计方法取得了重大改进。该系统的平均平均精度（MAP）为0.893，比当前最新方法提高了15％（基线图：0.776）。我们介绍了一种新颖的功能工程技术，该技术将崩溃位置数据与事件报告和天气状况相结合，在风险预测中达到92.4％的精度，而热点识别的精度为89.7％。通过使用500,000个初始崩溃记录过滤到59,496个高质量样本的广泛验证，我们的解决方案显示了预测准确性和计算效率的明显提高。关键创新包括强大的数据清洁管道，自适应功能生成以及可扩展的实时预测系统，能够处理1,000个并发请求的峰值负载，同时保持低于100ms的响应时间。

Title: GOLD: Graph Out-of-Distribution Detection via Implicit Adversarial Latent Generation

Authors: Danny Wang, Ruihong Qiu, Guangdong Bai, Zi Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.05780
Pdf URL: https://arxiv.org/pdf/2502.05780
Copy Paste: [[2502.05780]] GOLD: Graph Out-of-Distribution Detection via Implicit Adversarial Latent Generation(https://arxiv.org/abs/2502.05780)
Keywords: generation, generative
Abstract: Despite graph neural networks' (GNNs) great success in modelling graph-structured data, out-of-distribution (OOD) test instances still pose a great challenge for current GNNs. One of the most effective techniques to detect OOD nodes is to expose the detector model with an additional OOD node-set, yet the extra OOD instances are often difficult to obtain in practice. Recent methods for image data address this problem using OOD data synthesis, typically relying on pre-trained generative models like Stable Diffusion. However, these approaches require vast amounts of additional data, as well as one-for-all pre-trained generative models, which are not available for graph data. Therefore, we propose the GOLD framework for graph OOD detection, an implicit adversarial learning pipeline with synthetic OOD exposure without pre-trained models. The implicit adversarial training process employs a novel alternating optimisation framework by training: (1) a latent generative model to regularly imitate the in-distribution (ID) embeddings from an evolving GNN, and (2) a GNN encoder and an OOD detector to accurately classify ID data while increasing the energy divergence between the ID embeddings and the generative model's synthetic embeddings. This novel approach implicitly transforms the synthetic embeddings into pseudo-OOD instances relative to the ID data, effectively simulating exposure to OOD scenarios without auxiliary data. Extensive OOD detection experiments are conducted on five benchmark graph datasets, verifying the superior performance of GOLD without using real OOD data compared with the state-of-the-art OOD exposure and non-exposure baselines.
摘要：尽管图形神经网络（GNNS）在建模图形数据方面取得了巨大成功，但分布（OOD）测试实例仍然对当前GNN构成了巨大挑战。检测OOD节点的最有效技术之一是用额外的OOD节点集揭示检测器模型，但是在实践中通常很难获得额外的OOD实例。图像数据的最新方法使用OOD数据综合解决了此问题，通常依赖于预先训练的生成模型（例如稳定的扩散）。但是，这些方法需要大量的其他数据，以及所有预先训练的生成模型，这些模型无法用于图形数据。因此，我们提出了图形检测的黄金框架，这是一种具有合成OOD暴露的隐式对抗学习管道，而没有预训练的模型。隐式对抗训练过程采用培训采用新颖的交替优化框架：（1）一种潜在的生成模型，以定期模仿不断发展的GNN的分布（ID）嵌入（ID），（2）GNN编码器和OOD检测器准确地嵌入对ID数据进行分类，同时增加ID嵌入与生成模型的合成嵌入之间的能量差异。这种新颖的方法隐式将合成嵌入相对于ID数据而将合成嵌入到伪-OON实例中，从而有效地模拟了没有辅助数据的OOD场景的暴露。与最先进的OOD暴露和非暴露基线相比，在五个基准图数据集上进行了广泛的OOD检测实验，从而在五个基准图数据集上进行了不使用实际OOD数据的出色性能。

Title: Divide-and-Conquer: Tree-structured Strategy with Answer Distribution Estimator for Goal-Oriented Visual Dialogue

Authors: Shuo Cai, Xinzhe Han, Shuhui Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.05806
Pdf URL: https://arxiv.org/pdf/2502.05806
Copy Paste: [[2502.05806]] Divide-and-Conquer: Tree-structured Strategy with Answer Distribution Estimator for Goal-Oriented Visual Dialogue(https://arxiv.org/abs/2502.05806)
Keywords: generation
Abstract: Goal-oriented visual dialogue involves multi-round interaction between artificial agents, which has been of remarkable attention due to its wide applications. Given a visual scene, this task occurs when a Questioner asks an action-oriented question and an Answerer responds with the intent of letting the Questioner know the correct action to take. The quality of questions affects the accuracy and efficiency of the target search progress. However, existing methods lack a clear strategy to guide the generation of questions, resulting in the randomness in the search process and inconvergent results. We propose a Tree-Structured Strategy with Answer Distribution Estimator (TSADE) which guides the question generation by excluding half of the current candidate objects in each round. The above process is implemented by maximizing a binary reward inspired by the ``divide-and-conquer'' paradigm. We further design a candidate-minimization reward which encourages the model to narrow down the scope of candidate objects toward the end of the dialogue. We experimentally demonstrate that our method can enable the agents to achieve high task-oriented accuracy with fewer repeating questions and rounds compared to traditional ergodic question generation approaches. Qualitative results further show that TSADE facilitates agents to generate higher-quality questions.
摘要：面向目标的视觉对话涉及人造代理之间的多轮相互作用，由于其广泛的应用，这引起了极大的关注。在视觉场景的情况下，此任务发生在提问者提出面向动作的问题并且答案者的响应中，目的是让发问者知道正确的行动。问题质量会影响目标搜索进度的准确性和效率。但是，现有方法缺乏指导产生问题的明确策略，从而导致搜索过程中的随机性和不一致的结果。我们提出了一种带有答案分布估算器（TSADE）的树结构策略，该策略通过在每个回合中排除当前候选对象的一半来指导问题的生成。上述过程是通过最大化受``划分与折扣''范式启发的二进制奖励来实现的。我们进一步设计了候选最小化奖励，该奖励鼓励模型缩小对话结束时候选对象的范围。我们在实验上证明，与传统的ergodic问题生成方法相比，我们的方法可以使代理能够实现较高的重复问题和巡回赛，而重复的问题和回合较少。定性结果进一步表明，卫星有助于代理人产生更高质量的问题。

Title: Devil is in the Details: Density Guidance for Detail-Aware Generation with Flow Models

Authors: Rafał Karczewski, Markus Heinonen, Vikas Garg
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.05807
Pdf URL: https://arxiv.org/pdf/2502.05807
Copy Paste: [[2502.05807]] Devil is in the Details: Density Guidance for Detail-Aware Generation with Flow Models(https://arxiv.org/abs/2502.05807)
Keywords: generation, generative
Abstract: Diffusion models have emerged as a powerful class of generative models, capable of producing high-quality images by mapping noise to a data distribution. However, recent findings suggest that image likelihood does not align with perceptual quality: high-likelihood samples tend to be smooth, while lower-likelihood ones are more detailed. Controlling sample density is thus crucial for balancing realism and detail. In this paper, we analyze an existing technique, Prior Guidance, which scales the latent code to influence image detail. We introduce score alignment, a condition that explains why this method works and show that it can be tractably checked for any continuous normalizing flow model. We then propose Density Guidance, a principled modification of the generative ODE that enables exact log-density control during sampling. Finally, we extend Density Guidance to stochastic sampling, ensuring precise log-density control while allowing controlled variation in structure or fine details. Our experiments demonstrate that these techniques provide fine-grained control over image detail without compromising sample quality.
摘要：扩散模型已成为强大的生成模型类别，能够通过将噪声映射到数据分布来产生高质量的图像。但是，最近的发现表明，图像的可能性与感知质量不符：高样本样本往往平稳，而较低的样本则更详细。因此，控制样品密度对于平衡现实主义和细节至关重要。在本文中，我们分析了现有技术，即先前的指导，该技术缩放了潜在代码以影响图像细节。我们介绍了分数对齐方式，该条件解释了为什么此方法有效，并表明可以对其进行任何连续归一化流模型进行仔细检查。然后，我们提出密度引导，这是对在采样过程中实现精确对数密度控制的生成ode的原则修改。最后，我们将密度引导扩展到随机采样，以确保精确的对数密度控制，同时允许控制结构或细节的差异。我们的实验表明，这些技术提供了对图像细节的细粒度控制，而不会损害样品质量。

Title: MMGDreamer: Mixed-Modality Graph for Geometry-Controllable 3D Indoor Scene Generation

Authors: Zhifei Yang, Keyang Lu, Chao Zhang, Jiaxing Qi, Hanqi Jiang, Ruifei Ma, Shenglin Yin, Yifan Xu, Mingzhe Xing, Zhen Xiao, Jieyi Long, Xiangde Liu, Guangyao Zhai
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.05874
Pdf URL: https://arxiv.org/pdf/2502.05874
Copy Paste: [[2502.05874]] MMGDreamer: Mixed-Modality Graph for Geometry-Controllable 3D Indoor Scene Generation(https://arxiv.org/abs/2502.05874)
Keywords: generation
Abstract: Controllable 3D scene generation has extensive applications in virtual reality and interior design, where the generated scenes should exhibit high levels of realism and controllability in terms of geometry. Scene graphs provide a suitable data representation that facilitates these applications. However, current graph-based methods for scene generation are constrained to text-based inputs and exhibit insufficient adaptability to flexible user inputs, hindering the ability to precisely control object geometry. To address this issue, we propose MMGDreamer, a dual-branch diffusion model for scene generation that incorporates a novel Mixed-Modality Graph, visual enhancement module, and relation predictor. The mixed-modality graph allows object nodes to integrate textual and visual modalities, with optional relationships between nodes. It enhances adaptability to flexible user inputs and enables meticulous control over the geometry of objects in the generated scenes. The visual enhancement module enriches the visual fidelity of text-only nodes by constructing visual representations using text embeddings. Furthermore, our relation predictor leverages node representations to infer absent relationships between nodes, resulting in more coherent scene layouts. Extensive experimental results demonstrate that MMGDreamer exhibits superior control of object geometry, achieving state-of-the-art scene generation performance. Project page: this https URL.
摘要：可控的3D场景生成在虚拟现实和室内设计中具有广泛的应用，在该设计中，生成的场景应在几何形状方面表现出很高的现实主义和可控性。场景图提供了合适的数据表示，以促进这些应用程序。但是，当前基于图的场景生成方法限制在基于文本的输入中，并且对灵活的用户输入的适应性不足，从而阻碍了精确控制对象几何形状的能力。为了解决这个问题，我们提出了Mmgreamer，这是一个针对场景生成的双分支扩散模型，其中包含了新型的混合模式图，视觉增强模块和关系预测指标。混合模式图允许对象节点与节点之间的可选关系整合文本和视觉方式。它增强了对柔性用户输入的适应性，并可以对生成场景中对象的几何形状进行细致的控制。视觉增强模块通过使用文本嵌入来构造视觉表示，丰富了仅文本节点的视觉保真度。此外，我们的关系预测指标利用节点表示来推断节点之间不存在关系，从而导致场景布局更连贯。广泛的实验结果表明，mmgreamer表现出对物体几何形状的优越控制，从而实现了最新的场景生成性能。项目页面：此HTTPS URL。

Title: NeuralPrefix: A Zero-shot Sensory Data Imputation Plugin

Authors: Abdelwahed Khamis, Sara Khalifa
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2502.05883
Pdf URL: https://arxiv.org/pdf/2502.05883
Copy Paste: [[2502.05883]] NeuralPrefix: A Zero-shot Sensory Data Imputation Plugin(https://arxiv.org/abs/2502.05883)
Keywords: generative
Abstract: Real-world sensing challenges such as sensor failures, communication issues, and power constraints lead to data intermittency. An issue that is known to undermine the traditional classification task that assumes a continuous data stream. Previous works addressed this issue by designing bespoke solutions (i.e. task-specific and/or modality-specific imputation). These approaches, while effective for their intended purposes, had limitations in their applicability across different tasks and sensor modalities. This raises an important question: Can we build a task-agnostic imputation pipeline that is transferable to new sensors without requiring additional training? In this work, we formalise the concept of zero-shot imputation and propose a novel approach that enables the adaptation of pre-trained models to handle data intermittency. This framework, named NeuralPrefix, is a generative neural component that precedes a task model during inference, filling in gaps caused by data intermittency. NeuralPrefix is built as a continuous dynamical system, where its internal state can be estimated at any point in time by solving an Ordinary Differential Equation (ODE). This approach allows for a more versatile and adaptable imputation method, overcoming the limitations of task-specific and modality-specific solutions. We conduct a comprehensive evaluation of NeuralPrefix on multiple sensory datasets, demonstrating its effectiveness across various domains. When tested on intermittent data with a high 50% missing data rate, NeuralPreifx accurately recovers all the missing samples, achieving SSIM score between 0.93-0.96. Zero-shot evaluations show that NeuralPrefix generalises well to unseen datasets, even when the measurements come from a different modality.
摘要：真实的传感挑战，例如传感器故障，通信问题和功率约束，导致数据间歇性。已知的问题破坏了假定连续数据流的传统分类任务。以前的作品通过设计定制解决方案（即特定于任务和/或特定于模式的插补）来解决此问题。这些方法虽然有效地用于其预期目的，但在不同任务和传感器方式上的适用性限制了。这就提出了一个重要的问题：我们可以建立一个可以转移到新传感器而不需要额外培训的任务不足的插补管道吗？在这项工作中，我们将零弹药的概念形式化，并提出了一种新颖的方法，该方法可以适应预训练的模型以处理数据间歇性。该框架称为NeuralPrefix，是一种生成性神经成分，在推理过程中之前的任务模型之前，填补了由数据间歇性引起的空白。 NeuralPrefix是作为连续动力系统构建的，可以通过求解普通的微分方程（ODE）在任何时间点估算其内部状态。这种方法允许采用更广泛和适应性的插补方法，从而克服了特定于特定于任务的解决方案的局限性。我们对多个感官数据集进行了神经化的全面评估，证明了其在各个领域的有效性。当对高50％缺少数据率的间歇性数据进行测试时，NeuralPreifx准确地恢复了所有缺失的样本，从而在0.93-0.96中获得了SSIM得分。零拍摄的评估表明，即使测量结果来自不同的模态，神经化的概括也可以很好地看待看不见的数据集。

Title: Beyond Fine-Tuning: A Systematic Study of Sampling Techniques in Personalized Image Generation

Authors: Vera Soboleva, Maksim Nakhodnov, Aibek Alanov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.05895
Pdf URL: https://arxiv.org/pdf/2502.05895
Copy Paste: [[2502.05895]] Beyond Fine-Tuning: A Systematic Study of Sampling Techniques in Personalized Image Generation(https://arxiv.org/abs/2502.05895)
Keywords: generation
Abstract: Personalized text-to-image generation aims to create images tailored to user-defined concepts and textual descriptions. Balancing the fidelity of the learned concept with its ability for generation in various contexts presents a significant challenge. Existing methods often address this through diverse fine-tuning parameterizations and improved sampling strategies that integrate superclass trajectories during the diffusion process. While improved sampling offers a cost-effective, training-free solution for enhancing fine-tuned models, systematic analyses of these methods remain limited. Current approaches typically tie sampling strategies with fixed fine-tuning configurations, making it difficult to isolate their impact on generation outcomes. To address this issue, we systematically analyze sampling strategies beyond fine-tuning, exploring the impact of concept and superclass trajectories on the results. Building on this analysis, we propose a decision framework evaluating text alignment, computational constraints, and fidelity objectives to guide strategy selection. It integrates with diverse architectures and training approaches, systematically optimizing concept preservation, prompt adherence, and resource efficiency. The source code can be found at this https URL.
摘要：个性化的文本到图像生成旨在创建适合用户定义的概念和文本描述的图像。在各种情况下，平衡博学概念的忠诚度与其在各种情况下产生的能力提出了重大挑战。现有的方法通常通过各种微调参数化和改进的采样策略来解决这一问题，从而在扩散过程中整合了超类轨迹。尽管改进的采样提供了一种具有成本效益的无培训解决方案，以增强微调模型，但对这些方法的系统分析仍然有限。当前的方法通常将抽样策略与固定的微调配置联系起来，从而使它们对产生结果的影响很难隔离。为了解决这个问题，我们系统地分析了超越微调的采样策略，探索概念和超类轨迹对结果的影响。在此分析的基础上，我们提出了一个决策框架，以评估文本一致性，计算限制和忠实目标，以指导策略选择。它与各种体系结构和培训方法集成在一起，系统地优化概念保存，及时的依从性和资源效率。可以在此HTTPS URL上找到源代码。

Title: Fast Omni-Directional Image Super-Resolution: Adapting the Implicit Image Function with Pixel and Semantic-Wise Spherical Geometric Priors

Authors: Xuelin Shen, Yitong Wang, Silin Zheng, Kang Xiao, Wenhan Yang, Xu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.05902
Pdf URL: https://arxiv.org/pdf/2502.05902
Copy Paste: [[2502.05902]] Fast Omni-Directional Image Super-Resolution: Adapting the Implicit Image Function with Pixel and Semantic-Wise Spherical Geometric Priors(https://arxiv.org/abs/2502.05902)
Keywords: super-resolution
Abstract: In the context of Omni-Directional Image (ODI) Super-Resolution (SR), the unique challenge arises from the non-uniform oversampling characteristics caused by EquiRectangular Projection (ERP). Considerable efforts in designing complex spherical convolutions or polyhedron reprojection offer significant performance improvements but at the expense of cumbersome processing procedures and slower inference speeds. Under these circumstances, this paper proposes a new ODI-SR model characterized by its capacity to perform Fast and Arbitrary-scale ODI-SR processes, denoted as FAOR. The key innovation lies in adapting the implicit image function from the planar image domain to the ERP image domain by incorporating spherical geometric priors at both the latent representation and image reconstruction stages, in a low-overhead manner. Specifically, at the latent representation stage, we adopt a pair of pixel-wise and semantic-wise sphere-to-planar distortion maps to perform affine transformations on the latent representation, thereby incorporating it with spherical properties. Moreover, during the image reconstruction stage, we introduce a geodesic-based resampling strategy, aligning the implicit image function with spherical geometrics without introducing additional parameters. As a result, the proposed FAOR outperforms the state-of-the-art ODI-SR models with a much faster inference speed. Extensive experimental results and ablation studies have demonstrated the effectiveness of our design.
摘要：在Omni方向图像（ODI）超分辨率（SR）的背景下，独特的挑战来自于由等应角投影（ERP）引起的非均匀过度采样特征。在设计复杂的球形卷积或多面体再投影方面的巨大努力提供了重大的性能改进，但以繁琐的处理程序和较慢的推理速度为代价。在这种情况下，本文提出了一种新的ODI-SR模型，其特征在于其执行快速和任意规模的ODI-SR过程的能力，称为FAOR。关键创新在于通过在潜在的表示和图像重建阶段以低空的方式合并球形几何学先验，从而使隐式图像函数从平面图像域转化为ERP图像域。具体而言，在潜在表示阶段，我们采用了一对像素和语义的球形到平面畸变图，以对潜在表示进行仿射变换，从而将其与球形特性结合在一起。此外，在图像重建阶段，我们引入了一种基于测量的重采样策略，将隐式图像函数与球形几何形式对齐而不引入其他参数。结果，提议的FAOR的表现优于最先进的ODI-SR模型，其推理速度要快得多。广泛的实验结果和消融研究已经证明了我们设计的有效性。

Title: Multi-Branch Collaborative Learning Network for Video Quality Assessment in Industrial Video Search

Authors: Hengzhu Tang, Zefeng Zhang, Zhiping Li, Zhenyu Zhang, Xing Wu, Li Gao, Suqi Cheng, Dawei Yin
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2502.05924
Pdf URL: https://arxiv.org/pdf/2502.05924
Copy Paste: [[2502.05924]] Multi-Branch Collaborative Learning Network for Video Quality Assessment in Industrial Video Search(https://arxiv.org/abs/2502.05924)
Keywords: quality assessment
Abstract: Video Quality Assessment (VQA) is vital for large-scale video retrieval systems, aimed at identifying quality issues to prioritize high-quality videos. In industrial systems, low-quality video characteristics fall into four categories: visual-related issues like mosaics and black boxes, textual issues from video titles and OCR content, and semantic issues like frame incoherence and frame-text mismatch from AI-generated videos. Despite their prevalence in industrial settings, these low-quality videos have been largely overlooked in academic research, posing a challenge for accurate identification. To address this, we introduce the Multi-Branch Collaborative Network (MBCN) tailored for industrial video retrieval systems. MBCN features four branches, each designed to tackle one of the aforementioned quality issues. After each branch independently scores videos, we aggregate these scores using a weighted approach and a squeeze-and-excitation mechanism to dynamically address quality issues across different scenarios. We implement point-wise and pair-wise optimization objectives to ensure score stability and reasonableness. Extensive offline and online experiments on a world-level video search engine demonstrate MBCN's effectiveness in identifying video quality issues, significantly enhancing the retrieval system's ranking performance. Detailed experimental analyses confirm the positive contribution of all four evaluation branches. Furthermore, MBCN significantly improves recognition accuracy for low-quality AI-generated videos compared to the baseline.
摘要：视频质量评估（VQA）对于大规模视频检索系统至关重要，旨在识别优质问题以优先考虑高质量视频。在工业系统中，低质量的视频特征分为四类：与视觉相关的问题，例如马赛克和黑匣子，视频标题和OCR内容中的文本问题，以及诸如AI生成视频的框架不连贯性和框架不匹配的语义问题。尽管这些低质量的视频在学术研究中被忽略了，但在工业环境中的盛行，对准确的识别构成了挑战。为了解决这个问题，我们介绍了针对工业视频检索系统量身定制的多分支协作网络（MBCN）。 MBCN具有四个分支，每个分支都旨在解决上述质量问题之一。每个分支都独立得分视频后，我们使用加权方法和挤压和激发机制来汇总这些分数，以在不同情况下动态解决质量问题。我们实现了点和配对优化目标，以确保得分稳定性和合理性。在世界级视频搜索引擎上进行的广泛离线和在线实验证明了MBCN在识别视频质量问题方面的有效性，从而大大提高了检索系统的排名性能。详细的实验分析证实了所有四个评估分支的积极贡献。此外，与基线相比，MBCN显着提高了低质量AI生成视频的识别精度。

Title: Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention

Authors: Zhendong Zhang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2502.05947
Pdf URL: https://arxiv.org/pdf/2502.05947
Copy Paste: [[2502.05947]] Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention(https://arxiv.org/abs/2502.05947)
Keywords: generation
Abstract: Multiple heads decoding accelerates the inference of Large Language Models (LLMs) by predicting next several tokens simultaneously. It generates and verifies multiple candidate sequences in parallel via tree attention with a fixed structure. In this paper, we replace the fixed tree attention with dynamic tree attention on multiple head decoding, specifically in the context of MEDUSA. We propose a simple and low complexity strategy to generate candidates and construct the dynamic tree structure. Preliminary experiments show that the proposed method improves the decoding efficiency of multiple head decoding for LLMs while maintaining the generation quality. This result demonstrates the potential for improvement of multiple head decoding in candidate generation.
摘要：通过同时预测下一个代币，多个解码的负责人可以加速大型语言模型（LLMS）的推断。它通过固定结构通过树的注意并行生成和验证多个候选序列。在本文中，我们将固定的树的注意力替换为在多个头部解码上，特别是在美杜莎的背景下的动态树注意。我们提出了一种简单而低的复杂性策略，以生成候选者并构建动态树结构。初步实验表明，所提出的方法提高了LLM的多个头部解码的解码效率，同时保持发电质量。该结果表明了改善候选生成中多重脑电图解码的潜力。

Title: VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer

Authors: Xinyu Liu, Ailing Zeng, Wei Xue, Harry Yang, Wenhan Luo, Qifeng Liu, Yike Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.05979
Pdf URL: https://arxiv.org/pdf/2502.05979
Copy Paste: [[2502.05979]] VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer(https://arxiv.org/abs/2502.05979)
Keywords: generation, generative
Abstract: Crafting magic and illusions is one of the most thrilling aspects of filmmaking, with visual effects (VFX) serving as the powerhouse behind unforgettable cinematic experiences. While recent advances in generative artificial intelligence have driven progress in generic image and video synthesis, the domain of controllable VFX generation remains relatively underexplored. In this work, we propose a novel paradigm for animated VFX generation as image animation, where dynamic effects are generated from user-friendly textual descriptions and static reference images. Our work makes two primary contributions: (i) Open-VFX, the first high-quality VFX video dataset spanning 15 diverse effect categories, annotated with textual descriptions, instance segmentation masks for spatial conditioning, and start-end timestamps for temporal control. (ii) VFX Creator, a simple yet effective controllable VFX generation framework based on a Video Diffusion Transformer. The model incorporates a spatial and temporal controllable LoRA adapter, requiring minimal training videos. Specifically, a plug-and-play mask control module enables instance-level spatial manipulation, while tokenized start-end motion timestamps embedded in the diffusion process, alongside the text encoder, allow precise temporal control over effect timing and pace. Extensive experiments on the Open-VFX test set demonstrate the superiority of the proposed system in generating realistic and dynamic effects, achieving state-of-the-art performance and generalization ability in both spatial and temporal controllability. Furthermore, we introduce a specialized metric to evaluate the precision of temporal control. By bridging traditional VFX techniques with generative approaches, VFX Creator unlocks new possibilities for efficient and high-quality video effect generation, making advanced VFX accessible to a broader audience.
摘要：制作魔术和幻觉是电影制作中最激动人心的方面之一，视觉效果（VFX）是令人难忘的电影体验背后的强国。尽管生成人工智能的最新进展驱动了通用图像和视频综合方面的进步，但可控VFX生成的领域仍然相对毫无疑问。在这项工作中，我们为动画VFX生成作为图像动画提出了一种新颖的范式，其中动态效果是从用户友好的文本描述和静态参考图像中生成的。我们的工作做出了两个主要贡献：（i）Open-VFX，这是第一个跨越15种不同效果类别的高质量VFX视频数据集，带有文本描述，用于空间调理的实例分段掩码和用于时间控制的起始端时间戳。（ii）VFX Creator，这是一个基于视频扩散变压器的简单而有效的可控VFX生成框架。该模型结合了空间和时间可控的洛拉适配器，需要最少的培训视频。具体而言，插件掩码控制模块启用实例级的空间操作，而嵌入扩散过程中的标记起始端运动时间戳以及文本编码器并允许对效果时间和节奏进行精确的时间控制。开放VFX测试集的广泛实验证明了所提出的系统在产生现实和动态效果方面的优越性，在空间和时间可控性中实现了最先进的性能和概括能力。此外，我们引入了一个专门的指标，以评估时间控制的精度。通过将传统的VFX技术与生成方法桥接，VFX Creator解锁了新的可能性，以实现高效和高质量的视频效果生成，从而使更广泛的受众访问了高级VFX。

Title: Generating 3D Binding Molecules Using Shape-Conditioned Diffusion Models with Guidance

Authors: Ziqi Chen, Bo Peng, Tianhua Zhai, Daniel Adu-Ampratwum, Xia Ning
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.06027
Pdf URL: https://arxiv.org/pdf/2502.06027
Copy Paste: [[2502.06027]] Generating 3D Binding Molecules Using Shape-Conditioned Diffusion Models with Guidance(https://arxiv.org/abs/2502.06027)
Keywords: generative
Abstract: Drug development is a critical but notoriously resource- and time-consuming process. In this manuscript, we develop a novel generative artificial intelligence (genAI) method DiffSMol to facilitate drug development. DiffSmol generates 3D binding molecules based on the shapes of known ligands. DiffSMol encapsulates geometric details of ligand shapes within pre-trained, expressive shape embeddings and then generates new binding molecules through a diffusion model. DiffSMol further modifies the generated 3D structures iteratively via shape guidance to better resemble the ligand shapes. It also tailors the generated molecules toward optimal binding affinities under the guidance of protein pockets. Here, we show that DiffSMol outperforms the state-of-the-art methods on benchmark datasets. When generating binding molecules resembling ligand shapes, DiffSMol with shape guidance achieves a success rate 61.4%, substantially outperforming the best baseline (11.2%), meanwhile producing molecules with novel molecular graph structures. DiffSMol with pocket guidance also outperforms the best baseline in binding affinities by 13.2%, and even by 17.7% when combined with shape guidance. Case studies for two critical drug targets demonstrate very favorable physicochemical and pharmacokinetic properties of the generated molecules, thus, the potential of DiffSMol in developing promising drug candidates.
摘要：药物开发是一个关键但臭名昭著的资源和耗时的过程。在此手稿中，我们开发了一种新颖的生成人工智能（Genai）方法DIFFSMOL来促进药物开发。 DIFFSMOL基于已知配体的形状产生3D结合分子。 DIFFSMOL封装了预先训练的表达形状嵌入中配体形状的几何细节，然后通过扩散模型生成新的结合分子。 DIFFSMOL通过形状引导进一步修改生成的3D结构，以更好地类似于配体形状。它还在蛋白质口袋的指导下定制了产生的分子的最佳结合亲和力。在这里，我们表明DIFFSMOL优于基准数据集上的最新方法。当产生类似配体形状的结合分子时，具有形状引导的DIFFSMOL达到成功率为61.4％，显着优于最佳基线（11.2％），同时用新的分子图产生分子。带有袖珍指导的DIFFSMOL还表现出结合亲和力中最佳基线的效果，即与形状指导结合使用时甚至17.7％。两个关键药物靶标的案例研究表明，产生的分子的物理化学和药代动力学特性非常有利，因此，DIFFSMOL在发展有希望的候选药物方面的潜力。

Title: Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization

Authors: Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, Ge Liu
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2502.06061
Pdf URL: https://arxiv.org/pdf/2502.06061
Copy Paste: [[2502.06061]] Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization(https://arxiv.org/abs/2502.06061)
Keywords: generation, generative
Abstract: Recent advancements in reinforcement learning (RL) have achieved great success in fine-tuning diffusion-based generative models. However, fine-tuning continuous flow-based generative models to align with arbitrary user-defined reward functions remains challenging, particularly due to issues such as policy collapse from overoptimization and the prohibitively high computational cost of likelihoods in continuous-time flows. In this paper, we propose an easy-to-use and theoretically sound RL fine-tuning method, which we term Online Reward-Weighted Conditional Flow Matching with Wasserstein-2 Regularization (ORW-CFM-W2). Our method integrates RL into the flow matching framework to fine-tune generative models with arbitrary reward functions, without relying on gradients of rewards or filtered datasets. By introducing an online reward-weighting mechanism, our approach guides the model to prioritize high-reward regions in the data manifold. To prevent policy collapse and maintain diversity, we incorporate Wasserstein-2 (W2) distance regularization into our method and derive a tractable upper bound for it in flow matching, effectively balancing exploration and exploitation of policy optimization. We provide theoretical analyses to demonstrate the convergence properties and induced data distributions of our method, establishing connections with traditional RL algorithms featuring Kullback-Leibler (KL) regularization and offering a more comprehensive understanding of the underlying mechanisms and learning behavior of our approach. Extensive experiments on tasks including target image generation, image compression, and text-image alignment demonstrate the effectiveness of our method, where our method achieves optimal policy convergence while allowing controllable trade-offs between reward maximization and diversity preservation.
摘要：增强学习（RL）的最新进步在基于微调扩散的生成模型中取得了巨大成功。但是，与任意用户定义的奖励功能保持一致的微调基于流量的生成模型仍然具有挑战性，尤其是由于诸如策略崩溃过度的问题和连续时间流中可能性的高度计算成本。在本文中，我们提出了一种易于使用且理论上声音的RL微调方法，我们将其称为在线奖励加权与Wasserstein-2正则化（ORW-CFM-W2）的条件加权流量。我们的方法将RL集成到流量匹配框架中，以使用任意奖励功能的微调生成模型，而无需依赖奖励或过滤数据集的梯度。通过引入在线奖励加权机制，我们的方法指导该模型优先考虑数据歧管中的高回报区域。为了防止政策崩溃并维持多样性，我们将Wasserstein-2（W2）距离正则化纳入我们的方法中，并在流量匹配中获得了可拖延的上限，有效地平衡了对策略优化的探索和开发。我们提供了理论分析，以证明我们方法的收敛属性和诱导的数据分布，并与传统的RL算法建立了连接，这些算法具有Kullback-Leibler（KL）正则化，并对我们方法的潜在机制和学习行为提供了更全面的了解。对任务的广泛实验，包括目标图像产生，图像压缩和文本图像对齐，证明了我们方法的有效性，我们的方法可以实现最佳的策略收敛，同时允许在奖励最大化和多样性保存之间进行可控制的权衡。

Title: Debiasing Guidance for Discrete Diffusion with Sequential Monte Carlo

Authors: Cheuk Kit Lee, Paul Jeha, Jes Frellsen, Pietro Lio, Michael Samuel Albergo, Francisco Vargas
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.06079
Pdf URL: https://arxiv.org/pdf/2502.06079
Copy Paste: [[2502.06079]] Debiasing Guidance for Discrete Diffusion with Sequential Monte Carlo(https://arxiv.org/abs/2502.06079)
Keywords: generation, generative
Abstract: Discrete diffusion models are a class of generative models that produce samples from an approximated data distribution within a discrete state space. Often, there is a need to target specific regions of the data distribution. Current guidance methods aim to sample from a distribution with mass proportional to $p_0(x_0) p(\zeta|x_0)^\alpha$ but fail to achieve this in practice. We introduce a Sequential Monte Carlo algorithm that generates unbiasedly from this target distribution, utilising the learnt unconditional and guided process. We validate our approach on low-dimensional distributions, controlled images and text generations. For text generation, our method provides strong control while maintaining low perplexity compared to guidance-based approaches.
摘要：离散扩散模型是一类生成模型，它们从离散状态空间内的近似数据分布中产生样品。通常，需要针对数据分布的特定区域。当前的指导方法旨在从质量成比例成比例为$ p_0（x_0）p（\ zeta | x_0）^\ alpha $的分布中进行采样，但在实践中未能实现这一目标。我们介绍了一种顺序的蒙特卡洛算法，该算法利用学到的无条件和指导过程，该算法从该目标分布中公正地产生。我们验证了低维分布，受控图像和文本世代的方法。对于文本生成，与基于指导的方法相比，我们的方法可以提供强大的控制，同时保持低困惑。

Title: Graph Pseudotime Analysis and Neural Stochastic Differential Equations for Analyzing Retinal Degeneration Dynamics and Beyond

Authors: Dai Shi, Kuan Yan, Lequan Lin, Yue Zeng, Ting Zhang, Dmytro Matsypura, Mark C. Gillies, Ling Zhu, Junbin Gao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.06126
Pdf URL: https://arxiv.org/pdf/2502.06126
Copy Paste: [[2502.06126]] Graph Pseudotime Analysis and Neural Stochastic Differential Equations for Analyzing Retinal Degeneration Dynamics and Beyond(https://arxiv.org/abs/2502.06126)
Keywords: generation
Abstract: Understanding disease progression at the molecular pathway level usually requires capturing both structural dependencies between pathways and the temporal dynamics of disease evolution. In this work, we solve the former challenge by developing a biologically informed graph-forming method to efficiently construct pathway graphs for subjects from our newly curated JR5558 mouse transcriptomics dataset. We then develop Graph-level Pseudotime Analysis (GPA) to infer graph-level trajectories that reveal how disease progresses at the population level, rather than in individual subjects. Based on the trajectories estimated by GPA, we identify the most sensitive pathways that drive disease stage transitions. In addition, we measure changes in pathway features using neural stochastic differential equations (SDEs), which enables us to formally define and compute pathway stability and disease bifurcation points (points of no return), two fundamental problems in disease progression research. We further extend our theory to the case when pathways can interact with each other, enabling a more comprehensive and multi-faceted characterization of disease phenotypes. The comprehensive experimental results demonstrate the effectiveness of our framework in reconstructing the dynamics of the pathway, identifying critical transitions, and providing novel insights into the mechanistic understanding of disease evolution.
摘要：了解分子途径水平的疾病进展通常需要捕获途径之间的结构依赖性和疾病进化的时间动态。在这项工作中，我们通过开发一种生物学知情的图形方法来解决以前的挑战，以从我们新策划的JR5555小鼠转录组学数据集中为受试者有效构建途径图。然后，我们开发图形伪次分析（GPA）来推断图形轨迹，这些轨迹揭示了疾病在人群水平而不是在单个受试者中的发展。根据GPA估计的轨迹，我们确定了驱动疾病阶段过渡的最敏感途径。此外，我们使用神经随机微分方程（SDE）来衡量途径特征的变化，这使我们能够正式定义和计算途径稳定性和疾病分叉点（无回报点），这是疾病进展研究中的两个基本问题。我们进一步将我们的理论扩展到了途径可以相互作用的情况，从而实现了对疾病表型的更全面和多方面的特征。全面的实验结果表明，我们框架在重建途径的动力学，确定关键过渡并提供对疾病进化的机械理解的新见解方面的有效性。

Title: Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models

Authors: Ce Zhang, Zifu Wan, Zhehan Kan, Martin Q. Ma, Simon Stepputtis, Deva Ramanan, Russ Salakhutdinov, Louis-Philippe Morency, Katia Sycara, Yaqi Xie
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2502.06130
Pdf URL: https://arxiv.org/pdf/2502.06130
Copy Paste: [[2502.06130]] Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models(https://arxiv.org/abs/2502.06130)
Keywords: generation, generative
Abstract: While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios. In this work, inspired by the observation that the text-to-image generation process is the inverse of image-conditioned response generation in LVLMs, we explore the potential of leveraging text-to-image generative models to assist in mitigating hallucinations in LVLMs. We discover that generative models can offer valuable self-feedback for mitigating hallucinations at both the response and token levels. Building on this insight, we introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process to effectively mitigate hallucinations in LVLMs. Specifically, DeGF generates an image from the initial response produced by LVLMs, which acts as an auxiliary visual reference and provides self-feedback to verify and correct the initial response through complementary or contrastive decoding. Extensive experimental results validate the effectiveness of our approach in mitigating diverse types of hallucinations, consistently surpassing state-of-the-art methods across six benchmarks. Code is available at this https URL.
摘要：尽管最近的大型视力模型（LVLM）在多模式任务中表现出了显着的性能，但它们很容易产生幻觉文本响应，这些响应与给定的视觉输入不符，这限制了它们在现实世界中的实际适用性。在这项工作中，受到观察的启发，即文本对图像生成过程是LVLMS中图像条件的响应生成的倒数，我们探讨了利用文本对图像生成模型的潜力，以帮助减轻LVLMS中的幻觉。我们发现，生成模型可以提供有价值的自我反馈，以减轻响应和令牌水平的幻觉。在这种见识的基础上，我们将自我校正解码与生成反馈（DEGF）一起引入了一种新颖的无培训算法，将从文本到图像生成模型的反馈结合到解码过程中，以有效地减轻LVLM中的幻觉。具体而言，DEGF从LVLMS产生的初始响应中生成图像，该图像充当辅助视觉参考，并提供了自我反馈，以通过互补或对比解码来验证和纠正初始响应。广泛的实验结果证明了我们方法在缓解各种幻觉的有效性，从而始终超过六个基准测试的最新方法。代码可在此HTTPS URL上找到。

Title: Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

Authors: Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, Hao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06155
Pdf URL: https://arxiv.org/pdf/2502.06155
Copy Paste: [[2502.06155]] Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile(https://arxiv.org/abs/2502.06155)
Keywords: generation
Abstract: Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data; We identify a prevalent tile-style repetitive pattern in the 3D attention maps for video data, and advocate a new family of sparse 3D attention that holds a linear complexity w.r.t. the number of video frames. 2) Shorten the sampling process by adopting existing multi-step consistency distillation; We split the entire sampling trajectory into several segments and perform consistency distillation within each one to activate few-step generation capacities. We further devise a three-stage training pipeline to conjoin the low-complexity attention and few-step generation capacities. Notably, with 0.1% pretraining data, we turn the Open-Sora-Plan-1.2 model into an efficient one that is 7.4x -7.8x faster for 29 and 93 frames 720p video generation with a marginal performance trade-off in VBench. In addition, we demonstrate that our approach is amenable to distributed inference, achieving an additional 3.91x speedup when running on 4 GPUs with sequence parallelism.
摘要：尽管有望综合高保真视频，但由于注意力计算的复杂性和许多采样步骤，具有3D充足注意的扩散变压器（DIT）受到了昂贵的推断。例如，流行的开放式计划模型可以消耗9分钟以上的时间来生成一个29帧的视频。本文从两个方面解决了效率低下的问题：1）根据视频数据中的冗余，修剪3D的全部关注；我们在3D注意图中确定了视频数据的普遍重复模式，并主张一个新的稀疏3D注意家族，该家族具有线性复杂性W.R.T.视频框架的数量。 2）通过采用现有的多步稠度蒸馏来缩短采样过程；我们将整个采样轨迹分为几个段，并在每个段内进行一致性蒸馏以激活几个步骤的生成能力。我们进一步设计了一个三阶段的训练管道，以连接低复杂性的关注和几步的生成能力。值得注意的是，借助0.1％的数据，我们将开放式规划-1.2模型变成了一个有效的模型，在29和93帧720p的视频生成中，在VBENCEN中具有边际性能折衷的速度为7.4倍-7.8倍。此外，我们证明了我们的方法可以与分布式推理相提并论，在与序列并行性的4 GPU上运行时，可以实现3.91倍的速度。

Title: Universal Approximation of Visual Autoregressive Transformers

Authors: Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2502.06167
Pdf URL: https://arxiv.org/pdf/2502.06167
Copy Paste: [[2502.06167]] Universal Approximation of Visual Autoregressive Transformers(https://arxiv.org/abs/2502.06167)
Keywords: generation
Abstract: We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine ``next-scale prediction'' framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any image-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas.
摘要：我们研究了基于变压器的基础模型的基本限制，扩展了分析，以包括视觉自回旋（VAR）变压器。 VAR代表了使用新颖，可扩展的，粗到最细的``临时预测''框架生成图像的重要一步。这些模型设定了一个新的质量条，表现优于所有先前的方法，包括扩散变压器，同时具有最先进的图像综合任务性能。我们的主要贡献确定，对于具有单个自发动层和单个插值层的单头VAR变压器，Var Transformer是通用的。从统计的角度来看，我们证明了这种简单的VAR变形金刚是任何图像到图像lipschitz函数的通用近似器。此外，我们证明了基于流动的自回旋变压器继承了相似的近似功能。我们的结果为有效和计算有效的VAR变形金刚策略提供了重要的设计原理，这些策略可用于将其效用扩展到图像生成和其他相关领域的更复杂的VAR模型。

Title: An Interpretable Implicit-Based Approach for Modeling Local Spatial Effects: A Case Study of Global Gross Primary Productivity

Authors: Siqi Du, Hongsheng Huang, Kaixin Shen, Ziqi Liu, Shengjun Tang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.06170
Pdf URL: https://arxiv.org/pdf/2502.06170
Copy Paste: [[2502.06170]] An Interpretable Implicit-Based Approach for Modeling Local Spatial Effects: A Case Study of Global Gross Primary Productivity(https://arxiv.org/abs/2502.06170)
Keywords: generation
Abstract: In Earth sciences, unobserved factors exhibit non-stationary spatial distributions, causing the relationships between features and targets to display spatial heterogeneity. In geographic machine learning tasks, conventional statistical learning methods often struggle to capture spatial heterogeneity, leading to unsatisfactory prediction accuracy and unreliable interpretability. While approaches like Geographically Weighted Regression (GWR) capture local variations, they fall short of uncovering global patterns and tracking the continuous evolution of spatial heterogeneity. Motivated by this limitation, we propose a novel perspective - that is, simultaneously modeling common features across different locations alongside spatial differences using deep neural networks. The proposed method is a dual-branch neural network with an encoder-decoder structure. In the encoding stage, the method aggregates node information in a spatiotemporal conditional graph using GCN and LSTM, encoding location-specific spatiotemporal heterogeneity as an implicit conditional vector. Additionally, a self-attention-based encoder is used to extract location-invariant common features from the data. In the decoding stage, the approach employs a conditional generation strategy that predicts response variables and interpretative weights based on data features under spatiotemporal conditions. The approach is validated by predicting vegetation gross primary productivity (GPP) using global climate and land cover data from 2001 to 2020. Trained on 50 million samples and tested on 2.8 million, the proposed model achieves an RMSE of 0.836, outperforming LightGBM (1.063) and TabNet (0.944). Visualization analyses indicate that our method can reveal the distribution differences of the dominant factors of GPP across various times and locations.
摘要：在地球科学中，未观察到的因素表现出非平稳的空间分布，从而导致特征和目标之间的关系以显示空间异质性。在地理机器学习任务中，常规的统计学习方法通常难以捕获空间异质性，从而导致预测准确性和不可靠的解释性。尽管诸如地理加权回归（GWR）之类的方法捕获了局部变化，但它们却没有发现全球模式并跟踪空间异质性的持续发展。在这种局限性的推动下，我们提出了一种新颖的视角 - 也就是说，同时使用深层神经网络在不同位置进行了跨不同位置的共同特征进行建模。提出的方法是具有编码器解码器结构的双分支神经网络。在编码阶段，该方法使用GCN和LSTM在时空条件图中汇总了节点信息，将特定于位置的时空异质性编码为隐式条件矢量。此外，基于自我注意的编码器用于从数据中提取位置不变的共同特征。在解码阶段，该方法采用有条件的生成策略，该策略根据时空条件下的数据特征预测响应变量和解释权重。通过使用2001年至2020年的全球气候和土地覆盖数据预测植被总生产率（GPP），可以通过预测植被总生产率（GPP）来验证该方法。对5000万个样品进行了培训，并在280万个样本中进行了测试，拟议的模型可实现0.836的RMSE，超过了LightGBM（1.063）（1.063）和Tabnet（0.944）。可视化分析表明，我们的方法可以揭示GPP在不同时间和位置的主要因素的分布差异。

Title: Uncertainty-Aware Adaptation of Large Language Models for Protein-Protein Interaction Analysis

Authors: Sanket Jantre, Tianle Wang, Gilchan Park, Kriti Chopra, Nicholas Jeon, Xiaoning Qian, Nathan M. Urban, Byung-Jun Yoon
Subjects: cs.LG, cs.AI, cs.CL, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2502.06173
Pdf URL: https://arxiv.org/pdf/2502.06173
Copy Paste: [[2502.06173]] Uncertainty-Aware Adaptation of Large Language Models for Protein-Protein Interaction Analysis(https://arxiv.org/abs/2502.06173)
Keywords: generative
Abstract: Identification of protein-protein interactions (PPIs) helps derive cellular mechanistic understanding, particularly in the context of complex conditions such as neurodegenerative disorders, metabolic syndromes, and cancer. Large Language Models (LLMs) have demonstrated remarkable potential in predicting protein structures and interactions via automated mining of vast biomedical literature; yet their inherent uncertainty remains a key challenge for deriving reproducible findings, critical for biomedical applications. In this study, we present an uncertainty-aware adaptation of LLMs for PPI analysis, leveraging fine-tuned LLaMA-3 and BioMedGPT models. To enhance prediction reliability, we integrate LoRA ensembles and Bayesian LoRA models for uncertainty quantification (UQ), ensuring confidence-calibrated insights into protein behavior. Our approach achieves competitive performance in PPI identification across diverse disease contexts while addressing model uncertainty, thereby enhancing trustworthiness and reproducibility in computational biology. These findings underscore the potential of uncertainty-aware LLM adaptation for advancing precision medicine and biomedical research.
摘要：蛋白质蛋白质相互作用（PPI）的鉴定有助于得出细胞机理的理解，特别是在复杂的疾病（例如神经退行性疾病，代谢综合征和癌症）的情况下。大型语言模型（LLMS）通过自动挖掘大量生物医学文献来预测蛋白质结构和相互作用具有巨大的潜力。然而，它们固有的不确定性仍然是导致可再现发现的关键挑战，这对于生物医学应用至关重要。在这项研究中，我们提出了LLM的不确定性感知适应PPI分析，利用微调的Llama-3和BiomedGPT模型。为了提高预测可靠性，我们整合了洛拉集团和贝叶斯洛拉模型以进行不确定性定量（UQ），从而确保了对蛋白质行为的置信度的见解。我们的方法在跨不同疾病环境的PPI识别中实现了竞争性能，同时解决模型不确定性，从而增强了计算生物学的可信赖性和可重复性。这些发现强调了不确定性感知的LLM适应性的潜力，以推进精确医学和生物医学研究。

Title: CANeRV: Content Adaptive Neural Representation for Video Compression

Authors: Lv Tang, Jun Zhu, Xinfeng Zhang, Li Zhang, Siwei Ma, Qingming Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06181
Pdf URL: https://arxiv.org/pdf/2502.06181
Copy Paste: [[2502.06181]] CANeRV: Content Adaptive Neural Representation for Video Compression(https://arxiv.org/abs/2502.06181)
Keywords: restoration
Abstract: Recent advances in video compression introduce implicit neural representation (INR) based methods, which effectively capture global dependencies and characteristics of entire video sequences. Unlike traditional and deep learning based approaches, INR-based methods optimize network parameters from a global perspective, resulting in superior compression potential. However, most current INR methods utilize a fixed and uniform network architecture across all frames, limiting their adaptability to dynamic variations within and between video sequences. This often leads to suboptimal compression outcomes as these methods struggle to capture the distinct nuances and transitions in video content. To overcome these challenges, we propose Content Adaptive Neural Representation for Video Compression (CANeRV), an innovative INR-based video compression network that adaptively conducts structure optimisation based on the specific content of each video sequence. To better capture dynamic information across video sequences, we propose a dynamic sequence-level adjustment (DSA). Furthermore, to enhance the capture of dynamics between frames within a sequence, we implement a dynamic frame-level adjustment (DFA). {Finally, to effectively capture spatial structural information within video frames, thereby enhancing the detail restoration capabilities of CANeRV, we devise a structure level hierarchical structural adaptation (HSA).} Experimental results demonstrate that CANeRV can outperform both H.266/VVC and state-of-the-art INR-based video compression techniques across diverse video datasets.
摘要：视频压缩的最新进展引入了基于隐式神经表示（INR）的方法，该方法有效地捕获了整个视频序列的全局依赖性和特征。与传统和深度学习的方法不同，基于INR的方法从全球角度优化了网络参数，从而产生了出色的压缩潜力。但是，大多数当前的INR方法在所有帧上都使用固定和统一的网络体系结构，从而限制了它们对视频序列之间和之间动态变化的适应性。由于这些方法难以捕获视频内容中的不同细微差别和过渡，这通常会导致次优压缩结果。为了克服这些挑战，我们提出了视频压缩（CANERV）的内容自适应神经表示，这是一个创新的基于INR的视频压缩网络，该网络基于每个视频序列的特定内容自适应地进行结构优化。为了更好地捕获视频序列的动态信息，我们提出了动态序列级调整（DSA）。此外，为了增强序列内帧之间动力学的捕获，我们实现了动态帧级调整（DFA）。 {最后，为了有效捕获视频帧中的空间结构信息，从而增强了Canerv的细节恢复能力，我们设计了结构级别的层次结构适应（HSA）。}实验结果表明，Canerv可以表明Canerv可以超越H.266/VVC和状态和状态 - 基于ART INR的视频压缩技术跨不同的视频数据集。

Title: Comparing Image Segmentation Algorithms

Authors: Milind Cherukuri
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06201
Pdf URL: https://arxiv.org/pdf/2502.06201
Copy Paste: [[2502.06201]] Comparing Image Segmentation Algorithms(https://arxiv.org/abs/2502.06201)
Keywords: restoration
Abstract: This paper presents a novel approach for denoising binary images using simulated annealing (SA), a global optimization technique that addresses the inherent challenges of non convex energy functions. Binary images are often corrupted by noise, necessitating effective restoration methods. We propose an energy function E(x, y) that captures the relationship between the noisy image y and the desired clean image x. Our algorithm combines simulated annealing with a localized optimization strategy to efficiently navigate the solution space, minimizing the energy function while maintaining computational efficiency. We evaluate the performance of the proposed method against traditional iterative conditional modes (ICM), employing a binary image with 10% pixel corruption as a test case. Experimental results demonstrate that the simulated annealing method achieves a significant restoration improvement, yielding a 99.19% agreement with the original image compared to 96.21% for ICM. Visual assessments reveal that simulated annealing effectively removes noise while preserving structural details, making it a promising approach for binary image denoising. This work contributes to the field of image processing by highlighting the advantages of incorporating global optimization techniques in restoration tasks.
摘要：本文提出了一种使用模拟退火（SA）来降级二进制图像的新方法，这是一种全局优化技术，旨在解决非凸能函数的固有挑战。二进制图像通常被噪声损坏，需要有效的恢复方法。我们提出了一个能量函数e（x，y），该函数捕获嘈杂的图像y和所需的清洁图像x之间的关系。我们的算法将模拟退火与局部优化策略相结合，以有效地导航解决方案空间，从而在保持计算效率的同时最小化了能量功能。我们对传统迭代条件模式（ICM）的拟议方法的性能进行了评估，该模式采用了10％像素损坏作为测试案例的二进制图像。实验结果表明，模拟退火方法可实现显着的恢复改善，与原始图像相比，与ICM相比，与原始图像达到99.19％的一致性。视觉评估表明，模拟退火可以有效地消除噪声，同时保留结构细节，使其成为二元图像降级的有前途的方法。这项工作通过强调将全球优化技术纳入恢复任务的优势来有助于图像处理领域。

Title: DGNO: A Novel Physics-aware Neural Operator for Solving Forward and Inverse PDE Problems based on Deep, Generative Probabilistic Modeling

Authors: Yaohua Zang, Phaedon-Stelios Koutsourelakis
Subjects: cs.LG, math-ph
Abstract URL: https://arxiv.org/abs/2502.06250
Pdf URL: https://arxiv.org/pdf/2502.06250
Copy Paste: [[2502.06250]] DGNO: A Novel Physics-aware Neural Operator for Solving Forward and Inverse PDE Problems based on Deep, Generative Probabilistic Modeling(https://arxiv.org/abs/2502.06250)
Keywords: generative
Abstract: Solving parametric partial differential equations (PDEs) and associated PDE-based, inverse problems is a central task in engineering and physics, yet existing neural operator methods struggle with high-dimensional, discontinuous inputs and require large amounts of {\em labeled} training data. We propose the Deep Generative Neural Operator (DGNO), a physics-aware framework that addresses these challenges by leveraging a deep, generative, probabilistic model in combination with a set of lower-dimensional, latent variables that simultaneously encode PDE-inputs and PDE-outputs. This formulation can make use of unlabeled data and significantly improves inverse problem-solving, particularly for discontinuous or discrete-valued input functions. DGNO enforces physics constraints without labeled data by incorporating as virtual observables, weak-form residuals based on compactly supported radial basis functions (CSRBFs). These relax regularity constraints and eliminate higher-order derivatives from the objective function. We also introduce MultiONet, a novel neural operator architecture, which is a more expressive generalization of the popular DeepONet that significantly enhances the approximating power of the proposed model. These innovations make DGNO particularly effective for challenging forward and inverse, PDE-based problems, such as those involving multi-phase media. Numerical experiments demonstrate that DGNO achieves higher accuracy across multiple benchmarks while exhibiting robustness to noise and strong generalization to out-of-distribution cases. Its adaptability, and the ability to handle sparse, noisy data while providing probabilistic estimates, make DGNO a powerful tool for scientific and engineering applications.
摘要：求解参数偏微分方程（PDE）和相关的基于PDE的逆问题是工程和物理学中的一项核心任务，但是现有的神经操作员方法与高维，不连续的输入相比，需要大量{\ em emeled}训练数据。我们提出了深层生成神经操作员（DGNO），这是一种物理感知的框架，通过利用深层，生成的，概率的模型与一组同时代表PDE-PDE-pde-pde-pde-输入和PDE-输入和PDE-pde-pde--输出。该公式可以利用未标记的数据，并显着改善解决反问题的解决方案，尤其是对于不连续或离散值的输入函数。 DGNO通过将基于紧凑型径向基础函数（CSRBFS）的虚拟观测值（CSRBFS）纳入虚拟观测值来实施无标记数据的物理约束。这些放松规律性约束并从目标函数中消除高阶导数。我们还介绍了一种新型的神经操作架体系结构Multiotet，这是对流行Deponet的更具表现力的概括，可显着增强所提出模型的近似功能。这些创新使DGNO对于挑战前进和基于PDE的问题（例如涉及多相媒体的问题）特别有效。数值实验表明，DGNO在多个基准测试中实现了更高的精度，同时表现出对噪声的稳健性和对分布外病例的强烈概括。它的适应性及其在提供概率估计的同时处理稀疏，嘈杂数据的能力，使DGNO成为科学和工程应用的强大工具。

Title: Enhancing Ground-to-Aerial Image Matching for Visual Misinformation Detection Using Semantic Segmentation

Authors: Matteo Mule, Matteo Pannacci, Ali Ghasemi Goudarzi, Francesco Pro, Lorenzo Papa, Luca Maiano, Irene Amerini
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06288
Pdf URL: https://arxiv.org/pdf/2502.06288
Copy Paste: [[2502.06288]] Enhancing Ground-to-Aerial Image Matching for Visual Misinformation Detection Using Semantic Segmentation(https://arxiv.org/abs/2502.06288)
Keywords: generative
Abstract: The recent advancements in generative AI techniques, which have significantly increased the online dissemination of altered images and videos, have raised serious concerns about the credibility of digital media available on the Internet and distributed through information channels and social networks. This issue particularly affects domains that rely heavily on trustworthy data, such as journalism, forensic analysis, and Earth observation. To address these concerns, the ability to geolocate a non-geo-tagged ground-view image without external information, such as GPS coordinates, has become increasingly critical. This study tackles the challenge of linking a ground-view image, potentially exhibiting varying fields of view (FoV), to its corresponding satellite image without the aid of GPS data. To achieve this, we propose a novel four-stream Siamese-like architecture, the Quadruple Semantic Align Net (SAN-QUAD), which extends previous state-of-the-art (SOTA) approaches by leveraging semantic segmentation applied to both ground and satellite imagery. Experimental results on a subset of the CVUSA dataset demonstrate significant improvements of up to 9.8\% over prior methods across various FoV settings.
摘要：生成AI技术的最新进步已大大增加了对图像和视频的在线传播，这引起了人们对互联网上可用数字媒体的可信度的严重关注，并通过信息渠道和社交网络分发。这个问题特别影响了严重依赖可信赖数据的领域，例如新闻，法医分析和地球观察。为了解决这些问题，在没有外部信息（例如GPS坐标）的情况下，将非geo标签的地面视图图像地理位置的能力变得越来越关键。这项研究解决了将地面视图图像联系起来的挑战，即可能在无GPS数据的情况下，可能表现出不同的视野（FOV）（FOV）（FOV）。为了实现这一目标，我们提出了一种新颖的四河类似暹罗式建筑，即四式语义Align Align Net（San-Quad），该结构通过利用应用于地面和地面的语义细分来扩展以前的最新方法（SOTA）方法卫星图像。 CVUSA数据集子集的实验结果表明，在各种FOV设置上，先前方法的显着改善了高达9.8％。

Title: UniDemoir\'e: Towards Universal Image Demoir\'eing with Data Generation and Synthesis

Authors: Zemin Yang, Yujing Sun, Xidong Peng, Siu Ming Yiu, Yuexin Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.06324
Pdf URL: https://arxiv.org/pdf/2502.06324
Copy Paste: [[2502.06324]] UniDemoir\'e: Towards Universal Image Demoir\'eing with Data Generation and Synthesis(https://arxiv.org/abs/2502.06324)
Keywords: restoration, generation
Abstract: Image demoiréing poses one of the most formidable challenges in image restoration, primarily due to the unpredictable and anisotropic nature of moiré patterns. Limited by the quantity and diversity of training data, current methods tend to overfit to a single moiré domain, resulting in performance degradation for new domains and restricting their robustness in real-world applications. In this paper, we propose a universal image demoiréing solution, UniDemoiré, which has superior generalization capability. Notably, we propose innovative and effective data generation and synthesis methods that can automatically provide vast high-quality moiré images to train a universal demoiréing model. Our extensive experiments demonstrate the cutting-edge performance and broad potential of our approach for generalized image demoiréing.
摘要：图像演示构成了图像恢复中最艰巨的挑战之一，这主要是由于Moiré模式的不可预测和各向异性的性质。受培训数据的数量和多样性的限制，当前方法倾向于过度拟合到一个Moiré领域，从而导致新领域的性能降级并限制其在现实应用应用中的稳健性。在本文中，我们提出了一种通用图像演示解决方案UnideMoiré，具有较高的概括能力。值得注意的是，我们提出了创新有效的数据生成和合成方法，这些方法可以自动提供庞大的高质量Moiré图像来训练通用的演示模型。我们的广泛实验证明了我们对广义图像演示的尖端性能和广泛的潜力。

Title: LANTERN++: Enhanced Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive Models

Authors: Sihwan Park, Doohyuk Jang, Sungyub Kim, Souvik Kundu, Eunho Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06352
Pdf URL: https://arxiv.org/pdf/2502.06352
Copy Paste: [[2502.06352]] LANTERN++: Enhanced Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive Models(https://arxiv.org/abs/2502.06352)
Keywords: generation
Abstract: Speculative decoding has been widely used to accelerate autoregressive (AR) text generation. However, its effectiveness in visual AR models remains limited due to token selection ambiguity, where multiple tokens receive similarly low probabilities, reducing acceptance rates. While dynamic tree drafting has been proposed to improve speculative decoding, we show that it fails to mitigate token selection ambiguity, resulting in shallow draft trees and suboptimal acceleration. To address this, we introduce LANTERN++, a novel framework that integrates static tree drafting with a relaxed acceptance condition, allowing drafts to be selected independently of low-confidence predictions. This enables deeper accepted sequences, improving decoding efficiency while preserving image quality. Extensive experiments on state-of-the-art visual AR models demonstrate that LANTERN++ significantly accelerates inference, achieving up to $\mathbf{\times 2.56}$ speedup over standard AR decoding while maintaining high image quality.
摘要：投机解码已被广泛用于加速自回归（AR）文本生成。但是，由于令牌选择的歧义，其在视觉AR模型中的有效性仍然有限，因为多个令牌的概率类似，因此降低了接受率。尽管已经提出了动态树的绘画来改善投机性解码，但我们表明它无法减轻令牌选择的歧义，从而导致了浅的草稿树和次优的加速度。为了解决这个问题，我们介绍了Lantern ++，这是一个新颖的框架，该框架将静态树的起草与放松的接受条件集成在一起，从而可以独立于低信心预测选择草稿。这使得更深入公认的序列可以提高解码效率，同时保持图像质量。对最先进的视觉AR模型进行的广泛实验表明，灯笼++显着加速了推理，在保持高图像质量的同时，对标准AR解码进行了高达$ \ Mathbf {\ times 2.56} $加速。

Title: Solving Linear-Gaussian Bayesian Inverse Problems with Decoupled Diffusion Sequential Monte Carlo

Authors: Filip Ekström Kelvinius, Zheng Zhao, Fredrik Lindsten
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2502.06379
Pdf URL: https://arxiv.org/pdf/2502.06379
Copy Paste: [[2502.06379]] Solving Linear-Gaussian Bayesian Inverse Problems with Decoupled Diffusion Sequential Monte Carlo(https://arxiv.org/abs/2502.06379)
Keywords: generative
Abstract: A recent line of research has exploited pre-trained generative diffusion models as priors for solving Bayesian inverse problems. We contribute to this research direction by designing a sequential Monte Carlo method for linear-Gaussian inverse problems which builds on ``decoupled diffusion", where the generative process is designed such that larger updates to the sample are possible. The method is asymptotically exact and we demonstrate the effectiveness of our Decoupled Diffusion Sequential Monte Carlo (DDSMC) algorithm on both synthetic data and image reconstruction tasks. Further, we demonstrate how the approach can be extended to discrete data.
摘要：最近的一系列研究已将预训练的生成扩散模型作为解决贝叶斯反问题的先验。我们通过设计一种基于``脱钩扩散''的线性高斯逆问题的顺序蒙特卡洛方法来为这一研究方向做出了贡献，在``脱钩的扩散''中设计了生成过程，因此可以对样品进行更大的更新。我们证明了我们分离的扩散顺序蒙特卡洛（DDSMC）算法在合成数据和图像重建任务上的有效性。

Title: How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators

Authors: Shang Liu, Hanzhao Wang, Zhongyao Ma, Xiaocheng Li
Subjects: cs.LG, cs.GT, econ.TH
Abstract URL: https://arxiv.org/abs/2502.06387
Pdf URL: https://arxiv.org/pdf/2502.06387
Copy Paste: [[2502.06387]] How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators(https://arxiv.org/abs/2502.06387)
Keywords: quality assessment
Abstract: Human-annotated preference data play an important role in aligning large language models (LLMs). In this paper, we investigate the questions of assessing the performance of human annotators and incentivizing them to provide high-quality annotations. The quality assessment of language/text annotation faces two challenges: (i) the intrinsic heterogeneity among annotators, which prevents the classic methods that assume the underlying existence of a true label; and (ii) the unclear relationship between the annotation quality and the performance of downstream tasks, which excludes the possibility of inferring the annotators' behavior based on the model performance trained from the annotation data. Then we formulate a principal-agent model to characterize the behaviors of and the interactions between the company and the human annotators. The model rationalizes a practical mechanism of a bonus scheme to incentivize annotators which benefits both parties and it underscores the importance of the joint presence of an assessment system and a proper contract scheme. From a technical perspective, our analysis extends the existing literature on the principal-agent model by considering a continuous action space for the agent. We show the gap between the first-best and the second-best solutions (under the continuous action space) is of $\Theta(1/\sqrt{n \log n})$ for the binary contracts and $\Theta(1/n)$ for the linear contracts, where $n$ is the number of samples used for performance assessment; this contrasts with the known result of $\exp(-\Theta(n))$ for the binary contracts when the action space is discrete. Throughout the paper, we use real preference annotation data to accompany our discussions.
摘要：人类注销的偏好数据在对齐大语言模型（LLMS）中起重要作用。在本文中，我们研究了评估人类注释绩效的问题，并激励他们提供高质量的注释。语言/文本注释的质量评估面临两个挑战：（i）注释者之间的内在异质性，这阻止了具有真正标签的基本存在的经典方法；（ii）注释质量与下游任务的性能之间的不清楚关系，这些关系不包括基于从注释数据训练的模型性能来推断注释者行为的可能性。然后，我们制定了一个主要代理模型，以表征公司与人类注释者之间的行为和相互作用。该模型合理化了一种奖金计划的实用机制，以激励注释者受益于双方，并且强调了共同存在评估系统和适当的合同计划的重要性。从技术角度来看，我们的分析通过考虑代理商的连续行动空间来扩展有关主要代理模型的现有文献。我们显示了二进制合同的第一最好的解决方案（在连续动作空间下）之间的差距（在连续动作空间下）（1/\ sqrt {n \ log n}）$的差距为$ \ theta（1/\ sqrt {n \ log n}）$ /n）$用于线性合同，其中$ n $是用于绩效评估的样本数量；这与二进制合同的$ \ exp（ - \ theta（n））$的已知结果形成对比。在整个论文中，我们使用实际的偏好注释数据来伴随我们的讨论。

Title: TANGLED: Generating 3D Hair Strands from Images with Arbitrary Styles and Viewpoints

Authors: Pengyu Long, Zijun Zhao, Min Ouyang, Qingcheng Zhao, Qixuan Zhang, Wei Yang, Lan Xu, Jingyi Yu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2502.06392
Pdf URL: https://arxiv.org/pdf/2502.06392
Copy Paste: [[2502.06392]] TANGLED: Generating 3D Hair Strands from Images with Arbitrary Styles and Viewpoints(https://arxiv.org/abs/2502.06392)
Keywords: generation
Abstract: Hairstyles are intricate and culturally significant with various geometries, textures, and structures. Existing text or image-guided generation methods fail to handle the richness and complexity of diverse styles. We present TANGLED, a novel approach for 3D hair strand generation that accommodates diverse image inputs across styles, viewpoints, and quantities of input views. TANGLED employs a three-step pipeline. First, our MultiHair Dataset provides 457 diverse hairstyles annotated with 74 attributes, emphasizing complex and culturally significant styles to improve model generalization. Second, we propose a diffusion framework conditioned on multi-view linearts that can capture topological cues (e.g., strand density and parting lines) while filtering out noise. By leveraging a latent diffusion model with cross-attention on lineart features, our method achieves flexible and robust 3D hair generation across diverse input conditions. Third, a parametric post-processing module enforces braid-specific constraints to maintain coherence in complex structures. This framework not only advances hairstyle realism and diversity but also enables culturally inclusive digital avatars and novel applications like sketch-based 3D strand editing for animation and augmented reality.
摘要：发型具有复杂且具有各种几何形状，质地和结构的文化意义。现有的文本或图像引导的生成方法无法处理各种样式的丰富性和复杂性。我们提出了纠结的3D发束生成的新方法，可容纳各种样式，观点和数量的输入视图的各种图像输入。 Tangled采用三步管道。首先，我们的Multihair数据集提供了457种带有74个属性的不同发型，强调了复杂且具有文化意义的风格，以改善模型概括。其次，我们提出了一个基于多视线线路的扩散框架，该框架可以捕获拓扑提示（例如，链密度和分隔线），同时滤除噪声。通过利用在线心特征上进行交叉注意的潜在扩散模型，我们的方法在不同的输入条件下实现了灵活且健壮的3D发型。第三，参数后处理模块会实施编织特定的约束，以保持复杂结构中的连贯性。该框架不仅可以提高发型现实主义和多样性，而且还可以使文化包容性的数字化身以及新颖的应用以及基于素描的3D链编辑动画和增强现实。

Title: Hybrid State-Space and GRU-based Graph Tokenization Mamba for Hyperspectral Image Classification

Authors: Muhammad Ahmad, Muhammad Hassaan Farooq Butt, Muhammad Usama, Manuel Mazzara, Salvatore Distefano, Adil Mehmood Khan, Danfeng Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06427
Pdf URL: https://arxiv.org/pdf/2502.06427
Copy Paste: [[2502.06427]] Hybrid State-Space and GRU-based Graph Tokenization Mamba for Hyperspectral Image Classification(https://arxiv.org/abs/2502.06427)
Keywords: generation
Abstract: Hyperspectral image (HSI) classification plays a pivotal role in domains such as environmental monitoring, agriculture, and urban planning. However, it faces significant challenges due to the high-dimensional nature of the data and the complex spectral-spatial relationships inherent in HSI. Traditional methods, including conventional machine learning and convolutional neural networks (CNNs), often struggle to effectively capture these intricate spectral-spatial features and global contextual information. Transformer-based models, while powerful in capturing long-range dependencies, often demand substantial computational resources, posing challenges in scenarios where labeled datasets are limited, as is commonly seen in HSI applications. To overcome these challenges, this work proposes GraphMamba, a hybrid model that combines spectral-spatial token generation, graph-based token prioritization, and cross-attention mechanisms. The model introduces a novel hybridization of state-space modeling and Gated Recurrent Units (GRU), capturing both linear and nonlinear spatial-spectral dynamics. GraphMamba enhances the ability to model complex spatial-spectral relationships while maintaining scalability and computational efficiency across diverse HSI datasets. Through comprehensive experiments, we demonstrate that GraphMamba outperforms existing state-of-the-art models, offering a scalable and robust solution for complex HSI classification tasks.
摘要：高光谱图像（HSI）分类在环境监测，农业和城市规划等领域中起关键作用。但是，由于数据的高维质以及HSI固有的复杂光谱空间关系，它面临着重大挑战。传统方法，包括传统的机器学习和卷积神经网络（CNN），通常很难有效捕获这些复杂的光谱空间特征和全球上下文信息。基于变压器的模型虽然在捕获长期依赖性方面有力，但通常需要大量的计算资源，在限制标记数据集的情况下提出了挑战，正如HSI应用程序中通常可以看到的那样。为了克服这些挑战，这项工作提出了GraphMamba，这是一种混合模型，结合了光谱空间令牌生成，基于图形的令牌优先级和跨注意机制。该模型引入了状态空间建模和门控复发单元（GRU）的新型杂交，从而捕获了线性和非线性空间光谱动力学。 GraphMamba增强了对复杂的空间光谱关系建模的能力，同时保持各种HSI数据集的可扩展性和计算效率。通过全面的实验，我们证明了GraphMamba优于现有的最新模型，为复杂的HSI分类任务提供了可扩展且可靠的解决方案。

Title: FCVSR: A Frequency-aware Method for Compressed Video Super-Resolution

Authors: Qiang Zhu, Fan Zhang, Feiyu Chen, Shuyuan Zhu, David Bull, Bing Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06431
Pdf URL: https://arxiv.org/pdf/2502.06431
Copy Paste: [[2502.06431]] FCVSR: A Frequency-aware Method for Compressed Video Super-Resolution(https://arxiv.org/abs/2502.06431)
Keywords: super-resolution
Abstract: Compressed video super-resolution (SR) aims to generate high-resolution (HR) videos from the corresponding low-resolution (LR) compressed videos. Recently, some compressed video SR methods attempt to exploit the spatio-temporal information in the frequency domain, showing great promise in super-resolution performance. However, these methods do not differentiate various frequency subbands spatially or capture the temporal frequency dynamics, potentially leading to suboptimal results. In this paper, we propose a deep frequency-based compressed video SR model (FCVSR) consisting of a motion-guided adaptive alignment (MGAA) network and a multi-frequency feature refinement (MFFR) module. Additionally, a frequency-aware contrastive loss is proposed for training FCVSR, in order to reconstruct finer spatial details. The proposed model has been evaluated on three public compressed video super-resolution datasets, with results demonstrating its effectiveness when compared to existing works in terms of super-resolution performance (up to a 0.14dB gain in PSNR over the second-best model) and complexity.
摘要：压缩视频超分辨率（SR）旨在从相应的低分辨率（LR）压缩视频中生成高分辨率（HR）视频。最近，一些压缩视频SR方法试图利用频域中的时空信息，在超分辨率性能方面表现出巨大的希望。但是，这些方法并未在空间上区分各种频率子带或捕获时间频率动力学，这可能会导致次优结果。在本文中，我们提出了一个由运动引导的自适应对准（MGAA）网络和多频性特征改进（MFFR）模块组成的基于频率的压缩视频SR模型（FCVSR）。此外，为训练FCVSR提出了频率感的对比损失，以重建较细的空间细节。该模型已在三个公共压缩视频超分辨率数据集上进行了评估，结果证明了与现有作品相比超分辨率性能（在第二好的模型中，PSNR的0.14dB增益）和现有作品相比，其有效性证明了其有效性。复杂。

Title: Prompt-SID: Learning Structural Representation Prompt via Latent Diffusion for Single-Image Denoising

Authors: Huaqiu Li, Wang Zhang, Xiaowan Hu, Tao Jiang, Zikang Chen, Haoqian Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.06432
Pdf URL: https://arxiv.org/pdf/2502.06432
Copy Paste: [[2502.06432]] Prompt-SID: Learning Structural Representation Prompt via Latent Diffusion for Single-Image Denoising(https://arxiv.org/abs/2502.06432)
Keywords: generation
Abstract: Many studies have concentrated on constructing supervised models utilizing paired datasets for image denoising, which proves to be expensive and time-consuming. Current self-supervised and unsupervised approaches typically rely on blind-spot networks or sub-image pairs sampling, resulting in pixel information loss and destruction of detailed structural information, thereby significantly constraining the efficacy of such methods. In this paper, we introduce Prompt-SID, a prompt-learning-based single image denoising framework that emphasizes preserving of structural details. This approach is trained in a self-supervised manner using downsampled image pairs. It captures original-scale image information through structural encoding and integrates this prompt into the denoiser. To achieve this, we propose a structural representation generation model based on the latent diffusion process and design a structural attention module within the transformer-based denoiser architecture to decode the prompt. Additionally, we introduce a scale replay training mechanism, which effectively mitigates the scale gap from images of different resolutions. We conduct comprehensive experiments on synthetic, real-world, and fluorescence imaging datasets, showcasing the remarkable effectiveness of Prompt-SID.
摘要：许多研究集中在构建使用配对数据集进行图像denoising的监督模型上，事实证明这很昂贵且耗时。当前的自我监督和无监督的方法通常依赖于盲点网络或子图像对采样，从而导致像素信息丢失并破坏详细的结构信息，从而显着限制了此类方法的疗效。在本文中，我们介绍了及时的Sid，这是一个基于及时学习的单图像剥夺框架，强调保存结构细节。使用下采样的图像对以自我监督的方式训练这种方法。它通过结构编码捕获原始规模的图像信息，并将此提示集成到Denoiser中。为此，我们根据潜在扩散过程提出了一个结构表示生成模型，并在基于变压器的DeNoiser体系结构中设计一个结构关注模块以解码提示。此外，我们引入了量表重播训练机制，该机制有效地减轻了不同分辨率图像的比例差距。我们对合成，现实世界和荧光成像数据集进行了全面的实验，展示了及时速度的显着有效性。

Title: Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images

Authors: Lingao Xiao, Songhua Liu, Yang He, Xinchao Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2502.06434
Pdf URL: https://arxiv.org/pdf/2502.06434
Copy Paste: [[2502.06434]] Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images(https://arxiv.org/abs/2502.06434)
Keywords: generation
Abstract: Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational and storage efficiency. Despite their overlapping objectives, these approaches are rarely compared directly. Even within each field, the evaluation protocols are inconsistent across various methods, which complicates fair comparisons and hinders reproducibility. Considering these limitations, we introduce in this paper a benchmark that equitably evaluates methodologies across both distillation and pruning literatures. Notably, our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, which heavily rely on soft labels from pre-trained models, even randomly selected subsets can achieve surprisingly competitive performance. This finding suggests that an overemphasis on soft labels may be diverting attention from the intrinsic value of the image data, while also imposing additional burdens in terms of generation, storage, and application. To address these issues, we propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively, relies solely on hard labels for evaluation, and achieves state-of-the-art performance in this setup. By shifting the emphasis back to the images, our benchmark and PCA framework pave the way for more balanced and accessible techniques in dataset compression research. Our code is available at: this https URL
摘要：数据集蒸馏和数据集修剪是压缩数据集以提高计算效率的两种突出技术。尽管目标重叠，但这些方法很少直接直接比较。即使在每个字段中，各种方法的评估协议也不一致，这使公平比较复杂并阻碍了可重复性。考虑到这些局限性，我们在本文中介绍了一个基准，该基准可以公平地评估蒸馏和修剪文献中的方法论。值得注意的是，我们的基准表明，在主流数据集蒸馏设置中，大规模数据集在很大程度上依赖于预先训练的模型的软标签，即使是随机选择的子集也可以实现令人惊讶的竞争性能。这一发现表明，对软标签的过分强调可能会将注意力转移到图像数据的内在价值中，同时也施加了发电，存储和应用方面的额外负担。为了解决这些问题，我们为数据集压缩提出了一个新的框架，称为Prune，Combine and Gubment（PCA），该框架专注于仅利用图像数据，仅依靠硬标签进行评估，并实现最先进的艺术品在此设置中的性能。通过将重点转移回图像，我们的基准和PCA框架为数据集压缩研究中的更平衡和可访问的技术铺平了道路。我们的代码可用：此HTTPS URL

Title: Low-dimensional Functions are Efficiently Learnable under Randomly Biased Distributions

Authors: Elisabetta Cornacchia, Dan Mikulincer, Elchanan Mossel
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.06443
Pdf URL: https://arxiv.org/pdf/2502.06443
Copy Paste: [[2502.06443]] Low-dimensional Functions are Efficiently Learnable under Randomly Biased Distributions(https://arxiv.org/abs/2502.06443)
Keywords: generative
Abstract: The problem of learning single index and multi index models has gained significant interest as a fundamental task in high-dimensional statistics. Many recent works have analysed gradient-based methods, particularly in the setting of isotropic data distributions, often in the context of neural network training. Such studies have uncovered precise characterisations of algorithmic sample complexity in terms of certain analytic properties of the target function, such as the leap, information, and generative exponents. These properties establish a quantitative separation between low and high complexity learning tasks. In this work, we show that high complexity cases are rare. Specifically, we prove that introducing a small random perturbation to the data distribution--via a random shift in the first moment--renders any Gaussian single index model as easy to learn as a linear function. We further extend this result to a class of multi index models, namely sparse Boolean functions, also known as Juntas.
摘要：学习单个指数和多指数模型的问题已引起了高维统计中的一项基本任务的重大兴趣。许多最近的工作已经分析了基于梯度的方法，尤其是在各向同性数据分布的情况下，通常是在神经网络培训的背景下。这些研究已经发现了算法样品复杂性的精确特征，从目标功能的某些分析特性（例如LEAP，信息和生成指数）方面。这些属性在低复杂性学习任务之间建立了定量分离。在这项工作中，我们表明高复杂性案例很少见。具体而言，我们证明将一个小的随机扰动引入数据分布 - via在第一个时刻中随机移动 - 构成任何高斯单个索引模型，就像线性函数一样易于学习。我们将此结果进一步扩展到一类多指数模型，即稀疏的布尔函数，也称为Juntas。

Title: UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths

Authors: Weijia Mao, Zhenheng Yang, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06474
Pdf URL: https://arxiv.org/pdf/2502.06474
Copy Paste: [[2502.06474]] UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths(https://arxiv.org/abs/2502.06474)
Keywords: generation
Abstract: Unified multimodal transformers, which handle both generation and understanding tasks within a shared parameter space, have received increasing attention in recent research. Although various unified transformers have been proposed, training these models is costly due to redundant tokens and heavy attention computation. In the past, studies on large language models have demonstrated that token pruning methods, such as Mixture of Depths (MoD), can significantly improve computational efficiency. MoD employs a router to select the most important ones for processing within a transformer layer. However, directly applying MoD-based token pruning to unified transformers will result in suboptimal performance because different tasks exhibit varying levels of token redundancy. In our work, we analyze the unified transformers by (1) examining attention weight patterns, (2) evaluating the layer importance and token redundancy, and (3) analyzing task interactions. Our findings reveal that token redundancy is primarily influenced by different tasks and layers. Building on these findings, we introduce UniMoD, a task-aware token pruning method that employs a separate router for each task to determine which tokens should be pruned. We apply our method to Show-o and Emu3, reducing training FLOPs by approximately 15% in Show-o and 40% in Emu3, while maintaining or improving performance on several benchmarks. Code will be released at this https URL.
摘要：统一的多模式变压器处理共享参数空间内的发电和理解任务，在最近的研究中受到了越来越多的关注。尽管已经提出了各种统一的变压器，但是培训这些模型的昂贵，这是由于多余的令牌和大量的注意计算。过去，对大语言模型的研究表明，令牌修剪方法（例如深度（MOD）的混合物）可以显着提高计算效率。 MOD采用路由器来选择在变压器层中处理的最重要的路由器。但是，直接将基于mod的令牌修剪应用于统一变压器将导致次优性能，因为不同的任务表现出不同级别的令牌冗余。在我们的工作中，我们通过（1）检查注意力重量模式，（2）评估层的重要性和令牌冗余，以及（3）分析任务相互作用。我们的发现表明，令牌冗余主要受不同的任务和层次的影响。在这些发现的基础上，我们介绍了Unimod，这是一种任务感知的令牌修剪方法，该方法采用一个单独的路由器来确定应修剪哪个令牌。我们将我们的方法应用于Show-O和EMU3，在Show-O中减少了约15％，在EMU3中减少了40％，同时维持或改善了几种基准的性能。代码将在此HTTPS URL上发布。

Title: Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution

Authors: Vlad Hosu, Lorenzo Agnolucci, Daisuke Iso, Dietmar Saupe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06476
Pdf URL: https://arxiv.org/pdf/2502.06476
Copy Paste: [[2502.06476]] Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution(https://arxiv.org/abs/2502.06476)
Keywords: quality assessment
Abstract: Image Quality Assessment (IQA) measures and predicts perceived image quality by human observers. Although recent studies have highlighted the critical influence that variations in the scale of an image have on its perceived quality, this relationship has not been systematically quantified. To bridge this gap, we introduce the Image Intrinsic Scale (IIS), defined as the largest scale where an image exhibits its highest perceived quality. We also present the Image Intrinsic Scale Assessment (IISA) task, which involves subjectively measuring and predicting the IIS based on human judgments. We develop a subjective annotation methodology and create the IISA-DB dataset, comprising 785 image-IIS pairs annotated by experts in a rigorously controlled crowdsourcing study. Furthermore, we propose WIISA (Weak-labeling for Image Intrinsic Scale Assessment), a strategy that leverages how the IIS of an image varies with downscaling to generate weak labels. Experiments show that applying WIISA during the training of several IQA methods adapted for IISA consistently improves the performance compared to using only ground-truth labels. We will release the code, dataset, and pre-trained models upon acceptance.
摘要：图像质量评估（IQA）衡量并预测人类观察者感知的图像质量。尽管最近的研究强调了图像规模对其感知质量的变化的关键影响，但这种关系尚未系统地量化。为了弥合这一差距，我们介绍了图像固有量表（IIS），该量表定义为最大的刻度，其中图像表现出最高的感知质量。我们还介绍了图像内在的量表评估（IISA）任务，该任务涉及基于人类判断的IIS进行主观测量和预测。我们开发了一种主观的注释方法，并创建了IISA-DB数据集，其中包括785个图像-IIS对，由专家在严格控制的众包研究中注释。此外，我们提出了WIISA（图像内在量表评估的弱标记），该策略利用图像的IIS随降尺度而变化以产生弱标记。实验表明，在训练几种适合IISA的IQA方法期间，应用WIISA与仅使用地面真相标签相比，始终如一地提高了性能。接受后，我们将发布代码，数据集和预训练的模型。

Title: Model-Based Offline Reinforcement Learning with Reliability-Guaranteed Sequence Modeling

Authors: Shenghong He
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.06491
Pdf URL: https://arxiv.org/pdf/2502.06491
Copy Paste: [[2502.06491]] Model-Based Offline Reinforcement Learning with Reliability-Guaranteed Sequence Modeling(https://arxiv.org/abs/2502.06491)
Keywords: generation
Abstract: Model-based offline reinforcement learning (MORL) aims to learn a policy by exploiting a dynamics model derived from an existing dataset. Applying conservative quantification to the dynamics model, most existing works on MORL generate trajectories that approximate the real data distribution to facilitate policy learning by using current information (e.g., the state and action at time step $t$). However, these works neglect the impact of historical information on environmental dynamics, leading to the generation of unreliable trajectories that may not align with the real data distribution. In this paper, we propose a new MORL algorithm \textbf{R}eliability-guaranteed \textbf{T}ransformer (RT), which can eliminate unreliable trajectories by calculating the cumulative reliability of the generated trajectory (i.e., using a weighted variational distance away from the real data). Moreover, by sampling candidate actions with high rewards, RT can efficiently generate high-return trajectories from the existing offline data. We theoretically prove the performance guarantees of RT in policy learning, and empirically demonstrate its effectiveness against state-of-the-art model-based methods on several benchmark tasks.
摘要：基于模型的离线增强学习（MORL）旨在通过利用从现有数据集获得的动态模型来学习政策。将保守的量化应用于动力学模型，大多数现有作品都在MORL上生成轨迹，这些轨迹近似于实际数据分布，以通过使用当前信息（例如，在时间和行动步骤$ t $）来促进策略学习。但是，这些作品忽略了历史信息对环境动态的影响，导致产生可能与实际数据分布不符的不可靠轨迹。在本文中，我们提出了一种新的Morl算法\ textbf {r}可靠性保证\ textbf {t} ransformer（rt），可以通过计算生成的轨迹的累积可靠性来消除不可靠的轨迹（即使用加权变异距离远离实际数据）。此外，通过对高奖励的候选动作进行抽样，RT可以从现有的离线数据中有效地产生高回归轨迹。从理论上讲，我们在政策学习中证明了RT的绩效保证，并从经验上证明了其在几个基准任务上基于最新模型的方法的有效性。

Title: Boost-and-Skip: A Simple Guidance-Free Diffusion for Minority Generation

Authors: Soobin Um, Beomsu Kim, Jong Chul Ye
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2502.06516
Pdf URL: https://arxiv.org/pdf/2502.06516
Copy Paste: [[2502.06516]] Boost-and-Skip: A Simple Guidance-Free Diffusion for Minority Generation(https://arxiv.org/abs/2502.06516)
Keywords: generation, generative
Abstract: Minority samples are underrepresented instances located in low-density regions of a data manifold, and are valuable in many generative AI applications, such as data augmentation, creative content generation, etc. Unfortunately, existing diffusion-based minority generators often rely on computationally expensive guidance dedicated for minority generation. To address this, here we present a simple yet powerful guidance-free approach called Boost-and-Skip for generating minority samples using diffusion models. The key advantage of our framework requires only two minimal changes to standard generative processes: (i) variance-boosted initialization and (ii) timestep skipping. We highlight that these seemingly-trivial modifications are supported by solid theoretical and empirical evidence, thereby effectively promoting emergence of underrepresented minority features. Our comprehensive experiments demonstrate that Boost-and-Skip greatly enhances the capability of generating minority samples, even rivaling guidance-based state-of-the-art approaches while requiring significantly fewer computations.
摘要：少数族裔样本是位于数据歧管低密度区域的代表性不足的实例，并且在许多生成的AI应用程序中都很有价值，例如数据增强，创意内容产生等。不幸的是，现有的基于扩散的少数群体生成器通常依赖于计算上的昂贵指南专门用于少数族裔。为了解决这个问题，在这里，我们提出了一种简单而强大的无指导方法，称为Boost and Skip，用于使用扩散模型生成少数族裔样本。我们框架的关键优势仅需要对标准生成过程的最小更改：（i）方差增强的初始化和（ii）TimeStep跳过。我们强调说，这些看似琐碎的修改得到了坚实的理论和经验证据的支持，从而有效地促进了代表性不足的少数群体特征的出现。我们的综合实验表明，提升和刺激大大增强了产生少数族裔样本的能力，甚至可以与基于指导的最先进方法匹配，同时需要大大减少计算。

Title: CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers

Authors: D. She, Mushui Liu, Jingxuan Pang, Jin Wang, Zhen Yang, Wanggui He, Guanghao Zhang, Yi Wang, Qihan Huang, Haobin Tang, Yunlong Yu, Siming Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06527
Pdf URL: https://arxiv.org/pdf/2502.06527
Copy Paste: [[2502.06527]] CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers(https://arxiv.org/abs/2502.06527)
Keywords: generation
Abstract: Customized generation has achieved significant progress in image synthesis, yet personalized video generation remains challenging due to temporal inconsistencies and quality degradation. In this paper, we introduce CustomVideoX, an innovative framework leveraging the video diffusion transformer for personalized video generation from a reference image. CustomVideoX capitalizes on pre-trained video networks by exclusively training the LoRA parameters to extract reference features, ensuring both efficiency and adaptability. To facilitate seamless interaction between the reference image and video content, we propose 3D Reference Attention, which enables direct and simultaneous engagement of reference image features with all video frames across spatial and temporal dimensions. To mitigate the excessive influence of reference image features and textual guidance on generated video content during inference, we implement the Time-Aware Reference Attention Bias (TAB) strategy, dynamically modulating reference bias over different time steps. Additionally, we introduce the Entity Region-Aware Enhancement (ERAE) module, aligning highly activated regions of key entity tokens with reference feature injection by adjusting attention bias. To thoroughly evaluate personalized video generation, we establish a new benchmark, VideoBench, comprising over 50 objects and 100 prompts for extensive assessment. Experimental results show that CustomVideoX significantly outperforms existing methods in terms of video consistency and quality.
摘要：定制的一代在图像合成方面取得了重大进展，但由于时间不一致和质量降解，个性化的视频生成仍然具有挑战性。在本文中，我们介绍了CustomVideox，这是一个创新的框架，利用视频扩散变压器从参考图像中进行个性化视频生成。 CustomVideox通过专门培训LORA参数来提取参考功能，从而确保效率和适应性。为了促进参考图像和视频内容之间的无缝互动，我们提出了3D参考注意，该参考注意力可以直接和同时参与参考图像特征与空间和时间尺寸的所有视频帧。为了减轻参考图像特征的过度影响以及推理期间生成的视频内容的文本指导，我们实施了时间感知的参考注意偏见（TAB）策略，并在不同的时间步骤中动态调节参考偏置。此外，我们介绍了实体区域感知的增强（ERAE）模块，通过调节注意力偏置来对齐关键实体代币的高度激活区域与参考特征注入。为了彻底评估个性化的视频生成，我们建立了一个新的基准测试标准，包括50多个对象和100个提示，以进行广泛的评估。实验结果表明，CustomVideox在视频一致性和质量方面显着优于现有方法。

Title: Dimension-free Regret for Learning Asymmetric Linear Dynamical Systems

Authors: Annie Marsden, Elad Hazan
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.06545
Pdf URL: https://arxiv.org/pdf/2502.06545
Copy Paste: [[2502.06545]] Dimension-free Regret for Learning Asymmetric Linear Dynamical Systems(https://arxiv.org/abs/2502.06545)
Keywords: generative
Abstract: Previously, methods for learning marginally stable linear dynamical systems either required the transition matrix to be symmetric or incurred regret bounds that scale polynomially with the system's hidden dimension. In this work, we introduce a novel method that overcomes this trade-off, achieving dimension-free regret despite the presence of asymmetric matrices and marginal stability. Our method combines spectral filtering with linear predictors and employs Chebyshev polynomials in the complex plane to construct a novel spectral filtering basis. This construction guarantees sublinear regret in an online learning framework, without relying on any statistical or generative assumptions. Specifically, we prove that as long as the transition matrix has eigenvalues with complex component bounded by $1/\mathrm{poly} \log T$, then our method achieves regret $\tilde{O}(T^{9/10})$ when compared to the best linear dynamical predictor in hindsight.
摘要：以前，学习略有稳定的线性动力学系统的方法要求过渡矩阵是对称的，或者产生的遗憾界限与系统的隐藏尺寸进行了多项式缩放。在这项工作中，我们引入了一种新颖的方法，该方法克服了这种权衡，尽管存在不对称的矩阵和边际稳定性，但仍能实现无维度的遗憾。我们的方法将光谱滤波与线性预测变量结合在一起，并在复杂平面中采用Chebyshev多项式来构建新的光谱滤波基础。这种构造保证了在线学习框架中的统一遗憾，而无需依靠任何统计或生成假设。具体而言，我们证明，只要过渡矩阵具有由$ 1/\ mathrm {poly} \ log t $界限的复杂组件的特征值，那么我们的方法就会遗憾$ \ tilde {o}（t^{9/10}） $与事后看来最好的线性动力学预测器相比。

Title: Diffusion Models for Computational Neuroimaging: A Survey

Authors: Haokai Zhao, Haowei Lou, Lina Yao, Wei Peng, Ehsan Adeli, Kilian M Pohl, Yu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06552
Pdf URL: https://arxiv.org/pdf/2502.06552
Copy Paste: [[2502.06552]] Diffusion Models for Computational Neuroimaging: A Survey(https://arxiv.org/abs/2502.06552)
Keywords: generation
Abstract: Computational neuroimaging involves analyzing brain images or signals to provide mechanistic insights and predictive tools for human cognition and behavior. While diffusion models have shown stability and high-quality generation in natural images, there is increasing interest in adapting them to analyze brain data for various neurological tasks such as data enhancement, disease diagnosis and brain decoding. This survey provides an overview of recent efforts to integrate diffusion models into computational neuroimaging. We begin by introducing the common neuroimaging data modalities, follow with the diffusion formulations and conditioning mechanisms. Then we discuss how the variations of the denoising starting point, condition input and generation target of diffusion models are developed and enhance specific neuroimaging tasks. For a comprehensive overview of the ongoing research, we provide a publicly available repository at this https URL.
摘要：计算神经成像涉及分析大脑图像或信号，以提供机械洞察力和人类认知和行为的预测工具。尽管扩散模型在自然图像中表现出稳定性和高质量的产生，但对它们调整以分析脑数据的兴趣越来越大，例如数据增强，疾病诊断和大脑解码等各种神经系统数据。这项调查概述了将扩散模型整合到计算神经影像学中的最新努力。我们首先引入常见的神经影像学数据模式，遵循扩散制剂和调节机制。然后，我们讨论如何开发脱索的起点，条件输入和生成目标的变化，并增强特定的神经影像学任务。有关正在进行的研究的全面概述，我们在此HTTPS URL上提供了公开可用的存储库。

Title: Is API Access to LLMs Useful for Generating Private Synthetic Tabular Data?

Authors: Marika Swanberg, Ryan McKenna, Edo Roth, Albert Cheu, Peter Kairouz
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2502.06555
Pdf URL: https://arxiv.org/pdf/2502.06555
Copy Paste: [[2502.06555]] Is API Access to LLMs Useful for Generating Private Synthetic Tabular Data?(https://arxiv.org/abs/2502.06555)
Keywords: generation
Abstract: Differentially private (DP) synthetic data is a versatile tool for enabling the analysis of private data. Recent advancements in large language models (LLMs) have inspired a number of algorithm techniques for improving DP synthetic data generation. One family of approaches uses DP finetuning on the foundation model weights; however, the model weights for state-of-the-art models may not be public. In this work we propose two DP synthetic tabular data algorithms that only require API access to the foundation model. We adapt the Private Evolution algorithm (Lin et al., 2023; Xie et al., 2024) -- which was designed for image and text data -- to the tabular data domain. In our extension of Private Evolution, we define a query workload-based distance measure, which may be of independent interest. We propose a family of algorithms that use one-shot API access to LLMs, rather than adaptive queries to the LLM. Our findings reveal that API-access to powerful LLMs does not always improve the quality of DP synthetic data compared to established baselines that operate without such access. We provide insights into the underlying reasons and propose improvements to LLMs that could make them more effective for this application.
摘要：差异化私有（DP）合成数据是用于启用私人数据分析的多功能工具。大型语言模型（LLM）的最新进展激发了许多用于改善DP合成数据生成的算法技术。一种方法在基础型号的权重上使用DP登录；但是，最新模型的模型权重可能不公开。在这项工作中，我们提出了两种DP合成表格数据算法，仅需要API访问基础模型。我们将私有进化算法（Lin等，2023; Xie等，2024）（用于图像和文本数据设计）适应为表格数据域。在扩展私人进化时，我们定义了一个基于查询的基于工作量的距离度量，这可能具有独立的兴趣。我们提出了一种算法系列，该算法使用单发的API访问LLM，而不是对LLM的自适应查询。我们的发现表明，与在没有这种访问的情况下运行的基线相比，API访问强大的LLM并不总是提高DP合成数据的质量。我们提供了对根本原因的见解，并提出了对LLM的改进，这可能使它们在此应用程序中更有效。

Title: A Large-scale AI-generated Image Inpainting Benchmark

Authors: Paschalis Giakoumoglou, Dimitrios Karageorgiou, Symeon Papadopoulos, Panagiotis C. Petrantonakis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06593
Pdf URL: https://arxiv.org/pdf/2502.06593
Copy Paste: [[2502.06593]] A Large-scale AI-generated Image Inpainting Benchmark(https://arxiv.org/abs/2502.06593)
Keywords: generative
Abstract: Recent advances in generative models enable highly realistic image manipulations, creating an urgent need for robust forgery detection methods. Current datasets for training and evaluating these methods are limited in scale and diversity. To address this, we propose a methodology for creating high-quality inpainting datasets and apply it to create DiQuID, comprising over 95,000 inpainted images generated from 78,000 original images sourced from MS-COCO, RAISE, and OpenImages. Our methodology consists of three components: (1) Semantically Aligned Object Replacement (SAOR) that identifies suitable objects through instance segmentation and generates contextually appropriate prompts, (2) Multiple Model Image Inpainting (MMII) that employs various state-of-the-art inpainting pipelines primarily based on diffusion models to create diverse manipulations, and (3) Uncertainty-Guided Deceptiveness Assessment (UGDA) that evaluates image realism through comparative analysis with originals. The resulting dataset surpasses existing ones in diversity, aesthetic quality, and technical quality. We provide comprehensive benchmarking results using state-of-the-art forgery detection methods, demonstrating the dataset's effectiveness in evaluating and improving detection algorithms. Through a human study with 42 participants on 1,000 images, we show that while humans struggle with images classified as deceiving by our methodology, models trained on our dataset maintain high performance on these challenging cases. Code and dataset are available at this https URL.
摘要：生成模型的最新进展实现了高度逼真的图像操作，从而迫切需要强大的伪造检测方法。当前用于培训和评估这些方法的数据集的规模和多样性受到限制。为了解决这个问题，我们提出了一种创建高质量镶嵌数据集的方法，并将其应用以创建diquid，其中包括从MS-Coco，Rishing和OpenImages的78,000张原始图像产生的95,000多个生成的图像。我们的方法包括三个组成部分：（1）通过实例分割来识别合适对象并生成上下文适当的提示，（2）多个模型图像内置（MMII），该对象识别合适的对象，该对象识别合适的对象，（MMII）采用了各种最新的目的主要基于扩散模型来创建各种操纵的管道，以及（3）不确定性引导的欺骗性评估（UGDA），该评估（UGDA）通过与原件进行比较分析来评估图像现实主义。最终的数据集以多样性，美学质量和技术质量超过现有的数据集。我们使用最先进的伪造检测方法提供了全面的基准测试结果，证明了数据集在评估和改善检测算法方面的有效性。通过与42位参与者的1,000张图像的人类研究，我们表明，尽管人类与被我们的方法论归类为欺骗的图像斗争，但在我们的数据集中训练的模型在这些挑战性案例上保持了高性能。代码和数据集可在此HTTPS URL上找到。

Title: TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

Authors: Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, Yan-Pei Cao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.06608
Pdf URL: https://arxiv.org/pdf/2502.06608
Copy Paste: [[2502.06608]] TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models(https://arxiv.org/abs/2502.06608)
Keywords: generation, generative
Abstract: Recent advancements in diffusion techniques have propelled image and video generation to unprece- dented levels of quality, significantly accelerating the deployment and application of generative AI. However, 3D shape generation technology has so far lagged behind, constrained by limitations in 3D data scale, complexity of 3D data process- ing, and insufficient exploration of advanced tech- niques in the 3D domain. Current approaches to 3D shape generation face substantial challenges in terms of output quality, generalization capa- bility, and alignment with input conditions. We present TripoSG, a new streamlined shape diffu- sion paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images. Specifically, we propose: 1) A large-scale rectified flow transformer for 3D shape generation, achieving state-of-the-art fidelity through training on extensive, high-quality data. 2) A hybrid supervised training strategy combining SDF, normal, and eikonal losses for 3D VAE, achieving high- quality 3D reconstruction performance. 3) A data processing pipeline to generate 2 million high- quality 3D samples, highlighting the crucial rules for data quality and quantity in training 3D gen- erative models. Through comprehensive experi- ments, we have validated the effectiveness of each component in our new framework. The seamless integration of these parts has enabled TripoSG to achieve state-of-the-art performance in 3D shape generation. The resulting 3D shapes exhibit en- hanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input im- ages. Moreover, TripoSG demonstrates improved versatility in generating 3D models from diverse image styles and contents, showcasing strong gen- eralization capabilities. To foster progress and innovation in the field of 3D generation, we will make our model publicly available.
摘要：扩散技术的最新进步已将图像和视频生成推动到未经表面的质量水平，从而显着加速了生成AI的部署和应用。但是，到目前为止，3D形状生成技术已经落后，受到3D数据量表的限制，3D数据处理的复杂性以及3D域中先进技术的探索不足。在产出质量，概括性和与输入条件的一致性方面，3D形状生成的当前方法面临着重大挑战。我们提出了TripoSG，这是一种新的流线型形状扩散范式，能够生成具有与输入图像的精确对应的高保真3D网格。具体来说，我们建议：1）一种用于3D形状生成的大规模整流流量变压器，通过对广泛的高质量数据进行培训来实现最新的保真度。 2）结合了3D VAE的SDF，正常和Eikonal损失的混合监督培训策略，可实现高质量的3D重建性能。 3）一个数据处理管道生成200万个高质量3D样品，突出了培训3D代理模型中数据质量和数量的关键规则。通过全面的实验，我们在新框架中验证了每个组件的有效性。这些零件的无缝集成使Triposg能够在3D形状生成中实现最先进的性能。由于高分辨率的能力，所得的3D形状表现出了细节，并表现出对输入进度的非凡忠诚。此外，TripoSG在从不同的图像样式和内容中生成3D模型方面的多功能性提高了，展示了强大的一般性功能。为了促进3D一代领域的进步和创新，我们将公开使用我们的模型。

Title: Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language

Authors: Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.06634
Pdf URL: https://arxiv.org/pdf/2502.06634
Copy Paste: [[2502.06634]] Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language(https://arxiv.org/abs/2502.06634)
Keywords: generation
Abstract: Recent advancements in AI for biological research focus on integrating molecular data with natural language to accelerate drug discovery. However, the scarcity of high-quality annotations limits progress in this area. This paper introduces LA$^3$, a Language-based Automatic Annotation Augmentation framework that leverages large language models to augment existing datasets, thereby improving AI training. We demonstrate the effectiveness of LA$^3$ by creating an enhanced dataset, LaChEBI-20, where we systematically rewrite the annotations of molecules from an established dataset. These rewritten annotations preserve essential molecular information while providing more varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 based on a benchmark architecture to learn the mapping between molecular representations and augmented annotations. Experimental results on text-based *de novo* molecule generation and molecule captioning demonstrate that LaMolT5 outperforms state-of-the-art models. Notably, incorporating LA$^3$ leads to improvements of up to 301% over the benchmark architecture. Furthermore, we validate the effectiveness of LA$^3$ notable applications in *image*, *text* and *graph* tasks, affirming its versatility and utility.
摘要：AI生物学研究的最新进展集中在将分子数据与自然语言相结合以加速药物发现。但是，高质量注释的稀缺性限制了该领域的进展。本文介绍了LA $^3 $，这是一种基于语言的自动注释增强框架，该框架利用大型语言模型来增强现有数据集，从而改善了AI培训。我们通过创建一个增强的数据集Lachebi-20来证明LA $^3 $的有效性，在该数据集中，我们从已建立的数据集中系统地重写了分子的注释。这些重写的注释可保留基本的分子信息，同时提供更多多样化的句子结构和词汇。使用Lachebi-20，我们基于基准结构来训练Lamolt5，以了解分子表示和增强注释之间的映射。基于文本的 *从头 *分子产生和分子字幕的实验结果表明，Lamolt5优于最先进的模型。值得注意的是，将LA $^3 $纳入$^$，可以使高达301％的基准体系结构提高。此外，我们在 *image *， *text *and *graph *任务中验证了la $^3 $著名应用程序的有效性，从而确认其多功能性和实用性。

Title: MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing

Authors: Seokjin Go, Divya Mahajan
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2502.06643
Pdf URL: https://arxiv.org/pdf/2502.06643
Copy Paste: [[2502.06643]] MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing(https://arxiv.org/abs/2502.06643)
Keywords: generation
Abstract: Mixture-of-Experts (MoE) model architecture has emerged as a promising solution for scaling transformer models efficiently, offering sparse activation that reduces computational costs while increasing model capacity. However, as MoE models scale, they need to be distributed across GPU devices, thus face critical performance bottlenecks due to their large memory footprint. Expert parallelism distributes experts across GPUs, however, faces key challenges including an unbalanced token routing and expert activation, resulting in communication tail latency and processing inefficiencies. While existing solutions address some of these issues, they fail to resolve the dual challenges of load imbalance and communication skew. The imbalance in token processing load across experts causes uneven processing times on different GPUs, while communication skew between GPUs leads to unbalanced inter-GPU data transfers. These factors degrade the performance of MoE models by increasing tail latency and reducing overall throughput. To address these limitations, we propose an Integer Linear Programming (ILP) formulation to optimize expert placement by jointly considering token load, communication, and computation costs. We exploit the property that there is a token routing dependency across layers, where tokens routed to a specific expert in one layer are likely to be routed to a limited set of experts in the subsequent layer. Our solution, MoETuner, offers an optimal expert-to-GPU assignment that minimizes inter-GPU token routing costs and balances token processing across devices, thereby reducing tail latency and end-to-end execution time. Experimental results demonstrate 9.3% and 17.5% of end-to-end speedups for single-node and multi-node inference respectively, showcasing the potential of our ILP-based optimization for offering expert parallel solutions for next-generation MoEs.
摘要：Experts（MOE）模型架构的混合物已成为有效扩展变压器模型的有前途的解决方案，从而提供了稀疏激活，从而降低了计算成本，同时增加了模型容量。但是，随着MOE模型的规模，它们需要在GPU设备上分布，因此由于其大量记忆足迹，它们会面临关键的性能瓶颈。但是，专家并行性在GPU上分配专家，但是面临着关键的挑战，包括不平衡的令牌路由和专家激活，从而导致通信尾巴潜伏期和处理效率低下。尽管现有解决方案解决了其中一些问题，但他们无法解决负载失衡和通信偏差的双重挑战。跨专家的令牌处理负载的不平衡导致不同GPU的处理时间不均匀，而GPU之间的通信会导致GPU之间的不平衡GPU数据传输不平衡。这些因素通过增加尾部潜伏期并减少整体吞吐量来降低MOE模型的性能。为了解决这些限制，我们提出了一个整数线性编程（ILP）公式，以通过共同考虑令牌负载，通信和计算成本来优化专家安置。我们利用该属性跨层有一个令牌路由依赖性，其中将令牌路由到一个层中的特定专家可能会路由到后续层中有限的专家集。我们的解决方案MoEtuner提供了最佳的专家到GPU分配，可最大程度地减少GPU令牌的路由成本和余额跨设备的处理处理，从而减少了尾部潜伏期和端到端的执行时间。实验结果表明，分别为单节点和多节点推理的端到端加速度的9.3％和17.5％，展示了我们基于ILP的优化的潜力，以为下一代MOE提供专家并行解决方案。

Title: Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene

Authors: Tai-Yu Pan, Sooyoung Jeon, Mengdi Fan, Jinsu Yoo, Zhenyang Feng, Mark Campbell, Kilian Q. Weinberger, Bharath Hariharan, Wei-Lun Chao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06682
Pdf URL: https://arxiv.org/pdf/2502.06682
Copy Paste: [[2502.06682]] Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene(https://arxiv.org/abs/2502.06682)
Keywords: generation
Abstract: Self-driving cars relying solely on ego-centric perception face limitations in sensing, often failing to detect occluded, faraway objects. Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial. It requires placing multiple sensor-equipped agents in a real-world driving scene, simultaneously! As such, existing datasets are limited in locations and agents. We introduce a novel surrogate to the rescue, which is to generate realistic perception from different viewpoints in a driving scene, conditioned on a real-world sample - the ego-car's sensory data. This surrogate has huge potential: it could potentially turn any ego-car dataset into a collaborative driving one to scale up the development of CAV. We present the very first solution, using a combination of simulated collaborative data and real ego-car data. Our method, Transfer Your Perspective (TYP), learns a conditioned diffusion model whose output samples are not only realistic but also consistent in both semantics and layouts with the given ego-car data. Empirical results demonstrate TYP's effectiveness in aiding in a CAV setting. In particular, TYP enables us to (pre-)train collaborative perception algorithms like early and late fusion with little or no real-world collaborative data, greatly facilitating downstream CAV applications.
摘要：自动驾驶汽车仅依靠以自我为中心的感知在感知中面临局限性，通常无法检测到封闭的遥远的物体。协作自主驾驶（CAV）似乎是一个有前途的方向，但是收集开发数据并非平凡。它需要同时将多个配备有传感器的代理放置在现实世界中！因此，现有数据集在位置和代理中受到限制。我们为救援介绍了一种新颖的替代品，该替代品是从驾驶场景中的不同观点中产生逼真的感知，并以真实世界的样本为条件 - 自我卡车的感觉数据。这种代理具有巨大的潜力：它可能会将任何自我卡车数据集变成一个协作驱动器，以扩大CAV的发展。我们使用模拟协作数据和实际自我卡车数据的组合提出了第一个解决方案。我们的方法转移您的观点（典型）学习了一个条件扩散模型，其输出样本不仅是现实的，而且在语义和布局中都具有给定的自我卡车数据。经验结果证明了TYP在协助CAV环境中的有效性。特别是，键入使我们能够（预 - ）培训协作感知算法（例如早期和晚期融合），几乎没有现实世界的协作数据，从而极大地促进了下游CAV应用程序。

Title: No Trick, No Treat: Pursuits and Challenges Towards Simulation-free Training of Neural Samplers

Authors: Jiajun He, Yuanqi Du, Francisco Vargas, Dinghuai Zhang, Shreyas Padhy, RuiKang OuYang, Carla Gomes, José Miguel Hernández-Lobato
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.06685
Pdf URL: https://arxiv.org/pdf/2502.06685
Copy Paste: [[2502.06685]] No Trick, No Treat: Pursuits and Challenges Towards Simulation-free Training of Neural Samplers(https://arxiv.org/abs/2502.06685)
Keywords: generative
Abstract: We consider the sampling problem, where the aim is to draw samples from a distribution whose density is known only up to a normalization constant. Recent breakthroughs in generative modeling to approximate a high-dimensional data distribution have sparked significant interest in developing neural network-based methods for this challenging problem. However, neural samplers typically incur heavy computational overhead due to simulating trajectories during training. This motivates the pursuit of simulation-free training procedures of neural samplers. In this work, we propose an elegant modification to previous methods, which allows simulation-free training with the help of a time-dependent normalizing flow. However, it ultimately suffers from severe mode collapse. On closer inspection, we find that nearly all successful neural samplers rely on Langevin preconditioning to avoid mode collapsing. We systematically analyze several popular methods with various objective functions and demonstrate that, in the absence of Langevin preconditioning, most of them fail to adequately cover even a simple target. Finally, we draw attention to a strong baseline by combining the state-of-the-art MCMC method, Parallel Tempering (PT), with an additional generative model to shed light on future explorations of neural samplers.
摘要：我们考虑了采样问题，目的是从仅知道归一化常数的分布中绘制样品。生成建模的最新突破以近似高维数据分布引发了人们对为此具有挑战性问题开发基于神经网络的方法的重大兴趣。但是，由于训练过程中模拟轨迹，神经采样器通常会产生大量的计算开销。这激发了对神经采样器的无模拟培训程序的追求。在这项工作中，我们提出了对先前方法的优雅修改，该方法允许在时间依赖的归一流流程的帮助下进行无模拟训练。但是，它最终遭受了严重的模式崩溃。经过仔细检查，我们发现几乎所有成功的神经采样器都依赖于Langevin预处理以避免模式崩溃。我们系统地分析了具有各种目标功能的几种流行方法，并证明，在没有Langevin预处理的情况下，其中大多数甚至无法充分涵盖一个简单的目标。最后，我们通过将最新的MCMC方法（PT）与额外的生成模型相结合，以阐明神经采样器的未来探索，从而吸引人们对强大基线的注意。

Title: Se\~norita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists

Authors: Bojia Zi, Penghui Ruan, Marco Chen, Xianbiao Qi, Shaozhe Hao, Shihao Zhao, Youze Huang, Bin Liang, Rong Xiao, Kam-Fai Wong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06734
Pdf URL: https://arxiv.org/pdf/2502.06734
Copy Paste: [[2502.06734]] Se\~norita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists(https://arxiv.org/abs/2502.06734)
Keywords: generation, generative
Abstract: Recent advancements in video generation have spurred the development of video editing techniques, which can be divided into inversion-based and end-to-end methods. However, current video editing methods still suffer from several challenges. Inversion-based methods, though training-free and flexible, are time-consuming during inference, struggle with fine-grained editing instructions, and produce artifacts and jitter. On the other hand, end-to-end methods, which rely on edited video pairs for training, offer faster inference speeds but often produce poor editing results due to a lack of high-quality training video pairs. In this paper, to close the gap in end-to-end methods, we introduce Señorita-2M, a high-quality video editing dataset. Señorita-2M consists of approximately 2 millions of video editing pairs. It is built by crafting four high-quality, specialized video editing models, each crafted and trained by our team to achieve state-of-the-art editing results. We also propose a filtering pipeline to eliminate poorly edited video pairs. Furthermore, we explore common video editing architectures to identify the most effective structure based on current pre-trained generative model. Extensive experiments show that our dataset can help to yield remarkably high-quality video editing results. More details are available at this https URL.
摘要：视频生成的最新进展激发了视频编辑技术的开发，可以将其分为基于反演和端到端的方法。但是，当前的视频编辑方法仍然面临一些挑战。基于反演的方法虽然是无训练和灵活的，但在推断期间却耗时，与细粒度的编辑说明挣扎，并产生人工制品和抖动。另一方面，依靠编辑的视频对进行培训，提供更快的推理速度，但由于缺乏高质量的培训视频对，端到端方法通常会产生较差的编辑结果。在本文中，为了缩小端到端方法的差距，我们介绍了高质量的视频编辑数据集Señorita-2m。 Señorita-2m由大约200万视频编辑对组成。它是通过制作四个高质量的专业视频编辑模型来构建的，每个模型都由我们的团队制作和培训以获得最新的编辑结果。我们还提出了一条过滤管道，以消除编辑较差的视频对。此外，我们探索了常见的视频编辑体系结构，以基于当前训练的生成模型来识别最有效的结构。广泛的实验表明，我们的数据集可以帮助产生高质量的视频编辑结果。此HTTPS URL提供了更多详细信息。

Title: VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

Authors: Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, Ying Fan, Jungtaek Kim, Hyung Il Koo, Kannan Ramchandran, Dimitris Papailiopoulos, Kangwook Lee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.06737
Pdf URL: https://arxiv.org/pdf/2502.06737
Copy Paste: [[2502.06737]] VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data(https://arxiv.org/abs/2502.06737)
Keywords: generation
Abstract: Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline -- surpassing Qwen2.5-Math-PRM's gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.
摘要：事实证明，过程奖励模型（PRMS）通过利用增加推理时间计算来有效地增强大语模型（LLMS）的数学推理。但是，它们主要是对数学数据进行的训练，并且尚未严格研究其对非数学领域的普遍性。作为回应，这项工作首先表明当前的PRM在其他领域的性能较差。为了解决这一限制，我们介绍了VESTAPRM，这是一种使用我们的新型数据生成和注释方法生成的合成推理数据的多域PRM。 VersAPRM在各个领域达到了一致的性能。例如，在MMLU-PRO类别的法律类别中，VersapRM通过加权多数投票，比多数投票基线获得7.9％的绩效增长 - 超过QWEN2.5-MATH-PRM的增长率为1.3％。我们通过为VersaPRM的所有数据，代码和模型开放式营养，进一步为社区做出了贡献。

Title: ViSIR: Vision Transformer Single Image Reconstruction Method for Earth System Models

Authors: Ehsan Zeraatkar, Salah Faroughi, Jelena Tesic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06741
Pdf URL: https://arxiv.org/pdf/2502.06741
Copy Paste: [[2502.06741]] ViSIR: Vision Transformer Single Image Reconstruction Method for Earth System Models(https://arxiv.org/abs/2502.06741)
Keywords: generative
Abstract: Purpose: Earth system models (ESMs) integrate the interactions of the atmosphere, ocean, land, ice, and biosphere to estimate the state of regional and global climate under a wide variety of conditions. The ESMs are highly complex, and thus, deep neural network architectures are used to model the complexity and store the down-sampled data. In this paper, we propose the Vision Transformer Sinusoidal Representation Networks (ViSIR) to improve the single image SR (SR) reconstruction task for the ESM data. Methods: ViSIR combines the SR capability of Vision Transformers (ViT) with the high-frequency detail preservation of the Sinusoidal Representation Network (SIREN) to address the spectral bias observed in SR tasks. Results: The ViSIR outperforms ViT by 4.1 dB, SIREN by 7.5 dB, and SR-Generative Adversarial (SR-GANs) by 7.1dB PSNR on average for three different measurements. Conclusion: The proposed ViSIR is evaluated and compared with state-of-the-art methods. The results show that the proposed algorithm is outperforming other methods in terms of Mean Square Error(MSE), Peak-Signal-to-Noise-Ratio(PSNR), and Structural Similarity Index Measure(SSIM).
摘要：目的：地球系统模型（ESM）整合了大气，海洋，陆地，冰和生物圈的相互作用，以估计各种条件下的区域和全球气候状态。 ESM非常复杂，因此，深层神经网络体系结构用于对复杂性进行建模并存储下采样的数据。在本文中，我们提出了视觉变压器正弦表示网络（VISIR），以改善ESM数据的单个图像SR（SR）重建任务。方法：Visir将视觉变压器（VIT）的SR能力与正弦表示网络（Siren）的高频细节保存相结合，以解决在SR任务中观察到的光谱偏差。结果：Visir的表现优于4.1 dB，由7.5 dB的警笛和SR基因对抗（SR-GAN）平均为7.1db PSNR，平均三种不同的测量值。结论：对拟议的Visir进行了评估，并将其与最先进的方法进行了比较。结果表明，所提出的算法在均方根误差（MSE），峰信号到噪声比例（PSNR）和结构相似性指数指数（SSIM）方面优于其他方法。

Title: History-Guided Video Diffusion

Authors: Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, Vincent Sitzmann
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2502.06764
Pdf URL: https://arxiv.org/pdf/2502.06764
Copy Paste: [[2502.06764]] History-Guided Video Diffusion(https://arxiv.org/abs/2502.06764)
Keywords: generation
Abstract: Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos. Website: this https URL
摘要：无分类器引导（CFG）是改善扩散模型中有条件产生的关键技术，在增强样品质量的同时，可以更准确地控制。将此技术扩展到视频扩散是很自然的，该视频扩散会生成以可变数量的上下文框架（共同称为历史记录）为条件的视频。但是，我们发现有两个关键的挑战，可以通过可变的历史记录进行指导：仅支持固定尺寸条件的体系结构，以及CFG风格历史记录辍学效果较差的经验观察。为了解决这个问题，我们提出了扩散强迫变压器（DFOT），这是一个视频扩散体系结构和理论上扎根的训练目标，共同实现了灵活数量的历史记录框架。然后，我们介绍了历史指导，这是DFOT独特启用的指导方法家族。我们表明，它最简单的形式，香草历史指导，已经显着提高了视频的质量和时间一致性。更先进的方法，跨时间和频率的历史指导进一步增强了运动动态，使组成概括能够使分布历史悠久，并且可以稳定地推出非常长的视频。网站：此HTTPS URL

Title: Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions

Authors: Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, Sitan Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.06768
Pdf URL: https://arxiv.org/pdf/2502.06768
Copy Paste: [[2502.06768]] Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions(https://arxiv.org/abs/2502.06768)
Keywords: generative
Abstract: In recent years, masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains. Compared to autoregressive models (ARMs), MDMs trade off complexity at training time with flexibility at inference time. At training time, they must learn to solve an exponentially large number of infilling problems, but at inference time, they can decode tokens in essentially arbitrary order. In this work, we closely examine these two competing effects. On the training front, we theoretically and empirically demonstrate that MDMs indeed train on computationally intractable subproblems compared to their autoregressive counterparts. On the inference front, we show that a suitable strategy for adaptively choosing the token decoding order significantly enhances the capabilities of MDMs, allowing them to sidestep hard subproblems. On logic puzzles like Sudoku, we show that adaptive inference can boost solving accuracy in pretrained MDMs from $<7$% to $\approx 90$%, even outperforming ARMs with $7\times$ as many parameters and that were explicitly trained via teacher forcing to learn the right order of decoding.
摘要：近年来，蒙版扩散模型（MDMS）已成为离散域上生成建模的一种有希望的替代方法。与自回旋模型（ARM）相比，MDMS在培训时间进行了复杂性，并在推理时灵活性。在培训时，他们必须学会解决大量的填充问题，但是在推理时，他们可以以本质上任意的顺序解码令牌。在这项工作中，我们仔细研究了这两个竞争效果。在训练方面，我们从理论上和经验上证明，与自动回归的同行相比，MDM的确在计算上棘手的子问题上进行了训练。在推论方面，我们表明，适应性地选择令牌解码顺序的合适策略可显着增强MDM的功能，从而使它们能够避开硬质子问题。在诸如Sudoku之类的逻辑难题上，我们表明自适应推理可以提高验证的MDMS的准确性从$ <7 $％到$ \ $ \约90美元，甚至超过了$ 7 \ times $ $ \ times $的武器，这些参数是许多参数，并且通过老师进行了明确的培训。强迫学习正确的解码顺序。

Title: Enhancing Performance of Explainable AI Models with Constrained Concept Refinement

Authors: Geyu Liang, Senne Michielssen, Salar Fattahi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.06775
Pdf URL: https://arxiv.org/pdf/2502.06775
Copy Paste: [[2502.06775]] Enhancing Performance of Explainable AI Models with Constrained Concept Refinement(https://arxiv.org/abs/2502.06775)
Keywords: generative
Abstract: The trade-off between accuracy and interpretability has long been a challenge in machine learning (ML). This tension is particularly significant for emerging interpretable-by-design methods, which aim to redesign ML algorithms for trustworthy interpretability but often sacrifice accuracy in the process. In this paper, we address this gap by investigating the impact of deviations in concept representations-an essential component of interpretable models-on prediction performance and propose a novel framework to mitigate these effects. The framework builds on the principle of optimizing concept embeddings under constraints that preserve interpretability. Using a generative model as a test-bed, we rigorously prove that our algorithm achieves zero loss while progressively enhancing the interpretability of the resulting model. Additionally, we evaluate the practical performance of our proposed framework in generating explainable predictions for image classification tasks across various benchmarks. Compared to existing explainable methods, our approach not only improves prediction accuracy while preserving model interpretability across various large-scale benchmarks but also achieves this with significantly lower computational cost.
摘要：长期以来，准确性和可解释性之间的权衡一直是机器学习（ML）的挑战。这种张力对于新兴的逐设计方法尤为重要，该方法旨在重新设计ML算法以供可信赖的解释性，但在此过程中通常会牺牲准确性。在本文中，我们通过研究概念表示偏差的影响来解决这一差距 - 可解释模型的预测性能的重要组成部分，并提出了一个新的框架来减轻这些效果。该框架建立在优化概念嵌入在保留可解释性下的原则。使用生成模型作为测试床，我们严格地证明我们的算法实现了零损失，同时逐渐增强了所得模型的可解释性。此外，我们评估了提议的框架的实际性能，以生成各种基准的图像分类任务的可解释预测。与现有的可解释方法相比，我们的方法不仅提高了预测准确性，同时保留了各种大规模基准的模型可解释性，而且还以明显降低的计算成本来实现这一目标。

Title: Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

Authors: Dongyang Liu, Shicheng Li, Yutong Liu, Zhen Li, Kai Wang, Xinyue Li, Qi Qin, Yufei Liu, Yi Xin, Zhongyu Li, Bin Fu, Chenyang Si, Yuewen Cao, Conghui He, Ziwei Liu, Yu Qiao, Qibin Hou, Hongsheng Li, Peng Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.06782
Pdf URL: https://arxiv.org/pdf/2502.06782
Copy Paste: [[2502.06782]] Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT(https://arxiv.org/abs/2502.06782)
Keywords: generation, generative
Abstract: Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos' dynamic degree. Combined with a progressive training scheme with increasingly higher resolution and FPS, and a multi-source training scheme with mixed natural and synthetic data, Lumina-Video achieves remarkable aesthetic quality and motion smoothness at high training and inference efficiency. We additionally propose Lumina-V2A, a video-to-audio model based on Next-DiT, to create synchronized sounds for generated videos. Codes are released at this https URL.
摘要：最近的进步已经确立了扩散变压器（DIT）作为生成建模中的主要框架。在这一成功的基础上，Lumina-Next在具有隔壁的影像图像中实现了出色的表现。但是，它的视频生成潜力在很大程度上仍未开发，在建模视频数据固有的时空复杂性方面面临重大挑战。为了解决这个问题，我们介绍了Lumina-Video，该框架在引入量身定制的解决方案以供视频合成。 Lumina-Video结合了多尺度的隔壁体系结构，该体系结构共同学习了多个补丁以提高效率和灵活性。通过将运动得分纳入明确的条件，Lumina-Video还可以直接控制生成的视频的动态程度。与越来越高的分辨率和FPS的渐进培训方案结合在一起，以及具有混合自然和合成数据的多源培训方案，Lumina-Video在高训练和推理效率下实现了显着的美学质量和运动平滑度。我们另外提出了一种基于隔壁的视频与审计模型Lumina-V2A，以创建用于生成视频的同步声音。代码在此HTTPS URL上发布。