2025-03-12

Title: FourierNAT: A Fourier-Mixing-Based Non-Autoregressive Transformer for Parallel Sequence Generation

Authors: Andrew Kiruluta, Eric Lundy, Andreas Lemos
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.07630
Pdf URL: https://arxiv.org/pdf/2503.07630
Copy Paste: [[2503.07630]] FourierNAT: A Fourier-Mixing-Based Non-Autoregressive Transformer for Parallel Sequence Generation(https://arxiv.org/abs/2503.07630)
Keywords: generation
Abstract: We present FourierNAT, a novel non-autoregressive Transformer (NAT) architecture that employs Fourier-based mixing in the decoder to generate output sequences in parallel. While traditional NAT approaches often face challenges with capturing global dependencies, our method leverages a discrete Fourier transform to mix token embeddings across the entire sequence dimension, coupled with learned frequency-domain gating. This allows the model to efficiently propagate context without explicit autoregressive steps. Empirically, FourierNAT achieves competitive results against leading NAT baselines on standard benchmarks like WMT machine translation and CNN/DailyMail summarization, providing significant speed advantages over autoregressive Transformers. We further demonstrate that learned frequency-domain parameters allow the model to adaptively focus on long-range or short-range dependencies, partially mitigating the well-known coherence gaps in one-pass NAT generation. Overall, FourierNAT highlights the potential of integrating spectral-domain operations to accelerate and improve parallel text generation. This approach can potentially provide great computational and time savings in inference tasks LLMs.
摘要：我们提出了一种新型的非自动回应变压器（NAT）结构，该体系结构在解码器中采用基于傅立叶的混合以并联生成输出序列。尽管传统的NAT方法通常会通过捕获全球依赖性面临挑战，但我们的方法利用离散的傅立叶变换来混合整个序列维度的令牌嵌入，再加上学习的频域门控。这使模型无需明确的自回归步骤即可有效地传播上下文。从经验上讲，Fouriernat在WMT机器翻译和CNN/Dailymail摘要等标准基准的领先基准方面取得了竞争成果，从而提供了比自回旋变压器的速度优势。我们进一步证明，学到的频域参数使该模型可以自适应地专注于远程或短期依赖性，从而部分缓解了一通NAT生成中众所周知的相干差距。总体而言，Fouriernat强调了整合光谱域操作以加速和改善平行文本生成的潜力。这种方法可以在推理任务中有可能为LLM的推理任务提供大量的计算和时间节省。

Title: BrainNet-MoE: Brain-Inspired Mixture-of-Experts Learning for Neurological Disease Identification

Authors: Jing Zhang, Xiaowei Yu, Tong Chen, Chao Cao, Mingheng Chen, Yan Zhuang, Yanjun Lyu, Lu Zhang, Li Su, Tianming Liu, Dajiang Zhu
Subjects: cs.LG, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2503.07640
Pdf URL: https://arxiv.org/pdf/2503.07640
Copy Paste: [[2503.07640]] BrainNet-MoE: Brain-Inspired Mixture-of-Experts Learning for Neurological Disease Identification(https://arxiv.org/abs/2503.07640)
Keywords: generative
Abstract: The Lewy body dementia (LBD) is the second most common neurodegenerative dementia after Alzheimer's disease (AD). Early differentiation between AD and LBD is crucial because they require different treatment approaches, but this is challenging due to significant clinical overlap, heterogeneity, complex pathogenesis, and the rarity of LBD. While recent advances in artificial intelligence (AI) demonstrate powerful learning capabilities and offer new hope for accurate diagnosis, existing methods primarily focus on designing "neural-level networks". Our work represents a pioneering effort in modeling system-level artificial neural network called BrainNet-MoE for brain modeling and diagnosing. Inspired by the brain's hierarchical organization of bottom-up sensory integration and top-down control, we design a set of disease-specific expert groups to process brain sub-network under different condition, A disease gate mechanism guides the specializa-tion of expert groups, while a transformer layer enables communication be-tween all sub-networks, generating a comprehensive whole-brain represen-tation for downstream disease classification. Experimental results show superior classification accuracy with interpretable insights into how brain sub-networks contribute to different neurodegenerative conditions.
摘要：路易体痴呆（LBD）是仅次于阿尔茨海默氏病（AD）的第二常见神经退行性痴呆。 AD和LBD之间的早期分化是至关重要的，因为它们需要不同的治疗方法，但是由于显着的临床重叠，异质性，复杂的发病机理和LBD的稀有性，这是具有挑战性的。尽管人工智能（AI）的最新进展表现出强大的学习能力，并为准确的诊断提供了新的希望，但现有方法主要集中于设计“神经级网络”。我们的工作代表了用于建模系统级的人工神经网络的开创性努力，称为Brainnet-MoE用于大脑建模和诊断。受到大脑层次结构的自下而上感官整合和自上而下的控制的启发，我们设计了一组疾病特定的专家小组，以处理在不同病情下的大脑子网络，一种疾病门机制指导专家组的专门研究，而变压器层则可以在所有子网络之间进行沟通，从而为全脑疾病进行了全脑疾病的分类。实验结果表明了卓越的分类准确性，并可以解释性见解对大脑子网络如何促进不同的神经退行性条件。

Title: The day-ahead scenario generation method for new energy based on an improved conditional generative diffusion model

Authors: Changgang Wang, Wei Liu, Yu Cao, Dong Liang, Yang Li, Jingshan Mo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07648
Pdf URL: https://arxiv.org/pdf/2503.07648
Copy Paste: [[2503.07648]] The day-ahead scenario generation method for new energy based on an improved conditional generative diffusion model(https://arxiv.org/abs/2503.07648)
Keywords: generation, generative
Abstract: In the context of the rising share of new energy generation, accurately generating new energy output scenarios is crucial for day-ahead power system scheduling. Deep learning-based scenario generation methods can address this need, but their black-box nature raises concerns about interpretability. To tackle this issue, this paper introduces a method for day-ahead new energy scenario generation based on an improved conditional generative diffusion model. This method is built on the theoretical framework of Markov chains and variational inference. It first transforms historical data into pure noise through a diffusion process, then uses conditional information to guide the denoising process, ultimately generating scenarios that satisfy the conditional distribution. Additionally, the noise table is improved to a cosine form, enhancing the quality of the generated scenarios. When applied to actual wind and solar output data, the results demonstrate that this method effectively generates new energy output scenarios with good adaptability.
摘要：在新能源生成的份额上升的背景下，准确生成新的能源输出方案对于日前的电力系统调度至关重要。基于深度学习的场景生成方法可以满足这种需求，但是它们的黑盒本质引起了人们对解释性的关注。为了解决这个问题，本文介绍了一种基于改进的有条件生成扩散模型的日期新能量场景生成的方法。该方法建立在马尔可夫链和变异推理的理论框架上。它首先通过扩散过程将历史数据转换为纯噪声，然后使用条件信息来指导降解过程，最终生成满足条件分布的场景。此外，将噪声表改进到余弦形式，从而提高了生成的方案的质量。当应用于实际的风和太阳能输出数据时，结果表明，该方法有效地生成了具有良好适应性的新能量输出方案。

Title: TS-RAG: Retrieval-Augmented Generation based Time Series Foundation Models are Stronger Zero-Shot Forecaster

Authors: Kanghui Ning, Zijie Pan, Yu Liu, Yushan Jiang, James Y. Zhang, Kashif Rasul, Anderson Schneider, Lintao Ma, Yuriy Nevmyvaka, Dongjin Song
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07649
Pdf URL: https://arxiv.org/pdf/2503.07649
Copy Paste: [[2503.07649]] TS-RAG: Retrieval-Augmented Generation based Time Series Foundation Models are Stronger Zero-Shot Forecaster(https://arxiv.org/abs/2503.07649)
Keywords: generation
Abstract: Recently, Large Language Models (LLMs) and Foundation Models (FMs) have become prevalent for time series forecasting tasks. However, fine-tuning large language models (LLMs) for forecasting enables the adaptation to specific domains but may not generalize well across diverse, unseen datasets. Meanwhile, existing time series foundation models (TSFMs) lack inherent mechanisms for domain adaptation and suffer from limited interpretability, making them suboptimal for zero-shot forecasting. To this end, we present TS-RAG, a retrieval-augmented generation based time series forecasting framework that enhances the generalization capability and interpretability of TSFMs. Specifically, TS-RAG leverages pre-trained time series encoders to retrieve semantically relevant time series segments from a dedicated knowledge database, incorporating contextual patterns for the given time series query. Next, we develop a learnable Mixture-of-Experts (MoE)-based augmentation module, which dynamically fuses retrieved time series patterns with the TSFM's representation of the input query, improving forecasting accuracy without requiring task-specific fine-tuning. Thorough empirical studies on seven public benchmark datasets demonstrate that TS-RAG achieves state-of-the-art zero-shot forecasting performance, outperforming TSFMs by up to 6.51% across diverse domains and showcasing desired interpretability.
摘要：最近，大型语言模型（LLMS）和基础模型（FMS）已成为时间序列预测任务的普遍存在。但是，用于预测的大型语言模型（LLMS）可以适应特定领域，但可能无法很好地跨越各种各样的，看不见的数据集。同时，现有的时间序列基础模型（TSFM）缺乏适应域的固有机制，并且无法解释性有限，这使得它们次优先预测。为此，我们提出了TS-rag，这是一个基于检索的基于生成的时间序列预测框架，可增强TSFM的概括能力和解释性。具体而言，TS-rag利用预先训练的时间序列编码器从专用的知识数据库中检索语义相关的时间序列段，并结合了给定时间序列查询的上下文模式。接下来，我们开发可学习的专家（MOE）基于基于的增强模块，该模块将与TSFM的输入查询表示，将检索到的时间序列模式融合在一起，提高了预测准确性，而无需任务特定于任务的微调。对七个公共基准数据集的彻底实证研究表明，TS-rag实现了最先进的零射击预测性能，在不同领域的零击中表现优于TSFM高达6.51％，并展示了所需的可解释性。

Title: MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration

Authors: Jinguang Wang, Jingyu Wang, Haifeng Sun, Tingting Yang, Zirui Zhuang, Wanyi Ning, Yuexi Yin, Qi Qi, Jianxin Liao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07654
Pdf URL: https://arxiv.org/pdf/2503.07654
Copy Paste: [[2503.07654]] MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration(https://arxiv.org/abs/2503.07654)
Keywords: generation
Abstract: Quantization has been widely used to compress and accelerate inference of large language models (LLMs). Existing methods focus on exploring the per-token dynamic calibration to ensure both inference acceleration and model accuracy under 4-bit quantization. However, in autoregressive generation inference of long sequences, the overhead of repeated dynamic quantization and dequantization steps becomes considerably expensive. In this work, we propose MergeQuant, an accurate and efficient per-channel static quantization framework. MergeQuant integrates the per-channel quantization steps with the corresponding scalings and linear mappings through a Quantization Step Migration (QSM) method, thereby eliminating the quantization overheads before and after matrix multiplication. Furthermore, in view of the significant differences between the different channel ranges, we propose dimensional reconstruction and adaptive clipping to address the non-uniformity of quantization scale factors and redistribute the channel variations to the subsequent modules to balance the parameter distribution under QSM. Within the static quantization setting of W4A4, MergeQuant reduces the accuracy gap on zero-shot tasks compared to FP16 baseline to 1.3 points on Llama-2-70B model. On Llama-2-7B model, MergeQuant achieves up to 1.77x speedup in decoding, and up to 2.06x speedup in end-to-end compared to FP16 baseline.
摘要：量化已被广泛用于压缩和加速大型语言模型（LLMS）的推断。现有的方法着重于探索po绕动态校准，以确保在4位量化下的推理加速度和模型精度。但是，在长序列的自回旋生成推断中，重复的动态量化和去除步骤的开销变得非常昂贵。在这项工作中，我们提出了Mergequant，这是一个准确有效的每通道静态量化框架。合并将通过量化步骤迁移（QSM）方法与相应的量表和线性映射集成了每通道量化步骤，从而消除了矩阵乘法之前和之后的量化开销。此外，鉴于不同的通道范围之间的显着差异，我们提出了尺寸重建和自适应剪辑，以解决量化量表因子的不均匀性，并将通道变化重新分布到随后的模块以平衡QSM下的参数分布。在W4A4的静态量化设置中，合并后，与FP16基线相比，零弹药任务的准确性差距降低到Llama-2-70B模型上的1.3点。在Llama-2-7b模型上，与FP16的基线相比，合并后的分解速度高达1.77倍，端到端的速度高达2.06倍。

Title: Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy

Authors: Wei Junhao, Yu Zhe, Sakuma Jun
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.07661
Pdf URL: https://arxiv.org/pdf/2503.07661
Copy Paste: [[2503.07661]] Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy(https://arxiv.org/abs/2503.07661)
Keywords: generation
Abstract: Model merging is a technique that combines multiple finetuned models into a single model without additional training, allowing a free-rider to cheaply inherit specialized capabilities. This study investigates methodologies to suppress unwanted model merging by free-riders. Existing methods such as model watermarking or fingerprinting can only detect merging in hindsight. In contrast, we propose a first proactive defense against model merging. Specifically, our defense method modifies the model parameters so that the model is disrupted if the model is merged with any other model, while its functionality is kept unchanged if not merged with others. Our approach consists of two modules, rearranging MLP parameters and scaling attention heads, which push the model out of the shared basin in parameter space, causing the merging performance with other models to degrade significantly. We conduct extensive experiments on image classification, image generation, and text classification to demonstrate that our defense severely disrupts merging while retaining the functionality of the post-protect model. Moreover, we analyze potential adaptive attacks and further propose a dropout-based pruning to improve our proposal's robustness.
摘要：模型合并是一种将多个固定模型结合到一个模型的技术，而无需额外的培训，从而使自由骑士可以便宜地继承专业功能。这项研究调查了抑制自由骑士合并不需要的模型的方法。现有的方法（例如模型水印或指纹）只能在事后检测合并。相比之下，我们提出了针对模型合并的第一个主动防御。具体而言，我们的防御方法修改了模型参数，以便如果模型与任何其他模型合并，则模型会被破坏，而如果不与其他模型合并，则模型的功能保持不变。我们的方法由两个模块组成，重新安排MLP参数和扩展注意力头，它们将模型从参数空间中的共享盆地推出，从而导致与其他模型的合并性能显着降低。我们对图像分类，图像产生和文本分类进行了广泛的实验，以证明我们的防御会严重中断合并，同时保留了后保护模型的功能。此外，我们分析了潜在的适应性攻击，并进一步提出了基于辍学的修剪，以改善提案的稳健性。

Title: RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories

Authors: Huiyang Shao, Xin Xia, Yuhong Yang, Yuxi Ren, Xing Wang, Xuefeng Xiao
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.07699
Pdf URL: https://arxiv.org/pdf/2503.07699
Copy Paste: [[2503.07699]] RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories(https://arxiv.org/abs/2503.07699)
Keywords: generation
Abstract: Diffusion models have achieved remarkable success across various domains. However, their slow generation speed remains a critical challenge. Existing acceleration methods, while aiming to reduce steps, often compromise sample quality, controllability, or introduce training complexities. Therefore, we propose RayFlow, a novel diffusion framework that addresses these limitations. Unlike previous methods, RayFlow guides each sample along a unique path towards an instance-specific target distribution. This method minimizes sampling steps while preserving generation diversity and stability. Furthermore, we introduce Time Sampler, an importance sampling technique to enhance training efficiency by focusing on crucial timesteps. Extensive experiments demonstrate RayFlow's superiority in generating high-quality images with improved speed, control, and training efficiency compared to existing acceleration techniques.
摘要：扩散模型在各个领域取得了巨大的成功。但是，它们的缓慢生成速度仍然是一个关键的挑战。现有的加速方法虽然旨在减少步骤，但通常会损害样本质量，可控性或引入培训复杂性。因此，我们提出了Rayflow，这是一个解决这些局限性的新型扩散框架。与以前的方法不同，Rayflow将每个样本沿着独特的路径引导到实例特定目标分布。这种方法最大程度地减少了采样步骤，同时保留了产生多样性和稳定性。此外，我们介绍了时间采样器，这是一种重要的抽样技术，可以通过关注关键时刻步骤来提高训练效率。广泛的实验表明，与现有加速技术相比，雷夫尔在产生具有提高速度，控制效率和训练效率的高质量图像方面具有优势。

Title: Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

Authors: Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Linjie Yang, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, Weilin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07703
Pdf URL: https://arxiv.org/pdf/2503.07703
Copy Paste: [[2503.07703]] Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model(https://arxiv.org/abs/2503.07703)
Keywords: generation
Abstract: Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We develop a powerful data system that facilitates knowledge integration, and a caption system that balances the accuracy and richness for image description. Particularly, Seedream is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enable it to generate high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations, further improve the overall capability. Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score. In addition, it can be readily adapted to an instruction-based image editing model, such as SeedEdit, with strong editing capability that balances instruction-following and image consistency.
摘要：扩散模型的快速发展已经催化了图像生成领域的显着进步。但是，流行的模型，例如Flux，SD3.5和Midjourney，仍然努力应对模型偏见，有限的文本渲染能力以及对中国文化细微差别的理解不足。为了解决这些局限性，我们提出了Seedream 2.0，这是一种中文英语的本地双语图像生成基础模型，它跨越了各种维度，该模型擅长管理中文和英语的文本提示，从而支持双语图像生成和文本渲染。我们开发了一个功能强大的数据系统，可促进知识集成，并为图像描述的准确性和丰富度平衡标题系统。特别是，SeedReam与自发的双语大语模型作为文本编码器集成在一起，从而使其可以直接从大量数据中学习本地知识。这使其能够以中文或英语描述的准确的文化细微差别和美学表达产生高保真图像。在除了字形的BYT5之外，用于灵活的字符级文本渲染，而缩放的绳索则很好地推广到未经训练的分辨率。多相训练后的优化，包括SFT和RLHF迭代，进一步提高了整体能力。通过广泛的实验，我们证明了SeedReam 2.0在多个方面取得了最新的性能，包括及时关注，美学，文本渲染和结构上的正确性。此外，如其未偿还的ELO分数所揭示的那样，通过多次RLHF迭代进行了种子Ream 2.0的优化，以将其输出与人类的偏好保持一致。此外，它可以很容易地适应基于指令的图像编辑模型，例如SEEDEDIT，具有强大的编辑功能，可以平衡遵循指令遵循和图像一致性。

Title: Blind Video Super-Resolution based on Implicit Kernels

Authors: Qiang Zhu, Yuxuan Jiang, Shuyuan Zhu, Fan Zhang, David Bull, Bing Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07856
Pdf URL: https://arxiv.org/pdf/2503.07856
Copy Paste: [[2503.07856]] Blind Video Super-Resolution based on Implicit Kernels(https://arxiv.org/abs/2503.07856)
Keywords: super-resolution
Abstract: Blind video super-resolution (BVSR) is a low-level vision task which aims to generate high-resolution videos from low-resolution counterparts in unknown degradation scenarios. Existing approaches typically predict blur kernels that are spatially invariant in each video frame or even the entire video. These methods do not consider potential spatio-temporal varying degradations in videos, resulting in suboptimal BVSR performance. In this context, we propose a novel BVSR model based on Implicit Kernels, BVSR-IK, which constructs a multi-scale kernel dictionary parameterized by implicit neural representations. It also employs a newly designed recurrent Transformer to predict the coefficient weights for accurate filtering in both frame correction and feature alignment. Experimental results have demonstrated the effectiveness of the proposed BVSR-IK, when compared with four state-of-the-art BVSR models on three commonly used datasets, with BVSR-IK outperforming the second best approach, FMA-Net, by up to 0.59 dB in PSNR. Source code will be available at this https URL.
摘要：盲目视频超分辨率（BVSR）是一项低级视觉任务，旨在在未知的退化方案中从低分辨率对应物中生成高分辨率视频。现有方法通常可以预测每个视频框架甚至整个视频中空间不变的模糊内核。这些方法不考虑视频中潜在的时空变化降解，从而导致BVSR次级性能。在这种情况下，我们提出了一个基于隐式内核BVSR-IK的新型BVSR模型，该模型构建了通过隐式神经表示参数化的多尺度内核词典。它还采用了新设计的复发变压器来预测在框架校正和特征对齐中精确滤波的系数重量。与三个常用数据集上的四个最先进的BVSR模型相比，实验结果证明了所提出的BVSR-IK的有效性，而BVSR-IK在PSNR中最高为0.59 dB，BVSR-IK优于第二个最佳方法，FMA-NET。源代码将在此HTTPS URL上提供。

Title: Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?

Authors: Yuru Jia, Valerio Marsocci, Ziyang Gong, Xue Yang, Maarten Vergauwen, Andrea Nascetti
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07890
Pdf URL: https://arxiv.org/pdf/2503.07890
Copy Paste: [[2503.07890]] Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?(https://arxiv.org/abs/2503.07890)
Keywords: generation, generative
Abstract: Self-supervised learning (SSL) has revolutionized representation learning in Remote Sensing (RS), advancing Geospatial Foundation Models (GFMs) to leverage vast unlabeled satellite imagery for diverse downstream tasks. Currently, GFMs primarily focus on discriminative objectives, such as contrastive learning or masked image modeling, owing to their proven success in learning transferable representations. However, generative diffusion models--which demonstrate the potential to capture multi-grained semantics essential for RS tasks during image generation--remain underexplored for discriminative applications. This prompts the question: can generative diffusion models also excel and serve as GFMs with sufficient discriminative power? In this work, we answer this question with SatDiFuser, a framework that transforms a diffusion-based generative geospatial foundation model into a powerful pretraining tool for discriminative RS. By systematically analyzing multi-stage, noise-dependent diffusion features, we develop three fusion strategies to effectively leverage these diverse representations. Extensive experiments on remote sensing benchmarks show that SatDiFuser outperforms state-of-the-art GFMs, achieving gains of up to +5.7% mIoU in semantic segmentation and +7.9% F1-score in classification, demonstrating the capacity of diffusion-based generative foundation models to rival or exceed discriminative GFMs. Code will be released.
摘要：自我监督学习（SSL）已彻底改变了遥感（RS）的表示形式学习，推进了地理空间基础模型（GFMS），以利用庞大的未标记的卫星图像来完成多样化的下游任务。当前，GFMS主要集中于判别目标，例如对比度学习或掩盖图像建模，这是由于它们在学习可转移表示方面的成功成功。但是，生成扩散模型（证明了在图像生成过程中捕获RS任务必不可少的多透性语义的潜力 - 用于歧视应用的磁性磁线不足。这提示了一个问题：生成扩散模型是否也可以表现出足够的歧视能力吗？在这项工作中，我们使用Satdifuser回答了这个问题，该框架将基于扩散的生成地理空间粉底模型转换为一个有力的鉴别式RS工具。通过系统地分析多阶段，依赖噪声的扩散特征，我们开发了三种融合策略来有效利用这些不同的表示。关于遥感基准测试的广泛实验表明，Satdifuser的表现优于最先进的GFM，在语义分段中实现了高达 +5.7％MIOU的增长，分类中的F1得分为 +7.9％的F1得分，证明了基于扩散的基础基础基础模型竞争或超过歧视GFM的能力。代码将发布。

Title: FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction

Authors: Dennis Rotondi, Fabio Scaparro, Hermann Blum, Kai O. Arras
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.07909
Pdf URL: https://arxiv.org/pdf/2503.07909
Copy Paste: [[2503.07909]] FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction(https://arxiv.org/abs/2503.07909)
Keywords: generation
Abstract: The concept of 3D scene graphs is increasingly recognized as a powerful semantic and hierarchical representation of the environment. Current approaches often address this at a coarse, object-level resolution. In contrast, our goal is to develop a representation that enables robots to directly interact with their environment by identifying both the location of functional interactive elements and how these can be used. To achieve this, we focus on detecting and storing objects at a finer resolution, focusing on affordance-relevant parts. The primary challenge lies in the scarcity of data that extends beyond instance-level detection and the inherent difficulty of capturing detailed object features using robotic sensors. We leverage currently available 3D resources to generate 2D data and train a detector, which is then used to augment the standard 3D scene graph generation pipeline. Through our experiments, we demonstrate that our approach achieves functional element segmentation comparable to state-of-the-art 3D models and that our augmentation enables task-driven affordance grounding with higher accuracy than the current solutions.
摘要：3D场景图的概念越来越被认为是环境的强大语义和分层表示。当前的方法通常以粗略的对象级分辨率来解决此问题。相比之下，我们的目标是开发一个表示，该表示可以通过识别功能交互元素的位置以及如何使用它们来直接与环境相互作用。为了实现这一目标，我们专注于以精细的分辨率来检测和存储对象，专注于与负担相关的部分。主要的挑战在于数据稀缺性，该数据扩展了实例级检测和使用机器人传感器捕获详细对象特征的固有难度。我们利用当前可用的3D资源来生成2D数据并训练检测器，然后将其用于增强标准3D场景图生成管道。通过我们的实验，我们证明了我们的方法可以实现与最新3D模型相当的功能元件细分，并且我们的增强功能可以以比当前解决方案更高的任务驱动的负担能力接地。

Title: Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

Authors: Samuel Cahyawijaya, Holy Lovenia, Joel Ruben Antony Moniz, Tack Hwa Wong, Mohammad Rifqi Farhansyah, Thant Thiri Maung, Frederikus Hudi, David Anugraha, Muhammad Ravi Shulthan Habibi, Muhammad Reza Qorib, Amit Agarwal, Joseph Marvin Imperial, Hitesh Laxmichand Patel, Vicky Feliren, Bahrul Ilmi Nasution, Manuel Antonio Rufino, Genta Indra Winata, Rian Adam Rajagede, Carlos Rafael Catalan, Mohamed Fazli Imam, Priyaranjan Pattnayak, Salsabila Zahirah Pranida, Kevin Pratama, Yeshil Bangera, Adisai Na-Thalang, Patricia Nicole Monderin, Yueqi Song, Christian Simon, Lynnette Hui Xian Ng, Richardy Lobo' Sapan, Taki Hasan Rafi, Bin Wang, Supryadi, Kanyakorn Veerakanjana, Piyalitt Ittichaiwong, Matthew Theodore Roque, Karissa Vincentio, Takdanai Kreangphet, Phakphum Artkaew, Kadek Hendrawan Palgunadi, Yanzhi Yu, Rochana Prih Hastuti, William Nixon, Mithil Bangera, Adrian Xuan Wei Lim, Aye Hninn Khine, Hanif Muhammad Zhafran, Teddy Ferdinan, Audra Aurora Izzani, Ayushman Singh, Evan, Jauza Akbar Krito, Michael Anugraha, Fenal Ashokbhai Ilasariya, Haochen Li, John Amadeo Daniswara, Filbert Aurelian Tjiaranata, Eryawan Presma Yulianrifat, Can Udomcharoenchaikit, Fadil Risdian Ansori, Mahardika Krisna Ihsani, Giang Nguyen, Anab Maulana Barik, Dan John Velasco, Rifo Ahmad Genadi, Saptarshi Saha, Chengwei Wei, Isaiah Flores, Kenneth Ko Han Chen, Anjela Gail Santos, Wan Shen Lim, Kaung Si Phyo, Tim Santos, Meisyarah Dwiastuti, Jiayun Luo, Jan Christian Blaise Cruz, Ming Shan Hee, Ikhlasul Akmal Hanif, M.Alif Al Hakim, Muhammad Rizky Sya'ban, Kun Kerdthaisong, Lester James V. Miranda, Fajri Koto, Tirana Noor Fatyanosa, Alham Fikri Aji, Jostin Jerico Rosal, Jun Kevin, Robert Wijaya, Onno P. Kampman, Ruochen Zhang, Börje F. Karlsson, Peerat Limkonchotiwat
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.07920
Pdf URL: https://arxiv.org/pdf/2503.07920
Copy Paste: [[2503.07920]] Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia(https://arxiv.org/abs/2503.07920)
Keywords: generation, generative
Abstract: Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.
摘要：东南亚（SEA）是一个非凡的语言和文化多样性的地区，但在视觉语言（VL）研究中仍然存在明显不足的地区。这通常会导致人工智能（AI）模型无法捕捉海文化细微差别。为了填补这一空白，我们提出了SEA-VL，这是一项开源计划，致力于为海语的海语开发高质量的文化相关数据。通过吸引来自海洋国家的贡献者，SEA-VL旨在确保更好的文化相关性和多样性，从而在VL研究中促进代表性不足的语言的更大包容性。除了众包外，我们的倡议在探索自动收集文化相关图像的图像中，通过爬行和图像产生迈出了一步。首先，我们发现图像爬行的文化相关性约为85％，而比众包更具成本和时间效率。其次，尽管生成视觉模型取得了长足的进步，但合成图像在准确反映海洋培养物中仍然不可靠。生成的图像通常无法反映该地区细微的传统和文化背景。总的来说，我们收集了128万个与文化相关的图像，比其他现有数据集大50倍以上。通过SEA-VL，我们旨在弥合海洋中的代表差距，从而促进更具包容性的AI系统的发展，这些系统可真实地代表整个海洋各种文化。

Title: Counterfactual Explanations for Model Ensembles Using Entropic Risk Measures

Authors: Erfaun Noorani, Pasan Dissanayake, Faisal Hamman, Sanghamitra Dutta
Subjects: cs.LG, cs.CY, eess.SY, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2503.07934
Pdf URL: https://arxiv.org/pdf/2503.07934
Copy Paste: [[2503.07934]] Counterfactual Explanations for Model Ensembles Using Entropic Risk Measures(https://arxiv.org/abs/2503.07934)
Keywords: generation
Abstract: Counterfactual explanations indicate the smallest change in input that can translate to a different outcome for a machine learning model. Counterfactuals have generated immense interest in high-stakes applications such as finance, education, hiring, etc. In several use-cases, the decision-making process often relies on an ensemble of models rather than just one. Despite significant research on counterfactuals for one model, the problem of generating a single counterfactual explanation for an ensemble of models has received limited interest. Each individual model might lead to a different counterfactual, whereas trying to find a counterfactual accepted by all models might significantly increase cost (effort). We propose a novel strategy to find the counterfactual for an ensemble of models using the perspective of entropic risk measure. Entropic risk is a convex risk measure that satisfies several desirable properties. We incorporate our proposed risk measure into a novel constrained optimization to generate counterfactuals for ensembles that stay valid for several models. The main significance of our measure is that it provides a knob that allows for the generation of counterfactuals that stay valid under an adjustable fraction of the models. We also show that a limiting case of our entropic-risk-based strategy yields a counterfactual valid for all models in the ensemble (worst-case min-max approach). We study the trade-off between the cost (effort) for the counterfactual and its validity for an ensemble by varying degrees of risk aversion, as determined by our risk parameter knob. We validate our performance on real-world datasets.
摘要：反事实解释表明，输入的最小变化可以转化为机器学习模型的不同结果。反事实对金融，教育，招聘等高风险应用产生了巨大的兴趣。在几个用例中，决策过程通常依赖于模型的集合，而不仅仅是一个模型。尽管对一种模型的反事实进行了重大研究，但为模型集合产生单一反事实解释的问题受到了有限的兴趣。每个模型可能会导致不同的反事实，而试图找到所有模型接受的反事实可能会大大增加成本（努力）。我们提出了一种新的策略，以使用熵风险度量的角度找到模型集合的反事实。熵风险是满足几个理想特性的凸风险措施。我们将建议的风险度量纳入一种新颖的约束优化中，以生成对多种模型有效的合奏的反事实。我们度量的主要意义是，它提供了一个旋钮，该旋钮允许在模型的可调节分数下生成反事实的产生。我们还表明，我们的基于熵风险的策略的限制案例对合奏中的所有模型（最差的最低最大最大方法）中的所有模型产生了反事实。我们研究了反事实的成本（努力）与合奏的有效性在不同程度的风险规避程度之间的权衡，这取决于我们的风险参数旋钮。我们在现实世界数据集上验证我们的性能。

Title: CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement

Authors: Chenrui Ma, Rongchang Zhao, Xi Xiao, Hongyang Xie, Tianyang Wang, Xiao Wang, Hao Zhang, Yanning Shen
Subjects: cs.LG, cs.CV, stat.ME
Abstract URL: https://arxiv.org/abs/2503.07938
Pdf URL: https://arxiv.org/pdf/2503.07938
Copy Paste: [[2503.07938]] CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement(https://arxiv.org/abs/2503.07938)
Keywords: generative
Abstract: While deep generative models have significantly advanced representation learning, they may inherit or amplify biases and fairness issues by encoding sensitive attributes alongside predictive features. Enforcing strict independence in disentanglement is often unrealistic when target and sensitive factors are naturally correlated. To address this challenge, we propose CAD-VAE (Correlation-Aware Disentangled VAE), which introduces a correlated latent code to capture the shared information between target and sensitive attributes. Given this correlated latent, our method effectively separates overlapping factors without extra domain knowledge by directly minimizing the conditional mutual information between target and sensitive codes. A relevance-driven optimization strategy refines the correlated code by efficiently capturing essential correlated features and eliminating redundancy. Extensive experiments on benchmark datasets demonstrate that CAD-VAE produces fairer representations, realistic counterfactuals, and improved fairness-aware image editing.
摘要：尽管深层生成模型具有明显的高级表示学习，但它们可能通过编码敏感属性以及预测特征来继承或扩大偏见和公平问题。当目标和敏感因素自然相关时，实施严格的独立性通常是不现实的。为了应对这一挑战，我们提出了CAD-VAE（相关性散布的VAE），该挑战引入了相关的潜在代码，以捕获目标和敏感属性之间的共享信息。鉴于这种相关的潜在，我们的方法可以通过直接最大程度地减少目标和敏感代码之间的条件互信息来有效地分离重叠因子而无需额外的域知识。相关性驱动的优化策略通过有效捕获基本相关特征并消除冗余来完善相关代码。基准数据集的广泛实验表明，CAD-VAE会产生更公平的表示，现实的反事实和改进的公平意识图像编辑。

Title: STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision

Authors: Hin Wai Lui, Jeffrey L. Krichmar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07939
Pdf URL: https://arxiv.org/pdf/2503.07939
Copy Paste: [[2503.07939]] STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision(https://arxiv.org/abs/2503.07939)
Keywords: generative
Abstract: This paper explores vision-based localization through a biologically-inspired approach that mirrors how humans and animals link views or perspectives when navigating their world. We introduce two sequential generative models, VAE-RNN and VAE-Transformer, which transform first-person perspective (FPP) observations into global map perspective (GMP) representations and precise geographical coordinates. Unlike retrieval-based methods, our approach frames localization as a generative task, learning direct mappings between perspectives without relying on dense satellite image databases. We evaluate these models across two real-world environments: a university campus navigated by a Jackal robot and an urban downtown area navigated by a Tesla sedan. The VAE-Transformer achieves impressive precision, with median deviations of 2.29m (1.37% of environment size) and 4.45m (0.35% of environment size) respectively, outperforming both VAE-RNN and prior cross-view geo-localization approaches. Our comprehensive Localization Performance Characteristics (LPC) analysis demonstrates superior performance with the VAE-Transformer achieving an AUC of 0.777 compared to 0.295 for VIGOR 200 and 0.225 for TransGeo, establishing a new state-of-the-art in vision-based localization. In some scenarios, our vision-based system rivals commercial smartphone GPS accuracy (AUC of 0.797) while requiring 5x less GPU memory and delivering 3x faster inference than existing methods in cross-view geo-localization. These results demonstrate that models inspired by biological spatial navigation can effectively memorize complex, dynamic environments and provide precise localization with minimal computational resources.
摘要：本文通过以生物学启发的方法来探讨基于视觉的本地化，该方法反映了人类和动物在导航世界时如何联系视图或观点。我们介绍了两个顺序生成模型，即Vae-RNN和VAE-Transformer，它们将第一人称视角（FPP）观察转换为全局地图透视图（GMP）表示和精确的地理坐标。与基于检索的方法不同，我们的方法将本地化作为生成任务，在不依赖密集的卫星图像数据库的情况下学习直接映射。我们在两个现实世界环境中评估了这些模型：由jack式机器人导航的大学校园和一个由特斯拉轿车导航的城市市中心地区。 VAE转换器的中位偏差分别为229m（环境大小的1.37％）和4.45m（环境规模的0.35％），其中位数偏差达到了令人印象深刻的精确度，表现优于VAE-RNN和先前的跨视图地理位置定位方法。我们的全面本地化性能特征（LPC）分析表明，VAE转换器的AUC为0.777，而Vigor 200的AUC为0.777，而Transgeo的AUC为0.295，建立了基于视觉的本地化的新最新的。在某些情况下，我们基于视觉的系统与商用智能手机GPS的准确性（AUC为0.797），同时需要少5倍GPU内存，并且比跨视图地理位置定位的现有方法更快地提供了3倍的推理。这些结果表明，受生物空间导航启发的模型可以有效地记住复杂的动态环境，并提供最少的计算资源的精确定位。

Title: BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes

Authors: Minkyun Seo, Hyungtae Lim, Kanghee Lee, Luca Carlone, Jaesik Park
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2503.07940
Pdf URL: https://arxiv.org/pdf/2503.07940
Copy Paste: [[2503.07940]] BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes(https://arxiv.org/abs/2503.07940)
Keywords: generation
Abstract: Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors, and (c) raw coordinate usage, which exacerbates scale discrepancies. To address these issues, we present a zero-shot registration pipeline called BUFFER-X by (a) adaptively determining voxel size/search radii, (b) using farthest point sampling to bypass learned detectors, and (c) leveraging patch-wise scale normalization for consistent coordinate bounds. In particular, we present a multi-scale patch-based descriptor generation and a hierarchical inlier search across scales to improve robustness in diverse scenes. We also propose a novel generalizability benchmark using 11 datasets that cover various indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X achieves substantial generalization without prior information or manual parameter tuning for the test datasets. Our code is available at this https URL.
摘要：基于深度学习的点云注册的最新进展提高了概括，但是大多数方法仍然需要为每个新环境进行重新训练或手动参数调整。在本文中，我们确定了限制概括的三个关键因素：（a）依赖特定环境的体素大小和搜索半径，（b）基于学习的关键点探测器的差偏置稳健性，以及（c）原始坐标用法，这加剧了规模差异。为了解决这些问题，我们提出了一条零拍的注册管道，称为“ Buffer-X”，通过（a）自适应确定体素大小/搜索半径，（b）使用最远的点采样来绕过学习的探测器，以及（c）利用斑块的缩放尺度归一化以达到一致的坐标边界。特别是，我们提出了一个基于多尺度的补丁描述符的生成和跨量表的层次范围搜索，以提高不同场景的鲁棒性。我们还使用11个数据集提出了一种新型的概括性基准测试，该数据集涵盖了各种室内/室外场景和传感器方式，这表明Buffer-X无需用于测试数据集的先前信息或手动参数调整即可实现实质性的概括。我们的代码可在此HTTPS URL上找到。

Title: Recent Advances in Hypergraph Neural Networks

Authors: Murong Yang, Xin-Jian Xu
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2503.07959
Pdf URL: https://arxiv.org/pdf/2503.07959
Copy Paste: [[2503.07959]] Recent Advances in Hypergraph Neural Networks(https://arxiv.org/abs/2503.07959)
Keywords: generative
Abstract: The growing interest in hypergraph neural networks (HGNNs) is driven by their capacity to capture the complex relationships and patterns within hypergraph structured data across various domains, including computer vision, complex networks, and natural language processing. This paper comprehensively reviews recent advances in HGNNs and presents a taxonomy of mainstream models based on their architectures: hypergraph convolutional networks (HGCNs), hypergraph attention networks (HGATs), hypergraph autoencoders (HGAEs), hypergraph recurrent networks (HGRNs), and deep hypergraph generative models (DHGGMs). For each category, we delve into its practical applications, mathematical mechanisms, literature contributions, and open problems. Finally, we discuss some common challenges and promising research this http URL paper aspires to be a helpful resource that provides guidance for future research and applications of HGNNs.
摘要：对HyperGraph神经网络（HGNN）的兴趣日益增长的兴趣是由于它们在跨各个领域（包括计算机视觉，复杂网络和自然语言处理）中捕获HyperGraph结构化数据中复杂关系和模式的能力所驱动的。本文全面回顾了HGNN的最新进展，并根据其架构提供了主流模型的分类学：超图卷积网络（HGCN），超毛图注意网络（HGATS），HyperGraph AutoCododers（HGAES），超图复发网络（HGRNS）（HGRNS）和深层型号（HGRENS）和Deepergraph Geralaph（HGRNS）和DHGGG（DHGG）（DHGG）（DHGG）。对于每个类别，我们都深入研究其实际应用，数学机制，文学贡献和开放问题。最后，我们讨论了一些常见的挑战和有希望的研究，HTTP URL论文渴望成为有用的资源，为未来的研究和HGNN的应用提供指导。

Title: 7ABAW-Compound Expression Recognition via Curriculum Learning

Authors: Chen Liu, Feng Qiu, Wei Zhang, Lincheng Li, Dadong Wang, Xin Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07969
Pdf URL: https://arxiv.org/pdf/2503.07969
Copy Paste: [[2503.07969]] 7ABAW-Compound Expression Recognition via Curriculum Learning(https://arxiv.org/abs/2503.07969)
Keywords: generation
Abstract: With the advent of deep learning, expression recognition has made significant advancements. However, due to the limited availability of annotated compound expression datasets and the subtle variations of compound expressions, Compound Emotion Recognition (CE) still holds considerable potential for exploration. To advance this task, the 7th Affective Behavior Analysis in-the-wild (ABAW) competition introduces the Compound Expression Challenge based on C-EXPR-DB, a limited dataset without labels. In this paper, we present a curriculum learning-based framework that initially trains the model on single-expression tasks and subsequently incorporates multi-expression data. This design ensures that our model first masters the fundamental features of basic expressions before being exposed to the complexities of compound emotions. Specifically, our designs can be summarized as follows: 1) Single-Expression Pre-training: The model is first trained on datasets containing single expressions to learn the foundational facial features associated with basic emotions. 2) Dynamic Compound Expression Generation: Given the scarcity of annotated compound expression datasets, we employ CutMix and Mixup techniques on the original single-expression images to create hybrid images exhibiting characteristics of multiple basic emotions. 3) Incremental Multi-Expression Integration: After performing well on single-expression tasks, the model is progressively exposed to multi-expression data, allowing the model to adapt to the complexity and variability of compound expressions. The official results indicate that our method achieves the \textbf{best} performance in this competition track with an F-score of 0.6063. Our code is released at this https URL.
摘要：随着深度学习的出现，表达认可取得了重大进步。但是，由于带注释的复合表达数据集的可用性有限和复合表达的细微变化，因此复合情绪识别（CE）仍然具有巨大的探索潜力。为了促进这项任务，第七次情感行为分析（ABAW）竞争引入了基于C-EXPR-DB的复合表达挑战，C-EXPR-DB是一个没有标签的有限数据集。在本文中，我们提出了一个基于课程的学习框架，该框架最初在单个表达任务上训练该模型，并随后合并了多表达数据。该设计可确保我们的模型首先掌握基本表达式的基本特征，然后才能暴露于复合情绪的复杂性。具体而言，我们的设计可以总结如下：1）单表达预训练：该模型首先在包含单个表达式的数据集中训练，以学习与基本情感相关的基础面部特征。 2）动态化合物表达生成：鉴于带注释的复合表达数据集的稀缺性，我们在原始的单个表达图像上采用cutmix和混合技术来创建具有多种基本情绪特征的混合图像。 3）增量多表达集成：在单一表达任务上表现良好后，该模型逐渐暴露于多表达数据，从而使模型适应化合物表达式的复杂性和可变性。官方的结果表明，我们的方法在此竞赛轨道上实现了\ textbf {best}的性能，而F-评分为0.6063。我们的代码在此HTTPS URL上发布。

Title: Regulatory DNA sequence Design with Reinforcement Learning

Authors: Zhao Yang, Bing Su, Chuan Cao, Ji-Rong Wen
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2503.07981
Pdf URL: https://arxiv.org/pdf/2503.07981
Copy Paste: [[2503.07981]] Regulatory DNA sequence Design with Reinforcement Learning(https://arxiv.org/abs/2503.07981)
Keywords: generative
Abstract: Cis-regulatory elements (CREs), such as promoters and enhancers, are relatively short DNA sequences that directly regulate gene expression. The fitness of CREs, measured by their ability to modulate gene expression, highly depends on the nucleotide sequences, especially specific motifs known as transcription factor binding sites (TFBSs). Designing high-fitness CREs is crucial for therapeutic and bioengineering applications. Current CRE design methods are limited by two major drawbacks: (1) they typically rely on iterative optimization strategies that modify existing sequences and are prone to local optima, and (2) they lack the guidance of biological prior knowledge in sequence optimization. In this paper, we address these limitations by proposing a generative approach that leverages reinforcement learning (RL) to fine-tune a pre-trained autoregressive (AR) model. Our method incorporates data-driven biological priors by deriving computational inference-based rewards that simulate the addition of activator TFBSs and removal of repressor TFBSs, which are then integrated into the RL process. We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types, demonstrating its ability to generate high-fitness CREs while maintaining sequence diversity. The code is available at this https URL.
摘要：顺式调节元件（CRE），例如启动子和增强子是直接调节基因表达的相对较短的DNA序列。 CRE的适应性以调节基因表达的能力来衡量，高度取决于核苷酸序列，尤其是特定的基序，称为转录因子结合位点（TFBS）。设计高素质CRE对于治疗和生物工程应用至关重要。当前的CRE设计方法受两个主要缺点的限制：（1）它们通常依赖于修改现有序列并容易局部Optima的迭代优化策略，并且（2）他们在序列优化中缺乏生物学先验知识的指导。在本文中，我们通过提出一种利用强化学习（RL）来微调预训练的自动回归（AR）模型的生成方法来解决这些局限性。我们的方法通过得出基于计算推理的奖励来模拟激活剂TFBS并删除阻遏物TFBS来结合数据驱动的生物学先验，然后将其集成到RL过程中。我们在两个酵母媒体条件下的启动子设计任务以及三种人类细胞类型的增强剂设计任务中评估了我们的方法，这表明了其在维持序列多样性的同时产生高素质CRE的能力。该代码可在此HTTPS URL上找到。

Title: DiffEGG: Diffusion-Driven Edge Generation as a Pixel-Annotation-Free Alternative for Instance Annotation

Authors: Sanghyun Jo, Ziseok Lee, Wooyeol Lee, Kyungsu Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07982
Pdf URL: https://arxiv.org/pdf/2503.07982
Copy Paste: [[2503.07982]] DiffEGG: Diffusion-Driven Edge Generation as a Pixel-Annotation-Free Alternative for Instance Annotation(https://arxiv.org/abs/2503.07982)
Keywords: generation
Abstract: Achieving precise panoptic segmentation relies on pixel-wise instance annotations, but obtaining such datasets is costly. Unsupervised instance segmentation (UIS) eliminates annotation requirements but struggles with adjacent instance merging and single-instance fragmentation, largely due to the limitations of DINO-based backbones which lack strong instance separation cues. Weakly-supervised panoptic segmentation (WPS) reduces annotation costs using sparse labels (e.g., points, boxes), yet these annotations remain expensive and introduce human bias and boundary errors. To address these challenges, we propose DiffEGG (Diffusion-Driven EdGe Generation), a fully annotation-free method that extracts instance-aware features from pretrained diffusion models to generate precise instance edge maps. Unlike DINO-based UIS methods, diffusion models inherently capture fine-grained, instance-aware features, enabling more precise boundary delineation. For WPS, DiffEGG eliminates annotation costs and human bias by operating without any form of manual supervision, addressing the key limitations of prior best methods. Additionally, we introduce RIP, a post-processing technique that fuses DiffEGG's edge maps with segmentation masks in a task-agnostic manner. RIP allows DiffEGG to be seamlessly integrated into various segmentation frameworks. When applied to UIS, DiffEGG and RIP achieve an average $+4.4\text{ AP}$ improvement over prior best UIS methods. When combined with weakly-supervised semantic segmentation (WSS), DiffEGG enables WPS without instance annotations, outperforming prior best point-supervised WPS methods by $+1.7\text{ PQ}$. These results demonstrate that DiffEGG's edge maps serve as a cost-effective, annotation-free alternative to instance annotations, significantly improving segmentation without human intervention. Code is available at this https URL.
摘要：实现精确的综合分段依赖于像素实例注释，但是获得此类数据集的成本很高。无监督的实例细分（UIS）消除了注释要求，但在相邻实例合并和单一构度碎片的斗争中，很大程度上是由于基于恐龙的骨架的局限性，这些主链缺乏强大的实例分离提示。使用稀疏标签（例如，点，盒子）降低注释成本，但这些注释仍然昂贵，并引入人类的偏见和边界错误，可以降低注释成本。为了应对这些挑战，我们提出了Diffegg（扩散驱动的边缘生成），这是一种完全无注释的方法，它从验证的扩散模型中提取实例感知的特征，以生成精确的实例边缘映射。与基于Dino的UIS方法不同，扩散模型固有地捕获了细粒度的实例感知特征，从而实现了更精确的边界描述。对于WPS，Diffegg通过在没有任何形式的手动监督的情况下操作来消除注释成本和人类偏见，从而解决了先前最佳方法的关键局限性。此外，我们引入了RIP，这是一种后处理技术，该技术以任务无关的方式将Diffegg的边缘图与分割掩码融合在一起。 RIP允许将DIFFEGG无缝集成到各种分割框架中。当应用于UIS时，Diffegg和RIP实现了对先前最佳UIS方法的平均$+4.4 \ text {ap} $改进。当与弱监督的语义细分（WSS）结合使用时，DIFFEGG可以启用无实例注释的WPS，以$+1.7 \ text {pq} $优于先前的最佳点监督WPS方法。这些结果表明，Diffegg的边缘图是实例注释的一种经济高效的，无注释的替代方案，可以显着改善分割而无需人工干预。代码可在此HTTPS URL上找到。

Title: CDI3D: Cross-guided Dense-view Interpolation for 3D Reconstruction

Authors: Zhiyuan Wu, Xibin Song, Senbo Wang, Weizhe Liu, Jiayu Yang, Ziang Cheng, Shenzhou Chen, Taizhang Shang, Weixuan Sun, Shan Luo, Pan Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08005
Pdf URL: https://arxiv.org/pdf/2503.08005
Copy Paste: [[2503.08005]] CDI3D: Cross-guided Dense-view Interpolation for 3D Reconstruction(https://arxiv.org/abs/2503.08005)
Keywords: generation
Abstract: 3D object reconstruction from single-view image is a fundamental task in computer vision with wide-ranging applications. Recent advancements in Large Reconstruction Models (LRMs) have shown great promise in leveraging multi-view images generated by 2D diffusion models to extract 3D content. However, challenges remain as 2D diffusion models often struggle to produce dense images with strong multi-view consistency, and LRMs tend to amplify these inconsistencies during the 3D reconstruction process. Addressing these issues is critical for achieving high-quality and efficient 3D reconstruction. In this paper, we present CDI3D, a feed-forward framework designed for efficient, high-quality image-to-3D generation with view interpolation. To tackle the aforementioned challenges, we propose to integrate 2D diffusion-based view interpolation into the LRM pipeline to enhance the quality and consistency of the generated mesh. Specifically, our approach introduces a Dense View Interpolation (DVI) module, which synthesizes interpolated images between main views generated by the 2D diffusion model, effectively densifying the input views with better multi-view consistency. We also design a tilt camera pose trajectory to capture views with different elevations and perspectives. Subsequently, we employ a tri-plane-based mesh reconstruction strategy to extract robust tokens from these interpolated and original views, enabling the generation of high-quality 3D meshes with superior texture and geometry. Extensive experiments demonstrate that our method significantly outperforms previous state-of-the-art approaches across various benchmarks, producing 3D content with enhanced texture fidelity and geometric accuracy.
摘要：来自单视图像的3D对象重建是具有广泛应用程序的计算机视觉中的基本任务。大型重建模型（LRMS）的最新进展在利用2D扩散模型生成的多视图图像来提取3D内容方面表现出了巨大的希望。但是，挑战仍然存在，因为2D扩散模型通常难以生成具有强大多视图一致性的密集图像，而LRMS倾向于在3D重建过程中扩大这些不一致之处。解决这些问题对于实现高质量和高效的3D重建至关重要。在本文中，我们提出了CDI3D，这是一个供您使用的馈电框架，旨在通过视图插值进行高效，高质量的图像到3D生成。为了应对上述挑战，我们建议将基于2D扩散的视图插值整合到LRM管道中，以提高生成的网格的质量和一致性。具体而言，我们的方法引入了一个密集的视图插值（DVI）模块，该模块在2D扩散模型生成的主要视图之间综合了插值图像，从而有效地使输入视图具有更好的多视图一致性。我们还设计了一个倾斜的摄像头姿势轨迹，以捕获具有不同高程和视角的视图。随后，我们采用基于三平面的网格重建策略来从这些插值和原始视图中提取坚固的令牌，从而使高质量的3D网格具有出色的纹理和几何形状。广泛的实验表明，我们的方法在各种基准测试中的先前最新方法都显着优于先前的最先进方法，从而产生具有增强质地保真度和几何精度的3D含量。

Title: Exploring Bias in over 100 Text-to-Image Generative Models

Authors: Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08012
Pdf URL: https://arxiv.org/pdf/2503.08012
Copy Paste: [[2503.08012]] Exploring Bias in over 100 Text-to-Image Generative Models(https://arxiv.org/abs/2503.08012)
Keywords: generative
Abstract: We investigate bias trends in text-to-image generative models over time, focusing on the increasing availability of models through open platforms like Hugging Face. While these platforms democratize AI, they also facilitate the spread of inherently biased models, often shaped by task-specific fine-tuning. Ensuring ethical and transparent AI deployment requires robust evaluation frameworks and quantifiable bias metrics. To this end, we assess bias across three key dimensions: (i) distribution bias, (ii) generative hallucination, and (iii) generative miss-rate. Analyzing over 100 models, we reveal how bias patterns evolve over time and across generative tasks. Our findings indicate that artistic and style-transferred models exhibit significant bias, whereas foundation models, benefiting from broader training distributions, are becoming progressively less biased. By identifying these systemic trends, we contribute a large-scale evaluation corpus to inform bias research and mitigation strategies, fostering more responsible AI development. Keywords: Bias, Ethical AI, Text-to-Image, Generative Models, Open-Source Models
摘要：我们调查了随着时间的流逝，文本到图像生成模型的偏见趋势，重点是通过拥抱面（例如拥抱面）的模型增加。尽管这些平台使AI民主化，但它们还促进了固有的偏见模型的传播，通常是由特定于任务的微调塑造的。确保道德和透明的AI部署需要强大的评估框架和可量化的偏见指标。为此，我们评估了三个关键维度的偏见：（i）分布偏见，（ii）生成幻觉和（iii）生成遗传速率。分析100多个模型，我们揭示了偏差模式如何随着时间和生成任务而演变。我们的发现表明，艺术和风格转让的模型表现出很大的偏见，而基金会模型则受益于更广泛的培训分布，逐渐变得越来越少。通过确定这些系统性趋势，我们贡献了一个大规模的评估语料库，以告知偏见研究和缓解策略，从而促进更负责任的AI开发。关键字：偏见，道德AI，文本到图像，生成模型，开源模型

Title: GPT-PPG: A GPT-based Foundation Model for Photoplethysmography Signals

Authors: Zhaoliang Chen, Cheng Ding, Saurabh Kataria, Runze Yan, Minxiao Wang, Randall Lee, Xiao Hu
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2503.08015
Pdf URL: https://arxiv.org/pdf/2503.08015
Copy Paste: [[2503.08015]] GPT-PPG: A GPT-based Foundation Model for Photoplethysmography Signals(https://arxiv.org/abs/2503.08015)
Keywords: generative
Abstract: This study introduces a novel application of a Generative Pre-trained Transformer (GPT) model tailored for photoplethysmography (PPG) signals, serving as a foundation model for various downstream tasks. Adapting the standard GPT architecture to suit the continuous characteristics of PPG signals, our approach demonstrates promising results. Our models are pre-trained on our extensive dataset that contains more than 200 million 30s PPG samples. We explored different supervised fine-tuning techniques to adapt our model to downstream tasks, resulting in performance comparable to or surpassing current state-of-the-art (SOTA) methods in tasks like atrial fibrillation detection. A standout feature of our GPT model is its inherent capability to perform generative tasks such as signal denoising effectively, without the need for further fine-tuning. This success is attributed to the generative nature of the GPT framework.
摘要：这项研究介绍了针对光摄影学（PPG）信号量身定制的生成预训练的变压器（GPT）模型的新颖应用，它是各种下游任务的基础模型。我们的方法适应标准的GPT体系结构以适合PPG信号的连续特性，我们的方法证明了有希望的结果。我们的模型已在我们的广泛数据集中进行了预培训，该数据集包含超过2亿个PPG样品。我们探索了不同监督的微调技术，以使模型适应下游任务，从而使性能可与心房颤动检测等任务相当或超过当前最新方法（SOTA）方法。我们GPT模型的出色功能是其固有的能力可以执行生成任务，例如有效地信号DeNosing，而无需进行进一步的微调。这种成功归因于GPT框架的生成性质。

Title: HOFAR: High-Order Augmentation of Flow Autoregressive Transformers

Authors: Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08032
Pdf URL: https://arxiv.org/pdf/2503.08032
Copy Paste: [[2503.08032]] HOFAR: High-Order Augmentation of Flow Autoregressive Transformers(https://arxiv.org/abs/2503.08032)
Keywords: generation
Abstract: Flow Matching and Transformer architectures have demonstrated remarkable performance in image generation tasks, with recent work FlowAR [Ren et al., 2024] synergistically integrating both paradigms to advance synthesis fidelity. However, current FlowAR implementations remain constrained by first-order trajectory modeling during the generation process. This paper introduces a novel framework that systematically enhances flow autoregressive transformers through high-order supervision. We provide theoretical analysis and empirical evaluation showing that our High-Order FlowAR (HOFAR) demonstrates measurable improvements in generation quality compared to baseline models. The proposed approach advances the understanding of flow-based autoregressive modeling by introducing a systematic framework for analyzing trajectory dynamics through high-order expansion.
摘要：流量匹配和变压器体系结构在图像生成任务中表现出了显着的性能，最近的工作流程[Ren等，2024]协同整合了这两个范式以提高合成保真度。但是，当前的流动实现在生成过程中受一阶轨迹建模的限制。本文介绍了一个新颖的框架，该框架系统地通过高阶监督系统地增强了流动回归的变压器。我们提供了理论分析和经验评估，表明我们的高阶流量（HOFAR）与基线模型相比表现出可测量的发电质量改善。提出的方法通过引入系统框架来通过高阶扩展来分析轨迹动态，从而进步对基于流动的自回旋建模的理解。

Title: SphOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Learning Models

Authors: Nadarasar Bahavan, Sachith Seneviratne, Saman Halgamuge
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08049
Pdf URL: https://arxiv.org/pdf/2503.08049
Copy Paste: [[2503.08049]] SphOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Learning Models(https://arxiv.org/abs/2503.08049)
Keywords: generative
Abstract: The widespread use of deep learning classifiers necessitates Open-set recognition (OSR), which enables the identification of input data not only from classes known during training but also from unknown classes that might be present in test data. Many existing OSR methods are computationally expensive due to the reliance on complex generative models or suffer from high training costs. We investigate OSR from a representation-learning perspective, specifically through spherical embeddings. We introduce SphOR, a computationally efficient representation learning method that models the feature space as a mixture of von Mises-Fisher distributions. This approach enables the use of semantically ambiguous samples during training, to improve the detection of samples from unknown classes. We further explore the relationship between OSR performance and key representation learning properties which influence how well features are structured in high-dimensional space. Extensive experiments on multiple OSR benchmarks demonstrate the effectiveness of our method, producing state-of-the-art results, with improvements up-to 6% that validate its performance.
摘要：深度学习分类器的广泛使用需要开放式识别（OSR），这不仅可以从培训期间已知的类，而且还可以从测试数据中可能存在的未知类别中识别输入数据。由于依赖复杂的生成模型或高训练成本，许多现有的OSR方法在计算上昂贵。我们从表示学习的角度研究OSR，特别是通过球形嵌入。我们介绍了SPHOR，这是一种计算高效的表示学习方法，将特征空间建模为Von Mises-Fisher分布的混合物。这种方法可以在训练过程中使用语义上模棱两可的样本，以改善对未知类别的样品的检测。我们进一步探讨了OSR性能与关键表示属性之间的关系，这些学习属性影响了在高维空间中构建特征的效果。对多个OSR基准测试的广泛实验证明了我们方法的有效性，从而产生最先进的结果，并提高了6％的改进，从而验证了其性能。

Title: Unmasking the Unknown: Facial Deepfake Detection in the Open-Set Paradigm

Authors: Nadarasar Bahavan, Sanjay Saha, Ken Chen, Sachith Seneviratne, Sanka Rasnayaka, Saman Halgamuge
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08055
Pdf URL: https://arxiv.org/pdf/2503.08055
Copy Paste: [[2503.08055]] Unmasking the Unknown: Facial Deepfake Detection in the Open-Set Paradigm(https://arxiv.org/abs/2503.08055)
Keywords: generative
Abstract: Facial forgery methods such as deepfakes can be misused for identity manipulation and spreading misinformation. They have evolved alongside advancements in generative AI, leading to new and more sophisticated forgery techniques that diverge from existing 'known' methods. Conventional deepfake detection methods use the closedset paradigm, thus limiting their applicability to detecting forgeries created using methods that are not part of the training dataset. In this paper, we propose a shift from the closed-set paradigm for deepfake detection. In the open-set paradigm, models are designed not only to identify images created by known facial forgery methods but also to identify and flag those produced by previously unknown methods as 'unknown' and not as unforged/real/unmanipulated. In this paper, we propose an open-set deepfake classification algorithm based on supervised contrastive learning. The open-set paradigm used in our model allows it to function as a more robust tool capable of handling emerging and unseen deepfake techniques, enhancing reliability and confidence, and complementing forensic analysis. In open-set paradigm, we identify three groups including the "unknown group that is neither considered known deepfake nor real. We investigate deepfake open-set classification across three scenarios, classifying deepfakes from unknown methods not as real, distinguishing real images from deepfakes, and classifying deepfakes from known methods, using the FaceForensics++ dataset as a benchmark. Our method achieves state of the art results in the first two tasks and competitive results in the third task.
摘要：面部伪造方法（例如深击）可能会滥用身份操纵和传播错误信息。它们随着生成AI的进步而发展，导致新的，更复杂的伪造技术与现有的“已知”方法不同。常规的深击检测方法使用封闭套范式，从而将其适用性限制为检测使用不属于训练数据集的方法创建的伪造物。在本文中，我们提出了从封闭式范式进行深泡检测的转变。在开放式范式中，模型不仅旨在识别通过已知的面部伪造方法创建的图像，而且还识别和标记那些以前未知方法所产生的图像为“未知”，而不是未固定/真实/未经操纵。在本文中，我们提出了一种基于监督的对比学习的开放式深层分类算法。我们模型中使用的开放式范式使其可以作为一种更强大的工具，能够处理出现和看不见的深层技术，增强可靠性和信心以及补充法医分析。 In open-set paradigm, we identify three groups including the "unknown group that is neither considered known deepfake nor real. We investigate deepfake open-set classification across three scenarios, classifying deepfakes from unknown methods not as real, distinguishing real images from deepfakes, and classifying deepfakes from known methods, using the FaceForensics++ dataset as a benchmark. Our method achieves state of the art results in the first two tasks and competitive导致第三个任务。

Title: Seeing Beyond Haze: Generative Nighttime Image Dehazing

Authors: Beibei Lin, Stephen Lin, Robby Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08073
Pdf URL: https://arxiv.org/pdf/2503.08073
Copy Paste: [[2503.08073]] Seeing Beyond Haze: Generative Nighttime Image Dehazing(https://arxiv.org/abs/2503.08073)
Keywords: generative
Abstract: Nighttime image dehazing is particularly challenging when dense haze and intense glow severely degrade or completely obscure background information. Existing methods often encounter difficulties due to insufficient background priors and limited generative ability, both essential for handling such conditions. In this paper, we introduce BeyondHaze, a generative nighttime dehazing method that not only significantly reduces haze and glow effects but also infers background information in regions where it may be absent. Our approach is developed on two main ideas: gaining strong background priors by adapting image diffusion models to the nighttime dehazing problem, and enhancing generative ability for haze- and glow-obscured scene areas through guided training. Task-specific nighttime dehazing knowledge is distilled into an image diffusion model in a manner that preserves its capacity to generate clean images. The diffusion model is additionally trained on image pairs designed to improve its ability to generate background details and content that are missing in the input image due to haze effects. Since generative models are susceptible to hallucinations, we develop our framework to allow user control over the generative level, balancing visual realism and factual accuracy. Experiments on real-world images demonstrate that BeyondHaze effectively restores visibility in dense nighttime haze.
摘要：当密集的阴霾和强烈的发光严重降解或完全掩盖背景信息时，夜间图像尤其具有挑战性。现有方法通常由于背景先验不足和有限的生成能力而遇到困难，这对于处理这种情况至关重要。在本文中，我们介绍了Heyrowhaze，这是一种生成性的夜间飞去方法，不仅可以显着降低雾霾和发光效果，而且还可以在可能不存在的区域内注入背景信息。我们的方法是在两个主要思想上开发的：通过将图像扩散模型调整为夜间飞行问题，并通过带导向培训来增强雾化和发光的场景领域的生成能力，从而获得强大的背景先验。特定于任务的夜间飞行知识以保留其生成干净图像的能力的方式将图像扩散模型提炼成图像扩散模型。扩散模型是在图像对上还旨在提高其生成背景细节和内容的能力，这些详细信息和内容由于雾霾效应而在输入图像中缺少。由于生成模型容易受到幻觉的影响，因此我们开发了我们的框架，以允许用户控制生成级别，平衡视觉现实主义和事实准确性。对现实世界图像的实验表明，Hearther Haze有效地恢复了密集的夜间雾化中的可见度。

Title: PRISM: Privacy-Preserving Improved Stochastic Masking for Federated Generative Models

Authors: Kyeongkook Seo, Dong-Jun Han, Jaejun Yoo
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2503.08085
Pdf URL: https://arxiv.org/pdf/2503.08085
Copy Paste: [[2503.08085]] PRISM: Privacy-Preserving Improved Stochastic Masking for Federated Generative Models(https://arxiv.org/abs/2503.08085)
Keywords: generative
Abstract: Despite recent advancements in federated learning (FL), the integration of generative models into FL has been limited due to challenges such as high communication costs and unstable training in heterogeneous data environments. To address these issues, we propose PRISM, a FL framework tailored for generative models that ensures (i) stable performance in heterogeneous data distributions and (ii) resource efficiency in terms of communication cost and final model size. The key of our method is to search for an optimal stochastic binary mask for a random network rather than updating the model weights, identifying a sparse subnetwork with high generative performance; i.e., a ``strong lottery ticket''. By communicating binary masks in a stochastic manner, PRISM minimizes communication overhead. This approach, combined with the utilization of maximum mean discrepancy (MMD) loss and a mask-aware dynamic moving average aggregation method (MADA) on the server side, facilitates stable and strong generative capabilities by mitigating local divergence in FL scenarios. Moreover, thanks to its sparsifying characteristic, PRISM yields a lightweight model without extra pruning or quantization, making it ideal for environments such as edge devices. Experiments on MNIST, FMNIST, CelebA, and CIFAR10 demonstrate that PRISM outperforms existing methods, while maintaining privacy with minimal communication costs. PRISM is the first to successfully generate images under challenging non-IID and privacy-preserving FL environments on complex datasets, where previous methods have struggled.
摘要：尽管联合学习（FL）最近取得了进步，但由于诸如高度沟通成本和异质数据环境中的不稳定培训等挑战，生成模型将生成模型的整合受到限制。为了解决这些问题，我们提出了Prism，这是一个针对生成模型量身定制的FL框架，可确保（i）（i）在异质数据分布中稳定性能以及（ii）在通信成本和最终模型大小方面的资源效率。我们方法的关键是为随机网络搜索最佳的随机二进制掩码，而不是更新模型权重，从而识别具有高生成性能的稀疏子网；即，``强力彩票''。通过以随机方式传达二进制面具，Prism可以将沟通开销最小化。这种方法结合了最大平均差异（MMD）损失的利用和服务器端上的面具动态移动平均聚合方法（MADA），从而通过缓解FL场景中的局部差异来促进稳定且强大的生成能力。此外，由于其稀疏特征，棱镜可产生轻巧的模型，而无需额外的修剪或量化，因此非常适合Edge设备等环境。关于MNIST，FMNIST，CELEBA和CIFAR10的实验表明，Prism的表现优于现有方法，同时以最低的沟通成本来保持隐私。 Prism是第一个成功生成图像在具有挑战性的非IID和隐私保护环境下的复杂数据集中的图像，而先前方法一直在努力。

Title: MegaSR: Mining Customized Semantics and Expressive Guidance for Image Super-Resolution

Authors: Xinrui Li, Jianlong Wu, Xinchuan Huang, Chong Chen, Weili Guan, Xian-Sheng Hua, Liqiang Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08096
Pdf URL: https://arxiv.org/pdf/2503.08096
Copy Paste: [[2503.08096]] MegaSR: Mining Customized Semantics and Expressive Guidance for Image Super-Resolution(https://arxiv.org/abs/2503.08096)
Keywords: super-resolution
Abstract: Pioneering text-to-image (T2I) diffusion models have ushered in a new era of real-world image super-resolution (Real-ISR), significantly enhancing the visual perception of reconstructed images. However, existing methods typically integrate uniform abstract textual semantics across all blocks, overlooking the distinct semantic requirements at different depths and the fine-grained, concrete semantics inherently present in the images themselves. Moreover, relying solely on a single type of guidance further disrupts the consistency of reconstruction. To address these issues, we propose MegaSR, a novel framework that mines customized block-wise semantics and expressive guidance for diffusion-based ISR. Compared to uniform textual semantics, MegaSR enables flexible adaptation to multi-granularity semantic awareness by dynamically incorporating image attributes at each block. Furthermore, we experimentally identify HED edge maps, depth maps, and segmentation maps as the most expressive guidance, and propose a multi-stage aggregation strategy to modulate them into the T2I models. Extensive experiments demonstrate the superiority of MegaSR in terms of semantic richness and structural consistency.
摘要：开创性的文本对图像（T2I）扩散模型已迎来了现实世界图像超分辨率（Real-ISR）的新时代，从而显着增强了重建图像的视觉感知。但是，现有方法通常会在所有块上整合统一的抽象文本语义，从而忽略了不同深度的不同语义要求以及图像本身本质上存在的细粒度，具体的语义。此外，仅依靠一种类型的指导进一步破坏了重建的一致性。为了解决这些问题，我们提出了MEGASR，这是一个新颖的框架，可挖掘自定义的基于扩散的ISR的宽敞语义和表达性指南。与统一的文本语义相比，MEGASR通过在每个块上动态合并图像属性来灵活适应多晶格语义意识。此外，我们在实验上识别出HED边缘图，深度图和分割图是最表现力的指导，并提出了将它们调节到T2I模型中的多阶段聚合策略。广泛的实验证明了Megasr在语义丰富性和结构一致性方面的优越性。

Title: ACE: Concept Editing in Diffusion Models without Performance Degradation

Authors: Ruipeng Wang, Junfeng Fang, Jiaqi Li, Hao Chen, Jie Shi, Kun Wang, Xiang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08116
Pdf URL: https://arxiv.org/pdf/2503.08116
Copy Paste: [[2503.08116]] ACE: Concept Editing in Diffusion Models without Performance Degradation(https://arxiv.org/abs/2503.08116)
Keywords: generation
Abstract: Diffusion-based text-to-image models have demonstrated remarkable capabilities in generating realistic images, but they raise societal and ethical concerns, such as the creation of unsafe content. While concept editing is proposed to address these issues, they often struggle to balance the removal of unsafe concept with maintaining the model's general genera-tive capabilities. In this work, we propose ACE, a new editing method that enhances concept editing in diffusion models. ACE introduces a novel cross null-space projection approach to precisely erase unsafe concept while maintaining the model's ability to generate high-quality, semantically consistent images. Extensive experiments demonstrate that ACE significantly outperforms the advancing baselines,improving semantic consistency by 24.56% and image generation quality by 34.82% on average with only 1% of the time cost. These results highlight the practical utility of concept editing by mitigating its potential risks, paving the way for broader applications in the field. Code is avaliable at this https URL
摘要：基于扩散的文本对图像模型在产生逼真的图像方面表现出了显着的功能，但是它们引起了社会和道德问题，例如创建不安全的内容。尽管提出了概念编辑来解决这些问题，但他们通常很难平衡对不安全概念的去除与保持模型的一般属性功能的平衡。在这项工作中，我们提出了ACE，这是一种新的编辑方法，可增强扩散模型中的概念编辑。 ACE引入了一种新颖的杂交空空间投影方法，以精确消除不安全的概念，同时保持模型产生高质量的语义一致图像的能力。广泛的实验表明，ACE明显胜过前进的基线，将语义一致性提高24.56％，图像产生质量平均提高了34.82％，只有1％的时间成本。这些结果通过减轻其潜在风险，为该领域的更广泛的应用铺平道路，从而突出了概念编辑的实际实用性。在此https URL上可以使用代码

Title: Convergence Dynamics and Stabilization Strategies of Co-Evolving Generative Models

Authors: Weiguo Gao, Ming Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08117
Pdf URL: https://arxiv.org/pdf/2503.08117
Copy Paste: [[2503.08117]] Convergence Dynamics and Stabilization Strategies of Co-Evolving Generative Models(https://arxiv.org/abs/2503.08117)
Keywords: generative
Abstract: The increasing prevalence of synthetic data in training loops has raised concerns about model collapse, where generative models degrade when trained on their own outputs. While prior work focuses on this self-consuming process, we study an underexplored yet prevalent phenomenon: co-evolving generative models that shape each other's training through iterative feedback. This is common in multimodal AI ecosystems, such as social media platforms, where text models generate captions that guide image models, and the resulting images influence the future adaptation of the text model. We take a first step by analyzing such a system, modeling the text model as a multinomial distribution and the image model as a conditional multi-dimensional Gaussian distribution. Our analysis uncovers three key results. First, when one model remains fixed, the other collapses: a frozen image model causes the text model to lose diversity, while a frozen text model leads to an exponential contraction of image diversity, though fidelity remains bounded. Second, in fully interactive systems, mutual reinforcement accelerates collapse, with image contraction amplifying text homogenization and vice versa, leading to a Matthew effect where dominant texts sustain higher image diversity while rarer texts collapse faster. Third, we analyze stabilization strategies implicitly introduced by real-world external influences. Random corpus injections for text models and user-content injections for image models prevent collapse while preserving both diversity and fidelity. Our theoretical findings are further validated through experiments.
摘要：训练回路中合成数据的普遍性日益增加引起了人们对模型崩溃的担忧，在这种模型以自己的输出进行培训时，生成模型会降低。尽管先前的工作着重于这个自我消耗的过程，但我们研究了一种不受欢迎但普遍的现象：共同发展的生成模型，这些模型通过迭代反馈来塑造彼此的训练。这在多模式AI生态系统（例如社交媒体平台）中很常见，文本模型会生成指导图像模型的标题，而所产生的图像会影响文本模型的未来适应。我们通过分析这样的系统，将文本模型作为多项式分布建模，将图像模型建模为条件多维高斯分布，从而迈出第一步。我们的分析发现了三个关键结果。首先，当一个模型保持固定时，另一个模型会崩溃：冷冻图像模型会导致文本模型失去多样性，而冷冻文本模型会导致图像多样性的指数收缩，尽管富裕性仍然有限。其次，在完全互动的系统中，相互加固会加速崩溃，图像收缩扩大了文本均匀化，反之亦然，导致MATTHEW效应，其中主导文本具有更高的图像多样性，而稀有文本则更快地塌陷。第三，我们分析了由现实世界外部影响隐含引入的稳定策略。文本模型和图像模型用户符合注射的随机语料库注射，可阻止崩溃，同时保留多样性和保真度。我们的理论发现将通过实验进一步验证。

Title: Uni$\textbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models

Authors: Junzhe Li, Xuerui Qiu, Linrui Xu, Liya Guo, Delin Qu, Tingting Long, Chun Fan, Ming Li
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2503.08120
Pdf URL: https://arxiv.org/pdf/2503.08120
Copy Paste: [[2503.08120]] Uni$\textbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models(https://arxiv.org/abs/2503.08120)
Keywords: generation, generative
Abstract: Unified multimodal models (UMMs) have emerged as a powerful paradigm in foundational computer vision research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily focuses on $\textbf{coarse}$ facial attribute understanding, with limited capacity to handle $\textbf{fine-grained}$ facial attributes and without addressing generation capabilities. To overcome these limitations, we propose Uni$\textbf{F}^2$ace, the first UMM tailored specifically for fine-grained face understanding and generation. In general, we train Uni$\textbf{F}^2$ace on a self-constructed, specialized dataset utilizing two mutually beneficial diffusion techniques and a two-level mixture-of-experts architecture. Specifically, we first build a large-scale facial dataset, Uni$\textbf{F}^2$ace-130K, which contains 130K image-text pairs with one million question-answering pairs that span a wide range of facial attributes. Second, we establish a theoretical connection between discrete diffusion score matching and masked generative models, optimizing both evidence lower bounds simultaneously, which significantly improves the model's ability to synthesize facial details. Finally, we introduce both token-level and sequence-level mixture-of-experts, enabling efficient fine-grained representation learning for both understanding and generation tasks. Extensive experiments on Uni$\textbf{F}^2$ace-130K demonstrate that Uni$\textbf{F}^2$ace outperforms existing UMMs and generative models, achieving superior performance across both understanding and generation tasks.
摘要：统一的多模型模型（UMM）已成为基础计算机视觉研究中的强大范式，在图像理解和产生中都表现出了巨大的潜力。但是，面部域中的现有研究主要集中在$ \ textbf {croun} $面部属性理解上，处理$ \ textbf {fine-Graining} $面部属性的能力有限，而无需解决生成功能。为了克服这些限制，我们建议Uni $ \ textbf {f}^2 $ ace，这是专门针对细粒度的面部理解和生成而定制的第一个UMM。通常，我们利用两种互惠互利的扩散技术和两级Experts体系结构训练Uni $ \ textbf {f}^2 $ ace在一个自我结构的专业数据集上进行训练。具体来说，我们首先构建了一个大型面部数据集，Uni $ \ textbf {f}^2 $ ACE-130k，其中包含130k映像 - 文本对，带有一百万个问答对，跨越各种面部属性。其次，我们在离散扩散分数匹配和掩盖生成模型之间建立了理论上的联系，同时优化了这两个证据，从而显着提高了该模型合成面部细节的能力。最后，我们介绍了代币级别和序列级别的专家混合物，从而为理解和生成任务提供了有效的细粒度表示学习。对Uni $ \ textbf {f}^2 $ ACE-130K进行的广泛实验表明，Uni $ \ textbf {f}^2 $ Ace优于现有的UMMS和生成模型，在理解和生成任务中都实现了卓越的性能。

Title: Toward Stable World Models: Measuring and Addressing World Instability in Generative Environments

Authors: Soonwoo Kwon, Jin-Young Kim, Hyojun Go, Kyungjune Baek
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.08122
Pdf URL: https://arxiv.org/pdf/2503.08122
Copy Paste: [[2503.08122]] Toward Stable World Models: Measuring and Addressing World Instability in Generative Environments(https://arxiv.org/abs/2503.08122)
Keywords: generative
Abstract: We present a novel study on enhancing the capability of preserving the content in world models, focusing on a property we term World Stability. Recent diffusion-based generative models have advanced the synthesis of immersive and realistic environments that are pivotal for applications such as reinforcement learning and interactive game engines. However, while these models excel in quality and diversity, they often neglect the preservation of previously generated scenes over time--a shortfall that can introduce noise into agent learning and compromise performance in safety-critical settings. In this work, we introduce an evaluation framework that measures world stability by having world models perform a sequence of actions followed by their inverses to return to their initial viewpoint, thereby quantifying the consistency between the starting and ending observations. Our comprehensive assessment of state-of-the-art diffusion-based world models reveals significant challenges in achieving high world stability. Moreover, we investigate several improvement strategies to enhance world stability. Our results underscore the importance of world stability in world modeling and provide actionable insights for future research in this domain.
摘要：我们提出了一项新的研究，以增强保留世界模型中内容的能力，重点是我们称世界稳定性的财产。最近的基于扩散的生成模型已经提出了对沉浸式和现实的环境的综合，这些环境对诸如增强学习和互动游戏引擎等应用至关重要。但是，尽管这些模型在质量和多样性方面都表现出色，但它们经常忽略了随着时间的流逝的先前生成的场景的保存 - 这种短缺可以将噪声引入代理学习并损害安全至关重要的环境中的性能。在这项工作中，我们介绍了一个评估框架，该框架通过让世界模型执行一系列动作，然后对其进行逆，以返回其初始观点，从而量化了开始观测和结束观测之间的一致性。我们对基于最新扩散的世界模型的全面评估揭示了实现高世界稳定性的重大挑战。此外，我们研究了增强世界稳定性的几种改进策略。我们的结果强调了世界稳定在世界建模中的重要性，并为该领域的未来研究提供了可行的见解。

Title: MGHanD: Multi-modal Guidance for authentic Hand Diffusion

Authors: Taehyeon Eum, Jieun Choi, Tae-Kyun Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08133
Pdf URL: https://arxiv.org/pdf/2503.08133
Copy Paste: [[2503.08133]] MGHanD: Multi-modal Guidance for authentic Hand Diffusion(https://arxiv.org/abs/2503.08133)
Keywords: generation, generative
Abstract: Diffusion-based methods have achieved significant successes in T2I generation, providing realistic images from text prompts. Despite their capabilities, these models face persistent challenges in generating realistic human hands, often producing images with incorrect finger counts and structurally deformed hands. MGHanD addresses this challenge by applying multi-modal guidance during the inference process. For visual guidance, we employ a discriminator trained on a dataset comprising paired real and generated images with captions, derived from various hand-in-the-wild datasets. We also employ textual guidance with LoRA adapter, which learns the direction from `hands' towards more detailed prompts such as `natural hands', and `anatomically correct fingers' at the latent level. A cumulative hand mask which is gradually enlarged in the assigned time step is applied to the added guidance, allowing the hand to be refined while maintaining the rich generative capabilities of the pre-trained model. In the experiments, our method achieves superior hand generation qualities, without any specific conditions or priors. We carry out both quantitative and qualitative evaluations, along with user studies, to showcase the benefits of our approach in producing high-quality hand images.
摘要：基于扩散的方法在T2i生成中取得了重大成功，从文本提示中提供了逼真的图像。尽管具有能力，但这些模型在产生逼真的人手时仍面临持续的挑战，通常会产生手指计数不正确的图像和结构变形的手。 MGHAND通过在推理过程中应用多模式指导来应对这一挑战。为了进行视觉指导，我们采用了一个歧视器，该歧视器在包含配对的真实图像和带有标题的配对的数据集中训练，并从各种野外数据集中得出。我们还使用Lora适配器采用文本指导，该指导从“手”方向学习了诸如“自然手”之类的更详细提示，并在潜在的层面上学习了“自然手”的方向。在分配的时间步骤中逐渐扩大的累积手掩模将应用于附加的指导，从而使手得到完善，同时保持预训练模型的丰富生成能力。在实验中，我们的方法可以在没有任何特定条件或先验的情况下达到高级生成质量。我们同时进行定量和定性评估以及用户研究，以展示我们在生产高质量手图像中的方法的好处。

Title: FlowDPS: Flow-Driven Posterior Sampling for Inverse Problems

Authors: Jeongsol Kim, Bryan Sangwoo Kim, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08136
Pdf URL: https://arxiv.org/pdf/2503.08136
Copy Paste: [[2503.08136]] FlowDPS: Flow-Driven Posterior Sampling for Inverse Problems(https://arxiv.org/abs/2503.08136)
Keywords: generative
Abstract: Flow matching is a recent state-of-the-art framework for generative modeling based on ordinary differential equations (ODEs). While closely related to diffusion models, it provides a more general perspective on generative modeling. Although inverse problem solving has been extensively explored using diffusion models, it has not been rigorously examined within the broader context of flow models. Therefore, here we extend the diffusion inverse solvers (DIS) - which perform posterior sampling by combining a denoising diffusion prior with an likelihood gradient - into the flow framework. Specifically, by driving the flow-version of Tweedie's formula, we decompose the flow ODE into two components: one for clean image estimation and the other for noise estimation. By integrating the likelihood gradient and stochastic noise into each component, respectively, we demonstrate that posterior sampling for inverse problem solving can be effectively achieved using flows. Our proposed solver, Flow-Driven Posterior Sampling (FlowDPS), can also be seamlessly integrated into a latent flow model with a transformer architecture. Across four linear inverse problems, we confirm that FlowDPS outperforms state-of-the-art alternatives, all without requiring additional training.
摘要：流匹配是基于普通微分方程（ODE）的生成建模的最新框架。虽然与扩散模型密切相关，但它为生成建模提供了更一般的观点。尽管使用扩散模型对解决问题的解决方案进行了广泛的探索，但在更广泛的流程模型背景下并未对其进行严格检查。因此，在这里，我们将扩散逆求器（DIS）扩散 - 通过将降解扩散与可能性梯度相结合 - 进行后验采样 - 将其延伸到流动框架中。具体而言，通过驱动Tweedie公式的流动，我们将流ode分解为两个组件：一个用于干净的图像估计，另一个用于噪声估计。通过分别将似然梯度和随机噪声整合到每个组件中，我们证明了使用流量可以有效地实现反问题解决方案的后验采样。我们提出的求解器流动驱动后采样（FlowDPS）也可以无缝地集成到具有变压器体系结构的潜在流模型中。在四个线性反问题中，我们确认FlowDPS的表现优于最先进的替代方案，而无需额外的培训。

Title: FilmComposer: LLM-Driven Music Production for Silent Film Clips

Authors: Zhifeng Xie, Qile He, Youjia Zhu, Qiwei He, Mengtian Li
Subjects: cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.08147
Pdf URL: https://arxiv.org/pdf/2503.08147
Copy Paste: [[2503.08147]] FilmComposer: LLM-Driven Music Production for Silent Film Clips(https://arxiv.org/abs/2503.08147)
Keywords: generation, generative
Abstract: In this work, we implement music production for silent film clips using LLM-driven method. Given the strong professional demands of film music production, we propose the FilmComposer, simulating the actual workflows of professional musicians. FilmComposer is the first to combine large generative models with a multi-agent approach, leveraging the advantages of both waveform music and symbolic music generation. Additionally, FilmComposer is the first to focus on the three core elements of music production for film-audio quality, musicality, and musical development-and introduces various controls, such as rhythm, semantics, and visuals, to enhance these key aspects. Specifically, FilmComposer consists of the visual processing module, rhythm-controllable MusicGen, and multi-agent assessment, arrangement and mix. In addition, our framework can seamlessly integrate into the actual music production pipeline and allows user intervention in every step, providing strong interactivity and a high degree of creative freedom. Furthermore, we propose MusicPro-7k which includes 7,418 film clips, music, description, rhythm spots and main melody, considering the lack of a professional and high-quality film music dataset. Finally, both the standard metrics and the new specialized metrics we propose demonstrate that the music generated by our model achieves state-of-the-art performance in terms of quality, consistency with video, diversity, musicality, and musical development. Project page: this https URL
摘要：在这项工作中，我们使用LLM驱动的方法为无声电影剪辑实施音乐制作。鉴于电影音乐制作的专业要求很强，我们提出了电影节目，模拟了专业音乐家的实际工作流程。 FilmComposer是第一个将大型生成模型与多代理方法相结合的人，利用了波形音乐和象征性音乐发电的优势。此外，FilmComposer是第一个专注于电影原告质量，音乐性和音乐发展的三个核心元素，并引入各种控制，例如节奏，语义和视觉效果，以增强这些关键方面。具体而言，电影公司由视觉处理模块，节奏控制的音乐以及多代理评估，布置和混合组成。此外，我们的框架可以无缝地集成到实际的音乐生产管道中，并允许在每个步骤中进行干预，从而提供强大的互动性和高度的创作自由度。此外，考虑到缺乏专业且高质量的电影音乐数据集，我们提出了MusicPro-7K，其中包括7,418个电影剪辑，音乐，描述，节奏点和主要旋律。最后，我们提出的标准指标和新的专业指标都表明，我们的模型产生的音乐在质量，与视频，多样性，音乐性和音乐发展方面达到了最先进的性能。项目页面：此HTTPS URL

Title: Few-Shot Class-Incremental Model Attribution Using Learnable Representation From CLIP-ViT Features

Authors: Hanbyul Lee, Juneho Yi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08148
Pdf URL: https://arxiv.org/pdf/2503.08148
Copy Paste: [[2503.08148]] Few-Shot Class-Incremental Model Attribution Using Learnable Representation From CLIP-ViT Features(https://arxiv.org/abs/2503.08148)
Keywords: generative
Abstract: Recently, images that distort or fabricate facts using generative models have become a social concern. To cope with continuous evolution of generative artificial intelligence (AI) models, model attribution (MA) is necessary beyond just detection of synthetic images. However, current deep learning-based MA methods must be trained from scratch with new data to recognize unseen models, which is time-consuming and data-intensive. This work proposes a new strategy to deal with persistently emerging generative models. We adapt few-shot class-incremental learning (FSCIL) mechanisms for MA problem to uncover novel generative AI models. Unlike existing FSCIL approaches that focus on object classification using high-level information, MA requires analyzing low-level details like color and texture in synthetic images. Thus, we utilize a learnable representation from different levels of CLIP-ViT features. To learn an effective representation, we propose Adaptive Integration Module (AIM) to calculate a weighted sum of CLIP-ViT block features for each image, enhancing the ability to identify generative models. Extensive experiments show our method effectively extends from prior generative models to recent ones.
摘要：最近，使用生成模型扭曲或捏造事实的图像已成为社会问题。为了应对生成人工智能（AI）模型的连续演变，模型归因（MA）是必需的，而不仅仅是检测合成图像。但是，当前基于深度学习的MA方法必须从头开始培训新数据，以识别耗时和数据密集型的模型。这项工作提出了一种新的策略来处理持续的新兴生成模型。我们适应了MA问题的几个射击课程学习（FSCIL）机制，以发现新颖的生成AI模型。与使用高级信息专注于对象分类的现有FSCIL方法不同，MA需要分析综合图像中的颜色和纹理等低级细节。因此，我们从不同级别的剪辑量特征中利用可学习的表示形式。为了学习有效的表示形式，我们提出了自适应集成模块（AIM），以计算每个图像的夹子vit块特征的加权总和，从而增强了识别生成模型的能力。广泛的实验表明，我们的方法有效地从先前的生成模型扩展到最近的模型。

Title: Depth-Assisted Network for Indiscernible Marine Object Counting with Adaptive Motion-Differentiated Feature Encoding

Authors: Chengzhi Ma, Kunqian Li, Shuaixin Liu, Han Mei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08152
Pdf URL: https://arxiv.org/pdf/2503.08152
Copy Paste: [[2503.08152]] Depth-Assisted Network for Indiscernible Marine Object Counting with Adaptive Motion-Differentiated Feature Encoding(https://arxiv.org/abs/2503.08152)
Keywords: generation
Abstract: Indiscernible marine object counting encounters numerous challenges, including limited visibility in underwater scenes, mutual occlusion and overlap among objects, and the dynamic similarity in appearance, color, and texture between the background and foreground. These factors significantly complicate the counting process. To address the scarcity of video-based indiscernible object counting datasets, we have developed a novel dataset comprising 50 videos, from which approximately 800 frames have been extracted and annotated with around 40,800 point-wise object labels. This dataset accurately represents real underwater environments where indiscernible marine objects are intricately integrated with their surroundings, thereby comprehensively illustrating the aforementioned challenges in object counting. To address these challenges, we propose a depth-assisted network with adaptive motion-differentiated feature encoding. The network consists of a backbone encoding module and three branches: a depth-assisting branch, a density estimation branch, and a motion weight generation branch. Depth-aware features extracted by the depth-assisting branch are enhanced via a depth-enhanced encoder to improve object representation. Meanwhile, weights from the motion weight generation branch refine multi-scale perception features in the adaptive flow estimation module. Experimental results demonstrate that our method not only achieves state-of-the-art performance on the proposed dataset but also yields competitive results on three additional video-based crowd counting datasets. The pre-trained model, code, and dataset are publicly available at this https URL.
摘要：难以置信的海洋对象计数遇到了许多挑战，包括在水下场景中的可见度有限，对象之间的相互阻塞和重叠以及背景和前景之间的动态相似性。这些因素显着使计数过程复杂化。为了解决基于视频的不可见为对象计数数据集的稀缺性，我们开发了一个包含50个视频的新颖数据集，从中提取了大约800帧并使用约40,800个点的对象标签进行注释。该数据集准确地代表了真正的水下环境，在这些环境中，不可见作的海洋对象与周围环境复杂地集成在一起，从而全面说明了对象计数中上述挑战。为了应对这些挑战，我们提出了一个深度辅助网络，具有自适应运动分化的特征编码。该网络由一个编码模块和三个分支组成：深度辅助分支，密度估计分支和运动重量产生分支。深度辅助分支提取的深度感知特征通过深度增强的编码器增强，以改善对象表示。同时，自适应流量估计模块中的运动重量产生分支的重量提炼了多尺度感知特征。实验结果表明，我们的方法不仅可以在拟议的数据集中实现最先进的性能，而且还可以在三个基于视频的人群计数数据集上产生竞争结果。预先培训的模型，代码和数据集可在此HTTPS URL上公开可用。

Title: WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation

Authors: Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08153
Pdf URL: https://arxiv.org/pdf/2503.08153
Copy Paste: [[2503.08153]] WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation(https://arxiv.org/abs/2503.08153)
Keywords: generation
Abstract: Recent rapid advancements in text-to-video (T2V) generation, such as SoRA and Kling, have shown great potential for building world simulators. However, current T2V models struggle to grasp abstract physical principles and generate videos that adhere to physical laws. This challenge arises primarily from a lack of clear guidance on physical information due to a significant gap between abstract physical principles and generation models. To this end, we introduce the World Simulator Assistant (WISA), an effective framework for decomposing and incorporating physical principles into T2V models. Specifically, WISA decomposes physical principles into textual physical descriptions, qualitative physical categories, and quantitative physical properties. To effectively embed these physical attributes into the generation process, WISA incorporates several key designs, including Mixture-of-Physical-Experts Attention (MoPA) and a Physical Classifier, enhancing the model's physics awareness. Furthermore, most existing datasets feature videos where physical phenomena are either weakly represented or entangled with multiple co-occurring processes, limiting their suitability as dedicated resources for learning explicit physical principles. We propose a novel video dataset, WISA-32K, collected based on qualitative physical categories. It consists of 32,000 videos, representing 17 physical laws across three domains of physics: dynamics, thermodynamics, and optics. Experimental results demonstrate that WISA can effectively enhance the compatibility of T2V models with real-world physical laws, achieving a considerable improvement on the VideoPhy benchmark. The visual exhibitions of WISA and WISA-32K are available in the this https URL.
摘要：Sora和Kling等文本到视频（T2V）一代的最新快速进步表现出了建立世界模拟器的巨大潜力。但是，当前的T2V模型努力掌握抽象的物理原理并生成遵守物理定律的视频。由于抽象的物理原理和发电模型之间存在很大的差距，这一挑战主要源于缺乏对物理信息的明确指导。为此，我们介绍了世界模拟器助理（WISA），这是将物理原理分解和纳入T2V模型的有效框架。具体而言，WISA将物理原理分解为文本物理描述，定性物理类别和定量物理特性。为了有效地将这些物理属性嵌入生成过程中，WISA结合了几种关键设计，包括物理外科专家的混合物（MOPA）和物理分类器，增强了模型的物理意识。此外，大多数现有的数据集都具有视频，其中物理现象要么薄弱地代表或纠缠于多个同时发生的过程，因此将其适合性限制为学习明确的物理原理的专用资源。我们建议根据定性物理类别收集的一个新颖的视频数据集WISA-32K。它由32,000个视频组成，代表三个物理领域的17个物理定律：动力学，热力学和光学。实验结果表明，WISA可以有效地增强T2V模型与现实世界定律的兼容性，从而在录像基准上取得了可观的改进。 WISA和WISA-32K的视觉展览在此HTTPS URL中可用。

Title: Towards Large-scale Chemical Reaction Image Parsing via a Multimodal Large Language Model

Authors: Yufan Chen, Ching Ting Leung, Jianwei Sun, Yong Huang, Linyan Li, Hao Chen, Hanyu Gao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08156
Pdf URL: https://arxiv.org/pdf/2503.08156
Copy Paste: [[2503.08156]] Towards Large-scale Chemical Reaction Image Parsing via a Multimodal Large Language Model(https://arxiv.org/abs/2503.08156)
Keywords: generation
Abstract: Artificial intelligence (AI) has demonstrated significant promise in advancing organic chemistry research; however, its effectiveness depends on the availability of high-quality chemical reaction data. Currently, most published chemical reactions are not available in machine-readable form, limiting the broader application of AI in this field. The extraction of published chemical reactions into structured databases still relies heavily on manual curation, and robust automatic parsing of chemical reaction images into machine-readable data remains a significant challenge. To address this, we introduce the Reaction Image Multimodal large language model (RxnIM), the first multimodal large language model specifically designed to parse chemical reaction images into machine-readable reaction data. RxnIM not only extracts key chemical components from reaction images but also interprets the textual content that describes reaction conditions. Together with specially designed large-scale dataset generation method to support model training, our approach achieves excellent performance, with an average F1 score of 88% on various benchmarks, surpassing literature methods by 5%. This represents a crucial step toward the automatic construction of large databases of machine-readable reaction data parsed from images in the chemistry literature, providing essential data resources for AI research in chemistry. The source code, model checkpoints, and datasets developed in this work are released under permissive licenses. An instance of the RxnIM web application can be accessed at this https URL.
摘要：人工智能（AI）在推进有机化学研究方面表现出了巨大的希望。但是，其有效性取决于高质量化学反应数据的可用性。当前，大多数已发表的化学反应都无法以机器可读形式获得，这限制了AI在该领域的更广泛应用。将已发表的化学反应提取到结构化数据库中仍然在很大程度上取决于手动策展，并且将化学反应图像的强大自动解析在可读的机器可读数据中仍然是一个重大挑战。为了解决这个问题，我们介绍了反应图像多模式大语言模型（RXNIM），这是第一个专门设计用于将化学反应图像解析到机器可读反应数据中的多模式大型语言模型。 RXNIM不仅从反应图像中提取了关键的化学成分，而且还解释了描述反应条件的文本内容。加上专门设计的大规模数据集生成方法来支持模型培训，我们的方法可实现出色的性能，平均F1得分在各种基准测试中达到88％，超过了文献方法5％。这是朝着自动构建从化学文献中图像解析的机器可读反应数据的大型数据库的重要步骤，为化学研究提供了必不可少的数据资源。本工作中开发的源代码，模型检查点和数据集在允许许可下发布。可以在此HTTPS URL访问RXNIM Web应用程序的实例。

Title: U-StyDiT: Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers

Authors: Zhanjie Zhang, Ao Ma, Ke Cao, Jing Wang, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08157
Pdf URL: https://arxiv.org/pdf/2503.08157
Copy Paste: [[2503.08157]] U-StyDiT: Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers(https://arxiv.org/abs/2503.08157)
Keywords: generation
Abstract: Ultra-high quality artistic style transfer refers to repainting an ultra-high quality content image using the style information learned from the style image. Existing artistic style transfer methods can be categorized into style reconstruction-based and content-style disentanglement-based style transfer approaches. Although these methods can generate some artistic stylized images, they still exhibit obvious artifacts and disharmonious patterns, which hinder their ability to produce ultra-high quality artistic stylized images. To address these issues, we propose a novel artistic image style transfer method, U-StyDiT, which is built on transformer-based diffusion (DiT) and learns content-style disentanglement, generating ultra-high quality artistic stylized images. Specifically, we first design a Multi-view Style Modulator (MSM) to learn style information from a style image from local and global perspectives, conditioning U-StyDiT to generate stylized images with the learned style information. Then, we introduce a StyDiT Block to learn content and style conditions simultaneously from a style image. Additionally, we propose an ultra-high quality artistic image dataset, Aes4M, comprising 10 categories, each containing 400,000 style images. This dataset effectively solves the problem that the existing style transfer methods cannot produce high-quality artistic stylized images due to the size of the dataset and the quality of the images in the dataset. Finally, the extensive qualitative and quantitative experiments validate that our U-StyDiT can create higher quality stylized images compared to state-of-the-art artistic style transfer methods. To our knowledge, our proposed method is the first to address the generation of ultra-high quality stylized images using transformer-based diffusion.
摘要：超高质量的艺术风格转移是指使用从样式图像中学到的样式信息重新粉刷超高质量的内容图像。现有的艺术风格转移方法可以分为基于样式的基于样式的基于内容的基于内容式的基于内容的样式转移方法。尽管这些方法可以产生一些艺术风格化的图像，但它们仍然表现出明显的文物和不和谐的图案，这阻碍了它们产生超高质量艺术风格的图像的能力。为了解决这些问题，我们提出了一种新型的艺术图像样式转移方法U-Stydit，该方法建立在基于变压器的扩散（DIT）上，并学习内容式的脱节，从而产生超高质量的艺术风格化图像。具体来说，我们首先设计了多视图样式调制器（MSM），从本地和全局的角度从样式图像中学习样式信息，从而使U-Stydit调节使用学识渊博的样式信息生成风格化的图像。然后，我们引入了一个stydit块，以同时从样式图像中同时学习内容和样式条件。此外，我们提出了一个超高质量的艺术图像数据集，AES4M，包括10个类别，每个类别包含400,000个样式图像。该数据集有效地解决了由于数据集的大小和数据集中图像的质量，现有样式传输方法无法产生高质量的艺术风格化图像的问题。最后，与最先进的艺术风格转移方法相比，广泛的定性和定量实验验证了我们的U-风格可以创建更高质量风格的图像。据我们所知，我们提出的方法是第一个使用基于变压器的扩散来解决超高质量风格化图像的生成。

Title: Concept-Driven Deep Learning for Enhanced Protein-Specific Molecular Generation

Authors: Taojie Kuang, Qianli Ma, Athanasios V. Vasilakos, Yu Wang, Qiang (Shawn)Cheng, Zhixiang Ren
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08160
Pdf URL: https://arxiv.org/pdf/2503.08160
Copy Paste: [[2503.08160]] Concept-Driven Deep Learning for Enhanced Protein-Specific Molecular Generation(https://arxiv.org/abs/2503.08160)
Keywords: generation
Abstract: In recent years, deep learning techniques have made significant strides in molecular generation for specific targets, driving advancements in drug discovery. However, existing molecular generation methods present significant limitations: those operating at the atomic level often lack synthetic feasibility, drug-likeness, and interpretability, while fragment-based approaches frequently overlook comprehensive factors that influence protein-molecule interactions. To address these challenges, we propose a novel fragment-based molecular generation framework tailored for specific proteins. Our method begins by constructing a protein subpocket and molecular arm concept-based neural network, which systematically integrates interaction force information and geometric complementarity to sample molecular arms for specific protein subpockets. Subsequently, we introduce a diffusion model to generate molecular backbones that connect these arms, ensuring structural integrity and chemical diversity. Our approach significantly improves synthetic feasibility and binding affinity, with a 4% increase in drug-likeness and a 6% improvement in synthetic feasibility. Furthermore, by integrating explicit interaction data through a concept-based model, our framework enhances interpretability, offering valuable insights into the molecular design process.
摘要：近年来，深度学习技术在特定靶标方面取得了显着的进步，从而推动了药物发现的进步。但是，现有的分子产生方法具有重大的局限性：在原子水平上运行的方法通常缺乏合成的可行性，吸毒和解释性，而基于碎片的方法经常忽略影响蛋白质 - 分子相互作用的综合因素。为了应对这些挑战，我们提出了一个针对特定蛋白质量身定制的新型基于碎片的分子生成框架。我们的方法首先构建基于蛋白质的子口和基于分子臂概念的神经网络，该神经网络将相互作用的力信息和几何互补性与特定蛋白质子群的样品分子臂相结合。随后，我们引入了一个扩散模型，以生成连接这些臂的分子骨架，从而确保结构完整性和化学多样性。我们的方法显着提高了合成的可行性和结合亲和力，毒品务实增加了4％，合成可行性提高了6％。此外，通过通过基于概念的模型整合显式交互数据，我们的框架可增强可解释性，从而为分子设计过程提供宝贵的见解。

Title: Multimodal Generation of Animatable 3D Human Models with AvatarForge

Authors: Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08165
Pdf URL: https://arxiv.org/pdf/2503.08165
Copy Paste: [[2503.08165]] Multimodal Generation of Animatable 3D Human Models with AvatarForge(https://arxiv.org/abs/2503.08165)
Keywords: generation
Abstract: We introduce AvatarForge, a framework for generating animatable 3D human avatars from text or image inputs using AI-driven procedural generation. While diffusion-based methods have made strides in general 3D object generation, they struggle with high-quality, customizable human avatars due to the complexity and diversity of human body shapes, poses, exacerbated by the scarcity of high-quality data. Additionally, animating these avatars remains a significant challenge for existing methods. AvatarForge overcomes these limitations by combining LLM-based commonsense reasoning with off-the-shelf 3D human generators, enabling fine-grained control over body and facial details. Unlike diffusion models which often rely on pre-trained datasets lacking precise control over individual human features, AvatarForge offers a more flexible approach, bringing humans into the iterative design and modeling loop, with its auto-verification system allowing for continuous refinement of the generated avatars, and thus promoting high accuracy and customization. Our evaluations show that AvatarForge outperforms state-of-the-art methods in both text- and image-to-avatar generation, making it a versatile tool for artistic creation and animation.
摘要：我们介绍了Avatarforge，这是一个框架，用于使用AI驱动的程序生成从文本或图像输入中生成动画3D人体化身。尽管基于扩散的方法在一般的3D对象产生中取得了进步，但由于人体形状的复杂性和多样性，姿势的复杂性和多样性，由于缺乏高质量的数据而加剧了姿势。此外，对这些化身进行动画动画仍然是现有方法的重大挑战。 Avatarforge通过将基于LLM的常分推理与现成的3D人体发电机相结合，从而克服了这些限制，从而可以对身体和面部细节进行细粒度的控制。与通常依赖于缺乏对个别人类特征的精确控制的预训练数据集的扩散模型不同，Avatarforge提供了一种更灵活的方法，将人类带入迭代设计和建模循环，并具有其自动验证系统，从而允许对生成的头像进行持续改进，从而促进了高准确性和自定义。我们的评估表明，Avatarforge在文本和图像之间的生成中都优于最先进的方法，这使其成为艺术创作和动画的多功能工具。

Title: Towards Synthesized and Editable Motion In-Betweening Through Part-Wise Phase Representation

Authors: Minyue Dai, Jingbo Wang, Ke Fan, Bin Ji, Haoyu Zhao, Junting Dong, Bo Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08180
Pdf URL: https://arxiv.org/pdf/2503.08180
Copy Paste: [[2503.08180]] Towards Synthesized and Editable Motion In-Betweening Through Part-Wise Phase Representation(https://arxiv.org/abs/2503.08180)
Keywords: generation
Abstract: Styled motion in-betweening is crucial for computer animation and gaming. However, existing methods typically encode motion styles by modeling whole-body motions, often overlooking the representation of individual body parts. This limitation reduces the flexibility of infilled motion, particularly in adjusting the motion styles of specific limbs independently. To overcome this challenge, we propose a novel framework that models motion styles at the body-part level, enhancing both the diversity and controllability of infilled motions. Our approach enables more nuanced and expressive animations by allowing precise modifications to individual limb motions while maintaining overall motion coherence. Leveraging phase-related insights, our framework employs periodic autoencoders to automatically extract the phase of each body part, capturing distinctive local style features. Additionally, we effectively decouple the motion source from synthesis control by integrating motion manifold learning and conditional generation techniques from both image and motion domains. This allows the motion source to generate high-quality motions across various styles, with extracted motion and style features readily available for controlled synthesis in subsequent tasks. Comprehensive evaluations demonstrate that our method achieves superior speed, robust generalization, and effective generation of extended motion sequences.
摘要：风格的运动中的运动对于计算机动画和游戏至关重要。但是，现有方法通常通过对全身运动进行建模，通常忽略各个身体部位的表示。这种限制降低了填充运动的灵活性，特别是在独立调整特定肢体的运动方式时。为了克服这一挑战，我们提出了一个新颖的框架，该框架在身体部分的水平上建模运动样式，从而增强了填充运动的多样性和可控性。我们的方法通过允许对单个肢体运动进行精确的修改，同时保持整体运动相干性，从而实现更细微和表现力的动画。利用相关的见解，我们的框架采用定期自动编码器自动提取每个身体部位的相位，从而捕获独特的本地样式特征。此外，我们通过从图像和运动域中整合运动歧管学习和条件生成技术来有效地将运动源从合成控制中解脱出来。这使运动源可以在各种样式上产生高质量的动作，并具有提取的运动和样式功能，可以在后续任务中可用于控制的合成。全面的评估表明，我们的方法实现了较高的速度，稳定的概括和有效的扩展运动序列。

Title: A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models

Authors: Miao Zhang, Zhenlong Fang, Tianyi Wang, Qian Zhang, Shuai Lu, Junfeng Jiao, Tianyu Shi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08199
Pdf URL: https://arxiv.org/pdf/2503.08199
Copy Paste: [[2503.08199]] A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models(https://arxiv.org/abs/2503.08199)
Keywords: generation
Abstract: Traditional Reinforcement Learning (RL) suffers from replicating human-like behaviors, generalizing effectively in multi-agent scenarios, and overcoming inherent interpretability this http URL tasks are compounded when deep environment understanding, agent coordination and dynamic optimization are required. While Large Language Model (LLM) enhanced methods have shown promise in generalization and interoperability, they often neglect necessary multi-agent coordination. Therefore, we introduce the Cascading Cooperative Multi-agent (CCMA) framework, integrating RL for individual interactions, a fine-tuned LLM for regional cooperation, a reward function for global optimization, and the Retrieval-augmented Generation mechanism to dynamically optimize decision-making across complex driving scenarios. Our experiments demonstrate that the CCMA outperforms existing RL methods, demonstrating significant improvements in both micro and macro-level performance in complex driving environments.
摘要：传统的强化学习（RL）遭受重复类似人类的行为，在多代理场景中有效概括，并且需要在深度的环境理解，代理协调和动态优化的情况下加重此HTTP URL任务。尽管大型语言模型（LLM）增强的方法在概括和互操作性方面已显示出希望，但它们经常忽略必要的多机构协调。因此，我们介绍了级联合作多代理（CCMA）框架，集成了用于单个互动的RL，用于区域合作的微调LLM，全球优化的奖励功能以及检索功能的生成机制，以动态优化复杂驱动风景的决策。我们的实验表明，CCMA的表现优于现有的RL方法，这表明在复杂驾驶环境中，微观和宏观性能都有显着改善。

Title: ExMAG: Learning of Maximally Ancestral Graphs

Authors: Petr Ryšavý, Pavel Rytíř, Xiaoyu He, Jakub Mareček, Georgios Korpas
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08245
Pdf URL: https://arxiv.org/pdf/2503.08245
Copy Paste: [[2503.08245]] ExMAG: Learning of Maximally Ancestral Graphs(https://arxiv.org/abs/2503.08245)
Keywords: generation
Abstract: As one transitions from statistical to causal learning, one is seeking the most appropriate causal model. Dynamic Bayesian networks are a popular model, where a weighted directed acyclic graph represents the causal relationships. Stochastic processes are represented by its vertices, and weighted oriented edges suggest the strength of the causal relationships. When there are confounders, one would like to utilize both oriented edges (when the direction of causality is clear) and edges that are not oriented (when there is a confounder), yielding mixed graphs. A little-studied extension of acyclicity to this mixed-graph setting is known as maximally ancestral graphs. We propose a score-based learning algorithm for learning maximally ancestral graphs. A mixed-integer quadratic program is formulated, and an algorithmic approach is proposed, in which the pre-generation of exponentially many constraints is avoided by generating only violated constraints in the so-called branch-and-cut (``lazy constraint'') method. Comparing the novel approach to the state-of-the-art, we show that the proposed approach turns out to produce more accurate results when applied to small and medium-sized synthetic instances containing up to 25 variables.
摘要：随着一个从统计到因果学习的过渡，人们正在寻求最合适的因果模型。动态贝叶斯网络是一个流行的模型，其中加权的无环形图代表因果关系。随机过程由其顶点表示，而加权的边缘则表明了因果关系的强度。当有混杂因素时，人们希望使用两个方向的边缘（当因果关系的方向清晰）和未定向的边缘（当存在混杂因素时），产生混合图。对这种混合绘图设置的无环的扩展被称为最大祖先图。我们提出了一种基于分数的学习算法，用于学习最大的祖先图。制定了混合企业二次程序，并提出了一种算法方法，其中仅在所谓的分支和切割（``懒惰约束''）方法中仅生成违反违法的约束来避免了许多约束。将新颖的方法与最新方法进行比较，我们表明，当应用于最多25个变量的中小型合成实例时，提出的方法证明可以产生更准确的结果。

Title: Aligning Text to Image in Diffusion Models is Easier Than You Think

Authors: Jaa-Yeon Lee, Byunghee Cha, Jeongsol Kim, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08250
Pdf URL: https://arxiv.org/pdf/2503.08250
Copy Paste: [[2503.08250]] Aligning Text to Image in Diffusion Models is Easier Than You Think(https://arxiv.org/abs/2503.08250)
Keywords: generation, generative
Abstract: While recent advancements in generative modeling have significantly improved text-image alignment, some residual misalignment between text and image representations still remains. Although many approaches have attempted to address this issue by fine-tuning models using various reward models, etc., we revisit the challenge from the perspective of representation alignment-an approach that has gained popularity with the success of REPresentation Alignment (REPA). We first argue that conventional text-to-image (T2I) diffusion models, typically trained on paired image and text data (i.e., positive pairs) by minimizing score matching or flow matching losses, is suboptimal from the standpoint of representation alignment. Instead, a better alignment can be achieved through contrastive learning that leverages both positive and negative pairs. To achieve this efficiently even with pretrained models, we introduce a lightweight contrastive fine tuning strategy called SoftREPA that uses soft text tokens. This approach improves alignment with minimal computational overhead by adding fewer than 1M trainable parameters to the pretrained model. Our theoretical analysis demonstrates that our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency. Experimental results across text-to-image generation and text-guided image editing tasks validate the effectiveness of our approach in improving the semantic consistency of T2I generative models.
摘要：尽管生成建模的最新进步已显着改善了文本图像对齐，但文本和图像表示之间仍然存在一些残差。尽管许多方法试图通过使用各种奖励模型等微调模型来解决这个问题，但我们从表示形式一致的角度重新审视了挑战 - 随着代表一致性的成功（REPA），它已广受欢迎。我们首先认为，通常通过将分数匹配或流量匹配损失最小化在配对的图像和文本数据（即正面对）上，通常在配对的图像和文本数据（即积极对）上进行训练，从表示形式的角度来看，这是次优。取而代之的是，可以通过利用正面和负面对的对比学习来实现更好的一致性。为了有效地实现这一目标，我们引入了一种使用软文本令牌的轻巧对比度调整策略，称为Softrepa。这种方法通过在预验证的模型中添加少于1M的可训练参数来改善与最小计算开销的一致性。我们的理论分析表明，我们的方法明确增加了文本和图像表示之间的相互信息，从而增强了语义一致性。跨文本图像生成和文本指导的图像编辑任务的实验结果验证了我们方法在改善T2i生成模型的语义一致性方面的有效性。

Title: SARA: Structural and Adversarial Representation Alignment for Training-efficient Diffusion Models

Authors: Hesen Chen, Junyan Wang, Zhiyu Tan, Hao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08253
Pdf URL: https://arxiv.org/pdf/2503.08253
Copy Paste: [[2503.08253]] SARA: Structural and Adversarial Representation Alignment for Training-efficient Diffusion Models(https://arxiv.org/abs/2503.08253)
Keywords: generation
Abstract: Modern diffusion models encounter a fundamental trade-off between training efficiency and generation quality. While existing representation alignment methods, such as REPA, accelerate convergence through patch-wise alignment, they often fail to capture structural relationships within visual representations and ensure global distribution consistency between pretrained encoders and denoising networks. To address these limitations, we introduce SARA, a hierarchical alignment framework that enforces multi-level representation constraints: (1) patch-wise alignment to preserve local semantic details, (2) autocorrelation matrix alignment to maintain structural consistency within representations, and (3) adversarial distribution alignment to mitigate global representation discrepancies. Unlike previous approaches, SARA explicitly models both intra-representation correlations via self-similarity matrices and inter-distribution coherence via adversarial alignment, enabling comprehensive alignment across local and global scales. Experiments on ImageNet-256 show that SARA achieves an FID of 1.36 while converging twice as fast as REPA, surpassing recent state-of-the-art image generation methods. This work establishes a systematic paradigm for optimizing diffusion training through hierarchical representation alignment.
摘要：现代扩散模型在训练效率和发电质量之间遇到了基本的权衡。尽管现有的表示形式对齐方法（例如REPA）通过贴片的对齐方式加速收敛，但它们通常无法在视觉表示中捕获结构关系，并确保预告片的编码器和DeNoing网络之间的全局分布一致性。为了解决这些局限性，我们介绍了Sara，这是一个层次对齐框架，该框架可以强制执行多层表示约束：（1）通过贴片为准以保持本地语义细节的贴片对齐，（2）自相关矩阵对齐以维持在表示内的结构一致性，以及（3）对敌方分布分布与对敌方分配的态度差异差异。与以前的方法不同，萨拉通过自相似性矩阵和通过对抗对准的自相似性矩阵和分布相干性明确对代表性的相关性进行建模，从而使局部和全球尺度之间的全面对齐。 Imagenet-256上的实验表明，Sara的FID为1.36，同时收敛的速度是REPA的两倍，超过了最近最新的图像生成方法。这项工作建立了一个系统的范式，用于通过层次表示对齐方式优化扩散训练。

Title: DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness

Authors: Yiming Zhong, Qi Jiang, Jingyi Yu, Yuexin Ma
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.08257
Pdf URL: https://arxiv.org/pdf/2503.08257
Copy Paste: [[2503.08257]] DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness(https://arxiv.org/abs/2503.08257)
Keywords: generative
Abstract: A dexterous hand capable of grasping any object is essential for the development of general-purpose embodied intelligent robots. However, due to the high degree of freedom in dexterous hands and the vast diversity of objects, generating high-quality, usable grasping poses in a robust manner is a significant challenge. In this paper, we introduce DexGrasp Anything, a method that effectively integrates physical constraints into both the training and sampling phases of a diffusion-based generative model, achieving state-of-the-art performance across nearly all open datasets. Additionally, we present a new dexterous grasping dataset containing over 3.4 million diverse grasping poses for more than 15k different objects, demonstrating its potential to advance universal dexterous grasping. The code of our method and our dataset will be publicly released soon.
摘要：能够抓住任何物体的灵活的手对于开发通用体现的智能机器人至关重要。但是，由于灵巧的手和大量物体的高度自由，产生高质量的可用抓姿势以鲁棒的方式是一个重大挑战。在本文中，我们介绍了dexgrasp，这种方法可以有效地将物理约束纳入基于扩散的生成模型的训练和采样阶段，从而在几乎所有开放的数据集中都实现了最先进的性能。此外，我们提出了一个新的灵巧握把数据集，该数据集包含超过15K不同物体的340万种不同的握把姿势，这表明了它可以推动普遍灵活的抓地力。我们的方法和数据集的代码将很快公开发布。

Title: Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks

Authors: Junying Wang, Hongyuan Zhang, Yuan Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08269
Pdf URL: https://arxiv.org/pdf/2503.08269
Copy Paste: [[2503.08269]] Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks(https://arxiv.org/abs/2503.08269)
Keywords: generation
Abstract: Recent Customized Portrait Generation (CPG) methods, taking a facial image and a textual prompt as inputs, have attracted substantial attention. Although these methods generate high-fidelity portraits, they fail to prevent the generated portraits from being tracked and misused by malicious face recognition systems. To address this, this paper proposes a Customized Portrait Generation framework with facial Adversarial attacks (Adv-CPG). Specifically, to achieve facial privacy protection, we devise a lightweight local ID encryptor and an encryption enhancer. They implement progressive double-layer encryption protection by directly injecting the target identity and adding additional identity guidance, respectively. Furthermore, to accomplish fine-grained and personalized portrait generation, we develop a multi-modal image customizer capable of generating controlled fine-grained facial features. To the best of our knowledge, Adv-CPG is the first study that introduces facial adversarial attacks into CPG. Extensive experiments demonstrate the superiority of Adv-CPG, e.g., the average attack success rate of the proposed Adv-CPG is 28.1% and 2.86% higher compared to the SOTA noise-based attack methods and unconstrained attack methods, respectively.
摘要：最近的定制肖像生成（CPG）方法，采用面部图像和文本提示作为输入，引起了很大的关注。尽管这些方法产生了高保真的肖像，但它们无法防止因恶意面部识别系统跟踪和滥用产生的肖像。为了解决这个问题，本文提出了一个带有面部对抗攻击（ADV-CPG）的自定义肖像生成框架。具体来说，为了获得面部隐私保护，我们设计了一个轻量级的本地ID加密器和加密增强器。他们通过直接注入目标身份并分别添加其他身份指导来实现渐进的双层加密保护。此外，为了实现细粒度和个性化的肖像生成，我们开发了一个多模式图像定制器，能够生成受控的细粒面部特征。据我们所知，ADV-CPG是第一个将面部对抗攻击引入CPG的研究。广泛的实验表明，与基于SOTA噪声的攻击方法和不受约束的攻击方法相比，所提出的ADV-CPG的平均攻击成功率分别高28.1％和2.86％。

Title: HERO: Human Reaction Generation from Videos

Authors: Chengjun Yu, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08270
Pdf URL: https://arxiv.org/pdf/2503.08270
Copy Paste: [[2503.08270]] HERO: Human Reaction Generation from Videos(https://arxiv.org/abs/2503.08270)
Keywords: generation
Abstract: Human reaction generation represents a significant research domain for interactive AI, as humans constantly interact with their surroundings. Previous works focus mainly on synthesizing the reactive motion given a human motion sequence. This paradigm limits interaction categories to human-human interactions and ignores emotions that may influence reaction generation. In this work, we propose to generate 3D human reactions from RGB videos, which involves a wider range of interaction categories and naturally provides information about expressions that may reflect the subject's emotions. To cope with this task, we present HERO, a simple yet powerful framework for Human rEaction geneRation from videOs. HERO considers both global and frame-level local representations of the video to extract the interaction intention, and then uses the extracted interaction intention to guide the synthesis of the reaction. Besides, local visual representations are continuously injected into the model to maximize the exploitation of the dynamic properties inherent in videos. Furthermore, the ViMo dataset containing paired Video-Motion data is collected to support the task. In addition to human-human interactions, these video-motion pairs also cover animal-human interactions and scene-human interactions. Extensive experiments demonstrate the superiority of our methodology. The code and dataset will be publicly available at this https URL.
摘要：人类反应产生代表了交互式AI的重要研究领域，因为人类不断与周围环境相互作用。先前的作品主要集中于在人体运动序列下合成反应运动。该范式将相互作用类别限制在人类的互动中，而忽略了可能影响反应产生的情绪。在这项工作中，我们建议从RGB视频中产生3D人类反应，该视频涉及更广泛的交互类别，并且自然提供了有关表达式的信息，这些信息可能反映了主体的情绪。为了应对这项任务，我们介绍了英雄，这是一个简单而有力的框架，用于从视频中产生人类反应。英雄认为视频的全局和框架级局部表示以提取相互作用意图，然后使用提取的相互作用意图来指导反应的综合。此外，将局部视觉表示不断注入模型中，以最大程度地利用视频中固有的动态属性。此外，收集包含配对视频动作数据的VIMO数据集以支持任务。除了人类的相互作用外，这些视频运动对还涵盖了动物人类的相互作用和现场人类相互作用。广泛的实验证明了我们方法的优势。该代码和数据集将在此HTTPS URL上公开可用。

Title: OminiControl2: Efficient Conditioning for Diffusion Transformers

Authors: Zhenxiong Tan, Qiaochu Xue, Xingyi Yang, Songhua Liu, Xinchao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08280
Pdf URL: https://arxiv.org/pdf/2503.08280
Copy Paste: [[2503.08280]] OminiControl2: Efficient Conditioning for Diffusion Transformers(https://arxiv.org/abs/2503.08280)
Keywords: generation
Abstract: Fine-grained control of text-to-image diffusion transformer models (DiT) remains a critical challenge for practical deployment. While recent advances such as OminiControl and others have enabled a controllable generation of diverse control signals, these methods face significant computational inefficiency when handling long conditional inputs. We present OminiControl2, an efficient framework that achieves efficient image-conditional image generation. OminiControl2 introduces two key innovations: (1) a dynamic compression strategy that streamlines conditional inputs by preserving only the most semantically relevant tokens during generation, and (2) a conditional feature reuse mechanism that computes condition token features only once and reuses them across denoising steps. These architectural improvements preserve the original framework's parameter efficiency and multi-modal versatility while dramatically reducing computational costs. Our experiments demonstrate that OminiControl2 reduces conditional processing overhead by over 90% compared to its predecessor, achieving an overall 5.9$\times$ speedup in multi-conditional generation scenarios. This efficiency enables the practical implementation of complex, multi-modal control for high-quality image synthesis with DiT models.
摘要：对文本到图像扩散变压器模型（DIT）的细粒度控制仍然是实际部署的关键挑战。尽管诸如Ominicontrol等最新进展已经实现了可控的不同控制信号，但在处理较长的条件输入时，这些方法遇到了明显的计算效率。我们提出了aminicontrol2，这是一个有效的框架，可实现有效的图像条件形象生成。 aminicontrol2引入了两个关键的创新：（1）一种动态压缩策略，该策略通过仅保留生成期间最相关的代币来简化条件输入，以及（2）一种有条件的特征重用机制，该机制计算条件代币仅具有一次特征，并在跨剥离式的步骤中重新将它们重新置于它们。这些建筑改进保留了原始框架的参数效率和多模式多功能性，同时大大降低了计算成本。我们的实验表明，与其前身相比，Aminicontrol2将有条件的处理开销降低了90％以上，在多条件生成的情况下达到了总体5.9 $ \ times $速度。该效率可以实施使用DIT模型的高质量图像合成的复杂的多模式控制。

Title: D3PO: Preference-Based Alignment of Discrete Diffusion Models

Authors: Umberto Borso, Davide Paglieri, Jude Wells, Tim Rocktäschel
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08295
Pdf URL: https://arxiv.org/pdf/2503.08295
Copy Paste: [[2503.08295]] D3PO: Preference-Based Alignment of Discrete Diffusion Models(https://arxiv.org/abs/2503.08295)
Keywords: generation, generative
Abstract: Diffusion models have achieved state-of-the-art performance across multiple domains, with recent advancements extending their applicability to discrete data. However, aligning discrete diffusion models with task-specific preferences remains challenging, particularly in scenarios where explicit reward functions are unavailable. In this work, we introduce Discrete Diffusion DPO (D3PO), the first adaptation of Direct Preference Optimization (DPO) to discrete diffusion models formulated as continuous-time Markov chains. Our approach derives a novel loss function that directly fine-tunes the generative process using preference data while preserving fidelity to a reference distribution. We validate D3PO on a structured binary sequence generation task, demonstrating that the method effectively aligns model outputs with preferences while maintaining structural validity. Our results highlight that D3PO enables controlled fine-tuning without requiring explicit reward models, making it a practical alternative to reinforcement learning-based approaches. Future research will explore extending D3PO to more complex generative tasks, including language modeling and protein sequence generation, as well as investigating alternative noise schedules, such as uniform noising, to enhance flexibility across different applications.
摘要：扩散模型已经在多个领域实现了最先进的性能，最近的进步扩展了其对离散数据的适用性。但是，将离散扩散模型与特定于任务的偏好保持一致仍然具有挑战性，尤其是在不可用的明确奖励功能的情况下。在这项工作中，我们引入了离散扩散DPO（D3PO），这是直接偏好优化（DPO）的首次适应，以将其作为连续时间马尔可夫链配方的离散扩散模型。我们的方法得出了一种新颖的损失函数，该功能可以使用偏好数据直接微调生成过程，同时保真对参考分布。我们在结构化的二进制序列生成任务上验证了D3PO，表明该方法有效地将模型输出与偏好保持一致，同时保持结构有效性。我们的结果强调，D3PO可以在不需要明确的奖励模型的情况下进行受控的微调，这是基于增强学习方法的实用替代方法。未来的研究将探索将D3PO扩展到更复杂的生成任务，包括语言建模和蛋白质序列的产生，并研究替代的噪声时间表，例如均匀的nosising，以增强不同应用的灵活性。

Title: Feature Alignment with Equivariant Convolutions for Burst Image Super-Resolution

Authors: Xinyi Liu, Feiyu Tan, Qi Xie, Qian Zhao, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08300
Pdf URL: https://arxiv.org/pdf/2503.08300
Copy Paste: [[2503.08300]] Feature Alignment with Equivariant Convolutions for Burst Image Super-Resolution(https://arxiv.org/abs/2503.08300)
Keywords: super-resolution
Abstract: Burst image processing (BIP), which captures and integrates multiple frames into a single high-quality image, is widely used in consumer cameras. As a typical BIP task, Burst Image Super-Resolution (BISR) has achieved notable progress through deep learning in recent years. Existing BISR methods typically involve three key stages: alignment, upsampling, and fusion, often in varying orders and implementations. Among these stages, alignment is particularly critical for ensuring accurate feature matching and further reconstruction. However, existing methods often rely on techniques such as deformable convolutions and optical flow to realize alignment, which either focus only on local transformations or lack theoretical grounding, thereby limiting their performance. To alleviate these issues, we propose a novel framework for BISR, featuring an equivariant convolution-based alignment, ensuring consistent transformations between the image and feature domains. This enables the alignment transformation to be learned via explicit supervision in the image domain and easily applied in the feature domain in a theoretically sound way, effectively improving alignment accuracy. Additionally, we design an effective reconstruction module with advanced deep architectures for upsampling and fusion to obtain the final BISR result. Extensive experiments on BISR benchmarks show the superior performance of our approach in both quantitative metrics and visual quality.
摘要：爆发图像处理（BIP）将多个帧捕获并集成到单个高质量图像中，被广泛用于消费者摄像机中。作为典型的BIP任务，爆发图像超分辨率（BISR）近年来通过深度学习取得了显着的进步。现有的BISR方法通常涉及三个关键阶段：对齐，上采样和融合通常在不同的订单和实现中。在这些阶段中，对齐对于确保准确的特征匹配和进一步重建尤其重要。但是，现有的方法通常依赖于诸如可变形卷积和光流等技术来实现一致性，这要么仅着眼于局部转换或缺乏理论基础，从而限制了它们的性能。为了减轻这些问题，我们为BISR提出了一个新颖的框架，该框架具有基于卷积的对准，以确保图像和特征域之间的一致转换。这使得可以通过图像域中的显式监督来学习对齐转换，并以理论上的声音方式轻松地在特征域中应用，从而有效提高对齐的准确性。此外，我们设计了一个有效的重建模块，该模块具有高级深度体系结构，以提高采样和融合，以获得最终的BISR结果。 BISR基准测试的广泛实验表明，我们在定量指标和视觉质量方面的表现出色。

Title: $^R$FLAV: Rolling Flow matching for infinite Audio Video generation

Authors: Alex Ergasti, Giuseppe Gabriele Tarollo, Filippo Botti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08307
Pdf URL: https://arxiv.org/pdf/2503.08307
Copy Paste: [[2503.08307]] $^R$FLAV: Rolling Flow matching for infinite Audio Video generation(https://arxiv.org/abs/2503.08307)
Keywords: generation, generative
Abstract: Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present \arch{}, a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that \arch{} outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at this https URL.
摘要：联合音频视频（AV）生成在生成AI中仍然是一个重大挑战，这主要是由于三个关键要求：生成的样品的质量，无缝的多模式同步和时间连贯性，其音轨符合视觉数据和VICE VICE，而vice Vise，以及无限的视频持续时间。在本文中，我们提出了基于变压器的新型体系结构\ Arch {}，旨在应对AV代的所有关键挑战。我们探索了三个不同的跨模态相互作用模块，我们的轻质时间融合模块成为对齐音频和视觉方式的最有效和计算有效的方法。我们的实验结果表明，\ Arch {}在多模式AV生成任务中的现有最新模型优于现有的最新模型。我们的代码和检查点可在此HTTPS URL上找到。

Title: Pathology-Aware Adaptive Watermarking for Text-Driven Medical Image Synthesis

Authors: Chanyoung Kim, Dayun Ju, Jinyeong Kim, Woojung Han, Roberto Alcover-Couso, Seong Jae Hwang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08346
Pdf URL: https://arxiv.org/pdf/2503.08346
Copy Paste: [[2503.08346]] Pathology-Aware Adaptive Watermarking for Text-Driven Medical Image Synthesis(https://arxiv.org/abs/2503.08346)
Keywords: generation
Abstract: As recent text-conditioned diffusion models have enabled the generation of high-quality images, concerns over their potential misuse have also grown. This issue is critical in the medical domain, where text-conditioned generated medical images could enable insurance fraud or falsified records, highlighting the urgent need for reliable safeguards against unethical use. While watermarking techniques have emerged as a promising solution in general image domains, their direct application to medical imaging presents significant challenges. A key challenge is preserving fine-grained disease manifestations, as even minor distortions from a watermark may lead to clinical misinterpretation, which compromises diagnostic integrity. To overcome this gap, we present MedSign, a deep learning-based watermarking framework specifically designed for text-to-medical image synthesis, which preserves pathologically significant regions by adaptively adjusting watermark strength. Specifically, we generate a pathology localization map using cross-attention between medical text tokens and the diffusion denoising network, aggregating token-wise attention across layers, heads, and time steps. Leveraging this map, we optimize the LDM decoder to incorporate watermarking during image synthesis, ensuring cohesive integration while minimizing interference in diagnostically critical regions. Experimental results show that our MedSign preserves diagnostic integrity while ensuring watermark robustness, achieving state-of-the-art performance in image quality and detection accuracy on MIMIC-CXR and OIA-ODIR datasets.
摘要：由于最近有文本条件的扩散模型使高质量的图像产生了，因此对它们的潜在滥用的担忧也越来越大。此问题在医疗领域至关重要，在医疗领域，文本条件生成的医学图像可以实现保险欺诈或伪造的记录，从而强调了迫切需要对不道德使用的可靠保障措施。尽管水印技术已成为一般图像域中的一种有希望的解决方案，但它们在医学成像中的直接应用带来了重大挑战。一个关键的挑战是保留细粒度的疾病表现，因为即使是水印的微小扭曲也可能导致临床误解，从而损害了诊断完整性。为了克服这一差距，我们提出了MedSign，这是一个基于深度学习的水印框架，专门为文本对医学图像合成而设计，该框架通过自适应调节水印强度来保留具有病理意义的区域。具体而言，我们使用医学文本令牌与扩散网络之间的交叉注意生成病理定位图，从而跨越层，头部和时间步骤汇总令牌的关注。利用这张地图，我们优化了LDM解码器在图像合成过程中掺入水印，以确保凝聚力整合，同时最大程度地减少诊断上关键区域的干扰。实验结果表明，我们的MEDSIGN可以保留诊断完整性，同时确保水印鲁棒性，在MIMIC-CXR和OIA-ODIR数据集上实现图像质量和检测精度的最新性能。

Title: Robust Latent Matters: Boosting Image Generation with Sampling Error

Authors: Kai Qiu, Xiang Li, Jason Kuen, Hao Chen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08354
Pdf URL: https://arxiv.org/pdf/2503.08354
Copy Paste: [[2503.08354]] Robust Latent Matters: Boosting Image Generation with Sampling Error(https://arxiv.org/abs/2503.08354)
Keywords: generation, generative
Abstract: Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer. Though the performance of tokenizer plays an essential role to the successful generation, its current evaluation metrics (e.g. rFID) fail to precisely assess the tokenizer and correlate its performance to the generation quality (e.g. gFID). In this paper, we comprehensively analyze the reason for the discrepancy of reconstruction and generation qualities in a discrete latent space, and, from which, we propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction. Specifically, a latent perturbation approach is proposed to simulate sampling noises, i.e., the unexpected tokens sampled, from the generative process. With the latent perturbation, we further propose (1) a novel tokenizer evaluation metric, i.e., pFID, which successfully correlates the tokenizer performance to generation quality and (2) a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer thus boosting the generation quality and convergence speed. Extensive benchmarking are conducted with 11 advanced discrete image tokenizers with 2 autoregressive generation models to validate our approach. The tokenizer trained with our proposed latent perturbation achieve a notable 1.60 gFID with classifier-free guidance (CFG) and 3.45 gFID without CFG with a $\sim$400M generator. Code: this https URL.
摘要：最近的图像生成方案通常依靠冷冻图像令牌捕获预构建的潜在空间中的图像分布。尽管代币器的性能对成功的一代起着至关重要的作用，但其当前的评估指标（例如RFID）无法精确评估令牌，并将其性能与生成质量相关联（例如GFID）。在本文中，我们全面分析了离散潜在空间中重建和发电质量差异的原因，并从中提出了一种新颖的插入式引物训练方案，以促进潜在的空间构建。具体而言，提出了一种潜在的扰动方法来模拟生成过程中采样的采样噪声，即所采样的意外令牌。通过潜在的扰动，我们进一步提出了（1）一种新型的令牌评估度量，即PFID，该指标成功地将令牌的性能与生成质量相关联，以及（2）插件式令牌训练方案，从而显着增强了令牌的鲁棒性，从而增强了代币的鲁棒性，从而提高了代质量和转化速度。通过11种高级离散图像令牌进行了广泛的基准测试，具有2种自回归生成模型，以验证我们的方法。接受我们拟议的潜在扰动培训的代币器，具有无分类器指导（CFG）的1.60 GFID，而没有CFG的3.45 GFID，$ \ sim $ \ sim $ \ sim $ 4亿美元。代码：此HTTPS URL。

Title: Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens

Authors: Qingsong Xie, Zhao Zhang, Zhe Huang, Yanhao Zhang, Haonan Lu, Zhenyu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08377
Pdf URL: https://arxiv.org/pdf/2503.08377
Copy Paste: [[2503.08377]] Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens(https://arxiv.org/abs/2503.08377)
Keywords: generation
Abstract: Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and fidelity: high-resolution image reconstruction either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (Layton) that bridges discrete visual tokens with the compact latent space of pre-trained Latent Diffusion Models (LDMs), enabling efficient representation of 1024x1024 images using only 256 tokens-a 16 times compression over VQGAN. Layton integrates a transformer encoder, a quantized codebook, and a latent consistency decoder. Direct application of LDM as the decoder results in color and brightness discrepancies. Thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. Experiments demonstrate Layton's superiority in high-fidelity reconstruction, with 10.8 reconstruction Frechet Inception Distance on MSCOCO-2017 5K benchmark for 1024x1024 image reconstruction. We also extend Layton to a text-to-image generation model, LaytonGen, working in autoregression. It achieves 0.73 score on GenEval benchmark, surpassing current state-of-the-art methods. The code and model will be released.
摘要：图像令牌化具有明显的高级视觉生成和多模式建模，尤其是与自回归模型配对时。但是，当前的方法在平衡效率和忠诚度方面面临挑战：高分辨率图像重建需要过多的令牌或通过降低令牌损害关键细节。为了解决这一问题，我们提出了潜在的一致性令牌（Layton），将离散的视觉令牌与预训练的潜在扩散模型（LDMS）的紧凑潜在空间桥接在一起，从而有效地表示1024x1024图像，仅使用256个令牌-A 16倍压缩而不是Vqgan。 Layton集成了变压器编码器，量化的代码簿和潜在的一致性解码器。直接应用LDM作为解码器导致颜色和亮度差异。因此，我们将其转换为潜在的一致性解码器，将多步抽样减少到直接像素级监督的1-2个步骤。实验证明了Layton在高保真重建方面的优势，在MSCOCO-2017上有10.8重建Frechet Inception Inception-2017 5K基准1024x1024图像重建。我们还将Layton扩展到文本到图像生成模型，Laytongen，从事自动性的工作。它在Geneval基准测试中获得0.73得分，超过了当前的最新方法。代码和模型将发布。

Title: Recognition-Synergistic Scene Text Editing

Authors: Zhengyao Fang, Pengyuan Lyu, Jingjing Wu, Chengquan Zhang, Jun Yu, Guangming Lu, Wenjie Pei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08387
Pdf URL: https://arxiv.org/pdf/2503.08387
Copy Paste: [[2503.08387]] Recognition-Synergistic Scene Text Editing(https://arxiv.org/abs/2503.08387)
Keywords: generation
Abstract: Scene text editing aims to modify text content within scene images while maintaining style consistency. Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content, while ensuring content consistency using a pre-trained recognition model. Despite notable progress, these methods suffer from complex pipelines, leading to suboptimal performance in complex scenarios. In this work, we introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing. Our model seamlessly integrates text recognition with text editing within a unified framework, and leverages the recognition model's ability to implicitly disentangle style and content while ensuring content consistency. Specifically, our approach employs a multi-modal parallel decoder based on transformer architecture, which predicts both text content and stylized images in parallel. Additionally, our cyclic self-supervised fine-tuning strategy enables effective training on unpaired real-world data without ground truth, enhancing style and content consistency through a twice-cyclic generation process. Built on a relatively simple architecture, \mymodel achieves state-of-the-art performance on both synthetic and real-world benchmarks, and further demonstrates the effectiveness of leveraging the generated hard cases to boost the performance of downstream recognition tasks. Code is available at this https URL.
摘要：场景文本编辑旨在修改场景图像中的文本内容，同时保持样式一致性。传统方法通过将样式和内容从源图像中明确解开，然后将样式与目标内容融合，同时使用预训练的识别模型确保内容一致性。尽管取得了显着的进展，但这些方法仍具有复杂的管道，导致在复杂方案中的次优性能。在这项工作中，我们介绍了识别 - 杂志的场景文本编辑（RS-STE），这是一种新颖的方法，可以完全利用文本识别的内在协同作用进行编辑。我们的模型将文本识别与统一框架内的文本编辑无缝集成，并利用识别模型隐式删除样式和内容的能力，同时确保内容一致性。具体而言，我们的方法采用基于变压器体系结构的多模式并行解码器，该解码器可以并行预测文本内容和风格化的图像。此外，我们的周期性自我监督的微调策略可以有效地对未配对的现实数据进行有效培训，而没有地面真相，通过两次环保生成过程提高了风格和内容的一致性。 \ mymodel建立在相对简单的体系结构上，在合成和现实世界的基准测试上都实现了最新的性能，并进一步证明了利用生成的硬案例来提高下游识别任务的性能的有效性。代码可在此HTTPS URL上找到。

Title: Controlling Latent Diffusion Using Latent CLIP

Authors: Jason Becker, Chris Wendler, Peter Baylies, Robert West, Christian Wressnegger
Subjects: cs.CV, cs.AI, cs.LG, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2503.08455
Pdf URL: https://arxiv.org/pdf/2503.08455
Copy Paste: [[2503.08455]] Controlling Latent Diffusion Using Latent CLIP(https://arxiv.org/abs/2503.08455)
Keywords: generation
Abstract: Instead of performing text-conditioned denoising in the image domain, latent diffusion models (LDMs) operate in latent space of a variational autoencoder (VAE), enabling more efficient processing at reduced computational costs. However, while the diffusion process has moved to the latent space, the contrastive language-image pre-training (CLIP) models, as used in many image processing tasks, still operate in pixel space. Doing so requires costly VAE-decoding of latent images before they can be processed. In this paper, we introduce Latent-CLIP, a CLIP model that operates directly in the latent space. We train Latent-CLIP on 2.7B pairs of latent images and descriptive texts, and show that it matches zero-shot classification performance of similarly sized CLIP models on both the ImageNet benchmark and a LDM-generated version of it, demonstrating its effectiveness in assessing both real and generated content. Furthermore, we construct Latent-CLIP rewards for reward-based noise optimization (ReNO) and show that they match the performance of their CLIP counterparts on GenEval and T2I-CompBench while cutting the cost of the total pipeline by 21%. Finally, we use Latent-CLIP to guide generation away from harmful content, achieving strong performance on the inappropriate image prompts (I2P) benchmark and a custom evaluation, without ever requiring the costly step of decoding intermediate images.
摘要：潜在扩散模型（LDMS）没有在图像域中执行文本条件的denoing，而是在变异自动编码器（VAE）的潜在空间中运行，从而在降低的计算成本下实现了更有效的处理。但是，尽管扩散过程已移至潜在空间，但在许多图像处理任务中使用的对比语言图像预训练（剪辑）模型仍在像素空间中运行。这样做需要对潜在图像进行昂贵的VAE编码，然后才能处理它们。在本文中，我们介绍了潜在剪辑，这是一种直接在潜在空间中运行的夹子模型。我们在2.7b的潜在图像和描述性文本上训练潜在折叠，并证明它与ImageNet基准和LDM生成版本的类似尺寸的剪辑模型的零摄像分类性能匹配，并证明了其在评估真实和生成的内容方面的有效性。此外，我们为基于奖励的噪声优化（RENO）构建了潜在折扣奖励，并表明它们与剪辑在Geneval和T2i-Companch上的性能相匹配，同时将总管道的成本降低了21％。最后，我们使用Unit-CLIP来指导生成远离有害内容，在不适当的图像提示（I2P）基准和自定义评估上实现强劲的性能，而无需在解码中间图像方面付出代价高昂的步骤。

Title: Generalizable AI-Generated Image Detection Based on Fractal Self-Similarity in the Spectrum

Authors: Shengpeng Xiao, Yuanfang Guo, Heqi Peng, Zeming Liu, Liang Yang, Yunhong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08484
Pdf URL: https://arxiv.org/pdf/2503.08484
Copy Paste: [[2503.08484]] Generalizable AI-Generated Image Detection Based on Fractal Self-Similarity in the Spectrum(https://arxiv.org/abs/2503.08484)
Keywords: generative
Abstract: The generalization performance of AI-generated image detection remains a critical challenge. Although most existing methods perform well in detecting images from generative models included in the training set, their accuracy drops significantly when faced with images from unseen generators. To address this limitation, we propose a novel detection method based on the fractal self-similarity of the spectrum, a common feature among images generated by different models. Specifically, we demonstrate that AI-generated images exhibit fractal-like spectral growth through periodic extension and low-pass filtering. This observation motivates us to exploit the similarity among different fractal branches of the spectrum. Instead of directly analyzing the spectrum, our method mitigates the impact of varying spectral characteristics across different generators, improving detection performance for images from unseen models. Experiments on a public benchmark demonstrated the generalized detection performance across both GANs and diffusion models.
摘要：AI生成的图像检测的概括性能仍然是一个关键的挑战。尽管大多数现有方法在检测训练集中包含的生成模型的图像中表现良好，但是当面对来自看不见的发电机的图像时，它们的准确性显着下降。为了解决这一限制，我们提出了一种基于频谱的分形自相似性的新型检测方法，这是不同模型产生的图像中的共同特征。具体而言，我们证明了AI生成的图像通过周期性扩展和低通滤波表现出分形的光谱生长。这种观察激发了我们利用光谱不同分形分支之间的相似性。我们的方法没有直接分析光谱，而是减轻不同发电机之间不同光谱特性的影响，从而改善了来自看不见模型的图像的检测性能。公共基准上的实验证明了gan和扩散模型的广义检测性能。

Title: Learning to Match Unpaired Data with Minimum Entropy Coupling

Authors: Mustapha Bounoua, Giulio Franzese, Pietro Michiardi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08501
Pdf URL: https://arxiv.org/pdf/2503.08501
Copy Paste: [[2503.08501]] Learning to Match Unpaired Data with Minimum Entropy Coupling(https://arxiv.org/abs/2503.08501)
Keywords: generative
Abstract: Multimodal data is a precious asset enabling a variety of downstream tasks in machine learning. However, real-world data collected across different modalities is often not paired, which is a significant challenge to learn a joint distribution. A prominent approach to address the modality coupling problem is Minimum Entropy Coupling (MEC), which seeks to minimize the joint Entropy, while satisfying constraints on the marginals. Existing approaches to the MEC problem focus on finite, discrete distributions, limiting their application for cases involving continuous data. In this work, we propose a novel method to solve the continuous MEC problem, using well-known generative diffusion models that learn to approximate and minimize the joint Entropy through a cooperative scheme, while satisfying a relaxed version of the marginal constraints. We empirically demonstrate that our method, DDMEC, is general and can be easily used to address challenging tasks, including unsupervised single-cell multi-omics data alignment and unpaired image translation, outperforming specialized methods.
摘要：多模式数据是一项宝贵的资产，可实现机器学习中的各种下游任务。但是，跨不同模式收集的现实数据通常不是配对，这是学习联合分布的重大挑战。解决方式耦合问题的一种突出方法是最小熵耦合（MEC），该耦合旨在最大程度地减少联合熵，同时满足边缘的约束。 MEC问题的现有方法集中在有限的离散分布上，从而限制了其在涉及连续数据的案件中的应用。在这项工作中，我们提出了一种新的方法来解决连续的MEC问题，该方法使用众所周知的生成扩散模型，通过合作方案学会近似和最小化联合熵，同时满足边缘约束的放松版本。我们从经验上证明，我们的方法DDMEC是一般的，可以轻松地用于解决具有挑战性的任务，包括无监督的单细胞多摩s数据校准和未配对的图像翻译，优于特殊的专业方法。

Title: DISTINGUISH Workflow: A New Paradigm of Dynamic Well Placement Using Generative Machine Learning

Authors: Sergey Alyaev, Kristian Fossum, Hibat Errahmen Djecta, Jan Tveranger, Ahmed H. Elsheikh
Subjects: cs.LG, math.OC, physics.geo-ph, stat.AP
Abstract URL: https://arxiv.org/abs/2503.08509
Pdf URL: https://arxiv.org/pdf/2503.08509
Copy Paste: [[2503.08509]] DISTINGUISH Workflow: A New Paradigm of Dynamic Well Placement Using Generative Machine Learning(https://arxiv.org/abs/2503.08509)
Keywords: generative
Abstract: The real-time process of directional changes while drilling, known as geosteering, is crucial for hydrocarbon extraction and emerging directional drilling applications such as geothermal energy, civil infrastructure, and CO2 storage. The geo-energy industry seeks an automatic geosteering workflow that continually updates the subsurface uncertainties and captures the latest geological understanding given the most recent observations in real-time. We propose "DISTINGUISH": a real-time, AI-driven workflow designed to transform geosteering by integrating Generative Adversarial Networks (GANs) for geological parameterization, ensemble methods for model updating, and global discrete dynamic programming (DDP) optimization for complex decision-making during directional drilling operations. The DISTINGUISH framework relies on offline training of a GAN model to reproduce relevant geology realizations and a Forward Neural Network (FNN) to model Logging-While-Drilling (LWD) tools' response for a given geomodel. This paper introduces a first-of-its-kind workflow that progressively reduces GAN-geomodel uncertainty around and ahead of the drilling bit and adjusts the well plan accordingly. The workflow automatically integrates real-time LWD data with a DDP-based decision support system, enhancing predictive models of geology ahead of drilling and leading to better steering decisions. We present a simple yet representative benchmark case and document the performance target achieved by the DISTINGUISH workflow prototype. This benchmark will be a foundation for future methodological advancements and workflow refinements.
摘要：钻孔时定向变化的实时过程（称为GeoSteering）对于碳氢化合物提取和新兴方向钻孔应用至关重要，例如地热能，民用基础设施和CO2存储至关重要。地球能源行业寻求自动的地理工作流程，该工作流程不断更新地下不确定性，并捕获了最新的地质理解，鉴于实时的最新观察结果。我们提出了“区分”：一种实时的，AI驱动的工作流程，旨在通过集成生成的对抗网络（GAN）来改变地理座，用于地质参数化，用于模型更新的集合方法以及全球离散动力学编程（DDP）优化，以在方向钻探过程中进行复杂的决策制定。区别框架依赖于GAN模型的离线培训来复制相关的地质实现和向前神经网络（FNN）来对定点的测量时期（LWD）工具对定型地质模型的响应进行建模。本文介绍了首个工作流程，该工作流程逐渐降低了钻井钻头周围和之前的Gan-Geomodel不确定性，并相应地调整了良好的计划。该工作流将自动将实时LWD数据与基于DDP的决策支持系统集成，从而在钻探之前增强地质的预测模型，并带来更好的转向决策。我们提出了一个简单但代表性的基准案例，并记录了通过区分工作流程实现的性能目标。该基准将是未来方法论进步和工作流程改进的基础。

Title: High-Quality 3D Head Reconstruction from Any Single Portrait Image

Authors: Jianfu Zhang, yujie Gao, Jiahui Zhan, Wentao Wang, Yiyi Zhang, Haohua Zhao, Liqing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08516
Pdf URL: https://arxiv.org/pdf/2503.08516
Copy Paste: [[2503.08516]] High-Quality 3D Head Reconstruction from Any Single Portrait Image(https://arxiv.org/abs/2503.08516)
Keywords: generation, generative
Abstract: In this work, we introduce a novel high-fidelity 3D head reconstruction method from a single portrait image, regardless of perspective, expression, or accessories. Despite significant efforts in adapting 2D generative models for novel view synthesis and 3D optimization, most methods struggle to produce high-quality 3D portraits. The lack of crucial information, such as identity, expression, hair, and accessories, limits these approaches in generating realistic 3D head models. To address these challenges, we construct a new high-quality dataset containing 227 sequences of digital human portraits captured from 96 different perspectives, totalling 21,792 frames, featuring diverse expressions and accessories. To further improve performance, we integrate identity and expression information into the multi-view diffusion process to enhance facial consistency across views. Specifically, we apply identity- and expression-aware guidance and supervision to extract accurate facial representations, which guide the model and enforce objective functions to ensure high identity and expression consistency during generation. Finally, we generate an orbital video around the portrait consisting of 96 multi-view frames, which can be used for 3D portrait model reconstruction. Our method demonstrates robust performance across challenging scenarios, including side-face angles and complex accessories
摘要：在这项工作中，我们从单个肖像图像中引入了一种新颖的高保真3D头重建方法，无论透视，表达或配件如何。尽管在调整新观点合成和3D优化的2D生成模型方面做出了重大努力，但大多数方法都在努力产生高质量的3D肖像。缺乏关键信息，例如身份，表达，头发和配件，将这些方法限制在生成现实的3D头模型中。为了应对这些挑战，我们构建了一个新的高质量数据集，其中包含从96个不同角度捕获的227个数字人像序列，总计21,792帧，具有多样的表达式和配件。为了进一步提高性能，我们将身份和表达信息集成到多视图扩散过程中，以增强跨视图的面部一致性。具体而言，我们采用身份和表达感知的指导和监督来提取准确的面部表征，这些面部表征指导模型并执行目标函数，以确保生成期间高标识和表达一致性。最后，我们围绕肖像制作一个轨道视频，该视频由96个多视帧组成，可用于3D肖像模型重建。我们的方法表明了在具有挑战性的场景中表现出色的性能，包括侧面角度和复杂的配件

Title: 3D Point Cloud Generation via Autoregressive Up-sampling

Authors: Ziqiao Meng, Qichao Wang, Zhipeng Zhou, Irwin King, Peilin Zhao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08594
Pdf URL: https://arxiv.org/pdf/2503.08594
Copy Paste: [[2503.08594]] 3D Point Cloud Generation via Autoregressive Up-sampling(https://arxiv.org/abs/2503.08594)
Keywords: generation, generative
Abstract: We introduce a pioneering autoregressive generative model for 3D point cloud generation. Inspired by visual autoregressive modeling (VAR), we conceptualize point cloud generation as an autoregressive up-sampling process. This leads to our novel model, PointARU, which progressively refines 3D point clouds from coarse to fine scales. PointARU follows a two-stage training paradigm: first, it learns multi-scale discrete representations of point clouds, and then it trains an autoregressive transformer for next-scale prediction. To address the inherent unordered and irregular structure of point clouds, we incorporate specialized point-based up-sampling network modules in both stages and integrate 3D absolute positional encoding based on the decoded point cloud at each scale during the second stage. Our model surpasses state-of-the-art (SoTA) diffusion-based approaches in both generation quality and parameter efficiency across diverse experimental settings, marking a new milestone for autoregressive methods in 3D point cloud generation. Furthermore, PointARU demonstrates exceptional performance in completing partial 3D shapes and up-sampling sparse point clouds, outperforming existing generative models in these tasks.
摘要：我们为3D点云生成引入了开拓性自回归生成模型。受视觉自回旋建模（VAR）的启发，我们将Point Cloud生成概念化为自回归的上采样过程。这导致了我们的新型模型Pointaru，该模型逐渐完善了从粗尺度到细尺度的3D点云。 PointAru遵循了两个阶段的训练范式：首先，它学习了点云的多尺度离散表示，然后它训练自回归的变压器进行隔壁预测。为了解决点云的固有无序和不规则结构，我们在阶段中合并了基于专业的基于点的上采样网络模块，并基于第二阶段的每个尺度上的解码点云集成了3D绝对位置编码。我们的模型超过了各种实验设置的发电质量和参数效率的最新方法（SOTA）扩散方法，这标志着3D点云生成中自动回归方法的新里程碑。此外，PointAru在完成部分3D形状和上采样稀疏点云方面表现出了出色的性能，从而超过了这些任务中现有的生成模型。

Title: Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling

Authors: Subin Kim, Seoung Wug Oh, Jui-Hsien Wang, Joon-Young Lee, Jinwoo Shin
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08605
Pdf URL: https://arxiv.org/pdf/2503.08605
Copy Paste: [[2503.08605]] Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling(https://arxiv.org/abs/2503.08605)
Keywords: generation
Abstract: While recent advancements in text-to-video diffusion models enable high-quality short video generation from a single prompt, generating real-world long videos in a single pass remains challenging due to limited data and high computational costs. To address this, several works propose tuning-free approaches, i.e., extending existing models for long video generation, specifically using multiple prompts to allow for dynamic and controlled content changes. However, these methods primarily focus on ensuring smooth transitions between adjacent frames, often leading to content drift and a gradual loss of semantic coherence over longer sequences. To tackle such an issue, we propose Synchronized Coupled Sampling (SynCoS), a novel inference framework that synchronizes denoising paths across the entire video, ensuring long-range consistency across both adjacent and distant frames. Our approach combines two complementary sampling strategies: reverse and optimization-based sampling, which ensure seamless local transitions and enforce global coherence, respectively. However, directly alternating between these samplings misaligns denoising trajectories, disrupting prompt guidance and introducing unintended content changes as they operate independently. To resolve this, SynCoS synchronizes them through a grounded timestep and a fixed baseline noise, ensuring fully coupled sampling with aligned denoising paths. Extensive experiments show that SynCoS significantly improves multi-event long video generation, achieving smoother transitions and superior long-range coherence, outperforming previous approaches both quantitatively and qualitatively.
摘要：尽管文本到视频扩散模型的最新进展使单个提示可以从单个提示中获得高质量的短视频生成，但由于数据有限和计算成本高，在单个通行证中生成真实世界长的视频仍然具有挑战性。为了解决这个问题，几项作品提出了无调的方法，即扩展了长期视频生成的现有模型，特别是使用多个提示来允许动态和受控内容更改。但是，这些方法主要集中在确保相邻帧之间的平稳过渡，通常会导致内容漂移和逐渐在更长序列上逐渐丧失语义连贯性。为了解决此类问题，我们提出了同步的耦合采样（Syncos），这是一种新型的推理框架，可以同步整个视频中的DeNOCONETION路径，从而确保相邻和遥远框架的长距离一致性。我们的方法结合了两种互补的抽样策略：基于反向和优化的采样，它们分别确保了无缝的局部过渡并实施全球连贯性。但是，这些采样之间直接交替交替，使DeNo轨迹失误，破坏了及时的指导并在独立运行时引入意想不到的内容变化。为了解决此问题，Syncos通过接地的时间步和固定基线噪声将它们同步，从而确保与对齐的DeNoising路径完全耦合采样。广泛的实验表明，Syncos显着改善了多种事件的长期视频生成，实现了更平滑的过渡和出色的远程连贯性，在定量和定性上都超过了先前的方法。

Title: LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization

Authors: Xianfeng Wu, Yajing Bai, Haoze Zheng, Harold Haodong Chen, Yexin Liu, Zihao Wang, Xuran Ma, Wen-Jie Shu, Xianzu Wu, Harry Yang, Ser-Nam Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08619
Pdf URL: https://arxiv.org/pdf/2503.08619
Copy Paste: [[2503.08619]] LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization(https://arxiv.org/abs/2503.08619)
Keywords: generation
Abstract: Recent advances in text-to-image generation have primarily relied on extensive datasets and parameter-heavy architectures. These requirements severely limit accessibility for researchers and practitioners who lack substantial computational resources. In this paper, we introduce \model, an efficient training paradigm for image generation models that uses knowledge distillation (KD) and Direct Preference Optimization (DPO). Drawing inspiration from the success of data KD techniques widely adopted in Multi-Modal Large Language Models (MLLMs), LightGen distills knowledge from state-of-the-art (SOTA) text-to-image models into a compact Masked Autoregressive (MAR) architecture with only $0.7B$ parameters. Using a compact synthetic dataset of just $2M$ high-quality images generated from varied captions, we demonstrate that data diversity significantly outweighs data volume in determining model performance. This strategy dramatically reduces computational demands and reduces pre-training time from potentially thousands of GPU-days to merely 88 GPU-days. Furthermore, to address the inherent shortcomings of synthetic data, particularly poor high-frequency details and spatial inaccuracies, we integrate the DPO technique that refines image fidelity and positional accuracy. Comprehensive experiments confirm that LightGen achieves image generation quality comparable to SOTA models while significantly reducing computational resources and expanding accessibility for resource-constrained environments. Code is available at this https URL
摘要：文本到图像生成的最新进展主要依赖于广泛的数据集和参数重型架构。这些要求严重限制了缺乏大量计算资源的研究人员和从业人员的可访问性。在本文中，我们介绍了\模型，这是一种使用知识蒸馏（KD）和直接偏好优化（DPO）的图像生成模型的有效训练范式。从多模式大语言模型（MLLM）中广泛采用的数据KD技术的成功中汲取灵感，LightGen将知识从最先进的文本对图像模型（SOTA）模型蒸发到仅$ 0.7B $参数的紧凑型掩盖自动化（MAR）体系结构中。使用由不同字幕产生的紧凑型合成数据集仅为$ 200万美元的高质量图像，我们证明数据多样性在确定模型性能方面显着超过数据量。该策略大大减少了计算需求，并将预训练时间从可能的数千个GPU日减少到仅88天。此外，为了解决综合数据的固有缺点，尤其是高频细节和空间不准确，我们整合了DPO技术，以优化图像保真度和位置准确性。全面的实验证实，LightGen可以达到与SOTA模型相当的图像生成质量，同时大大降低了计算资源并扩大了资源受限环境的可访问性。代码可在此HTTPS URL上找到

Title: MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention

Authors: Yuhan Wang, Fangzhou Hong, Shuai Yang, Liming Jiang, Wayne Wu, Chen Change Loy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08664
Pdf URL: https://arxiv.org/pdf/2503.08664
Copy Paste: [[2503.08664]] MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention(https://arxiv.org/abs/2503.08664)
Keywords: generation
Abstract: Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling multiview attention to higher resolutions. In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called mesh attention to enable training at 1024x1024 resolution. Using a clothed human mesh as a central coarse geometric representation, the proposed mesh attention leverages rasterization and projection to establish direct cross-view coordinate correspondences. This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency. Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, MEAT. In addition, we present valuable insights into applying multiview human motion videos for diffusion training, addressing the longstanding issue of data scarcity. Extensive experiments show that MEAT effectively generates dense, consistent multiview human images at the megapixel level, outperforming existing multiview diffusion methods.
摘要：多视频扩散模型在一般对象的图像到3D生成中已显示出巨大的成功。但是，当应用于人类数据时，现有的方法尚未提供有希望的结果，这主要是由于将多视值关注对更高分辨率的挑战所带来的挑战。在本文中，我们探索了百万像素级别的人类多视图扩散模型，并在1024x1024分辨率下引入了一种称为网状注意的解决方案。使用衣服的人网作为中央粗几何表示，提出的网格注意力利用了栅格化和投影以建立直接的跨视图坐标对应关系。这种方法大大降低了多视值注意的复杂性，同时保持跨视图一致性。在此基础的基础上，我们设计了一个网格注意力区块，并将其与按键点条件结合在一起，以创建人类特异性的多视图扩散模型肉。此外，我们还为应用多览人类运动视频进行扩散训练提供了宝贵的见解，以解决长期存在的数据稀缺问题。广泛的实验表明，肉在百像级别有效地产生密集的，一致的多视图人类图像，表现优于现有的多视频扩散方法。

Title: REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder

Authors: Yitian Zhang, Long Mai, Aniruddha Mahapatra, David Bourgin, Yicong Hong, Jonah Casebeer, Feng Liu, Yun Fu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08665
Pdf URL: https://arxiv.org/pdf/2503.08665
Copy Paste: [[2503.08665]] REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder(https://arxiv.org/abs/2503.08665)
Keywords: generation, generative
Abstract: We present a novel perspective on learning video embedders for generative modeling: rather than requiring an exact reproduction of an input video, an effective embedder should focus on synthesizing visually plausible reconstructions. This relaxed criterion enables substantial improvements in compression ratios without compromising the quality of downstream generative models. Specifically, we propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework that employs a diffusion transformer (DiT) to synthesize missing details from a compact latent space. Therein, we develop a dedicated latent conditioning module to condition the DiT decoder on the encoded video latent embedding. Our experiments demonstrate that our approach enables superior encoding-decoding performance compared to state-of-the-art methods, particularly as the compression ratio increases. To demonstrate the efficacy of our approach, we report results from our video embedders achieving a temporal compression ratio of up to 32x (8x higher than leading video embedders) and validate the robustness of this ultra-compact latent space for text-to-video generation, providing a significant efficiency boost in latent diffusion model training and inference.
摘要：我们对学习视频嵌入式的生成建模介绍了一种新颖的观点：而不是需要对输入视频进行精确复制，而是有效的嵌入器应着重于合成视觉上可见的重建。该轻松的标准可以在不损害下游生成模型的质量的情况下进行重大改善。具体来说，我们建议用一个使用编码器生成器框架替换传统的编码器嵌入式视频嵌入器，该框架采用扩散变压器（DIT）来合成紧凑型潜在空间中缺少的细节。在其中，我们开发了一个专用的潜在调理模块，以调节编码的视频潜在嵌入式DIT解码器。我们的实验表明，与最先进的方法相比，我们的方法可以使卓越的编码分解性能进行较高的编码性能，尤其是随着压缩比的增加。为了证明我们的方法的功效，我们报告了视频嵌入式的结果，这些嵌入者的时间压缩率达到了32X的时间压缩率（比领先的视频嵌入式高8倍），并验证了这种超紧凑型潜在的文本潜在空间的鲁棒性，从而可以发电，从而在潜在的扩散模型模型训练和培训方面具有显着有效的有效性。

Title: Understanding and Mitigating Distribution Shifts For Machine Learning Force Fields

Authors: Tobias Kreiman, Aditi S. Krishnapriyan
Subjects: cs.LG, cond-mat.mtrl-sci, physics.chem-ph, q-bio.BM
Abstract URL: https://arxiv.org/abs/2503.08674
Pdf URL: https://arxiv.org/pdf/2503.08674
Copy Paste: [[2503.08674]] Understanding and Mitigating Distribution Shifts For Machine Learning Force Fields(https://arxiv.org/abs/2503.08674)
Keywords: generation
Abstract: Machine Learning Force Fields (MLFFs) are a promising alternative to expensive ab initio quantum mechanical molecular simulations. Given the diversity of chemical spaces that are of interest and the cost of generating new data, it is important to understand how MLFFs generalize beyond their training distributions. In order to characterize and better understand distribution shifts in MLFFs, we conduct diagnostic experiments on chemical datasets, revealing common shifts that pose significant challenges, even for large foundation models trained on extensive data. Based on these observations, we hypothesize that current supervised training methods inadequately regularize MLFFs, resulting in overfitting and learning poor representations of out-of-distribution systems. We then propose two new methods as initial steps for mitigating distribution shifts for MLFFs. Our methods focus on test-time refinement strategies that incur minimal computational cost and do not use expensive ab initio reference labels. The first strategy, based on spectral graph theory, modifies the edges of test graphs to align with graph structures seen during training. Our second strategy improves representations for out-of-distribution systems at test-time by taking gradient steps using an auxiliary objective, such as a cheap physical prior. Our test-time refinement strategies significantly reduce errors on out-of-distribution systems, suggesting that MLFFs are capable of and can move towards modeling diverse chemical spaces, but are not being effectively trained to do so. Our experiments establish clear benchmarks for evaluating the generalization capabilities of the next generation of MLFFs. Our code is available at this https URL.
摘要：机器学习力场（MLFF）是昂贵的量子量子机械分子模拟的有前途的替代方法。鉴于感兴趣的化学空间多样性以及生成新数据的成本，重要的是要了解MLFF如何推广其训练分布。为了表征和更好地了解MLFF的分布变化，我们在化学数据集上进行诊断实验，揭示了构成重大挑战的常见转变，即使对于接受广泛数据训练的大型基础模型也是如此。基于这些观察结果，我们假设当前的监督培训方法不足地规范MLFF，从而导致过度拟合和学习分布式系统的不良表示。然后，我们提出了两种新方法，作为减轻MLFF分布变化的初始步骤。我们的方法着眼于测试时间的精炼策略，这些策略会产生最低的计算成本，并且不使用昂贵的从头算参考标签。基于光谱图理论的第一个策略修改了测试图的边缘，以与训练过程中看到的图形结构保持一致。我们的第二种策略通过使用辅助目标（例如廉价的物理先验）采取梯度步骤来改善测试时间内分布系统的表示。我们的测试时间改进策略可显着减少分布系统的错误，这表明MLFFS具有并可以朝着建模不同的化学空间建模，但没有被有效地训练。我们的实验建立了明确的基准，用于评估下一代MLFF的概括能力。我们的代码可在此HTTPS URL上找到。

Title: OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Authors: Yongsheng Yu, Ziyun Zeng, Haitian Zheng, Jiebo Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08677
Pdf URL: https://arxiv.org/pdf/2503.08677
Copy Paste: [[2503.08677]] OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting(https://arxiv.org/abs/2503.08677)
Keywords: generative
Abstract: Diffusion-based generative models have revolutionized object-oriented image editing, yet their deployment in realistic object removal and insertion remains hampered by challenges such as the intricate interplay of physical effects and insufficient paired training data. In this work, we introduce OmniPaint, a unified framework that re-conceptualizes object removal and insertion as interdependent processes rather than isolated tasks. Leveraging a pre-trained diffusion prior along with a progressive training pipeline comprising initial paired sample optimization and subsequent large-scale unpaired refinement via CycleFlow, OmniPaint achieves precise foreground elimination and seamless object insertion while faithfully preserving scene geometry and intrinsic properties. Furthermore, our novel CFD metric offers a robust, reference-free evaluation of context consistency and object hallucination, establishing a new benchmark for high-fidelity image editing. Project page: this https URL
摘要：基于扩散的生成模型已彻底改变了面向对象的图像编辑，但是它们在逼真的对象去除和插入中的部署仍然受到诸如物理效果的复杂相互作用和配对训练数据不足的挑战的阻碍。在这项工作中，我们介绍了Omnipaint，这是一个统一的框架，将对象的去除和插入重新概念化为相互依存的过程，而不是孤立的任务。通过循环流程，通过循环流进行了预先训练的扩散以及包括初始配对样品优化的渐进训练管道以及随后的大规模未配对的细化，Omnipaint实现了精确的前景消除和无缝的对象插入，同时忠实地保留了场景的几何形状和内在特性。此外，我们的小说CFD指标提供了对上下文一致性和对象幻觉的强大，无参考的评估，为高保真图像编辑建立了新的基准。项目页面：此HTTPS URL

Title: OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

Authors: Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08686
Pdf URL: https://arxiv.org/pdf/2503.08686
Copy Paste: [[2503.08686]] OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models(https://arxiv.org/abs/2503.08686)
Keywords: generation
Abstract: Recent advancements in unified multimodal understanding and visual generation (or multimodal generation) models have been hindered by their quadratic computational complexity and dependence on large-scale training data. We present OmniMamba, the first linear-architecture-based multimodal generation model that generates both text and images through a unified next-token prediction paradigm. The model fully leverages Mamba-2's high computational and memory efficiency, extending its capabilities from text generation to multimodal generation. To address the data inefficiency of existing unified models, we propose two key innovations: (1) decoupled vocabularies to guide modality-specific generation, and (2) task-specific LoRA for parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage training strategy to mitigate data imbalance between two tasks. Equipped with these techniques, OmniMamba achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks, despite being trained on merely 2M image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba stands out with outstanding inference efficiency, achieving up to a 119.2 times speedup and 63% GPU memory reduction for long-sequence generation compared to Transformer-based counterparts. Code and models are released at this https URL
摘要：统一的多模式理解和视觉生成（或多模式生成）模型的最新进展受到了二次计算复杂性和对大规模训练数据的依赖。我们提出Omnimamba，这是第一个基于线性结构的多模式生成模型，该模型通过统一的下一步预测范式同时生成文本和图像。该模型完全利用Mamba-2的高计算和内存效率，从而将其功能从文本生成到多模式生成。为了解决现有统一模型的数据效率低下的数据，我们提出了两个关键创新：（1）解耦词汇以指导特定于模态特定的生成，以及（2）特定于任务的洛拉，用于参数有效适应。此外，我们引入了一种十个脱钩的两阶段训练策略，以减轻两个任务之间的数据失衡。 Omnimamba配备了这些技术，尽管受过2M图像文本对培训，但在跨基准的横跨基准的同时，在Janusflow上实现了竞争性能，这比Show-O少了1000倍。值得注意的是，与基于变压器的同类产品相比，Omnimamba以出色的推理效率脱颖而出，长达119.2倍的速度和63％的GPU记忆减少。代码和模型在此HTTPS URL上发布