2025-05-09

Title: Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

Authors: Divyansh Srivastava, Xiang Zhang, He Wen, Chenru Wen, Zhuowen Tu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04718
Pdf URL: https://arxiv.org/pdf/2505.04718
Copy Paste: [[2505.04718]] Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers(https://arxiv.org/abs/2505.04718)
Keywords: generation
Abstract: We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.
摘要：我们提出了您的景点（速记Layousyn），这是一种新颖的自然场景的文本到较长的生成管道。先前的场景布局生成方法要么是封闭式摄影库，要么使用专有的大语言模型进行开放式摄影生成，从而限制了它们的建模功能，并且在可控的图像生成中更广泛适用。在这项工作中，我们建议使用轻巧的开源语言模型从文本提示和一种新颖的方面扩散变压器体系结构中获取场景元素，并以开放式唱片的方式训练有条件的布局生成。广泛的实验表明，Layousyn优于现有方法，并在具有挑战性的空间和数值推理基准上实现最先进的性能。此外，我们提出了Layousyn的两个应用。首先，我们表明，大型语言模型的粗略初始化可以与我们的方法无缝结合，以获得更好的结果。其次，我们提出了一条用于在图像中添加对象的管道，证明了图像编辑应用中Layousyn的潜力。

Title: When Bad Data Leads to Good Models

Authors: Kenneth Li, Yida Chen, Fernanda Viégas, Martin Wattenberg
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.04741
Pdf URL: https://arxiv.org/pdf/2505.04741
Copy Paste: [[2505.04741]] When Bad Data Leads to Good Models(https://arxiv.org/abs/2505.04741)
Keywords: generation
Abstract: In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.
摘要：

Title: Replay to Remember (R2R): An Efficient Uncertainty-driven Unsupervised Continual Learning Framework Using Generative Replay

Authors: Sriram Mandalika, Harsha Vardhan, Athira Nambiar
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04787
Pdf URL: https://arxiv.org/pdf/2505.04787
Copy Paste: [[2505.04787]] Replay to Remember (R2R): An Efficient Uncertainty-driven Unsupervised Continual Learning Framework Using Generative Replay(https://arxiv.org/abs/2505.04787)
Keywords: generative
Abstract: Continual Learning entails progressively acquiring knowledge from new data while retaining previously acquired knowledge, thereby mitigating ``Catastrophic Forgetting'' in neural networks. Our work presents a novel uncertainty-driven Unsupervised Continual Learning framework using Generative Replay, namely ``Replay to Remember (R2R)''. The proposed R2R architecture efficiently uses unlabelled and synthetic labelled data in a balanced proportion using a cluster-level uncertainty-driven feedback mechanism and a VLM-powered generative replay module. Unlike traditional memory-buffer methods that depend on pretrained models and pseudo-labels, our R2R framework operates without any prior training. It leverages visual features from unlabeled data and adapts continuously using clustering-based uncertainty estimation coupled with dynamic thresholding. Concurrently, a generative replay mechanism along with DeepSeek-R1 powered CLIP VLM produces labelled synthetic data representative of past experiences, resembling biological visual thinking that replays memory to remember and act in new, unseen tasks. Extensive experimental analyses are carried out in CIFAR-10, CIFAR-100, CINIC-10, SVHN and TinyImageNet datasets. Our proposed R2R approach improves knowledge retention, achieving a state-of-the-art performance of 98.13%, 73.06%, 93.41%, 95.18%, 59.74%, respectively, surpassing state-of-the-art performance by over 4.36%.
摘要：持续学习需要从新数据中逐步获取知识，同时保留先前获得的知识，从而减轻神经网络中的``灾难性遗忘''。我们的作品提出了一个新颖的不确定性驱动的无监督的持续学习框架，即使用生成重播，即````重播''（r2r）''。提出的R2R体系结构有效地使用群集级的不确定性驱动的反馈机制和VLM驱动的生成式重播模块，以平衡的比例有效地使用了未标记和合成标记的数据。与依赖于验证的模型和伪标签的传统记忆缓冲方法不同，我们的R2R框架在没有任何事先培训的情况下运行。它利用未标记数据的视觉特征，并使用基于聚类的不确定性估计与动态阈值连续调整。同时，一种生成的重放机制以及DeepSeek-R1启动的剪辑VLM会产生标记的综合数据代表过去的经验，类似于生物学视觉思维，可以重播记忆以记住并在新的，看不见的任务中行动。在CIFAR-10，CIFAR-100，CINIC-10，SVHN和TINYIMAGENET数据集中进行了广泛的实验分析。我们提出的R2R方法提高了知识的保留，达到98.13％，73.06％，93.41％，95.18％，59.74％的最先进表现，超过4.36％的最新性能。

Title: Guide your favorite protein sequence generative model

Authors: Junhao Xiong, Hunter Nisonoff, Ishan Gaur, Jennifer Listgarten
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2505.04823
Pdf URL: https://arxiv.org/pdf/2505.04823
Copy Paste: [[2505.04823]] Guide your favorite protein sequence generative model(https://arxiv.org/abs/2505.04823)
Keywords: generation, generative
Abstract: Generative machine learning models have begun to transform protein engineering, yet no principled framework for conditioning on auxiliary information in a plug-and-play manner exists; one may want to iteratively incorporate experimental feedback, or make use of an existing classifier -- such as for predicting enzyme commission number -- in order to guide the sampling of the generative model to generate sequences with desired properties. Herein, we present ProteinGuide, a rigorous and general framework to achieve just that: through unifying a broad class of protein generative models that includes masked language, (order-agnostic) autoregressive, diffusion and flow-matching models, we provide an approach to statistically condition pre-trained protein generative models. We demonstrate applicability of our approach by guiding each of two commonly used protein generative models, ProteinMPNN and ESM3, to generate amino acid and structure token sequences conditioned on several user-specified properties, namely, enhanced stability and CATH-labeled fold generation.
摘要：

Title: Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Authors: Kusha Sareen, Morgane M Moss, Alessandro Sordoni, Rishabh Agarwal, Arian Hosseini
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04842
Pdf URL: https://arxiv.org/pdf/2505.04842
Copy Paste: [[2505.04842]] Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers(https://arxiv.org/abs/2505.04842)
Keywords: generative
Abstract: Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL$^V$ that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL$^V$ boosts MATH accuracy by over 20\% with parallel sampling and enables $8-32\times$ efficient test-time compute scaling compared to the base RL method. RL$^V$ also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL$^V$ achieves $1.2-1.6\times$ higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.
摘要：

Title: ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning

Authors: Ziqing Qiao, Yongheng Deng, Jiali Zeng, Dong Wang, Lai Wei, Fandong Meng, Jie Zhou, Ju Ren, Yaoxue Zhang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.04881
Pdf URL: https://arxiv.org/pdf/2505.04881
Copy Paste: [[2505.04881]] ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning(https://arxiv.org/abs/2505.04881)
Keywords: generation
Abstract: Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs caused by redundant content, increasing computational overhead, and degrading user experience. Existing compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection, which fails to intervene effectively during generation. In this work, we introduce a confidence-guided perspective to explain the emergence of redundant reflection in LRMs, identifying two key patterns: Confidence Deficit, where the model reconsiders correct steps due to low internal confidence, and Termination Delay, where reasoning continues even after reaching a confident answer. Based on this analysis, we propose ConCISE (Confidence-guided Compression In Step-by-step Efficient Reasoning), a framework that simplifies reasoning chains by reinforcing the model's confidence during inference, thus preventing the generation of redundant reflection steps. It integrates Confidence Injection to stabilize intermediate steps and Early Stopping to terminate reasoning when confidence is sufficient. Extensive experiments demonstrate that fine-tuning LRMs on ConCISE-generated data yields significantly shorter outputs, reducing length by up to approximately 50% under SimPO, while maintaining high task accuracy. ConCISE consistently outperforms existing baselines across multiple reasoning benchmarks.
摘要：大型推理模型（LRMS）在复杂的推理任务（COT）提示中表现出色，但通常会遭受由冗余内容引起的冗长输出，增加计算开销和降低用户体验。现有的压缩方法要么进行事后修剪，有可能破坏推理连贯性，要么依赖基于抽样的选择，而这些选择在生成过程中无法有效干预。在这项工作中，我们介绍了一个信心引导的观点，以解释LRMS中冗余反思的出现，确定了两个关键模式：置信度不足，模型重新构想者由于内部信心较低和终止延迟而纠正了步骤，即使在达到信心答案之后，推理仍在继续。基于此分析，我们提出了简洁（逐步有效推理中的置信度引导的压缩），该框架通过在推断过程中增强模型的信心来简化推理链，从而阻止了冗余反射步骤的产生。它整合了置信度注入以稳定中间步骤，并在置信度足够的情况下早期停止以终止推理。广泛的实验表明，在简明的数据上进行微调LRM会产生明显较短的输出，在SIMPO下将长度降低到大约50％，同时保持高任务准确性。始终在多个推理基准的基准中始终优于现有基准。

Title: Cross-Branch Orthogonality for Improved Generalization in Face Deepfake Detection

Authors: Tharindu Fernando, Clinton Fookes, Sridha Sridharan, Simon Denman
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04888
Pdf URL: https://arxiv.org/pdf/2505.04888
Copy Paste: [[2505.04888]] Cross-Branch Orthogonality for Improved Generalization in Face Deepfake Detection(https://arxiv.org/abs/2505.04888)
Keywords: generation, generative
Abstract: Remarkable advancements in generative AI technology have given rise to a spectrum of novel deepfake categories with unprecedented leaps in their realism, and deepfakes are increasingly becoming a nuisance to law enforcement authorities and the general public. In particular, we observe alarming levels of confusion, deception, and loss of faith regarding multimedia content within society caused by face deepfakes, and existing deepfake detectors are struggling to keep up with the pace of improvements in deepfake generation. This is primarily due to their reliance on specific forgery artifacts, which limits their ability to generalise and detect novel deepfake types. To combat the spread of malicious face deepfakes, this paper proposes a new strategy that leverages coarse-to-fine spatial information, semantic information, and their interactions while ensuring feature distinctiveness and reducing the redundancy of the modelled features. A novel feature orthogonality-based disentanglement strategy is introduced to ensure branch-level and cross-branch feature disentanglement, which allows us to integrate multiple feature vectors without adding complexity to the feature space or compromising generalisation. Comprehensive experiments on three public benchmarks: FaceForensics++, Celeb-DF, and the Deepfake Detection Challenge (DFDC) show that these design choices enable the proposed approach to outperform current state-of-the-art methods by 5% on the Celeb-DF dataset and 7% on the DFDC dataset in a cross-dataset evaluation setting.
摘要：生成AI技术的显着进步已引起了许多新颖的深层类别，其现实主义前所未有，而Deepfakes越来越成为执法机构和公众的滋扰。特别是，我们观察到令人震惊的混乱，欺骗和对社会中多媒体内容的信念的令人震惊的水平，而现有的深层探测器正在努力跟上深层摄影产生的改善速度。这主要是由于它们依赖于特定的伪造工件，这限制了它们概括和检测新型深冰类型的能力。为了打击恶意脸部深击的传播，本文提出了一种新的策略，该策略利用粗线到细节的空间信息，语义信息及其相互作用，同时确保特征独特性并降低模型特征的冗余。引入了一种新型的基于正交性的分离策略，以确保分支级别和跨支流特征分离，这使我们能够集成多个特征向量而不增加功能空间或损害概括的复杂性。对三个公共基准测试的全面实验：FaceForensics ++，Celeb-DF和DeepFake检测挑战（DFDC）表明，这些设计选择可以使建议的方法在Celeb-DF数据集上超过当前的最新方法，而DFDC数据集则在交叉数据集中的DFDC数据集中7％。

Title: Clustering with Communication: A Variational Framework for Single Cell Representation Learning

Authors: Cong Qi, Yeqing Chen, Jie Zhang, Wei Zhi
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2505.04891
Pdf URL: https://arxiv.org/pdf/2505.04891
Copy Paste: [[2505.04891]] Clustering with Communication: A Variational Framework for Single Cell Representation Learning(https://arxiv.org/abs/2505.04891)
Keywords: generation, generative
Abstract: Single-cell RNA sequencing (scRNA-seq) has revealed complex cellular heterogeneity, but recent studies emphasize that understanding biological function also requires modeling cell-cell communication (CCC), the signaling interactions mediated by ligand-receptor pairs that coordinate cellular behavior. Tools like CellChat have demonstrated that CCC plays a critical role in processes such as cell differentiation, tissue regeneration, and immune response, and that transcriptomic data inherently encodes rich information about intercellular signaling. We propose CCCVAE, a novel variational autoencoder framework that incorporates CCC signals into single-cell representation learning. By leveraging a communication-aware kernel derived from ligand-receptor interactions and a sparse Gaussian process, CCCVAE encodes biologically informed priors into the latent space. Unlike conventional VAEs that treat each cell independently, CCCVAE encourages latent embeddings to reflect both transcriptional similarity and intercellular signaling context. Empirical results across four scRNA-seq datasets show that CCCVAE improves clustering performance, achieving higher evaluation scores than standard VAE baselines. This work demonstrates the value of embedding biological priors into deep generative models for unsupervised single-cell analysis.
摘要：单细胞RNA测序（SCRNA-SEQ）揭示了复杂的细胞异质性，但是最近的研究强调，了解生物学功能还需要建模细胞 - 细胞通信（CCC），这是由配体 - 受体 - 受体对介导的信号相互作用，这些信号相互作用是配合细胞行为的配体能力。 CellChat之类的工具表明，CCC在细胞分化，组织再生和免疫反应等过程中起着至关重要的作用，而转录组数据固有地编码了有关细胞间信号传导的丰富信息。我们提出了CCCVAE，这是一种新型的变性自动编码器框架，将CCC信号纳入单细胞表示学习中。通过利用从配体相互作用和稀疏的高斯过程得出的通信感知内核，CCCVAE将生物学知识的先验编码到潜在空间中。与独立处理每个细胞的常规VAE不同，CCCVAE鼓励潜在嵌入反映转录相似性和细胞间信号传导环境。四个SCRNA-SEQ数据集的经验结果表明，CCCVAE提高了聚类的性能，比标准VAE基线获得了更高的评估得分。这项工作证明了将生物先验嵌入到无监督的单细胞分析中的深层生成模型中的价值。

Title: OWT: A Foundational Organ-Wise Tokenization Framework for Medical Imaging

Authors: Sifan Song, Siyeop Yoon, Pengfei Jin, Sekeun Kim, Matthew Tivnan, Yujin Oh, Runqi Meng, Ling Chen, Zhiliang Lyu, Dufan Wu, Ning Guo, Xiang Li, Quanzheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04899
Pdf URL: https://arxiv.org/pdf/2505.04899
Copy Paste: [[2505.04899]] OWT: A Foundational Organ-Wise Tokenization Framework for Medical Imaging(https://arxiv.org/abs/2505.04899)
Keywords: generation
Abstract: Recent advances in representation learning often rely on holistic, black-box embeddings that entangle multiple semantic components, limiting interpretability and generalization. These issues are especially critical in medical imaging. To address these limitations, we propose an Organ-Wise Tokenization (OWT) framework with a Token Group-based Reconstruction (TGR) training paradigm. Unlike conventional approaches that produce holistic features, OWT explicitly disentangles an image into separable token groups, each corresponding to a distinct organ or semantic entity. Our design ensures each token group encapsulates organ-specific information, boosting interpretability, generalization, and efficiency while allowing fine-grained control in downstream tasks. Experiments on CT and MRI datasets demonstrate the effectiveness of OWT in not only achieving strong image reconstruction and segmentation performance, but also enabling novel semantic-level generation and retrieval applications that are out of reach for standard holistic embedding methods. These findings underscore the potential of OWT as a foundational framework for semantically disentangled representation learning, offering broad scalability and applicability to real-world medical imaging scenarios and beyond.
摘要：表示学习的最新进展通常依赖于整体的黑盒嵌入，这些嵌入纠缠了多个语义成分，从而限制了可解释性和概括。这些问题在医学成像中尤其重要。为了解决这些限制，我们提出了一个基于令牌的重建（TGR）培训范式的器官令牌化（OWT）框架。与产生整体特征的常规方法不同，OWT明确将图像分解为可分离的令牌基团，每个组对应于一个不同的器官或语义实体。我们的设计确保每个令牌组都封装了特定于器官的信息，提高可解释性，概括性和效率，同时允许在下游任务中进行细粒度的控制。 CT和MRI数据集的实验证明了OWT不仅在实现强大图像重建和分割性能方面的有效性，而且还可以实现新颖的语义级产生和检索应用，而这些应用程序无法触及标准的整体嵌入方法。这些发现强调了OWT作为语义上解开表示学习的基础框架的潜力，提供了广泛的可扩展性和适用于现实世界中医学成像场景及其他方面的能力。

Title: SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models

Authors: Shun Taguchi, Hideki Deguchi, Takumi Hamazaki, Hiroyuki Sakai
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.04911
Pdf URL: https://arxiv.org/pdf/2505.04911
Copy Paste: [[2505.04911]] SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models(https://arxiv.org/abs/2505.04911)
Keywords: generation
Abstract: This study introduces SpatialPrompting, a novel framework that harnesses the emergent reasoning capabilities of off-the-shelf multimodal large language models to achieve zero-shot spatial reasoning in three-dimensional (3D) environments. Unlike existing methods that rely on expensive 3D-specific fine-tuning with specialized 3D inputs such as point clouds or voxel-based features, SpatialPrompting employs a keyframe-driven prompt generation strategy. This framework uses metrics such as vision-language similarity, Mahalanobis distance, field of view, and image sharpness to select a diverse and informative set of keyframes from image sequences and then integrates them with corresponding camera pose data to effectively abstract spatial relationships and infer complex 3D structures. The proposed framework not only establishes a new paradigm for flexible spatial reasoning that utilizes intuitive visual and positional cues but also achieves state-of-the-art zero-shot performance on benchmark datasets, such as ScanQA and SQA3D, across several metrics. The proposed method effectively eliminates the need for specialized 3D inputs and fine-tuning, offering a simpler and more scalable alternative to conventional approaches.
摘要：这项研究介绍了空间宣传，这是一个新型框架，它利用了现成的多模式大型语言模型的紧急推理能力，以在三维（3D）环境中实现零摄像的空间推理。与现有的方法不同的方法依赖于昂贵的3D特定微型调整，并具有专门的3D输入（例如点云或基于体素的功能），而PATIALPROMPTING则采用了键帧驱动的及时生成策略。该框架使用诸如视觉相似性，Mahalanobis距离，视野和图像清晰度之类的指标来从图像序列中选择多样化且信息性的关键帧集，然后将它们与相应的摄像机姿势数据集成到有效抽象的空间关系并推断复杂的3D结构。所提出的框架不仅为灵活的空间推理建立了新的范式，该框架利用了直观的视觉和位置提示，而且还可以在基准数据集上（例如ScanQA和SQA3D）上实现最先进的零拍性能。所提出的方法有效地消除了对专业的3D输入和微调的需求，提供了更简单，更可扩展的替代方案。

Title: GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

Authors: Tong Wang, Ting Liu, Xiaochao Qu, Chengjing Wu, Luoqi Liu, Xiaolin Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04915
Pdf URL: https://arxiv.org/pdf/2505.04915
Copy Paste: [[2505.04915]] GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing(https://arxiv.org/abs/2505.04915)
Keywords: generation
Abstract: Scene text editing, a subfield of image editing, requires modifying texts in images while preserving style consistency and visual coherence with the surrounding environment. While diffusion-based methods have shown promise in text generation, they still struggle to produce high-quality results. These methods often generate distorted or unrecognizable characters, particularly when dealing with complex characters like Chinese. In such systems, characters are composed of intricate stroke patterns and spatial relationships that must be precisely maintained. We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision. Our key insight is that existing methods, despite using pretrained OCR models for feature extraction, fail to capture the hierarchical nature of text structures - from individual strokes to stroke-level interactions to overall character-level structure. To address this, our glyph encoder explicitly models and captures the cross-level interactions between local-level individual characters and global-level text lines through our novel glyph attention module. Meanwhile, our model implements a feature pyramid network to fuse the multi-scale OCR backbone features at the global-level. Through these cross-level and multi-scale fusions, we obtain more detailed glyph-aware guidance, enabling precise control over the scene text generation process. Our method achieves an 18.02\% improvement in sentence accuracy over the state-of-the-art multi-lingual scene text editing baseline, while simultaneously reducing the text-region Fréchet inception distance by 53.28\%.
摘要：场景文本编辑是图像编辑的一个子字段，需要在图像中修改文本，同时保持样式的一致性和与周围环境的视觉连贯性。尽管基于扩散的方法在文本生成中表现出了希望，但它们仍然很难产生高质量的结果。这些方法通常会产生扭曲或无法识别的字符，尤其是在处理诸如中文之类的复杂字符时。在这样的系统中，字符由复杂的中风模式和必须精确维护的空间关系组成。我们提出了Glyphmastero，这是一种专门的字形编码器，旨在指导潜在扩散模型，以生成具有冲程级别精度的文本。我们的关键见解是，尽管使用了验证的OCR模型进行特征提取，但未能捕获文本结构的层次结构性质 - 从单个笔划到中风级别的相互作用到整体角色级结构。为了解决这个问题，我们的字形编码器明确建模，并通过我们的新颖的Glyph注意模块捕获本地级个体字符和全球级文本线条之间的跨层次相互作用。同时，我们的模型实现了特征金字塔网络，以融合全球级别的多尺度OCR骨架功能。通过这些跨层次和多尺度融合，我们获得了更详细的字形感知指导，从而可以精确控制场景文本生成过程。我们的方法在最先进的多语言场景文本编辑基线上实现了18.02 \％的句子准确性提高，同时将文本区域的成立距离降低了53.28 \％。

Title: Canny2Palm: Realistic and Controllable Palmprint Generation for Large-scale Pre-training

Authors: Xingzeng Lan, Xing Duan, Chen Chen, Weiyu Lin, Bo Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04922
Pdf URL: https://arxiv.org/pdf/2505.04922
Copy Paste: [[2505.04922]] Canny2Palm: Realistic and Controllable Palmprint Generation for Large-scale Pre-training(https://arxiv.org/abs/2505.04922)
Keywords: generation
Abstract: Palmprint recognition is a secure and privacy-friendly method of biometric identification. One of the major challenges to improve palmprint recognition accuracy is the scarcity of palmprint data. Recently, a popular line of research revolves around the synthesis of virtual palmprints for large-scale pre-training purposes. In this paper, we propose a novel synthesis method named Canny2Palm that extracts palm textures with Canny edge detector and uses them to condition a Pix2Pix network for realistic palmprint generation. By re-assembling palmprint textures from different identities, we are able to create new identities by seeding the generator with new assemblies. Canny2Palm not only synthesizes realistic data following the distribution of real palmprints but also enables controllable diversity to generate large-scale new identities. On open-set palmprint recognition benchmarks, models pre-trained with Canny2Palm synthetic data outperform the state-of-the-art with up to 7.2% higher identification accuracy. Moreover, the performance of models pre-trained with Canny2Palm continues to improve given 10,000 synthetic IDs while those with existing methods already saturate, demonstrating the potential of our method for large-scale pre-training.
摘要：Palmprint识别是一种安全且对隐私友好的生物识别方法。提高掌刻识别准确性的主要挑战之一是掌上数据的稀缺性。最近，一系列流行的研究线围绕着虚拟掌刻的合成，出于大规模的预训练目的。在本文中，我们提出了一种名为Canny2Palm的新型合成方法，该方法用Canny Edge探测器提取棕榈纹理，并使用它们来调节pix2pix网络，以实现逼真的棕榈贴产生。通过重新组装来自不同身份的掌上纹理，我们能够通过用新的组件为发电机播种来创建新的身份。 Canny2Palm不仅综合了实际掌刻分布之后的现实数据，而且还可以使可控的多样性产生大规模的新身份。在开放式棕榈印刷识别基准上，用Canny2Palm合成数据预先培训的模型优于最先进的识别精度高达7.2％。此外，给定10,000个合成ID，而现有方法已经饱和的模型的模型的性能继续改善，而现有方法的模型则表明了我们方法对大规模预训练的潜力。

Title: Building-Guided Pseudo-Label Learning for Cross-Modal Building Damage Mapping

Authors: Jiepan Li, He Huang, Yu Sheng, Yujun Guo, Wei He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04941
Pdf URL: https://arxiv.org/pdf/2505.04941
Copy Paste: [[2505.04941]] Building-Guided Pseudo-Label Learning for Cross-Modal Building Damage Mapping(https://arxiv.org/abs/2505.04941)
Keywords: generation
Abstract: Accurate building damage assessment using bi-temporal multi-modal remote sensing images is essential for effective disaster response and recovery planning. This study proposes a novel Building-Guided Pseudo-Label Learning Framework to address the challenges of mapping building damage from pre-disaster optical and post-disaster SAR images. First, we train a series of building extraction models using pre-disaster optical images and building labels. To enhance building segmentation, we employ multi-model fusion and test-time augmentation strategies to generate pseudo-probabilities, followed by a low-uncertainty pseudo-label training method for further refinement. Next, a change detection model is trained on bi-temporal cross-modal images and damaged building labels. To improve damage classification accuracy, we introduce a building-guided low-uncertainty pseudo-label refinement strategy, which leverages building priors from the previous step to guide pseudo-label generation for damaged buildings, reducing uncertainty and enhancing reliability. Experimental results on the 2025 IEEE GRSS Data Fusion Contest dataset demonstrate the effectiveness of our approach, which achieved the highest mIoU score (54.28%) and secured first place in the competition.
摘要：使用双向多模式遥感图像进行准确的建筑损害评估对于有效的灾难响应和恢复计划至关重要。这项研究提出了一个新颖的建筑物引导的伪标签学习框架，以应对绘制disaster光学和污水爆炸后SAR图像构图损害的挑战。首先，我们使用前悬挂光学图像和建筑标签训练一系列建筑提取模型。为了增强建筑物细分，我们采用多模型融合和测试时间扩展策略来产生伪探针，然后采用低确定性伪标签训练方法进行进一步的细化。接下来，对两次跨跨模式图像和损坏的建筑标签进行了变更检测模型。为了提高损害分类的准确性，我们引入了建筑物指导的低确定性伪标签改进策略，该策略从上一步开始利用建筑物的先验来指导伪标签生成损坏的建筑物，从而降低了不确定性并提高可靠性。 2025 IEEE GRSS数据融合竞赛数据集的实验结果证明了我们的方法的有效性，该方法获得了最高的MIOU得分（54.28％），并在比赛中获得了第一名。

Title: T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models

Authors: Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, Jiale Zhao
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.04946
Pdf URL: https://arxiv.org/pdf/2505.04946
Copy Paste: [[2505.04946]] T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models(https://arxiv.org/abs/2505.04946)
Keywords: generation
Abstract: Thanks to recent advancements in scalable deep architectures and large-scale pretraining, text-to-video generation has achieved unprecedented capabilities in producing high-fidelity, instruction-following content across a wide range of styles, enabling applications in advertising, entertainment, and education. However, these models' ability to render precise on-screen text, such as captions or mathematical formulas, remains largely untested, posing significant challenges for applications requiring exact textual accuracy. In this work, we introduce T2VTextBench, the first human-evaluation benchmark dedicated to evaluating on-screen text fidelity and temporal consistency in text-to-video models. Our suite of prompts integrates complex text strings with dynamic scene changes, testing each model's ability to maintain detailed instructions across frames. We evaluate ten state-of-the-art systems, ranging from open-source solutions to commercial offerings, and find that most struggle to generate legible, consistent text. These results highlight a critical gap in current video generators and provide a clear direction for future research aimed at enhancing textual manipulation in video synthesis.
摘要：由于最近在可扩展的深度体系结构和大规模预处理方面取得了进步，文本到视频的一代已经在产生高保真性，在广泛的样式上，在广告，娱乐和教育方面的应用中实现了前所未有的能力。但是，这些模型具有精确的屏幕上文本的能力，例如字幕或数学公式，在很大程度上未经测试，对需要精确文本准确性的应用构成了重大挑战。在这项工作中，我们介绍了T2VTextBench，这是第一个专门用于评估文本对视频模型中屏幕文本保真度和时间一致性的人类评估基准。我们的提示套件将复杂的文本字符串与动态场景更改相结合，测试了每个模型在跨帧之间维护详细说明的能力。我们评估了十个最先进的系统，从开源解决方案到商业产品，发现大多数努力生成清晰，一致的文本。这些结果突出了当前视频发电机中的一个关键差距，并为未来的研究提供了一个明确的方向，旨在增强视频综合中的文本操作。

Title: Graffe: Graph Representation Learning via Diffusion Probabilistic Models

Authors: Dingshuo Chen, Shuchen Xue, Liuji Chen, Yingheng Wang, Qiang Liu, Shu Wu, Zhi-Ming Ma, Liang Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.04956
Pdf URL: https://arxiv.org/pdf/2505.04956
Copy Paste: [[2505.04956]] Graffe: Graph Representation Learning via Diffusion Probabilistic Models(https://arxiv.org/abs/2505.04956)
Keywords: generative
Abstract: Diffusion probabilistic models (DPMs), widely recognized for their potential to generate high-quality samples, tend to go unnoticed in representation learning. While recent progress has highlighted their potential for capturing visual semantics, adapting DPMs to graph representation learning remains in its infancy. In this paper, we introduce Graffe, a self-supervised diffusion model proposed for graph representation learning. It features a graph encoder that distills a source graph into a compact representation, which, in turn, serves as the condition to guide the denoising process of the diffusion decoder. To evaluate the effectiveness of our model, we first explore the theoretical foundations of applying diffusion models to representation learning, proving that the denoising objective implicitly maximizes the conditional mutual information between data and its representation. Specifically, we prove that the negative logarithm of the denoising score matching loss is a tractable lower bound for the conditional mutual information. Empirically, we conduct a series of case studies to validate our theoretical insights. In addition, Graffe delivers competitive results under the linear probing setting on node and graph classification tasks, achieving state-of-the-art performance on 9 of the 11 real-world datasets. These findings indicate that powerful generative models, especially diffusion models, serve as an effective tool for graph representation learning.
摘要：扩散概率模型（DPM）因其产生高质量样本的潜力而被广泛认可，在表示学习中往往不会引起人们的注意。尽管最近的进步强调了它们捕获视觉语义的潜力，但将DPM适应图形表示学习仍处于起步阶段。在本文中，我们介绍了Graffe，这是一个为图表示学习而提出的一个自我监督的扩散模型。它具有图形编码器，该图编码器将源图提炼成紧凑的表示形式，进而将其作为指导扩散解码器的脱索过程的条件。为了评估我们的模型的有效性，我们首先探讨了将扩散模型应用于表示学习的理论基础，并证明了剥落的目标隐含地隐含了数据与其表示之间的条件相互信息。具体而言，我们证明了denoising评分匹配损失的负对数是有条件的相互信息的可拖动下限。从经验上讲，我们进行了一系列案例研究以验证我们的理论见解。此外，涂鸦在节点和图形分类任务的线性探测设置下提供竞争结果，从而在11个现实世界数据集中的9个中实现了最新性能。这些发现表明，强大的生成模型，尤其是扩散模型，是图形表示学习的有效工具。

Title: CAG-VLM: Fine-Tuning of a Large-Scale Model to Recognize Angiographic Images for Next-Generation Diagnostic Systems

Authors: Yuto Nakamura, Satoshi Kodera, Haruki Settai, Hiroki Shinohara, Masatsugu Tamura, Tomohiro Noguchi, Tatsuki Furusawa, Ryo Takizawa, Tempei Kabayama, Norihiko Takeda
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04964
Pdf URL: https://arxiv.org/pdf/2505.04964
Copy Paste: [[2505.04964]] CAG-VLM: Fine-Tuning of a Large-Scale Model to Recognize Angiographic Images for Next-Generation Diagnostic Systems(https://arxiv.org/abs/2505.04964)
Keywords: generation
Abstract: Coronary angiography (CAG) is the gold-standard imaging modality for evaluating coronary artery disease, but its interpretation and subsequent treatment planning rely heavily on expert cardiologists. To enable AI-based decision support, we introduce a two-stage, physician-curated pipeline and a bilingual (Japanese/English) CAG image-report dataset. First, we sample 14,686 frames from 539 exams and annotate them for key-frame detection and left/right laterality; a ConvNeXt-Base CNN trained on this data achieves 0.96 F1 on laterality classification, even on low-contrast frames. Second, we apply the CNN to 243 independent exams, extract 1,114 key frames, and pair each with its pre-procedure report and expert-validated diagnostic and treatment summary, yielding a parallel corpus. We then fine-tune three open-source VLMs (PaliGemma2, Gemma3, and ConceptCLIP-enhanced Gemma3) via LoRA and evaluate them using VLScore and cardiologist review. Although PaliGemma2 w/LoRA attains the highest VLScore, Gemma3 w/LoRA achieves the top clinician rating (mean 7.20/10); we designate this best-performing model as CAG-VLM. These results demonstrate that specialized, fine-tuned VLMs can effectively assist cardiologists in generating clinical reports and treatment recommendations from CAG images.
摘要：冠状动脉造影（CAG）是评估冠状动脉疾病的金标准成像方式，但其解释和随后的治疗计划在很大程度上取决于专家心脏病学家。为了启用基于AI的决策支持，我们介绍了两阶段的医师策划管道和双语（日语/英语）CAG图像报告数据集。首先，我们从539项考试中采样了14,686帧，并注释它们以进行键入框架检测和左/右侧。在此数据上训练的Convnext-base CNN即使在低对手帧上，也可以在侧向分类方面达到0.96 F1。其次，我们将CNN应用于243次独立考试，提取1,114个关键帧，并将每个框架与其预科报告和专家验证的诊断和治疗摘要配对，并产生平行的语料库。然后，我们通过Lora微调了三个开源VLM（Paligemma2，Gemma3和ConceptClip增强的Gemma3），并使用VLSCORE和心脏病专家评论对其进行评估。尽管paligemma2 w/lora达到最高的VLSCORE，但gemma3 w/lora却达到了最高的临床医生评级（平均7.20/10）；我们将这种表现最好的模型指定为CAG-VLM。这些结果表明，专门的，微调的VLM可以有效地帮助心脏病专家生成CAG图像中的临床报告和治疗建议。

Title: ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment

Authors: Wanjiang Weng, Xiaofeng Tan, Hongsong Wang, Pan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.04974
Pdf URL: https://arxiv.org/pdf/2505.04974
Copy Paste: [[2505.04974]] ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment(https://arxiv.org/abs/2505.04974)
Keywords: generation
Abstract: Bilingual text-to-motion generation, which synthesizes 3D human motions from bilingual text inputs, holds immense potential for cross-linguistic applications in gaming, film, and robotics. However, this task faces critical challenges: the absence of bilingual motion-language datasets and the misalignment between text and motion distributions in diffusion models, leading to semantically inconsistent or low-quality motions. To address these challenges, we propose BiHumanML3D, a novel bilingual human motion dataset, which establishes a crucial benchmark for bilingual text-to-motion generation models. Furthermore, we propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics, thereby achieving a unified bilingual model. Building upon this, we propose Reward-guided sampling Alignment (ReAlign) method, comprising a step-aware reward model to assess alignment quality during sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Experiments demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods. Project page: this https URL.
摘要：双语文本产生的双语文本到动作生成，从双语文本输入中综合了3D人类动作，在游戏，电影和机器人技术中具有跨语言应用的巨大潜力。但是，这项任务面临着关键的挑战：在扩散模型中缺乏双语运动语言数据集以及文本和运动分布之间的错位，从而导致语义上不一致或低质量的动作。为了应对这些挑战，我们提出了一种新型的双语人类运动数据集Bihumanml3d，该数据集为双语文本到动作生成模型建立了至关重要的基准。此外，我们提出了双语运动扩散模型（BIMD），该模型利用跨语性对齐表示捕获语义，从而实现了统一的双语模型。在此基础上，我们提出了奖励指导的采样对准方法（Releign）方法，其中包括一个逐步感知的奖励模型，以评估采样过程中的对齐质量以及将扩散过程定向到最佳分配分布的奖励指导策略。该奖励模型集成了阶跃感知令牌，并结合了一个文本对准模块，以实现语义一致性和一个运动对准模块，以实现现实主义，在每个时间步上完善嘈杂的动作，以平衡概率密度和对齐。实验表明，与现有的最新方法相比，我们的方法显着改善了文本运动对齐和运动质量。项目页面：此HTTPS URL。

Title: Generating Reliable Synthetic Clinical Trial Data: The Role of Hyperparameter Optimization and Domain Constraints

Authors: Waldemar Hahn, Jan-Niklas Eckardt, Christoph Röllig, Martin Sedlmayr, Jan Moritz Middeke, Markus Wolfien
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.05019
Pdf URL: https://arxiv.org/pdf/2505.05019
Copy Paste: [[2505.05019]] Generating Reliable Synthetic Clinical Trial Data: The Role of Hyperparameter Optimization and Domain Constraints(https://arxiv.org/abs/2505.05019)
Keywords: generation, generative
Abstract: The generation of synthetic clinical trial data offers a promising approach to mitigating privacy concerns and data accessibility limitations in medical research. However, ensuring that synthetic datasets maintain high fidelity, utility, and adherence to domain-specific constraints remains a key challenge. While hyperparameter optimization (HPO) has been shown to improve generative model performance, the effectiveness of different optimization strategies for synthetic clinical data remains unclear. This study systematically evaluates four HPO strategies across eight generative models, comparing single-metric optimization against compound metric optimization approaches. Our results demonstrate that HPO consistently improves synthetic data quality, with TVAE, CTGAN, and CTAB-GAN+ achieving improvements of up to 60%, 39%, and 38%, respectively. Compound metric optimization outperformed single-metric strategies, producing more balanced and generalizable synthetic datasets. Interestingly, HPO alone is insufficient to ensure clinically valid synthetic data, as all models exhibited violations of fundamental survival constraints. Preprocessing and postprocessing played a crucial role in reducing these violations, as models lacking robust processing steps produced invalid data in up to 61% of cases. These findings underscore the necessity of integrating explicit domain knowledge alongside HPO to create high quality synthetic datasets. Our study provides actionable recommendations for improving synthetic data generation, with future research needed to refine metric selection and validate these findings on larger datasets to enhance clinical applicability.
摘要：合成临床试验数据的产生提供了一种有希望的方法来缓解医学研究中的隐私问题和数据可访问性限制。但是，确保合成数据集保持高忠诚度，效用和遵守特定领域的约束仍然是一个关键挑战。虽然已显示超参数优化（HPO）可以改善生成模型的性能，但对合成临床数据的不同优化策略的有效性尚不清楚。这项研究系统地评估了八个生成模型的四种HPO策略，将单金属优化与复合度量优化方法进行了比较。我们的结果表明，HPO始终提高合成数据质量，TVAE，CTGAN和CTAB-GAN+分别提高了高达60％，39％和38％的改善。复合度量优化优于单一金属策略，产生更平衡和可推广的合成数据集。有趣的是，仅HPO就不足以确保临床上有效的合成数据，因为所有模型均表现出对基本生存约束的侵犯。预处理和后处理在减少这些违规行为方面起着至关重要的作用，因为缺乏强大的处理步骤的模型在高达61％的情况下产生了无效的数据。这些发现强调了将显式域知识与HPO一起集成以创建高质量合成数据集的必要性。我们的研究提供了可行的建议，以改善合成数据的生成，未来的研究需要完善度量选择并在较大的数据集中验证这些发现以增强临床适用性。

Title: Generative Models for Long Time Series: Approximately Equivariant Recurrent Network Structures for an Adjusted Training Scheme

Authors: Ruwen Fulek, Markus Lange-Hegermann
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.05020
Pdf URL: https://arxiv.org/pdf/2505.05020
Copy Paste: [[2505.05020]] Generative Models for Long Time Series: Approximately Equivariant Recurrent Network Structures for an Adjusted Training Scheme(https://arxiv.org/abs/2505.05020)
Keywords: generative
Abstract: We present a simple yet effective generative model for time series data based on a Variational Autoencoder (VAE) with recurrent layers, referred to as the Recurrent Variational Autoencoder with Subsequent Training (RVAE-ST). Our method introduces an adapted training scheme that progressively increases the sequence length, addressing the challenge recurrent layers typically face when modeling long sequences. By leveraging the recurrent architecture, the model maintains a constant number of parameters regardless of sequence length. This design encourages approximate time-shift equivariance and enables efficient modeling of long-range temporal dependencies. Rather than introducing a fundamentally new architecture, we show that a carefully composed combination of known components can match or outperform state-of-the-art generative models on several benchmark datasets. Our model performs particularly well on time series that exhibit quasi-periodic structure,while remaining competitive on datasets with more irregular or partially non-stationary behavior. We evaluate its performance using ELBO, Fréchet Distance, discriminative scores, and visualizations of the learned embeddings.
摘要：我们为时间序列数据提供了一个简单而有效的生成模型，该模型基于带有复发层的变异自动编码器（VAE），被称为经过随后的训练（RVAE-ST）的经常性变异自动编码器。我们的方法引入了一种改编的训练方案，该方案逐渐增加了序列的长度，解决了在建模长序列时通常会面对的挑战复发层。通过利用复发体系结构，该模型将保持恒定数量的参数数量，而不管序列长度如何。该设计鼓励了近似时间换档，并实现了长期时间依赖性的有效建模。与其引入从根本上引入新的架构，我们表明，精心组成的已知组件组合可以在几个基准数据集上匹配或胜过最先进的生成模型。我们的模型在表现出准周期结构的时间序列上的表现特别出色，同时在具有更不规则或部分非平稳行为的数据集上保持竞争力。我们使用Elbo，Fréchet距离，判别分数和可视化的嵌入式评估其性能。

Title: SOAP: Style-Omniscient Animatable Portraits

Authors: Tingting Liao, Yujian Zheng, Adilbek Karmanov, Liwen Hu, Leyang Jin, Yuliang Xiu, Hao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05022
Pdf URL: https://arxiv.org/pdf/2505.05022
Copy Paste: [[2505.05022]] SOAP: Style-Omniscient Animatable Portraits(https://arxiv.org/abs/2505.05022)
Keywords: generation
Abstract: Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles. While 3D diffusion models advance single-view reconstruction for general objects, outputs often lack animation controls or suffer from artifacts because of the domain gap. We propose SOAP, a style-omniscient framework to generate rigged, topology-consistent avatars from any portrait. Our method leverages a multiview diffusion model trained on 24K 3D heads with multiple styles and an adaptive optimization pipeline to deform the FLAME mesh while maintaining topology and rigging via differentiable rendering. The resulting textured avatars support FACS-based animation, integrate with eyeballs and teeth, and preserve details like braided hair or accessories. Extensive experiments demonstrate the superiority of our method over state-of-the-art techniques for both single-view head modeling and diffusion-based generation of Image-to-3D. Our code and data are publicly available for research purposes at this https URL.
摘要：由于样式限制（现实，卡通，动漫）以及处理配件或发型的困难，从单个图像中创建动画的3D化身仍然具有挑战性。尽管3D扩散模型可以推进一般对象的单视重构造，但由于域间隙，输出通常缺乏动画控制或遭受伪影。我们提出了肥皂，这是一种风格的友善框架，可从任何肖像中生成经过拓扑的，拓扑符合的头像。我们的方法利用了具有多种样式的24K 3D头部训练的多视频扩散模型，并采用了自适应优化管道，以使火焰网格变形，同时通过可通过可区分的渲染进行拓扑和索具。由此产生的纹理化身支持基于FACS的动画，与眼球和牙齿集成，并保留诸如编织的头发或配件之类的细节。广泛的实验证明了我们方法比最先进的技术对单视头模型和基于扩散的图像到3D的生成的优越性。我们的代码和数据可在此HTTPS URL上公开用于研究目的。

Title: CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts

Authors: Manik Sheokand, Parth Sawant
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2505.05063
Pdf URL: https://arxiv.org/pdf/2505.05063
Copy Paste: [[2505.05063]] CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts(https://arxiv.org/abs/2505.05063)
Keywords: generation
Abstract: Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and BigCodeBench primarily evaluate LLMs on English-only prompts, overlooking the real-world scenario where multilingual developers often use code-mixed language while interacting with LLMs. To address this gap, we introduce CodeMixBench, a novel benchmark designed to evaluate the robustness of LLMs on code generation from code-mixed prompts. Built upon BigCodeBench, CodeMixBench introduces controlled code-mixing (CMD) into the natural language parts of prompts across three language pairs: Hinglish (Hindi-English), Spanish-English, and Chinese Pinyin-English. We comprehensively evaluate a diverse set of open-source code generation models ranging from 1.5B to 15B parameters. Our results show that code-mixed prompts consistently degrade Pass@1 performance compared to their English-only counterparts, with performance drops increasing under higher CMD levels for smaller models. CodeMixBench provides a realistic evaluation framework for studying multilingual code generation and highlights new challenges and directions for building robust code generation models that generalize well across diverse linguistic settings.
摘要：大型语言模型（LLMS）在代码生成任务中取得了巨大的成功，为代码完成，调试和编程帮助等各种应用程序提供动力。但是，现有的基准（例如HumaneVal，MBPP和BigCodebench）主要在仅英语提示上评估LLM，从而忽略了真实的情况，在这种情况下，多语言开发人员在与LLMS进行交互时经常使用代码混合语言。为了解决这一差距，我们介绍了CodeMixbench，这是一种新颖的基准测试，旨在评估LLMS从代码混合提示中生成代码生成的鲁棒性。 Codemixbench建立在BigCodebench的基础上，将受控的代码混合（CMD）引入了三种语言对的提示的自然语言部分：Hinglish（印度语英语），西班牙语 - 英语和中国拼音英语。我们全面评估了一组从1.5b到15b参数的开源代码生成模型。我们的结果表明，与仅英语同行相比，代码混合的提示始终降低了1个性能，而较小型号的CMD级别下降了较高的CMD水平。 Codemixbench提供了一个现实的评估框架，用于研究多语言代码生成，并突出了建立强大的代码生成模型的新挑战和方向，这些模型可以很好地跨越各种语言环境。

Title: PIDiff: Image Customization for Personalized Identities with Diffusion Models

Authors: Jinyu Gu, Haipeng Liu, Meng Wang, Yang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05081
Pdf URL: https://arxiv.org/pdf/2505.05081
Copy Paste: [[2505.05081]] PIDiff: Image Customization for Personalized Identities with Diffusion Models(https://arxiv.org/abs/2505.05081)
Keywords: generation, generative
Abstract: Text-to-image generation for personalized identities aims at incorporating the specific identity into images using a text prompt and an identity image. Based on the powerful generative capabilities of DDPMs, many previous works adopt additional prompts, such as text embeddings and CLIP image embeddings, to represent the identity information, while they fail to disentangle the identity information and background information. As a result, the generated images not only lose key identity characteristics but also suffer from significantly reduced diversity. To address this issue, previous works have combined the W+ space from StyleGAN with diffusion models, leveraging this space to provide a more accurate and comprehensive representation of identity features through multi-level feature extraction. However, the entanglement of identity and background information in in-the-wild images during training prevents accurate identity localization, resulting in severe semantic interference between identity and background. In this paper, we propose a novel fine-tuning-based diffusion model for personalized identities text-to-image generation, named PIDiff, which leverages the W+ space and an identity-tailored fine-tuning strategy to avoid semantic entanglement and achieves accurate feature extraction and localization. Style editing can also be achieved by PIDiff through preserving the characteristics of identity features in the W+ space, which vary from coarse to fine. Through the combination of the proposed cross-attention block and parameter optimization strategy, PIDiff preserves the identity information and maintains the generation capability for in-the-wild images of the pre-trained model during inference. Our experimental results validate the effectiveness of our method in this task.
摘要：个性化身份的文本到图像生成旨在使用文本提示和身份图像将特定的身份纳入图像中。基于DDPM的强大生成能力，许多先前的作品采用了其他提示，例如文本嵌入和剪辑图像嵌入，以表示身份信息，而它们无法解开身份信息和背景信息。结果，生成的图像不仅失去了关键的身份特征，而且还会遭受多样性的显着降低。为了解决这个问题，以前的作品将stylegan的W+空间与扩散模型相结合，利用此空间通过多级特征提取来提供更准确，更全面的身份特征。但是，在训练期间，身份和背景信息的身份和背景信息的纠缠阻止了准确的身份定位，从而导致身份和背景之间严重的语义干扰。在本文中，我们为个性化身份的文本到图像生成的新型基于微调的扩散模型，名为Pidiff，该模型利用W+空间和身份限制的微调策略，以避免语义纠缠并实现准确的特征提取和本地化。 Pidiff可以通过保留W+空间中身份特征的特征来实现样式编辑，从而从粗糙到细。通过提出的跨注意区和参数优化策略的结合，Pidiff保留了身份信息，并在推理过程中维持了预训练模型的野外图像的生成能力。我们的实验结果证明了我们方法在此任务中的有效性。

Title: ItDPDM: Information-Theoretic Discrete Poisson Diffusion Model

Authors: Sagnik Bhattacharya, Abhiram R. Gorle, Ahmed Mohsin, Ahsan Bilal, Connor Ding, Amit Kumar Singh Yadav, Tsachy Weissman
Subjects: cs.LG, cs.IT, math.PR
Abstract URL: https://arxiv.org/abs/2505.05082
Pdf URL: https://arxiv.org/pdf/2505.05082
Copy Paste: [[2505.05082]] ItDPDM: Information-Theoretic Discrete Poisson Diffusion Model(https://arxiv.org/abs/2505.05082)
Keywords: generative
Abstract: Existing methods for generative modeling of discrete data, such as symbolic music tokens, face two primary challenges: (1) they either embed discrete inputs into continuous state-spaces or (2) rely on variational losses that only approximate the true negative log-likelihood. Previous efforts have individually targeted these limitations. While information-theoretic Gaussian diffusion models alleviate the suboptimality of variational losses, they still perform modeling in continuous domains. In this work, we introduce the Information-Theoretic Discrete Poisson Diffusion Model (ItDPDM), which simultaneously addresses both limitations by directly operating in a discrete state-space via a Poisson diffusion process inspired by photon arrival processes in camera sensors. We introduce a novel Poisson Reconstruction Loss (PRL) and derive an exact relationship between PRL and the true negative log-likelihood, thereby eliminating the need for approximate evidence lower bounds. Experiments conducted on the Lakh MIDI symbolic music dataset and the CIFAR-10 image benchmark demonstrate that ItDPDM delivers significant improvements, reducing test NLL by up to 80% compared to prior baselines, while also achieving faster convergence.
摘要：现有的离散数据生成建模的方法，例如符号音乐令牌，面临两个主要挑战：（1）它们要么将离散输入嵌入连续状态空间中，要么（2）依赖于仅近似真正的负面对数近似类似的变异损失。以前的努力已分别针对这些限制。尽管信息理论的高斯扩散模型减轻了变异损失的次优性，但它们仍在连续域中进行建模。在这项工作中，我们介绍了信息理论离散的泊松扩散模型（ITDPDM），该模型同时通过直接在离散状态空间中通过泊松扩散过程在离散状态空间中运行，该过程受到相机传感器中光子到达过程的启发。我们引入了一种新颖的泊松重建损失（PRL），并得出了PRL与真实的负模样之间的确切关系，从而消除了对近似证据的需求。在数十万的MIDI符号音乐数据集和CIFAR-10图像基准上进行的实验表明，与先前的基线相比，ITDPDM可提供显着改进，使测试NLL最多减少80％，同时还可以实现更快的收敛性。

Title: EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution

Authors: Haizhen Xie, Kunpeng Du, Qiangyu Yan, Sen Lu, Jianhong Han, Hanting Chen, Hailin Hu, Jie Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05209
Pdf URL: https://arxiv.org/pdf/2505.05209
Copy Paste: [[2505.05209]] EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution(https://arxiv.org/abs/2505.05209)
Keywords: restoration, super-resolution, generation
Abstract: Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind Super-Resolution (BSR) has become a predominant approach in the field. While T2I models have traditionally relied on U-Net architectures, recent advancements have demonstrated that Diffusion Transformers (DiT) achieve significantly higher performance in this domain. In this work, we introduce Enhancing Anything Model (EAM), a novel BSR method that leverages DiT and outperforms previous U-Net-based approaches. We introduce a novel block, $\Psi$-DiT, which effectively guides the DiT to enhance image restoration. This block employs a low-resolution latent as a separable flow injection control, forming a triple-flow architecture that effectively leverages the prior knowledge embedded in the pre-trained DiT. To fully exploit the prior guidance capabilities of T2I models and enhance their generalization in BSR, we introduce a progressive Masked Image Modeling strategy, which also reduces training costs. Additionally, we propose a subject-aware prompt generation strategy that employs a robust multi-modal model in an in-context learning framework. This strategy automatically identifies key image areas, provides detailed descriptions, and optimizes the utilization of T2I diffusion priors. Our experiments demonstrate that EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.
摘要：利用预先训练的文本对图像（T2I）扩散模型来指导盲目的超分辨率（BSR）已成为该领域的主要方法。尽管T2I模型传统上依赖于U-NET架构，但最近的进步表明，扩散变压器（DIT）在该领域的性能明显更高。在这项工作中，我们介绍了增强的任何模型（EAM），这是一种新型的BSR方法，它利用DIT并优于先前的基于U-NET的方法。我们介绍了一个新颖的块，即$ \ psi $ -dit，该块有效地指导DIT来增强图像恢复。该区块采用低分辨率潜在作为可分离的流动注入控制，形成了三流架构，该体系结构有效地利用了预先训练的DIT中嵌入的先验知识。为了充分利用T2I模型的先前指导能力并增强其在BSR中的概括，我们引入了渐进式掩盖的图像建模策略，这也降低了培训成本。此外，我们提出了一种主题感知的及时生成策略，该策略在秘密学习框架中采用了强大的多模式模型。该策略会自动识别关键图像领域，提供详细的描述，并优化T2I扩散先验的利用率。我们的实验表明，EAM在多个数据集中取得了最新的结果，在定量指标和视觉质量方面都优于现有方法。

Title: Diffusion Model Quantization: A Review

Authors: Qian Zeng, Chenggong Hu, Mingli Song, Jie Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05215
Pdf URL: https://arxiv.org/pdf/2505.05215
Copy Paste: [[2505.05215]] Diffusion Model Quantization: A Review(https://arxiv.org/abs/2505.05215)
Keywords: generative
Abstract: Recent success of large text-to-image models has empirically underscored the exceptional performance of diffusion models in generative tasks. To facilitate their efficient deployment on resource-constrained edge devices, model quantization has emerged as a pivotal technique for both compression and acceleration. This survey offers a thorough review of the latest advancements in diffusion model quantization, encapsulating and analyzing the current state of the art in this rapidly advancing domain. First, we provide an overview of the key challenges encountered in the quantization of diffusion models, including those based on U-Net architectures and Diffusion Transformers (DiT). We then present a comprehensive taxonomy of prevalent quantization techniques, engaging in an in-depth discussion of their underlying principles. Subsequently, we perform a meticulous analysis of representative diffusion model quantization schemes from both qualitative and quantitative perspectives. From a quantitative standpoint, we rigorously benchmark a variety of methods using widely recognized datasets, delivering an extensive evaluation of the most recent and impactful research in the field. From a qualitative standpoint, we categorize and synthesize the effects of quantization errors, elucidating these impacts through both visual analysis and trajectory examination. In conclusion, we outline prospective avenues for future research, proposing novel directions for the quantization of generative models in practical applications. The list of related papers, corresponding codes, pre-trained models and comparison results are publicly available at the survey project homepage this https URL.
摘要：大型文本对图像模型的最新成功在经验上突显了生成任务中扩散模型的出色性能。为了促进其在资源受限的边缘设备上的有效部署，模型量化已成为压缩和加速度的关键技术。这项调查对扩散模型量化的最新进步进行了详尽的回顾，封装和分析了这个迅速前进的域中最新技术的现状。首先，我们概述了扩散模型的量化中遇到的关键挑战，包括基于U-NET架构和扩散变压器（DIT）的关键挑战。然后，我们提出了普遍的量化技术的全面分类法，对其基本原则进行了深入的讨论。随后，我们从定性和定量观点对代表扩散模型量化方案进行了细致的分析。从定量的角度来看，我们严格对使用广泛认可的数据集进行了多种方法，对该领域的最新和有影响力的研究进行了广泛的评估。从定性的角度来看，我们对量化错误的影响进行分类和综合，通过视觉分析和轨迹检查阐明这些影响。总之，我们概述了未来研究的前瞻性途径，为实用应用中的生成模型量化提出了新的方向。相关论文，相应的代码，预培训模型和比较结果的列表可在此HTTPS URL的调查项目主页上公开获得。

Title: GFlowNets for Active Learning Based Resource Allocation in Next Generation Wireless Networks

Authors: Charbel Bou Chaaya, Mehdi Bennis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.05224
Pdf URL: https://arxiv.org/pdf/2505.05224
Copy Paste: [[2505.05224]] GFlowNets for Active Learning Based Resource Allocation in Next Generation Wireless Networks(https://arxiv.org/abs/2505.05224)
Keywords: generation, generative
Abstract: In this work, we consider the radio resource allocation problem in a wireless system with various integrated functionalities, such as communication, sensing and computing. We design suitable resource management techniques that can simultaneously cater to those heterogeneous requirements, and scale appropriately with the high-dimensional and discrete nature of the problem. We propose a novel active learning framework where resource allocation patterns are drawn sequentially, evaluated in the environment, and then used to iteratively update a surrogate model of the environment. Our method leverages a generative flow network (GFlowNet) to sample favorable solutions, as such models are trained to generate compositional objects proportionally to their training reward, hence providing an appropriate coverage of its modes. As such, GFlowNet generates diverse and high return resource management designs that update the surrogate model and swiftly discover suitable solutions. We provide simulation results showing that our method can allocate radio resources achieving 20% performance gains against benchmarks, while requiring less than half of the number of acquisition rounds.
摘要：在这项工作中，我们考虑具有各种集成功能的无线系统中的无线电资源分配问题，例如通信，传感和计算。我们设计了合适的资源管理技术，可以同时满足这些异质要求，并根据问题的高维和离散性质进行适当的扩展。我们提出了一个新颖的主动学习框架，其中依次绘制资源分配模式，在环境中进行评估，然后迭代地更新环境的替代模型。我们的方法利用生成流网络（GFLOWNET）来采样有利的解决方案，因此，训练了这些模型以与其培训奖励成比例地生成组成对象，因此提供了适当的模式覆盖范围。因此，Gflownet生成了多样化和高回报资源管理设计，这些设计更新了替代模型并迅速发现合适的解决方案。我们提供的模拟结果表明，我们的方法可以分配无线电资源，以实现20％的性能提高，同时需要不到一半的收购回合。

Title: Does CLIP perceive art the same way we do?

Authors: Andrea Asperti, Leonardo Dessì, Maria Chiara Tonetti, Nico Wu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2505.05229
Pdf URL: https://arxiv.org/pdf/2505.05229
Copy Paste: [[2505.05229]] Does CLIP perceive art the same way we do?(https://arxiv.org/abs/2505.05229)
Keywords: generative
Abstract: CLIP has emerged as a powerful multimodal model capable of connecting images and text through joint embeddings, but to what extent does it "see" the same way humans do - especially when interpreting artworks? In this paper, we investigate CLIP's ability to extract high-level semantic and stylistic information from paintings, including both human-created and AI-generated imagery. We evaluate its perception across multiple dimensions: content, scene understanding, artistic style, historical period, and the presence of visual deformations or artifacts. By designing targeted probing tasks and comparing CLIP's responses to human annotations and expert benchmarks, we explore its alignment with human perceptual and contextual understanding. Our findings reveal both strengths and limitations in CLIP's visual representations, particularly in relation to aesthetic cues and artistic intent. We further discuss the implications of these insights for using CLIP as a guidance mechanism during generative processes, such as style transfer or prompt-based image synthesis. Our work highlights the need for deeper interpretability in multimodal systems, especially when applied to creative domains where nuance and subjectivity play a central role.
摘要：剪辑已成为一种强大的多模型模型，能够通过联合嵌入连接图像和文本，但是它在多大程度上以人类的方式“看到”相同的方式 - 尤其是在解释艺术品时？在本文中，我们研究了剪辑从绘画中提取高级语义和风格信息的能力，包括人类创造的图像和AI生成的图像。我们评估了其在多个维度上的看法：内容，场景理解，艺术风格，历史时期以及视觉变形或人工制品的存在。通过设计有针对性的探测任务并比较Clip对人类注释和专家基准的响应，我们探索了它与人类感知和上下文理解的一致性。我们的发现揭示了剪贴画的视觉表示的优势和局限性，尤其是与美学提示和艺术意图有关。我们进一步讨论了这些见解对在生成过程中使用夹子作为指导机制的含义，例如样式转移或基于迅速的图像合成。我们的工作强调了在多模式系统中需要更深入的可解释性的需求，尤其是应用于细微差别和主观性发挥核心作用的创意领域时。

Title: Joint Super-Resolution and Segmentation for 1-m Impervious Surface Area Mapping in China's Yangtze River Economic Belt

Authors: Jie Deng, Danfeng Hong, Chenyu Li, Naoto Yokoya
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.05367
Pdf URL: https://arxiv.org/pdf/2505.05367
Copy Paste: [[2505.05367]] Joint Super-Resolution and Segmentation for 1-m Impervious Surface Area Mapping in China's Yangtze River Economic Belt(https://arxiv.org/abs/2505.05367)
Keywords: super-resolution, generation
Abstract: We propose a novel joint framework by integrating super-resolution and segmentation, called JointSeg, which enables the generation of 1-meter ISA maps directly from freely available Sentinel-2 imagery. JointSeg was trained on multimodal cross-resolution inputs, offering a scalable and affordable alternative to traditional approaches. This synergistic design enables gradual resolution enhancement from 10m to 1m while preserving fine-grained spatial textures, and ensures high classification fidelity through effective cross-scale feature fusion. This method has been successfully applied to the Yangtze River Economic Belt (YREB), a region characterized by complex urban-rural patterns and diverse topography. As a result, a comprehensive ISA mapping product for 2021, referred to as ISA-1, was generated, covering an area of over 2.2 million square kilometers. Quantitative comparisons against the 10m ESA WorldCover and other benchmark products reveal that ISA-1 achieves an F1-score of 85.71%, outperforming bilinear-interpolation-based segmentation by 9.5%, and surpassing other ISA datasets by 21.43%-61.07%. In densely urbanized areas (e.g., Suzhou, Nanjing), ISA-1 reduces ISA overestimation through improved discrimination of green spaces and water bodies. Conversely, in mountainous regions (e.g., Ganzi, Zhaotong), it identifies significantly more ISA due to its enhanced ability to detect fragmented anthropogenic features such as rural roads and sparse settlements, demonstrating its robustness across diverse landscapes. Moreover, we present biennial ISA maps from 2017 to 2023, capturing spatiotemporal urbanization dynamics across representative cities. The results highlight distinct regional growth patterns: rapid expansion in upstream cities, moderate growth in midstream regions, and saturation in downstream metropolitan areas.
摘要：我们通过集成了一个名为“ interseg”的超分辨率和分割来提出一个新颖的联合框架，该框架可以直接从自由使用的Sentinel-2图像中生成1米ISA地图。联合SSEG接受了多模式跨分辨率输入的培训，为传统方法提供了可扩展且负担得起的替代方案。这种协同设计可以在保留细粒的空间纹理的同时从10m到1M逐渐增强，并通过有效的跨尺度特征融合来确保高分类忠诚度。该方法已成功地应用于长江经济带（YREB），该地区以复杂的城乡模式和各种地形为特征。结果，生成了2021年的全面ISA映射产品，称为ISA-1，覆盖了超过220万平方公里的面积。与10M ESA World-Cover和其他基准产品的定量比较表明，ISA-1的F1得分为85.71％，表现优于基于双线性间隔的分段的F1得分率降低了9.5％，并且超过了其他ISA数据集，并以21.43％的-61.07％超过其他ISA数据集。在密集的城市化地区（例如，苏州，南京），ISA-1通过改善绿色空间和水体的歧视而降低了ISA高估。相反，在山区（例如Ganzi，Zhaotong）中，由于其检测到碎片的人为特征（例如乡村道路和稀疏定居点）的能力增强，它发现了更多的ISA，展示了其在各种景观中的稳健性。此外，我们从2017年到2023年介绍了双年展的ISA地图，从而捕获了代表城市的时空城市化动态。结果突出了不同的区域增长模式：上游城市的快速扩张，中游地区中等增长以及下游大都市地区的饱和度。

Title: TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

Authors: Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.05422
Pdf URL: https://arxiv.org/pdf/2505.05422
Copy Paste: [[2505.05422]] TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation(https://arxiv.org/abs/2505.05422)
Keywords: generation, generative
Abstract: Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens. TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations. Our empirical results demonstrate that TokLIP achieves exceptional data efficiency, empowering visual tokens with high-level semantic understanding while enhancing low-level generative capacity, making it well-suited for autoregressive Transformers in both comprehension and generation tasks. The code and models are available at this https URL.
摘要：Chameleon和EMU3等基于代币的开创性作品为多模式统一建立了基础，但由于缺乏高级语义，面临高训练计算开销和有限的理解性能的挑战。在本文中，我们介绍了Toklip，这是一种视觉令牌，它通过对矢量定量（VQ）令牌进行命令来增强理解力，并结合夹子级的语义，同时促进端到端的多模式自动性训练和标准VQ令牌。 Toklip将低级离散的VQ令牌与基于VIT的令牌编码器集成在一起，以捕获高级连续语义。与以前的方法（例如Vila-U）不同，可以离散高级功能，Toklip disklip distangles培训目标是用于理解和生成的，从而可以直接应用高级VQ Tokenizer，而无需量身定量的量化操作。我们的经验结果表明，Toklip可以提高出色的数据效率，以高级的语义理解赋予视觉令牌，同时增强低级生成能力，使其非常适合理解和生成任务中的自动回归变压器。代码和模型可在此HTTPS URL上找到。

Title: Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Authors: Han Xiao, Yina Xie, Guanxin Tan, Yinghao Chen, Rui Hu, Ke Wang, Aojun Zhou, Hao Li, Hao Shao, Xudong Lu, Peng Gao, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2505.05446
Pdf URL: https://arxiv.org/pdf/2505.05446
Copy Paste: [[2505.05446]] Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding(https://arxiv.org/abs/2505.05446)
Keywords: generation
Abstract: Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data annotations for grounded instruction following. Extensive experiments demonstrate that our proposed model significantly outperforms existing state-of-theart MLLMs across a range of visual document understanding benchmarks, facilitating advanced reasoning and comprehension capabilities in complex visual scenarios. Our code and models are released at https://github. com/Euphoria16/DocMark.
摘要：随着文本丰富的视觉内容的增加，视觉文档的理解变得至关重要。由于需要有效整合视觉感知和文本理解，尤其是在具有复杂布局的各种文档类型的情况下，该领域构成了重大挑战。此外，该领域的现有微调数据集通常在提供详细的上下文信息方面缺乏以富有理解的理解，从而导致幻觉和对视觉元素之间空间关系的理解有限。为了应对这些挑战，我们提出了一条创新的管道，该管道利用了Markdown，JSON，HTML和TIKZ等自适应生成的标记语言来构建高度结构化的文档表示形式并提供上下文基础的响应。我们介绍了两个细粒的结构化数据集：DOCMARK-PILE，包括用于文档解析的大约380万次预处理的数据对，以及DocMark-Instruct，其中包含624K微调数据注释，用于接地。广泛的实验表明，我们提出的模型在一系列视觉文档中明显胜过现有的最新MLLM，理解基准，从而促进了复杂的视觉场景中的高级推理和理解能力。我们的代码和模型在https：// github上发布。 com/euphoria16/docmark。

Title: Flow-GRPO: Training Flow Matching Models via Online RL

Authors: Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, Wanli Ouyang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.05470
Pdf URL: https://arxiv.org/pdf/2505.05470
Copy Paste: [[2505.05470]] Flow-GRPO: Training Flow Matching Models via Online RL(https://arxiv.org/abs/2505.05470)
Keywords: generation
Abstract: We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from $63\%$ to $95\%$. In visual text rendering, its accuracy improves from $59\%$ to $92\%$, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, little to no reward hacking occurred, meaning rewards did not increase at the cost of image quality or diversity, and both remained stable in our experiments.
摘要：我们提出了Flow-GRPO，这是将在线增强学习（RL）集成到流匹配模型中的第一种方法。我们的方法采用了两种关键策略：（1）将确定性的普通微分方程（ODE）转换为等效的随机微分方程（SDE）的ode到SDE转换，该方程（SDE）与所有时间段的原始模型的边际分布相匹配，从而实现RL探索的统计抽样；（2）一种降低策略，可在保留原始推理时间段数量的同时减少培训deno的步骤，从而显着提高采样效率而不会降解。从经验上讲，Flow-GRPO在多个文本到图像任务中都是有效的。对于复杂的构图，RL调整的SD3.5产生了几乎完美的对象计数，空间关系和细粒度的属性，从而将遗传学精度从$ 63 \％$ $提高到$ 95 \％$ $。在视觉文本渲染中，其准确性从$ 59 \％$提高到$ 92 \％$，可显着增强文本生成。 Flow-GRPO在人类的偏好比对方面也可实现可观的收益。值得注意的是，几乎没有发生奖励黑客，这意味着奖励并没有以图像质量或多样性为代价增加，并且在我们的实验中都保持稳定。

Title: Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

Authors: Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, Weilin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05472
Pdf URL: https://arxiv.org/pdf/2505.05472
Copy Paste: [[2505.05472]] Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation(https://arxiv.org/abs/2505.05472)
Keywords: generation
Abstract: Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance, which allow it to harness the strengths of both autoregressive models for text generation and diffusion models for high-quality image synthesis. These practical improvements also make Mogao particularly effective to process interleaved sequences of text and images arbitrarily. To further unlock the potential of unified models, we introduce an efficient training strategy on a large-scale, in-house dataset specifically curated for joint text and image generation. Extensive experiments show that Mogao not only achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs. Its emergent capabilities in zero-shot image editing and compositional generation highlight Mogao as a practical omni-modal foundation model, paving the way for future development and scaling the unified multi-modal systems.
摘要：统一模型的图像理解和产生的最新进展令人印象深刻，但是大多数方法仍然限于以多种方式为条件的单模式生成。在本文中，我们提出了Mogao，这是一个统一的框架，通过因果方法来实现交织的多模式生成来推动这一范式。 Mogao整合了建筑设计中的一系列关键技术改进，包括深融合设计，双重视觉编码器，交织的旋转位置嵌入以及多模式的无分类器指导，可以利用它来利用两种自动性模型的文本生成模型和扩散模型的优势，以实现高素质图像合成。这些实际的改进还使Mogao在任意处理文本和图像的交织序列方面特别有效。为了进一步释放统一模型的潜力，我们在专门策划联合文本和图像生成的大规模内部数据集上引入了有效的培训策略。广泛的实验表明，Mogao不仅在多模式理解和文本对象生成中实现了最先进的表现，而且在产生高质量，相干交织的输出方面也表现出色。它在零拍图像编辑和组成生成中的紧急功能将Mogao作为一种实用的Omni-Modal基础模型，为未来的开发铺平了道路，并扩展了统一的多模式系统。

Title: 3D Scene Generation: A Survey

Authors: Beichen Wen, Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05474
Pdf URL: https://arxiv.org/pdf/2505.05474
Copy Paste: [[2505.05474]] 3D Scene Generation: A Survey(https://arxiv.org/abs/2505.05474)
Keywords: generation, generative
Abstract: 3D scene generation seeks to synthesize spatially structured, semantically meaningful, and photorealistic environments for applications such as immersive media, robotics, autonomous driving, and embodied AI. Early methods based on procedural rules offered scalability but limited diversity. Recent advances in deep generative models (e.g., GANs, diffusion models) and 3D representations (e.g., NeRF, 3D Gaussians) have enabled the learning of real-world scene distributions, improving fidelity, diversity, and view consistency. Recent advances like diffusion models bridge 3D scene synthesis and photorealism by reframing generation as image or video synthesis problems. This survey provides a systematic overview of state-of-the-art approaches, organizing them into four paradigms: procedural generation, neural 3D-based generation, image-based generation, and video-based generation. We analyze their technical foundations, trade-offs, and representative results, and review commonly used datasets, evaluation protocols, and downstream applications. We conclude by discussing key challenges in generation capacity, 3D representation, data and annotations, and evaluation, and outline promising directions including higher fidelity, physics-aware and interactive generation, and unified perception-generation models. This review organizes recent advances in 3D scene generation and highlights promising directions at the intersection of generative AI, 3D vision, and embodied intelligence. To track ongoing developments, we maintain an up-to-date project page: this https URL.
摘要：3D场景的生成力求综合空间结构，语义上有意义的和逼真的环境，例如沉浸式媒体，机器人技术，自动驾驶和体现的AI。基于程序规则的早期方法提供了可伸缩性，但多样性有限。深层生成模型（例如gan，扩散模型）和3D表示（例如Nerf，3D Gaussians）的最新进展使得能够学习现实世界的场景分布，改善忠诚度，多样性和视图一致性。诸如扩散模型之类的最新进展桥梁3D场景的综合和光真相，通过将产生重新定义为图像或视频综合问题。这项调查提供了对最新方法的系统概述，将它们组织成四个范式：程序生成，基于3D的神经生成，基于图像的一代和基于视频的发电。我们分析了他们的技术基础，权衡和代表性结果，并审查了常用的数据集，评估协议和下游应用程序。最后，我们通过讨论发电能力，3D表示，数据和评估以及概述有希望的方向的关键挑战，包括更高的忠诚度，物理学和交互式生成以及统一的感知生成模型。这篇评论组织了3D场景生成的最新进展，并在生成AI，3D视觉和具体智能的交汇处突出了有希望的方向。为了跟踪正在进行的开发，我们维护了最新的项目页面：此HTTPS URL。

Title: SVAD: From Single Image to 3D Avatar via Synthetic Data Generation with Video Diffusion and Data Augmentation

Authors: Yonwoo Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.05475
Pdf URL: https://arxiv.org/pdf/2505.05475
Copy Paste: [[2505.05475]] SVAD: From Single Image to 3D Avatar via Synthetic Data Generation with Video Diffusion and Data Augmentation(https://arxiv.org/abs/2505.05475)
Keywords: restoration, generation, generative
Abstract: Creating high-quality animatable 3D human avatars from a single image remains a significant challenge in computer vision due to the inherent difficulty of reconstructing complete 3D information from a single viewpoint. Current approaches face a clear limitation: 3D Gaussian Splatting (3DGS) methods produce high-quality results but require multiple views or video sequences, while video diffusion models can generate animations from single images but struggle with consistency and identity preservation. We present SVAD, a novel approach that addresses these limitations by leveraging complementary strengths of existing techniques. Our method generates synthetic training data through video diffusion, enhances it with identity preservation and image restoration modules, and utilizes this refined data to train 3DGS avatars. Comprehensive evaluations demonstrate that SVAD outperforms state-of-the-art (SOTA) single-image methods in maintaining identity consistency and fine details across novel poses and viewpoints, while enabling real-time rendering capabilities. Through our data augmentation pipeline, we overcome the dependency on dense monocular or multi-view training data typically required by traditional 3DGS approaches. Extensive quantitative, qualitative comparisons show our method achieves superior performance across multiple metrics against baseline models. By effectively combining the generative power of diffusion models with both the high-quality results and rendering efficiency of 3DGS, our work establishes a new approach for high-fidelity avatar generation from a single image input.
摘要：从单个图像中创建高质量的动画3D人体化身仍然是计算机视觉中的重大挑战，因为从单个角度重建完整的3D信息的固有困难。当前方法面临明确的限制：3D高斯裂（3DGS）方法产生高质量的结果，但需要多个视图或视频序列，而视频扩散模型可以从单个图像中生成动画，但要在一致性和身份保存方面挣扎。我们提出了SVAD，这是一种新颖的方法，通过利用现有技术的互补优势来解决这些局限性。我们的方法通过视频扩散生成综合训练数据，通过身份保存和图像恢复模块增强它，并利用这些精制数据来训练3DGS头像。全面的评估表明，SVAD在保持身份一致性和跨新颖的姿势和观点的细节方面优于最先进的单图像方法，同时启用实时渲染功能。通过我们的数据增强管道，我们克服了对传统3DGS方法通常需要的密集单眼或多视图训练数据的依赖。广泛的定量定性比较表明，我们的方法在多个指标上与基线模型达到了卓越的性能。通过有效地将扩散模型的生成能力与3DG的高质量结果和渲染效率相结合，我们的工作从单个图像输入中建立了一种新的方法来实现高保真化的化身生成。