2025-04-16

Title: Beyond the Generative Learning Trilemma: Generative Model Assessment in Data Scarcity Domains

Authors: Marco Salmè, Lorenzo Tronchin, Rosa Sicilia, Paolo Soda, Valerio Guarrasi
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2504.10555
Pdf URL: https://arxiv.org/pdf/2504.10555
Copy Paste: [[2504.10555]] Beyond the Generative Learning Trilemma: Generative Model Assessment in Data Scarcity Domains(https://arxiv.org/abs/2504.10555)
Keywords: diffusion, generative
Abstract: Data scarcity remains a critical bottleneck impeding technological advancements across various domains, including but not limited to medicine and precision agriculture. To address this challenge, we explore the potential of Deep Generative Models (DGMs) in producing synthetic data that satisfies the Generative Learning Trilemma: fidelity, diversity, and sampling efficiency. However, recognizing that these criteria alone are insufficient for practical applications, we extend the trilemma to include utility, robustness, and privacy, factors crucial for ensuring the applicability of DGMs in real-world scenarios. Evaluating these metrics becomes particularly challenging in data-scarce environments, as DGMs traditionally rely on large datasets to perform optimally. This limitation is especially pronounced in domains like medicine and precision agriculture, where ensuring acceptable model performance under data constraints is vital. To address these challenges, we assess the Generative Learning Trilemma in data-scarcity settings using state-of-the-art evaluation metrics, comparing three prominent DGMs: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Diffusion Models (DMs). Furthermore, we propose a comprehensive framework to assess utility, robustness, and privacy in synthetic data generated by DGMs. Our findings demonstrate varying strengths among DGMs, with each model exhibiting unique advantages based on the application context. This study broadens the scope of the Generative Learning Trilemma, aligning it with real-world demands and providing actionable guidance for selecting DGMs tailored to specific applications.

Title: VAE-based Feature Disentanglement for Data Augmentation and Compression in Generalized GNSS Interference Classification

Authors: Lucas Heublein, Simon Kocher, Tobias Feigl, Alexander Rügamer, Christopher Mutschler, Felix Ott
Subjects: cs.LG, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2504.10556
Pdf URL: https://arxiv.org/pdf/2504.10556
Copy Paste: [[2504.10556]] VAE-based Feature Disentanglement for Data Augmentation and Compression in Generalized GNSS Interference Classification(https://arxiv.org/abs/2504.10556)
Keywords: generative
Abstract: Distributed learning and Edge AI necessitate efficient data processing, low-latency communication, decentralized model training, and stringent data privacy to facilitate real-time intelligence on edge devices while reducing dependency on centralized infrastructure and ensuring high model performance. In the context of global navigation satellite system (GNSS) applications, the primary objective is to accurately monitor and classify interferences that degrade system performance in distributed environments, thereby enhancing situational awareness. To achieve this, machine learning (ML) models can be deployed on low-resource devices, ensuring minimal communication latency and preserving data privacy. The key challenge is to compress ML models while maintaining high classification accuracy. In this paper, we propose variational autoencoders (VAEs) for disentanglement to extract essential latent features that enable accurate classification of interferences. We demonstrate that the disentanglement approach can be leveraged for both data compression and data augmentation by interpolating the lower-dimensional latent representations of signal power. To validate our approach, we evaluate three VAE variants - vanilla, factorized, and conditional generative - on four distinct datasets, including two collected in controlled indoor environments and two real-world highway datasets. Additionally, we conduct extensive hyperparameter searches to optimize performance. Our proposed VAE achieves a data compression rate ranging from 512 to 8,192 and achieves an accuracy up to 99.92%.

Title: H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models

Authors: Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, Sergey Tulyakov
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.10567
Pdf URL: https://arxiv.org/pdf/2504.10567
Copy Paste: [[2504.10567]] H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models(https://arxiv.org/abs/2504.10567)
Keywords: diffusion
Abstract: Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network design, compression ratio, and training strategy. In this work, we systematically examine the architecture design choices and optimize the computation distribution to obtain a series of efficient and high-compression video AEs that can decode in real time on mobile devices. We also unify the design of plain Autoencoder and image-conditioned I2V VAE, achieving multifunctionality in a single network. In addition, we find that the widely adopted discriminative losses, i.e., GAN, LPIPS, and DWT losses, provide no significant improvements when training AEs at scale. We propose a novel latent consistency loss that does not require complicated discriminator design or hyperparameter tuning, but provides stable improvements in reconstruction quality. Our AE achieves an ultra-high compression ratio and real-time decoding speed on mobile while outperforming prior art in terms of reconstruction metrics by a large margin. We finally validate our AE by training a DiT on its latent space and demonstrate fast, high-quality text-to-video generation capability.

Title: Demo: ViolentUTF as An Accessible Platform for Generative AI Red Teaming

Authors: Tam n. Nguyen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.10603
Pdf URL: https://arxiv.org/pdf/2504.10603
Copy Paste: [[2504.10603]] Demo: ViolentUTF as An Accessible Platform for Generative AI Red Teaming(https://arxiv.org/abs/2504.10603)
Keywords: generative
Abstract: The rapid integration of Generative AI (GenAI) into various applications necessitates robust risk management strategies which includes Red Teaming (RT) - an evaluation method for simulating adversarial attacks. Unfortunately, RT for GenAI is often hindered by technical complexity, lack of user-friendly interfaces, and inadequate reporting features. This paper introduces Violent UTF - an accessible, modular, and scalable platform for GenAI red teaming. Through intuitive interfaces (Web GUI, CLI, API, MCP) powered by LLMs and for LLMs, Violent UTF aims to empower non-technical domain experts and students alongside technical experts, facilitate comprehensive security evaluation by unifying capabilities from RT frameworks like Microsoft PyRIT, Nvidia Garak and its own specialized evaluators. ViolentUTF is being used for evaluating the robustness of a flagship LLM-based product in a large US Government department. It also demonstrates effectiveness in evaluating LLMs' cross-domain reasoning capability between cybersecurity and behavioral psychology.

Title: Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling

Authors: Michal Balcerak, Tamaz Amiranashvili, Suprosanna Shit, Antonio Terpin, Sebastian Kaltenbach, Petros Koumoutsakos, Bjoern Menze
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2504.10612
Pdf URL: https://arxiv.org/pdf/2504.10612
Copy Paste: [[2504.10612]] Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling(https://arxiv.org/abs/2504.10612)
Keywords: generative
Abstract: Generative models often map noise to data by matching flows or scores, but these approaches become cumbersome for incorporating partial observations or additional priors. Inspired by recent advances in Wasserstein gradient flows, we propose Energy Matching, a framework that unifies flow-based approaches with the flexibility of energy-based models (EBMs). Far from the data manifold, samples move along curl-free, optimal transport paths from noise to data. As they approach the data manifold, an entropic energy term guides the system into a Boltzmann equilibrium distribution, explicitly capturing the underlying likelihood structure of the data. We parameterize this dynamic with a single time-independent scalar field, which serves as both a powerful generator and a flexible prior for effective regularization of inverse problems. Our method substantially outperforms existing EBMs on CIFAR-10 generation (FID 3.97 compared to 8.61), while retaining the simulation-free training of transport-based approaches away from the data manifold. Additionally, we exploit the flexibility of our method and introduce an interaction energy for diverse mode exploration. Our approach focuses on learning a static scalar potential energy -- without time conditioning, auxiliary generators, or additional networks -- marking a significant departure from recent EBM methods. We believe this simplified framework significantly advances EBM capabilities and paves the way for their broader adoption in generative modeling across diverse domains.

Title: Improving In-Context Learning with Reasoning Distillation

Authors: Nafis Sadeq, Xin Xu, Zhouhang Xie, Julian McAuley, Byungkyu Kang, Prarit Lamba, Xiang Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10647
Pdf URL: https://arxiv.org/pdf/2504.10647
Copy Paste: [[2504.10647]] Improving In-Context Learning with Reasoning Distillation(https://arxiv.org/abs/2504.10647)
Keywords: in-context
Abstract: Language models rely on semantic priors to perform in-context learning, which leads to poor performance on tasks involving inductive reasoning. Instruction-tuning methods based on imitation learning can superficially enhance the in-context learning performance of language models, but they often fail to improve the model's understanding of the underlying rules that connect inputs and outputs in few-shot demonstrations. We propose ReDis, a reasoning distillation technique designed to improve the inductive reasoning capabilities of language models. Through a careful combination of data augmentation, filtering, supervised fine-tuning, and alignment, ReDis achieves significant performance improvements across a diverse range of tasks, including 1D-ARC, List Function, ACRE, and MiniSCAN. Experiments on three language model backbones show that ReDis outperforms equivalent few-shot prompting baselines across all tasks and even surpasses the teacher model, GPT-4o, in some cases. ReDis, based on the LLaMA-3 backbone, achieves relative improvements of 23.2%, 2.8%, and 66.6% over GPT-4o on 1D-ARC, ACRE, and MiniSCAN, respectively, within a similar hypothesis search space. The code, dataset, and model checkpoints will be made available at this https URL.

Title: H-MoRe: Learning Human-centric Motion Representation for Action Analysis

Authors: Zhanbo Huang, Xiaoming Liu, Yu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10676
Pdf URL: https://arxiv.org/pdf/2504.10676
Copy Paste: [[2504.10676]] H-MoRe: Learning Human-centric Motion Representation for Action Analysis(https://arxiv.org/abs/2504.10676)
Keywords: self-supervised
Abstract: In this paper, we propose H-MoRe, a novel pipeline for learning precise human-centric motion representation. Our approach dynamically preserves relevant human motion while filtering out background movement. Notably, unlike previous methods relying on fully supervised learning from synthetic data, H-MoRe learns directly from real-world scenarios in a self-supervised manner, incorporating both human pose and body shape information. Inspired by kinematics, H-MoRe represents absolute and relative movements of each body point in a matrix format that captures nuanced motion details, termed world-local flows. H-MoRe offers refined insights into human motion, which can be integrated seamlessly into various action-related applications. Experimental results demonstrate that H-MoRe brings substantial improvements across various downstream tasks, including gait recognition(CL@R1: +16.01%), action recognition(Acc@1: +8.92%), and video generation(FVD: -67.07%). Additionally, H-MoRe exhibits high inference efficiency (34 fps), making it suitable for most real-time scenarios. Models and code will be released upon publication.

Title: Achieving Optimal Tissue Repair Through MARL with Reward Shaping and Curriculum Learning

Authors: Muhammad Al-Zafar Khan, Jamal Al-Karaki
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2504.10677
Pdf URL: https://arxiv.org/pdf/2504.10677
Copy Paste: [[2504.10677]] Achieving Optimal Tissue Repair Through MARL with Reward Shaping and Curriculum Learning(https://arxiv.org/abs/2504.10677)
Keywords: diffusion
Abstract: In this paper, we present a multi-agent reinforcement learning (MARL) framework for optimizing tissue repair processes using engineered biological agents. Our approach integrates: (1) stochastic reaction-diffusion systems modeling molecular signaling, (2) neural-like electrochemical communication with Hebbian plasticity, and (3) a biologically informed reward function combining chemical gradient tracking, neural synchronization, and robust penalties. A curriculum learning scheme guides the agent through progressively complex repair scenarios. In silico experiments demonstrate emergent repair strategies, including dynamic secretion control and spatial coordination.

Title: SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models

Authors: Stathis Galanakis, Alexandros Lattas, Stylianos Moschoglou, Bernhard Kainz, Stefanos Zafeiriou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10716
Pdf URL: https://arxiv.org/pdf/2504.10716
Copy Paste: [[2504.10716]] SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models(https://arxiv.org/abs/2504.10716)
Keywords: diffusion
Abstract: Despite recent progress in diffusion models, generating realistic head portraits from novel viewpoints remains a significant challenge. Most current approaches are constrained to limited angular ranges, predominantly focusing on frontal or near-frontal views. Moreover, although the recent emerging large-scale diffusion models have been proven robust in handling 3D scenes, they underperform on facial data, given their complex structure and the uncanny valley pitfalls. In this paper, we propose SpinMeRound, a diffusion-based approach designed to generate consistent and accurate head portraits from novel viewpoints. By leveraging a number of input views alongside an identity embedding, our method effectively synthesizes diverse viewpoints of a subject whilst robustly maintaining its unique identity features. Through experimentation, we showcase our model's generation capabilities in 360 head synthesis, while beating current state-of-the-art multiview diffusion models.

Title: Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization

Authors: Darryl Hannan, John Cooper, Dylan White, Timothy Doster, Henry Kvinge, Yijing Watkins
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10727
Pdf URL: https://arxiv.org/pdf/2504.10727
Copy Paste: [[2504.10727]] Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization(https://arxiv.org/abs/2504.10727)
Keywords: foundation model
Abstract: Multimodal large language models (MLLMs) have altered the landscape of computer vision, obtaining impressive results across a wide range of tasks, especially in zero-shot settings. Unfortunately, their strong performance does not always transfer to out-of-distribution domains, such as earth observation (EO) imagery. Prior work has demonstrated that MLLMs excel at some EO tasks, such as image captioning and scene understanding, while failing at tasks that require more fine-grained spatial reasoning, such as object localization. However, MLLMs are advancing rapidly and insights quickly become out-dated. In this work, we analyze more recent MLLMs that have been explicitly trained to include fine-grained spatial reasoning capabilities, benchmarking them on EO object localization tasks. We demonstrate that these models are performant in certain settings, making them well suited for zero-shot scenarios. Additionally, we provide a detailed discussion focused on prompt selection, ground sample distance (GSD) optimization, and analyzing failure cases. We hope that this work will prove valuable as others evaluate whether an MLLM is well suited for a given EO localization task and how to optimize it.

Title: Power-scaled Bayesian Inference with Score-based Generative mModels

Authors: Huseyin Tuna Erdinc, Yunlin Zeng, Abhinav Prakash Gahlot, Felix J. Herrmann
Subjects: cs.LG, cs.CV, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2504.10807
Pdf URL: https://arxiv.org/pdf/2504.10807
Copy Paste: [[2504.10807]] Power-scaled Bayesian Inference with Score-based Generative mModels(https://arxiv.org/abs/2504.10807)
Keywords: generative
Abstract: We propose a score-based generative algorithm for sampling from power-scaled priors and likelihoods within the Bayesian inference framework. Our algorithm enables flexible control over prior-likelihood influence without requiring retraining for different power-scaling configurations. Specifically, we focus on synthesizing seismic velocity models conditioned on imaged seismic. Our method enables sensitivity analysis by sampling from intermediate power posteriors, allowing us to assess the relative influence of the prior and likelihood on samples of the posterior distribution. Through a comprehensive set of experiments, we evaluate the effects of varying the power parameter in different settings: applying it solely to the prior, to the likelihood of a Bayesian formulation, and to both simultaneously. The results show that increasing the power of the likelihood up to a certain threshold improves the fidelity of posterior samples to the conditioning data (e.g., seismic images), while decreasing the prior power promotes greater structural diversity among samples. Moreover, we find that moderate scaling of the likelihood leads to a reduced shot data residual, confirming its utility in posterior refinement.

Title: Tabular foundation model to detect empathy from visual cues

Authors: Md Rakibul Hasan, Shafin Rahman, Md Zakir Hossain, Aneesh Krishna, Tom Gedeon
Subjects: cs.CV, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10808
Pdf URL: https://arxiv.org/pdf/2504.10808
Copy Paste: [[2504.10808]] Tabular foundation model to detect empathy from visual cues(https://arxiv.org/abs/2504.10808)
Keywords: foundation model, in-context
Abstract: Detecting empathy from video interactions is an emerging area of research. Video datasets, however, are often released as extracted features (i.e., tabular data) rather than raw footage due to privacy and ethical concerns. Prior research on such tabular datasets established tree-based classical machine learning approaches as the best-performing models. Motivated by the recent success of textual foundation models (i.e., large language models), we explore the use of tabular foundation models in empathy detection from tabular visual features. We experiment with two recent tabular foundation models $-$ TabPFN v2 and TabICL $-$ through in-context learning and fine-tuning setups. Our experiments on a public human-robot interaction benchmark demonstrate a significant boost in cross-subject empathy detection accuracy over several strong baselines (accuracy: $0.590 \rightarrow 0.730$; AUC: $0.564 \rightarrow 0.669$). In addition to performance improvement, we contribute novel insights and an evaluation setup to ensure generalisation on unseen subjects in this public benchmark. As the practice of releasing video features as tabular datasets is likely to persist due to privacy constraints, our findings will be widely applicable to future empathy detection video datasets as well.

Title: GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR

Authors: Christophe Bolduc, Yannick Hold-Geoffroy, Zhixin Shu, Jean-François Lalonde
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10809
Pdf URL: https://arxiv.org/pdf/2504.10809
Copy Paste: [[2504.10809]] GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR(https://arxiv.org/abs/2504.10809)
Keywords: diffusion
Abstract: We present GaSLight, a method that generates spatially-varying lighting from regular images. Our method proposes using HDR Gaussian Splats as light source representation, marking the first time regular images can serve as light sources in a 3D renderer. Our two-stage process first enhances the dynamic range of images plausibly and accurately by leveraging the priors embedded in diffusion models. Next, we employ Gaussian Splats to model 3D lighting, achieving spatially variant lighting. Our approach yields state-of-the-art results on HDR estimations and their applications in illuminating virtual objects and scenes. To facilitate the benchmarking of images as light sources, we introduce a novel dataset of calibrated and unsaturated HDR to evaluate images as light sources. We assess our method using a combination of this novel dataset and an existing dataset from the literature. The code to reproduce our method will be available upon acceptance.

Title: IlluSign: Illustrating Sign Language Videos by Leveraging the Attention Mechanism

Authors: Janna Bruner, Amit Moryossef, Lior Wolf
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10822
Pdf URL: https://arxiv.org/pdf/2504.10822
Copy Paste: [[2504.10822]] IlluSign: Illustrating Sign Language Videos by Leveraging the Attention Mechanism(https://arxiv.org/abs/2504.10822)
Keywords: diffusion, generative
Abstract: Sign languages are dynamic visual languages that involve hand gestures, in combination with non manual elements such as facial expressions. While video recordings of sign language are commonly used for education and documentation, the dynamic nature of signs can make it challenging to study them in detail, especially for new learners and educators. This work aims to convert sign language video footage into static illustrations, which serve as an additional educational resource to complement video content. This process is usually done by an artist, and is therefore quite costly. We propose a method that illustrates sign language videos by leveraging generative models' ability to understand both the semantic and geometric aspects of images. Our approach focuses on transferring a sketch like illustration style to video footage of sign language, combining the start and end frames of a sign into a single illustration, and using arrows to highlight the hand's direction and motion. While many style transfer methods address domain adaptation at varying levels of abstraction, applying a sketch like style to sign languages, especially for hand gestures and facial expressions, poses a significant challenge. To tackle this, we intervene in the denoising process of a diffusion model, injecting style as keys and values into high resolution attention layers, and fusing geometric information from the image and edges as queries. For the final illustration, we use the attention mechanism to combine the attention weights from both the start and end illustrations, resulting in a soft combination. Our method offers a cost effective solution for generating sign language illustrations at inference time, addressing the lack of such resources in educational materials.

Title: OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding

Authors: Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Yuchi Huo, Rui Wang, Chi Zhang, Xuelong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10825
Pdf URL: https://arxiv.org/pdf/2504.10825
Copy Paste: [[2504.10825]] OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding(https://arxiv.org/abs/2504.10825)
Keywords: diffusion
Abstract: In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality's role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps). By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications.

Title: LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation

Authors: Hengyu Shi, Junhao Su, Huansheng Ning, Xiaoming Wei, Jialin Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10829
Pdf URL: https://arxiv.org/pdf/2504.10829
Copy Paste: [[2504.10829]] LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation(https://arxiv.org/abs/2504.10829)
Keywords: generative, in-context
Abstract: Conditional layout generation aims to automatically generate visually appealing and semantically coherent layouts from user-defined constraints. While recent methods based on generative models have shown promising results, they typically require substantial amounts of training data or extensive fine-tuning, limiting their versatility and practical applicability. Alternatively, some training-free approaches leveraging in-context learning with Large Language Models (LLMs) have emerged, but they often suffer from limited reasoning capabilities and overly simplistic ranking mechanisms, which restrict their ability to generate consistently high-quality layouts. To this end, we propose LayoutCoT, a novel approach that leverages the reasoning capabilities of LLMs through a combination of Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) techniques. Specifically, LayoutCoT transforms layout representations into a standardized serialized format suitable for processing by LLMs. A Layout-aware RAG is used to facilitate effective retrieval and generate a coarse layout by LLMs. This preliminary layout, together with the selected exemplars, is then fed into a specially designed CoT reasoning module for iterative refinement, significantly enhancing both semantic coherence and visual quality. We conduct extensive experiments on five public datasets spanning three conditional layout generation tasks. Experimental results demonstrate that LayoutCoT achieves state-of-the-art performance without requiring training or fine-tuning. Notably, our CoT reasoning module enables standard LLMs, even those without explicit deep reasoning abilities, to outperform specialized deep-reasoning models such as deepseek-R1, highlighting the potential of our approach in unleashing the deep reasoning capabilities of LLMs for layout generation tasks.

Title: Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators

Authors: Phill Kyu Rhee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10845
Pdf URL: https://arxiv.org/pdf/2504.10845
Copy Paste: [[2504.10845]] Moving Beyond Next-Token Prediction: Transformers are Context-Sensitive Language Generators(https://arxiv.org/abs/2504.10845)
Keywords: generative
Abstract: Large Language Models (LLMs), powered by Transformers, have demonstrated human-like intelligence capabilities, yet their underlying mechanisms remain poorly understood. This paper presents a novel framework for interpreting LLMs as probabilistic left context-sensitive languages (CSLs) generators. We hypothesize that Transformers can be effectively decomposed into three fundamental components: context windows, attention mechanisms, and autoregressive generation frameworks. This decomposition allows for the development of more flexible and interpretable computational models, moving beyond the traditional view of attention and autoregression as inseparable processes. We argue that next-token predictions can be understood as probabilistic, dynamic approximations of left CSL production rules, providing an intuitive explanation for how simple token predictions can yield human-like intelligence outputs. Given that all CSLs are left context-sensitive (Penttonen, 1974), we conclude that Transformers stochastically approximate CSLs, which are widely recognized as models of human-like intelligence. This interpretation bridges the gap between Formal Language Theory and the observed generative power of Transformers, laying a foundation for future advancements in generative AI theory and applications. Our novel perspective on Transformer architectures will foster a deeper understanding of LLMs and their future potentials.

Title: How to Enhance Downstream Adversarial Robustness (almost) without Touching the Pre-Trained Foundation Model?

Authors: Meiqi Liu, Zhuoqun Huang, Yue Xing
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2504.10850
Pdf URL: https://arxiv.org/pdf/2504.10850
Copy Paste: [[2504.10850]] How to Enhance Downstream Adversarial Robustness (almost) without Touching the Pre-Trained Foundation Model?(https://arxiv.org/abs/2504.10850)
Keywords: foundation model
Abstract: With the rise of powerful foundation models, a pre-training-fine-tuning paradigm becomes increasingly popular these days: A foundation model is pre-trained using a huge amount of data from various sources, and then the downstream users only need to fine-tune and adapt it to specific downstream tasks. However, due to the high computation complexity of adversarial training, it is not feasible to fine-tune the foundation model to improve its robustness on the downstream task. Observing the above challenge, we want to improve the downstream robustness without updating/accessing the weights in the foundation model. Inspired from existing literature in robustness inheritance (Kim et al., 2020), through theoretical investigation, we identify a close relationship between robust contrastive learning with the adversarial robustness of supervised learning. To further validate and utilize this theoretical insight, we design a simple-yet-effective robust auto-encoder as a data pre-processing method before feeding the data into the foundation model. The proposed approach has zero access to the foundation model when training the robust auto-encoder. Extensive experiments demonstrate the effectiveness of the proposed method in improving the robustness of downstream tasks, verifying the connection between the feature robustness (implied by small adversarial contrastive loss) and the robustness of the downstream task.

Title: Enhancing Features in Long-tailed Data Using Large Vision Mode

Authors: Pengxiao Han, Changkun Ye, Jinguang Tong, Cuicui Jiang, Jie Hong, Li Fang, Xuesong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10852
Pdf URL: https://arxiv.org/pdf/2504.10852
Copy Paste: [[2504.10852]] Enhancing Features in Long-tailed Data Using Large Vision Mode(https://arxiv.org/abs/2504.10852)
Keywords: foundation model
Abstract: Language-based foundation models, such as large language models (LLMs) or large vision-language models (LVLMs), have been widely studied in long-tailed recognition. However, the need for linguistic data is not applicable to all practical tasks. In this study, we aim to explore using large vision models (LVMs) or visual foundation models (VFMs) to enhance long-tailed data features without any language information. Specifically, we extract features from the LVM and fuse them with features in the baseline network's map and latent space to obtain the augmented features. Moreover, we design several prototype-based losses in the latent space to further exploit the potential of the augmented features. In the experimental section, we validate our approach on two benchmark datasets: ImageNet-LT and iNaturalist2018.

Title: PT-Mark: Invisible Watermarking for Text-to-image Diffusion Models via Semantic-aware Pivotal Tuning

Authors: Yaopeng Wang, Huiyu Xu, Zhibo Wang, Jiacheng Du, Zhichao Li, Yiming Li, Qiu Wang, Kui Ren
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.10853
Pdf URL: https://arxiv.org/pdf/2504.10853
Copy Paste: [[2504.10853]] PT-Mark: Invisible Watermarking for Text-to-image Diffusion Models via Semantic-aware Pivotal Tuning(https://arxiv.org/abs/2504.10853)
Keywords: diffusion
Abstract: Watermarking for diffusion images has drawn considerable attention due to the widespread use of text-to-image diffusion models and the increasing need for their copyright protection. Recently, advanced watermarking techniques, such as Tree Ring, integrate watermarks by embedding traceable patterns (e.g., Rings) into the latent distribution during the diffusion process. Such methods disrupt the original semantics of the generated images due to the inevitable distribution shift caused by the watermarks, thereby limiting their practicality, particularly in digital art creation. In this work, we present Semantic-aware Pivotal Tuning Watermarks (PT-Mark), a novel invisible watermarking method that preserves both the semantics of diffusion images and the traceability of the watermark. PT-Mark preserves the original semantics of the watermarked image by gradually aligning the generation trajectory with the original (pivotal) trajectory while maintaining the traceable watermarks during whole diffusion denoising process. To achieve this, we first compute the salient regions of the watermark at each diffusion denoising step as a spatial prior to identify areas that can be aligned without disrupting the watermark pattern. Guided by the region, we then introduce an additional pivotal tuning branch that optimizes the text embedding to align the semantics while preserving the watermarks. Extensive evaluations demonstrate that PT-Mark can preserve the original semantics of the diffusion images while integrating robust watermarks. It achieves a 10% improvement in the performance of semantic preservation (i.e., SSIM, PSNR, and LPIPS) compared to state-of-the-art watermarking methods, while also showing comparable robustness against real-world perturbations and four times greater efficiency.

Title: LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation

Authors: Hanning Chen, Yang Ni, Wenjun Huang, Hyunwoo Oh, Yezi Liu, Tamoghno Das, Mohsen Imani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10854
Pdf URL: https://arxiv.org/pdf/2504.10854
Copy Paste: [[2504.10854]] LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation(https://arxiv.org/abs/2504.10854)
Keywords: foundation model
Abstract: Large Vision Language Models (LVLMs) have been widely adopted to guide vision foundation models in performing reasoning segmentation tasks, achieving impressive performance. However, the substantial computational overhead associated with LVLMs presents a new challenge. The primary source of this computational cost arises from processing hundreds of image tokens. Therefore, an effective strategy to mitigate such overhead is to reduce the number of image tokens, a process known as image token pruning. Previous studies on image token pruning for LVLMs have primarily focused on high level visual understanding tasks, such as visual question answering and image captioning. In contrast, guiding vision foundation models to generate accurate visual masks based on textual queries demands precise semantic and spatial reasoning capabilities. Consequently, pruning methods must carefully control individual image tokens throughout the LVLM reasoning process. Our empirical analysis reveals that existing methods struggle to adequately balance reductions in computational overhead with the necessity to maintain high segmentation accuracy. In this work, we propose LVLM_CSP, a novel training free visual token pruning method specifically designed for LVLM based reasoning segmentation tasks. LVLM_CSP consists of three stages: clustering, scattering, and pruning. Initially, the LVLM performs coarse-grained visual reasoning using a subset of selected image tokens. Next, fine grained reasoning is conducted, and finally, most visual tokens are pruned in the last stage. Extensive experiments demonstrate that LVLM_CSP achieves a 65% reduction in image token inference FLOPs with virtually no accuracy degradation, and a 70% reduction with only a minor 1% drop in accuracy on the 7B LVLM.

Title: Bringing together invertible UNets with invertible attention modules for memory-efficient diffusion models

Authors: Karan Jain, Mohammad Nayeem Teli
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10883
Pdf URL: https://arxiv.org/pdf/2504.10883
Copy Paste: [[2504.10883]] Bringing together invertible UNets with invertible attention modules for memory-efficient diffusion models(https://arxiv.org/abs/2504.10883)
Keywords: diffusion
Abstract: Diffusion models have recently gained state of the art performance on many image generation tasks. However, most models require significant computational resources to achieve this. This becomes apparent in the application of medical image synthesis due to the 3D nature of medical datasets like CT-scans, MRIs, electron microscope, etc. In this paper we propose a novel architecture for a single GPU memory-efficient training for diffusion models for high dimensional medical datasets. The proposed model is built by using an invertible UNet architecture with invertible attention modules. This leads to the following two contributions: 1. denoising diffusion models and thus enabling memory usage to be independent of the dimensionality of the dataset, and 2. reducing the energy usage during training. While this new model can be applied to a multitude of image generation tasks, we showcase its memory-efficiency on the 3D BraTS2020 dataset leading to up to 15\% decrease in peak memory consumption during training with comparable results to SOTA while maintaining the image quality.

Title: Bridging Distribution Gaps in Time Series Foundation Model Pretraining with Prototype-Guided Normalization

Authors: Peiliang Gong, Emadeldeen Eldele, Min Wu, Zhenghua Chen, Xiaoli Li, Daoqiang Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10900
Pdf URL: https://arxiv.org/pdf/2504.10900
Copy Paste: [[2504.10900]] Bridging Distribution Gaps in Time Series Foundation Model Pretraining with Prototype-Guided Normalization(https://arxiv.org/abs/2504.10900)
Keywords: foundation model
Abstract: Foundation models have achieved remarkable success across diverse machine-learning domains through large-scale pretraining on large, diverse datasets. However, pretraining on such datasets introduces significant challenges due to substantial mismatches in data distributions, a problem particularly pronounced with time series data. In this paper, we tackle this issue by proposing a domain-aware adaptive normalization strategy within the Transformer architecture. Specifically, we replace the traditional LayerNorm with a prototype-guided dynamic normalization mechanism (ProtoNorm), where learned prototypes encapsulate distinct data distributions, and sample-to-prototype affinity determines the appropriate normalization layer. This mechanism effectively captures the heterogeneity of time series characteristics, aligning pretrained representations with downstream tasks. Through comprehensive empirical evaluation, we demonstrate that our method significantly outperforms conventional pretraining techniques across both classification and forecasting tasks, while effectively mitigating the adverse effects of distribution shifts during pretraining. Incorporating ProtoNorm is as simple as replacing a single line of code. Extensive experiments on diverse real-world time series benchmarks validate the robustness and generalizability of our approach, advancing the development of more versatile time series foundation models.

Title: InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation

Authors: Yukang Lin, Yan Hong, Zunnan Xu, Xindi Li, Chao Xu, Chuanbiao Song, Ronghui Li, Haoxing Chen, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang, Xiu Li
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2504.10905
Pdf URL: https://arxiv.org/pdf/2504.10905
Copy Paste: [[2504.10905]] InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation(https://arxiv.org/abs/2504.10905)
Keywords: diffusion
Abstract: Recent video generation research has focused heavily on isolated actions, leaving interactive motions-such as hand-face interactions-largely unexamined. These interactions are essential for emerging biometric authentication systems, which rely on interactive motion-based anti-spoofing approaches. From a security perspective, there is a growing need for large-scale, high-quality interactive videos to train and strengthen authentication models. In this work, we introduce a novel paradigm for animating realistic hand-face interactions. Our approach simultaneously learns spatio-temporal contact dynamics and biomechanically plausible deformation effects, enabling natural interactions where hand movements induce anatomically accurate facial deformations while maintaining collision-free contact. To facilitate this research, we present InterHF, a large-scale hand-face interaction dataset featuring 18 interaction patterns and 90,000 annotated videos. Additionally, we propose InterAnimate, a region-aware diffusion model designed specifically for interaction animation. InterAnimate leverages learnable spatial and temporal latents to effectively capture dynamic interaction priors and integrates a region-aware interaction mechanism that injects these priors into the denoising process. To the best of our knowledge, this work represents the first large-scale effort to systematically study human hand-face interactions. Qualitative and quantitative results show InterAnimate produces highly realistic animations, setting a new benchmark. Code and data will be made public to advance research.

Title: Towards A Universal Graph Structural Encoder

Authors: Jialin Chen, Haolan Zuo, Haoyu Peter Wang, Siqi Miao, Pan Li, Rex Ying
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10917
Pdf URL: https://arxiv.org/pdf/2504.10917
Copy Paste: [[2504.10917]] Towards A Universal Graph Structural Encoder(https://arxiv.org/abs/2504.10917)
Keywords: self-supervised
Abstract: Recent advancements in large-scale pre-training have shown the potential to learn generalizable representations for downstream tasks. In the graph domain, however, capturing and transferring structural information across different graph domains remains challenging, primarily due to the inherent differences in topological patterns across various contexts. Additionally, most existing models struggle to capture the complexity of rich graph structures, leading to inadequate exploration of the embedding space. To address these challenges, we propose GFSE, a universal graph structural encoder designed to capture transferable structural patterns across diverse domains such as molecular graphs, social networks, and citation networks. GFSE is the first cross-domain graph structural encoder pre-trained with multiple self-supervised learning objectives. Built on a Graph Transformer, GFSE incorporates attention mechanisms informed by graph inductive bias, enabling it to encode intricate multi-level and fine-grained topological features. The pre-trained GFSE produces generic and theoretically expressive positional and structural encoding for graphs, which can be seamlessly integrated with various downstream graph feature encoders, including graph neural networks for vectorized features and Large Language Models for text-attributed graphs. Comprehensive experiments on synthetic and real-world datasets demonstrate GFSE's capability to significantly enhance the model's performance while requiring substantially less task-specific fine-tuning. Notably, GFSE achieves state-of-the-art performance in 81.6% evaluated cases, spanning diverse graph models and datasets, highlighting its potential as a powerful and versatile encoder for graph-structured data.

Title: Transfer Learning for Temporal Link Prediction

Authors: Ayan Chatterjee, Barbara Ikica, Babak Ravandi, John Palowitch
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10925
Pdf URL: https://arxiv.org/pdf/2504.10925
Copy Paste: [[2504.10925]] Transfer Learning for Temporal Link Prediction(https://arxiv.org/abs/2504.10925)
Keywords: foundation model
Abstract: Link prediction on graphs has applications spanning from recommender systems to drug discovery. Temporal link prediction (TLP) refers to predicting future links in a temporally evolving graph and adds additional complexity related to the dynamic nature of graphs. State-of-the-art TLP models incorporate memory modules alongside graph neural networks to learn both the temporal mechanisms of incoming nodes and the evolving graph topology. However, memory modules only store information about nodes seen at train time, and hence such models cannot be directly transferred to entirely new graphs at test time and deployment. In this work, we study a new transfer learning task for temporal link prediction, and develop transfer-effective methods for memory-laden models. Specifically, motivated by work showing the informativeness of structural signals for the TLP task, we augment a structural mapping module to the existing TLP model architectures, which learns a mapping from graph structural (topological) features to memory embeddings. Our work paves the way for a memory-free foundation model for TLP.

Title: AFiRe: Anatomy-Driven Self-Supervised Learning for Fine-Grained Representation in Radiographic Images

Authors: Yihang Liu, Lianghua He, Ying Wen, Longzhen Yang, Hongzhou Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10972
Pdf URL: https://arxiv.org/pdf/2504.10972
Copy Paste: [[2504.10972]] AFiRe: Anatomy-Driven Self-Supervised Learning for Fine-Grained Representation in Radiographic Images(https://arxiv.org/abs/2504.10972)
Keywords: self-supervised, anomaly
Abstract: Current self-supervised methods, such as contrastive learning, predominantly focus on global discrimination, neglecting the critical fine-grained anatomical details required for accurate radiographic analysis. To address this challenge, we propose an Anatomy-driven self-supervised framework for enhancing Fine-grained Representation in radiographic image analysis (AFiRe). The core idea of AFiRe is to align the anatomical consistency with the unique token-processing characteristics of Vision Transformer. Specifically, AFiRe synergistically performs two self-supervised schemes: (i) Token-wise anatomy-guided contrastive learning, which aligns image tokens based on structural and categorical consistency, thereby enhancing fine-grained spatial-anatomical discrimination; (ii) Pixel-level anomaly-removal restoration, which particularly focuses on local anomalies, thereby refining the learned discrimination with detailed geometrical information. Additionally, we propose Synthetic Lesion Mask to enhance anatomical diversity while preserving intra-consistency, which is typically corrupted by traditional data augmentations, such as Cropping and Affine transformations. Experimental results show that AFiRe: (i) provides robust anatomical discrimination, achieving more cohesive feature clusters compared to state-of-the-art contrastive learning methods; (ii) demonstrates superior generalization, surpassing 7 radiography-specific self-supervised methods in multi-label classification tasks with limited labeling; and (iii) integrates fine-grained information, enabling precise anomaly detection using only image-level annotations.

Title: Self-Supervised Enhancement of Forward-Looking Sonar Images: Bridging Cross-Modal Degradation Gaps through Feature Space Transformation and Multi-Frame Fusion

Authors: Zhisheng Zhang, Peng Zhang, Fengxiang Wang, Liangli Ma, Fuchun Sun
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.10974
Pdf URL: https://arxiv.org/pdf/2504.10974
Copy Paste: [[2504.10974]] Self-Supervised Enhancement of Forward-Looking Sonar Images: Bridging Cross-Modal Degradation Gaps through Feature Space Transformation and Multi-Frame Fusion(https://arxiv.org/abs/2504.10974)
Keywords: self-supervised
Abstract: Enhancing forward-looking sonar images is critical for accurate underwater target detection. Current deep learning methods mainly rely on supervised training with simulated data, but the difficulty in obtaining high-quality real-world paired data limits their practical use and generalization. Although self-supervised approaches from remote sensing partially alleviate data shortages, they neglect the cross-modal degradation gap between sonar and remote sensing images. Directly transferring pretrained weights often leads to overly smooth sonar images, detail loss, and insufficient brightness. To address this, we propose a feature-space transformation that maps sonar images from the pixel domain to a robust feature domain, effectively bridging the degradation gap. Additionally, our self-supervised multi-frame fusion strategy leverages complementary inter-frame information to naturally remove speckle noise and enhance target-region brightness. Experiments on three self-collected real-world forward-looking sonar datasets show that our method significantly outperforms existing approaches, effectively suppressing noise, preserving detailed edges, and substantially improving brightness, demonstrating strong potential for underwater target detection applications.

Title: ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings

Authors: Zitai Kong, Yiheng Zhu, Yinlong Xu, Hanjing Zhou, Mingzhe Yin, Jialu Wu, Hongxia Xu, Chang-Yu Hsieh, Tingjun Hou, Jian Wu
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2504.10983
Pdf URL: https://arxiv.org/pdf/2504.10983
Copy Paste: [[2504.10983]] ProtFlow: Fast Protein Sequence Design via Flow Matching on Compressed Protein Language Model Embeddings(https://arxiv.org/abs/2504.10983)
Keywords: diffusion, generative
Abstract: The design of protein sequences with desired functionalities is a fundamental task in protein engineering. Deep generative methods, such as autoregressive models and diffusion models, have greatly accelerated the discovery of novel protein sequences. However, these methods mainly focus on local or shallow residual semantics and suffer from low inference efficiency, large modeling space and high training cost. To address these challenges, we introduce ProtFlow, a fast flow matching-based protein sequence design framework that operates on embeddings derived from semantically meaningful latent space of protein language models. By compressing and smoothing the latent space, ProtFlow enhances performance while training on limited computational resources. Leveraging reflow techniques, ProtFlow enables high-quality single-step sequence generation. Additionally, we develop a joint design pipeline for the design scene of multichain proteins. We evaluate ProtFlow across diverse protein design tasks, including general peptides and long-chain proteins, antimicrobial peptides, and antibodies. Experimental results demonstrate that ProtFlow outperforms task-specific methods in these applications, underscoring its potential and broad applicability in computational protein sequence design and analysis.

Title: TMCIR: Token Merge Benefits Composed Image Retrieval

Authors: Chaoyang Wang, Zeyu Zhang, Long Teng, Zijun Li, Shichao Kan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10995
Pdf URL: https://arxiv.org/pdf/2504.10995
Copy Paste: [[2504.10995]] TMCIR: Token Merge Benefits Composed Image Retrieval(https://arxiv.org/abs/2504.10995)
Keywords: diffusion
Abstract: Composed Image Retrieval (CIR) retrieves target images using a multi-modal query that combines a reference image with text describing desired modifications. The primary challenge is effectively fusing this visual and textual information. Current cross-modal feature fusion approaches for CIR exhibit an inherent bias in intention interpretation. These methods tend to disproportionately emphasize either the reference image features (visual-dominant fusion) or the textual modification intent (text-dominant fusion through image-to-text conversion). Such an imbalanced representation often fails to accurately capture and reflect the actual search intent of the user in the retrieval results. To address this challenge, we propose TMCIR, a novel framework that advances composed image retrieval through two key innovations: 1) Intent-Aware Cross-Modal Alignment. We first fine-tune CLIP encoders contrastively using intent-reflecting pseudo-target images, synthesized from reference images and textual descriptions via a diffusion model. This step enhances the encoder ability of text to capture nuanced intents in textual descriptions. 2) Adaptive Token Fusion. We further fine-tune all encoders contrastively by comparing adaptive token-fusion features with the target image. This mechanism dynamically balances visual and textual representations within the contrastive learning pipeline, optimizing the composed feature for retrieval. Extensive experiments on Fashion-IQ and CIRR datasets demonstrate that TMCIR significantly outperforms state-of-the-art methods, particularly in capturing nuanced user intent.

Title: AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era

Authors: Chenyang Zhu, Xing Zhang, Yuyang Sun, Ching-Chun Chang, Isao Echizen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11015
Pdf URL: https://arxiv.org/pdf/2504.11015
Copy Paste: [[2504.11015]] AnimeDL-2M: Million-Scale AI-Generated Anime Image Detection and Localization in Diffusion Era(https://arxiv.org/abs/2504.11015)
Keywords: diffusion
Abstract: Recent advances in image generation, particularly diffusion models, have significantly lowered the barrier for creating sophisticated forgeries, making image manipulation detection and localization (IMDL) increasingly challenging. While prior work in IMDL has focused largely on natural images, the anime domain remains underexplored-despite its growing vulnerability to AI-generated forgeries. Misrepresentations of AI-generated images as hand-drawn artwork, copyright violations, and inappropriate content modifications pose serious threats to the anime community and industry. To address this gap, we propose AnimeDL-2M, the first large-scale benchmark for anime IMDL with comprehensive annotations. It comprises over two million images including real, partially manipulated, and fully AI-generated samples. Experiments indicate that models trained on existing IMDL datasets of natural images perform poorly when applied to anime images, highlighting a clear domain gap between anime and natural images. To better handle IMDL tasks in anime domain, we further propose AniXplore, a novel model tailored to the visual characteristics of anime imagery. Extensive evaluations demonstrate that AniXplore achieves superior performance compared to existing methods. Dataset and code can be found in this https URL.

Title: Defending Against Frequency-Based Attacks with Diffusion Models

Authors: Fatemeh Amerehi, Patrick Healy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11034
Pdf URL: https://arxiv.org/pdf/2504.11034
Copy Paste: [[2504.11034]] Defending Against Frequency-Based Attacks with Diffusion Models(https://arxiv.org/abs/2504.11034)
Keywords: diffusion, generative
Abstract: Adversarial training is a common strategy for enhancing model robustness against adversarial attacks. However, it is typically tailored to the specific attack types it is trained on, limiting its ability to generalize to unseen threat models. Adversarial purification offers an alternative by leveraging a generative model to remove perturbations before classification. Since the purifier is trained independently of both the classifier and the threat models, it is better equipped to handle previously unseen attack scenarios. Diffusion models have proven highly effective for noise purification, not only in countering pixel-wise adversarial perturbations but also in addressing non-adversarial data shifts. In this study, we broaden the focus beyond pixel-wise robustness to explore the extent to which purification can mitigate both spectral and spatial adversarial attacks. Our findings highlight its effectiveness in handling diverse distortion patterns across low- to high-frequency regions.

Title: Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models

Authors: Andrea Tirinzoni, Ahmed Touati, Jesse Farebrother, Mateusz Guzek, Anssi Kanervisto, Yingchen Xu, Alessandro Lazaric, Matteo Pirotta
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.11054
Pdf URL: https://arxiv.org/pdf/2504.11054
Copy Paste: [[2504.11054]] Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models(https://arxiv.org/abs/2504.11054)
Keywords: foundation model
Abstract: Unsupervised reinforcement learning (RL) aims at pre-training agents that can solve a wide range of downstream tasks in complex environments. Despite recent advancements, existing approaches suffer from several limitations: they may require running an RL process on each downstream task to achieve a satisfactory performance, they may need access to datasets with good coverage or well-curated task-specific samples, or they may pre-train policies with unsupervised losses that are poorly correlated with the downstream tasks of interest. In this paper, we introduce a novel algorithm regularizing unsupervised RL towards imitating trajectories from unlabeled behavior datasets. The key technical novelty of our method, called Forward-Backward Representations with Conditional-Policy Regularization, is to train forward-backward representations to embed the unlabeled trajectories to the same latent space used to represent states, rewards, and policies, and use a latent-conditional discriminator to encourage policies to ``cover'' the states in the unlabeled behavior dataset. As a result, we can learn policies that are well aligned with the behaviors in the dataset, while retaining zero-shot generalization capabilities for reward-based and imitation tasks. We demonstrate the effectiveness of this new approach in a challenging humanoid control problem: leveraging observation-only motion capture datasets, we train Meta Motivo, the first humanoid behavioral foundation model that can be prompted to solve a variety of whole-body tasks, including motion tracking, goal reaching, and reward optimization. The resulting model is capable of expressing human-like behaviors and it achieves competitive performance with task-specific methods while outperforming state-of-the-art unsupervised RL and model-based baselines.

Title: Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detections

Authors: Alireza Salehi, Mohammadreza Salehi, Reshad Hosseini, Cees G. M. Snoek, Makoto Yamada, Mohammad Sabokrou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11055
Pdf URL: https://arxiv.org/pdf/2504.11055
Copy Paste: [[2504.11055]] Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detections(https://arxiv.org/abs/2504.11055)
Keywords: anomaly
Abstract: Anomaly Detection (AD) involves identifying deviations from normal data distributions and is critical in fields such as medical diagnostics and industrial defect detection. Traditional AD methods typically require the availability of normal training samples; however, this assumption is not always feasible, as collecting such data can be impractical. Additionally, these methods often struggle to generalize across different domains. Recent advancements, such as AnomalyCLIP and AdaCLIP, utilize the zero-shot generalization capabilities of CLIP but still face a performance gap between image-level and pixel-level anomaly detection. To address this gap, we propose a novel approach that conditions the prompts of the text encoder based on image context extracted from the vision encoder. Also, to capture fine-grained variations more effectively, we have modified the CLIP vision encoder and altered the extraction of dense features. These changes ensure that the features retain richer spatial and structural information for both normal and anomalous prompts. Our method achieves state-of-the-art performance, improving performance by 2% to 29% across different metrics on 14 datasets. This demonstrates its effectiveness in both image-level and pixel-level anomaly detection.

Title: UKDM: Underwater keypoint detection and matching using underwater image enhancement techniques

Authors: Pedro Diaz-Garcia, Felix Escalona, Miguel Cazorla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11063
Pdf URL: https://arxiv.org/pdf/2504.11063
Copy Paste: [[2504.11063]] UKDM: Underwater keypoint detection and matching using underwater image enhancement techniques(https://arxiv.org/abs/2504.11063)
Keywords: generative
Abstract: The purpose of this paper is to explore the use of underwater image enhancement techniques to improve keypoint detection and matching. By applying advanced deep learning models, including generative adversarial networks and convolutional neural networks, we aim to find the best method which improves the accuracy of keypoint detection and the robustness of matching algorithms. We evaluate the performance of these techniques on various underwater datasets, demonstrating significant improvements over traditional methods.

Title: Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting

Authors: Jiaxin Huang, Sheng Miao, BangBnag Yang, Yuewen Ma, Yiyi Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11092
Pdf URL: https://arxiv.org/pdf/2504.11092
Copy Paste: [[2504.11092]] Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting(https://arxiv.org/abs/2504.11092)
Keywords: generative
Abstract: Reconstructing 4D dynamic scenes from casually captured monocular videos is valuable but highly challenging, as each timestamp is observed from a single viewpoint. We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views - synthesizing multi-view videos from a monocular input. Unlike existing methods that either solely leverage geometric priors for supervision or use generative priors while overlooking geometry, we integrate both. This reformulates view augmentation as a video inpainting task, where observed views are warped into new viewpoints based on monocular depth priors. To achieve this, we train a video inpainting model on unposed web videos with synthetically generated masks that mimic warping occlusions, ensuring spatially and temporally consistent completion of missing regions. To further mitigate inaccuracies in monocular depth priors, we introduce an iterative view augmentation strategy and a robust reconstruction loss. Experiments demonstrate that our method effectively improves monocular 4D scene reconstruction and completion.

Title: Using LLMs as prompt modifier to avoid biases in AI image generators

Authors: René Peinl
Subjects: cs.CL, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2504.11104
Pdf URL: https://arxiv.org/pdf/2504.11104
Copy Paste: [[2504.11104]] Using LLMs as prompt modifier to avoid biases in AI image generators(https://arxiv.org/abs/2504.11104)
Keywords: diffusion
Abstract: This study examines how Large Language Models (LLMs) can reduce biases in text-to-image generation systems by modifying user prompts. We define bias as a model's unfair deviation from population statistics given neutral prompts. Our experiments with Stable Diffusion XL, 3.5 and Flux demonstrate that LLM-modified prompts significantly increase image diversity and reduce bias without the need to change the image generators themselves. While occasionally producing results that diverge from original user intent for elaborate prompts, this approach generally provides more varied interpretations of underspecified requests rather than superficial variations. The method works particularly well for less advanced image generators, though limitations persist for certain contexts like disability representation. All prompts and generated images are available at this https URL

Title: Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models

Authors: Jiangtao Liu, Zhaoxin Wang, Handing Wang, Cong Tian, Yaochu Jin
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2504.11106
Pdf URL: https://arxiv.org/pdf/2504.11106
Copy Paste: [[2504.11106]] Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models(https://arxiv.org/abs/2504.11106)
Keywords: generative
Abstract: Recent advancements in Text-to-Image (T2I) generation have significantly enhanced the realism and creativity of generated images. However, such powerful generative capabilities pose risks related to the production of inappropriate or harmful content. Existing defense mechanisms, including prompt checkers and post-hoc image checkers, are vulnerable to sophisticated adversarial attacks. In this work, we propose TCBS-Attack, a novel query-based black-box jailbreak attack that searches for tokens located near the decision boundaries defined by text and image checkers. By iteratively optimizing tokens near these boundaries, TCBS-Attack generates semantically coherent adversarial prompts capable of bypassing multiple defensive layers in T2I models. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art jailbreak attacks across various T2I models, including securely trained open-source models and commercial online services like DALL-E 3. TCBS-Attack achieves an ASR-4 of 45\% and an ASR-1 of 21\% on jailbreaking full-chain T2I models, significantly surpassing baseline methods.

Title: Taming Consistency Distillation for Accelerated Human Image Animation

Authors: Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yujie Wei, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11143
Pdf URL: https://arxiv.org/pdf/2504.11143
Copy Paste: [[2504.11143]] Taming Consistency Distillation for Accelerated Human Image Animation(https://arxiv.org/abs/2504.11143)
Keywords: diffusion
Abstract: Recent advancements in human image animation have been propelled by video diffusion models, yet their reliance on numerous iterative denoising steps results in high inference costs and slow speeds. An intuitive solution involves adopting consistency models, which serve as an effective acceleration paradigm through consistency distillation. However, simply employing this strategy in human image animation often leads to quality decline, including visual blurring, motion degradation, and facial distortion, particularly in dynamic regions. In this paper, we propose the DanceLCM approach complemented by several enhancements to improve visual quality and motion continuity at low-step regime: (1) segmented consistency distillation with an auxiliary light-weight head to incorporate supervision from real video latents, mitigating cumulative errors resulting from single full-trajectory generation; (2) a motion-focused loss to centre on motion regions, and explicit injection of facial fidelity features to improve face authenticity. Extensive qualitative and quantitative experiments demonstrate that DanceLCM achieves results comparable to state-of-the-art video diffusion models with a mere 2-4 inference steps, significantly reducing the inference burden without compromising video quality. The code and models will be made publicly available.

Title: SAR-to-RGB Translation with Latent Diffusion for Earth Observation

Authors: Kaan Aydin, Joelle Hanna, Damian Borth
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.11154
Pdf URL: https://arxiv.org/pdf/2504.11154
Copy Paste: [[2504.11154]] SAR-to-RGB Translation with Latent Diffusion for Earth Observation(https://arxiv.org/abs/2504.11154)
Keywords: diffusion
Abstract: Earth observation satellites like Sentinel-1 (S1) and Sentinel-2 (S2) provide complementary remote sensing (RS) data, but S2 images are often unavailable due to cloud cover or data gaps. To address this, we propose a diffusion model (DM)-based approach for SAR-to-RGB translation, generating synthetic optical images from SAR inputs. We explore three different setups: two using Standard Diffusion, which reconstruct S2 images by adding and removing noise (one without and one with class conditioning), and one using Cold Diffusion, which blends S2 with S1 before removing the SAR signal. We evaluate the generated images in downstream tasks, including land cover classification and cloud removal. While generated images may not perfectly replicate real S2 data, they still provide valuable information. Our results show that class conditioning improves classification accuracy, while cloud removal performance remains competitive despite our approach not being optimized for it. Interestingly, despite exhibiting lower perceptual quality, the Cold Diffusion setup performs well in land cover classification, suggesting that traditional quantitative evaluation metrics may not fully reflect the practical utility of generated images. Our findings highlight the potential of DMs for SAR-to-RGB translation in RS applications where RGB images are missing.

Title: TerraMind: Large-Scale Generative Multimodality for Earth Observation

Authors: Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11171
Pdf URL: https://arxiv.org/pdf/2504.11171
Copy Paste: [[2504.11171]] TerraMind: Large-Scale Generative Multimodality for Earth Observation(https://arxiv.org/abs/2504.11171)
Keywords: foundation model, generative
Abstract: We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code is open-sourced under a permissive license.

Title: TerraMesh: A Planetary Mosaic of Multimodal Earth Observation Data

Authors: Benedikt Blumenstiel, Paolo Fraccaro, Valerio Marsocci, Johannes Jakubik, Stefano Maurogiovanni, Mikolaj Czerkawski, Rocco Sedona, Gabriele Cavallaro, Thomas Brunschwiler, Juan Bernabe-Moreno, Nicolas Longépé
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11172
Pdf URL: https://arxiv.org/pdf/2504.11172
Copy Paste: [[2504.11172]] TerraMesh: A Planetary Mosaic of Multimodal Earth Observation Data(https://arxiv.org/abs/2504.11172)
Keywords: foundation model
Abstract: Large-scale foundation models in Earth Observation can learn versatile, label-efficient representations by leveraging massive amounts of unlabeled data. However, existing public datasets are often limited in scale, geographic coverage, or sensor variety. We introduce TerraMesh, a new globally diverse, multimodal dataset combining optical, synthetic aperture radar, elevation, and land-cover modalities in an Analysis-Ready Data format. TerraMesh includes over 9 million samples with eight spatiotemporal aligned modalities, enabling large-scale pre-training and fostering robust cross-modal correlation learning. We provide detailed data processing steps, comprehensive statistics, and empirical evidence demonstrating improved model performance when pre-trained on TerraMesh. The dataset will be made publicly available with a permissive license.

Title: R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning

Authors: Lijun Sheng, Jian Liang, Zilei Wang, Ran He
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2504.11195
Pdf URL: https://arxiv.org/pdf/2504.11195
Copy Paste: [[2504.11195]] R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning(https://arxiv.org/abs/2504.11195)
Keywords: foundation model
Abstract: Vision-language models (VLMs), such as CLIP, have gained significant popularity as foundation models, with numerous fine-tuning methods developed to enhance performance on downstream tasks. However, due to their inherent vulnerability and the common practice of selecting from a limited set of open-source models, VLMs suffer from a higher risk of adversarial attacks than traditional vision models. Existing defense techniques typically rely on adversarial fine-tuning during training, which requires labeled data and lacks of flexibility for downstream tasks. To address these limitations, we propose robust test-time prompt tuning (R-TPT), which mitigates the impact of adversarial attacks during the inference stage. We first reformulate the classic marginal entropy objective by eliminating the term that introduces conflicts under adversarial conditions, retaining only the pointwise entropy minimization. Furthermore, we introduce a plug-and-play reliability-based weighted ensembling strategy, which aggregates useful information from reliable augmented views to strengthen the defense. R-TPT enhances defense against adversarial attacks without requiring labeled training data while offering high flexibility for inference tasks. Extensive experiments on widely used benchmarks with various attacks demonstrate the effectiveness of R-TPT. The code is available in this https URL.

Title: Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning

Authors: Juan Garcia Giraldo, Nikolaos Dimitriadis, Ke Wang, Pascal Frossard
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11268
Pdf URL: https://arxiv.org/pdf/2504.11268
Copy Paste: [[2504.11268]] Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning(https://arxiv.org/abs/2504.11268)
Keywords: foundation model
Abstract: Model merging is a flexible and computationally tractable approach to merge single-task checkpoints into a multi-task model. Prior work has solely focused on constrained multi-task settings where there is a one-to-one mapping between a sample and a task, overlooking the paradigm where multiple tasks may operate on the same sample, e.g., scene understanding. In this paper, we focus on the multi-task setting with single-input-multiple-outputs (SIMO) and show that it qualitatively differs from the single-input-single-output model merging settings studied in the literature due to the existence of task-specific decoders and diverse loss objectives. We identify that existing model merging methods lead to significant performance degradation, primarily due to representation misalignment between the merged encoder and task-specific decoders. We propose two simple and efficient fixes for the SIMO setting to re-align the feature representation after merging. Compared to joint fine-tuning, our approach is computationally effective and flexible, and sheds light into identifying task relationships in an offline manner. Experiments on NYUv2, Cityscapes, and a subset of the Taskonomy dataset demonstrate: (1) task arithmetic suffices to enable multi-task capabilities; however, the representations generated by the merged encoder has to be re-aligned with the task-specific heads; (2) the proposed architecture rivals traditional multi-task learning in performance but requires fewer samples and training steps by leveraging the existence of task-specific models.

Title: UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

Authors: Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11289
Pdf URL: https://arxiv.org/pdf/2504.11289
Copy Paste: [[2504.11289]] UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer(https://arxiv.org/abs/2504.11289)
Keywords: diffusion, generative
Abstract: This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at this https URL.

Title: Autoregressive Distillation of Diffusion Transformers

Authors: Yeongmin Kim, Sotiris Anagnostidis, Yuming Du, Edgar Schönfeld, Jonas Kohler, Markos Georgopoulos, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11295
Pdf URL: https://arxiv.org/pdf/2504.11295
Copy Paste: [[2504.11295]] Autoregressive Distillation of Diffusion Transformers(https://arxiv.org/abs/2504.11295)
Keywords: diffusion
Abstract: Diffusion models with transformer architectures have demonstrated promising capabilities in generating high-fidelity images and scalability for high resolution. However, iterative sampling process required for synthesis is very resource-intensive. A line of work has focused on distilling solutions to probability flow ODEs into few-step student models. Nevertheless, existing methods have been limited by their reliance on the most recent denoised samples as input, rendering them susceptible to exposure bias. To address this limitation, we propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps. ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information. ARD modifies the teacher transformer architecture by adding token-wise time embedding to mark each input from the trajectory history and employs a block-wise causal attention mask for training. Furthermore, incorporating historical inputs only in lower transformer layers enhances performance and efficiency. We validate the effectiveness of ARD in a class-conditioned generation on ImageNet and T2I synthesis. Our model achieves a $5\times$ reduction in FID degradation compared to the baseline methods while requiring only 1.1\% extra FLOPs on ImageNet-256. Moreover, ARD reaches FID of 1.84 on ImageNet-256 in merely 4 steps and outperforms the publicly available 1024p text-to-image distilled models in prompt adherence score with a minimal drop in FID compared to the teacher. Project page: this https URL.

Title: Seedream 3.0 Technical Report

Authors: Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xuanda Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, Weilin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11346
Pdf URL: https://arxiv.org/pdf/2504.11346
Copy Paste: [[2504.11346]] Seedream 3.0 Technical Report(https://arxiv.org/abs/2504.11346)
Keywords: foundation model
Abstract: We present Seedream 3.0, a high-performance Chinese-English bilingual image generation foundation model. We develop several technical improvements to address existing challenges in Seedream 2.0, including alignment with complicated prompts, fine-grained typography generation, suboptimal visual aesthetics and fidelity, and limited image resolutions. Specifically, the advancements of Seedream 3.0 stem from improvements across the entire pipeline, from data construction to model deployment. At the data stratum, we double the dataset using a defect-aware training paradigm and a dual-axis collaborative data-sampling framework. Furthermore, we adopt several effective techniques such as mixed-resolution training, cross-modality RoPE, representation alignment loss, and resolution-aware timestep sampling in the pre-training phase. During the post-training stage, we utilize diversified aesthetic captions in SFT, and a VLM-based reward model with scaling, thereby achieving outputs that well align with human preferences. Furthermore, Seedream 3.0 pioneers a novel acceleration paradigm. By employing consistent noise expectation and importance-aware timestep sampling, we achieve a 4 to 8 times speedup while maintaining image quality. Seedream 3.0 demonstrates significant improvements over Seedream 2.0: it enhances overall capabilities, in particular for text-rendering in complicated Chinese characters which is important to professional typography generation. In addition, it provides native high-resolution output (up to 2K), allowing it to generate images with high visual quality.

Title: DeepWheel: Generating a 3D Synthetic Wheel Dataset for Design and Performance Evaluation

Authors: Soyoung Yoo, Namwoo Kang
Subjects: cs.CV, physics.app-ph
Abstract URL: https://arxiv.org/abs/2504.11347
Pdf URL: https://arxiv.org/pdf/2504.11347
Copy Paste: [[2504.11347]] DeepWheel: Generating a 3D Synthetic Wheel Dataset for Design and Performance Evaluation(https://arxiv.org/abs/2504.11347)
Keywords: diffusion, generative
Abstract: Data-driven design is emerging as a powerful strategy to accelerate engineering innovation. However, its application to vehicle wheel design remains limited due to the lack of large-scale, high-quality datasets that include 3D geometry and physical performance metrics. To address this gap, this study proposes a synthetic design-performance dataset generation framework using generative AI. The proposed framework first generates 2D rendered images using Stable Diffusion, and then reconstructs the 3D geometry through 2.5D depth estimation. Structural simulations are subsequently performed to extract engineering performance data. To further expand the design and performance space, topology optimization is applied, enabling the generation of a more diverse set of wheel designs. The final dataset, named DeepWheel, consists of over 6,000 photo-realistic images and 900 structurally analyzed 3D models. This multi-modal dataset serves as a valuable resource for surrogate model training, data-driven inverse design, and design space exploration. The proposed methodology is also applicable to other complex design domains. The dataset is released under the Creative Commons Attribution-NonCommercial 4.0 International(CC BY-NC 4.0) and is available on the this https URL

Title: OpenTuringBench: An Open-Model-based Benchmark and Framework for Machine-Generated Text Detection and Attribution

Authors: Lucio La Cava, Andrea Tagarelli
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, physics.soc-ph
Abstract URL: https://arxiv.org/abs/2504.11369
Pdf URL: https://arxiv.org/pdf/2504.11369
Copy Paste: [[2504.11369]] OpenTuringBench: An Open-Model-based Benchmark and Framework for Machine-Generated Text Detection and Attribution(https://arxiv.org/abs/2504.11369)
Keywords: generative
Abstract: Open Large Language Models (OLLMs) are increasingly leveraged in generative AI applications, posing new challenges for detecting their outputs. We propose OpenTuringBench, a new benchmark based on OLLMs, designed to train and evaluate machine-generated text detectors on the Turing Test and Authorship Attribution problems. OpenTuringBench focuses on a representative set of OLLMs, and features a number of challenging evaluation tasks, including human/machine-manipulated texts, out-of-domain texts, and texts from previously unseen models. We also provide OTBDetector, a contrastive learning framework to detect and attribute OLLM-based machine-generated texts. Results highlight the relevance and varying degrees of difficulty of the OpenTuringBench tasks, with our detector achieving remarkable capabilities across the various tasks and outperforming most existing detectors. Resources are available on the OpenTuringBench Hugging Face repository at this https URL

Title: ADT: Tuning Diffusion Models with Adversarial Supervision

Authors: Dazhong Shen, Guanglu Song, Yi Zhang, Bingqi Ma, Lujundong Li, Dongzhi Jiang, Zhuofan Zong, Yu Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11423
Pdf URL: https://arxiv.org/pdf/2504.11423
Copy Paste: [[2504.11423]] ADT: Tuning Diffusion Models with Adversarial Supervision(https://arxiv.org/abs/2504.11423)
Keywords: diffusion
Abstract: Diffusion models have achieved outstanding image generation by reversing a forward noising process to approximate true data distributions. During training, these models predict diffusion scores from noised versions of true samples in a single forward pass, while inference requires iterative denoising starting from white noise. This training-inference divergences hinder the alignment between inference and training data distributions, due to potential prediction biases and cumulative error accumulation. To address this problem, we propose an intuitive but effective fine-tuning framework, called Adversarial Diffusion Tuning (ADT), by stimulating the inference process during optimization and aligning the final outputs with training data by adversarial supervision. Specifically, to achieve robust adversarial training, ADT features a siamese-network discriminator with a fixed pre-trained backbone and lightweight trainable parameters, incorporates an image-to-image sampling strategy to smooth discriminative difficulties, and preserves the original diffusion loss to prevent discriminator hacking. In addition, we carefully constrain the backward-flowing path for back-propagating gradients along the inference path without incurring memory overload or gradient explosion. Finally, extensive experiments on Stable Diffusion models (v1.5, XL, and v3), demonstrate that ADT significantly improves both distribution alignment and image quality.

Title: NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

Authors: Yanrui Bin, Wenbo Hu, Haoyuan Wang, Xinya Chen, Bing Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11427
Pdf URL: https://arxiv.org/pdf/2504.11427
Copy Paste: [[2504.11427]] NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors(https://arxiv.org/abs/2504.11427)
Keywords: diffusion
Abstract: Surface normal estimation serves as a cornerstone for a spectrum of computer vision applications. While numerous efforts have been devoted to static image scenarios, ensuring temporal coherence in video-based normal estimation remains a formidable challenge. Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of video diffusion models. To secure high-fidelity normal estimation across sequences, we propose Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues, encouraging the model to concentrate on the intrinsic semantics of the scene. Moreover, we introduce a two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context. Extensive evaluations demonstrate the efficacy of our method, showcasing a superior performance in generating temporally consistent normal sequences with intricate details from diverse videos.

Title: Diffusion Distillation With Direct Preference Optimization For Efficient 3D LiDAR Scene Completion

Authors: An Zhaol, Shengyuan Zhang, Ling Yang, Zejian Li, Jiale Wu, Haoran Xu, AnYang Wei, Perry Pengyun GU Lingyun Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11447
Pdf URL: https://arxiv.org/pdf/2504.11447
Copy Paste: [[2504.11447]] Diffusion Distillation With Direct Preference Optimization For Efficient 3D LiDAR Scene Completion(https://arxiv.org/abs/2504.11447)
Keywords: diffusion
Abstract: The application of diffusion models in 3D LiDAR scene completion is limited due to diffusion's slow sampling speed. Score distillation accelerates diffusion sampling but with performance degradation, while post-training with direct policy optimization (DPO) boosts performance using preference data. This paper proposes Distillation-DPO, a novel diffusion distillation framework for LiDAR scene completion with preference aligment. First, the student model generates paired completion scenes with different initial noises. Second, using LiDAR scene evaluation metrics as preference, we construct winning and losing sample pairs. Such construction is reasonable, since most LiDAR scene metrics are informative but non-differentiable to be optimized directly. Third, Distillation-DPO optimizes the student model by exploiting the difference in score functions between the teacher and student models on the paired completion scenes. Such procedure is repeated until convergence. Extensive experiments demonstrate that, compared to state-of-the-art LiDAR scene completion diffusion models, Distillation-DPO achieves higher-quality scene completion while accelerating the completion speed by more than 5-fold. Our method is the first to explore adopting preference learning in distillation to the best of our knowledge and provide insights into preference-aligned distillation. Our code is public available on this https URL.

Title: Elucidating the Design Space of Multimodal Protein Language Models

Authors: Cheng-Yen (Wesley)Hsieh, Xinyou Wang, Daiheng Zhang, Dongyu Xue, Fei Ye, Shujian Huang, Zaixiang Zheng, Quanquan Gu
Subjects: cs.LG, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2504.11454
Pdf URL: https://arxiv.org/pdf/2504.11454
Copy Paste: [[2504.11454]] Elucidating the Design Space of Multimodal Protein Language Models(https://arxiv.org/abs/2504.11454)
Keywords: generative
Abstract: Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks. To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling. The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models.

Title: Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

Authors: Ziqi Pang, Xin Xu, Yu-Xiong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11457
Pdf URL: https://arxiv.org/pdf/2504.11457
Copy Paste: [[2504.11457]] Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception(https://arxiv.org/abs/2504.11457)
Keywords: diffusion, generative
Abstract: With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at this https URL.