2025-04-15

Title: Embedding Hidden Adversarial Capabilities in Pre-Trained Diffusion Models

Authors: Lucas Beerens, Desmond J. Higham
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2504.08782
Pdf URL: https://arxiv.org/pdf/2504.08782
Copy Paste: [[2504.08782]] Embedding Hidden Adversarial Capabilities in Pre-Trained Diffusion Models(https://arxiv.org/abs/2504.08782)
Keywords: diffusion, generative
Abstract: We introduce a new attack paradigm that embeds hidden adversarial capabilities directly into diffusion models via fine-tuning, without altering their observable behavior or requiring modifications during inference. Unlike prior approaches that target specific images or adjust the generation process to produce adversarial outputs, our method integrates adversarial functionality into the model itself. The resulting tampered model generates high-quality images indistinguishable from those of the original, yet these images cause misclassification in downstream classifiers at a high rate. The misclassification can be targeted to specific output classes. Users can employ this compromised model unaware of its embedded adversarial nature, as it functions identically to a standard diffusion model. We demonstrate the effectiveness and stealthiness of our approach, uncovering a covert attack vector that raises new security concerns. These findings expose a risk arising from the use of externally-supplied models and highlight the urgent need for robust model verification and defense mechanisms against hidden threats in generative models. The code is available at this https URL .

Title: Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation

Authors: Toqeer Ehsan, Thamar Solorio
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2504.08792
Pdf URL: https://arxiv.org/pdf/2504.08792
Copy Paste: [[2504.08792]] Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation(https://arxiv.org/abs/2504.08792)
Keywords: generative
Abstract: Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.

Title: InfoGain Wavelets: Furthering the Design of Diffusion Wavelets for Graph-Structured Data

Authors: David R. Johnson, Smita Krishnaswamy, Michael Perlmutter
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.08802
Pdf URL: https://arxiv.org/pdf/2504.08802
Copy Paste: [[2504.08802]] InfoGain Wavelets: Furthering the Design of Diffusion Wavelets for Graph-Structured Data(https://arxiv.org/abs/2504.08802)
Keywords: diffusion
Abstract: Diffusion wavelets extract information from graph signals at different scales of resolution by utilizing graph diffusion operators raised to various powers, known as diffusion scales. Traditionally, the diffusion scales are chosen to be dyadic integers, $\mathbf{2^j}$. Here, we propose a novel, unsupervised method for selecting the diffusion scales based on ideas from information theory. We then show that our method can be incorporated into wavelet-based GNNs via graph classification experiments.

Title: Generative AI in Live Operations: Evidence of Productivity Gains in Cybersecurity and Endpoint Management

Authors: James Bono, Justin Grana, Kleanthis Karakolios, Pruthvi Hanumanthapura Ramakrishna, Ankit Srivastava
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08805
Pdf URL: https://arxiv.org/pdf/2504.08805
Copy Paste: [[2504.08805]] Generative AI in Live Operations: Evidence of Productivity Gains in Cybersecurity and Endpoint Management(https://arxiv.org/abs/2504.08805)
Keywords: generative
Abstract: We measure the association between generative AI (GAI) tool adoption and four metrics spanning security operations, information protection, and endpoint management: 1) number of security alerts per incident, 2) probability of security incident reopenings, 3) time to classify a data loss prevention alert, and 4) time to resolve device policy conflicts. We find that GAI is associated with robust and statistically and practically significant improvements in the four metrics. Although unobserved confounders inhibit causal identification, these results are among the first to use observational data from live operations to investigate the relationship between GAI adoption and security operations, data loss prevention, and device policy management.

Title: Mechanistic Anomaly Detection for "Quirky" Language Models

Authors: David O. Johnston, Arkajyoti Chakraborty, Nora Belrose
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2504.08812
Pdf URL: https://arxiv.org/pdf/2504.08812
Copy Paste: [[2504.08812]] Mechanistic Anomaly Detection for "Quirky" Language Models(https://arxiv.org/abs/2504.08812)
Keywords: anomaly
Abstract: As LLMs grow in capability, the task of supervising LLMs becomes more challenging. Supervision failures can occur if LLMs are sensitive to factors that supervisors are unaware of. We investigate Mechanistic Anomaly Detection (MAD) as a technique to augment supervision of capable models; we use internal model features to identify anomalous training signals so they can be investigated or discarded. We train detectors to flag points from the test environment that differ substantially from the training environment, and experiment with a large variety of detector features and scoring rules to detect anomalies in a set of ``quirky'' language models. We find that detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks. MAD techniques may be effective in low-stakes applications, but advances in both detection and evaluation are likely needed if they are to be used in high stakes settings.

Title: Probabilistic QoS Metric Forecasting in Delay-Tolerant Networks Using Conditional Diffusion Models on Latent Dynamics

Authors: Enming Zhang, Zheng Liu, Yu Xiang, Yanwen Qu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.08821
Pdf URL: https://arxiv.org/pdf/2504.08821
Copy Paste: [[2504.08821]] Probabilistic QoS Metric Forecasting in Delay-Tolerant Networks Using Conditional Diffusion Models on Latent Dynamics(https://arxiv.org/abs/2504.08821)
Keywords: diffusion
Abstract: Active QoS metric prediction, commonly employed in the maintenance and operation of DTN, could enhance network performance regarding latency, throughput, energy consumption, and dependability. Naturally formulated as a multivariate time series forecasting problem, it attracts substantial research efforts. Traditional mean regression methods for time series forecasting cannot capture the data complexity adequately, resulting in deteriorated performance in operational tasks in DTNs such as routing. This paper formulates the prediction of QoS metrics in DTN as a probabilistic forecasting problem on multivariate time series, where one could quantify the uncertainty of forecasts by characterizing the distribution of these samples. The proposed approach hires diffusion models and incorporates the latent temporal dynamics of non-stationary and multi-mode data into them. Extensive experiments demonstrate the efficacy of the proposed approach by showing that it outperforms the popular probabilistic time series forecasting methods.

Title: PatchTrAD: A Patch-Based Transformer focusing on Patch-Wise Reconstruction Error for Time Series Anomaly Detection

Authors: Samy-Melwan Vilhes (LITIS), Gilles Gasso (LITIS), Mokhtar Z Alaya (LMAC)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08827
Pdf URL: https://arxiv.org/pdf/2504.08827
Copy Paste: [[2504.08827]] PatchTrAD: A Patch-Based Transformer focusing on Patch-Wise Reconstruction Error for Time Series Anomaly Detection(https://arxiv.org/abs/2504.08827)
Keywords: anomaly
Abstract: Time series anomaly detection (TSAD) focuses on identifying whether observations in streaming data deviate significantly from normal patterns. With the prevalence of connected devices, anomaly detection on time series has become paramount, as it enables real-time monitoring and early detection of irregular behaviors across various application domains. In this work, we introduce PatchTrAD, a Patch-based Transformer model for time series anomaly detection. Our approach leverages a Transformer encoder along with the use of patches under a reconstructionbased framework for anomaly detection. Empirical evaluations on multiple benchmark datasets show that PatchTrAD is on par, in terms of detection performance, with state-of-the-art deep learning models for anomaly detection while being time efficient during inference.

Title: Datum-wise Transformer for Synthetic Tabular Data Detection in the Wild

Authors: G. Charbel N. Kindji (IRISA, MALT), Elisa Fromont (MALT, IRISA), Lina Maria Rojas-Barahona, Tanguy Urvoy
Subjects: cs.LG, cs.AI, cs.DB, cs.NE
Abstract URL: https://arxiv.org/abs/2504.08829
Pdf URL: https://arxiv.org/pdf/2504.08829
Copy Paste: [[2504.08829]] Datum-wise Transformer for Synthetic Tabular Data Detection in the Wild(https://arxiv.org/abs/2504.08829)
Keywords: generative
Abstract: The growing power of generative models raises major concerns about the authenticity of published content. To address this problem, several synthetic content detection methods have been proposed for uniformly structured media such as image or text. However, little work has been done on the detection of synthetic tabular data, despite its importance in industry and government. This form of data is complex to handle due to the diversity of its structures: the number and types of the columns may vary wildly from one table to another. We tackle the tough problem of detecting synthetic tabular data ''in the wild'', i.e. when the model is deployed on table structures it has never seen before. We introduce a novel datum-wise transformer architecture and show that it outperforms existing models. Furthermore, we investigate the application of domain adaptation techniques to enhance the effectiveness of our model, thereby providing a more robust data-forgery detection solution.

Title: Mimic In-Context Learning for Multimodal Tasks

Authors: Yuchu Jiang, Jiale Fu, Chenduo Hao, Xinting Hu, Yingzhe Peng, Xin Geng, Xu Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08851
Pdf URL: https://arxiv.org/pdf/2504.08851
Copy Paste: [[2504.08851]] Mimic In-Context Learning for Multimodal Tasks(https://arxiv.org/abs/2504.08851)
Keywords: in-context
Abstract: Recently, In-context Learning (ICL) has become a significant inference paradigm in Large Multimodal Models (LMMs), utilizing a few in-context demonstrations (ICDs) to prompt LMMs for new tasks. However, the synergistic effects in multimodal data increase the sensitivity of ICL performance to the configurations of ICDs, stimulating the need for a more stable and general mapping function. Mathematically, in Transformer-based models, ICDs act as ``shift vectors'' added to the hidden states of query tokens. Inspired by this, we introduce Mimic In-Context Learning (MimIC) to learn stable and generalizable shift effects from ICDs. Specifically, compared with some previous shift vector-based methods, MimIC more strictly approximates the shift effects by integrating lightweight learnable modules into LMMs with four key enhancements: 1) inserting shift vectors after attention layers, 2) assigning a shift vector to each attention head, 3) making shift magnitude query-dependent, and 4) employing a layer-wise alignment loss. Extensive experiments on two LMMs (Idefics-9b and Idefics2-8b-base) across three multimodal tasks (VQAv2, OK-VQA, Captioning) demonstrate that MimIC outperforms existing shift vector-based methods. The code is available at this https URL.

Title: Knowledge Graph-extended Retrieval Augmented Generation for Question Answering

Authors: Jasper Linders, Jakub M. Tomczak
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.08893
Pdf URL: https://arxiv.org/pdf/2504.08893
Copy Paste: [[2504.08893]] Knowledge Graph-extended Retrieval Augmented Generation for Question Answering(https://arxiv.org/abs/2504.08893)
Keywords: in-context
Abstract: Large Language Models (LLMs) and Knowledge Graphs (KGs) offer a promising approach to robust and explainable Question Answering (QA). While LLMs excel at natural language understanding, they suffer from knowledge gaps and hallucinations. KGs provide structured knowledge but lack natural language interaction. Ideally, an AI system should be both robust to missing facts as well as easy to communicate with. This paper proposes such a system that integrates LLMs and KGs without requiring training, ensuring adaptability across different KGs with minimal human effort. The resulting approach can be classified as a specific form of a Retrieval Augmented Generation (RAG) with a KG, thus, it is dubbed Knowledge Graph-extended Retrieval Augmented Generation (KG-RAG). It includes a question decomposition module to enhance multi-hop information retrieval and answer explainability. Using In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting, it generates explicit reasoning chains processed separately to improve truthfulness. Experiments on the MetaQA benchmark show increased accuracy for multi-hop questions, though with a slight trade-off in single-hop performance compared to LLM with KG baselines. These findings demonstrate KG-RAG's potential to improve transparency in QA by bridging unstructured language understanding with structured knowledge retrieval.

Title: Position: Beyond Euclidean -- Foundation Models Should Embrace Non-Euclidean Geometries

Authors: Neil He, Jiahong Liu, Buze Zhang, Ngoc Bui, Ali Maatouk, Menglin Yang, Irwin King, Melanie Weber, Rex Ying
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08896
Pdf URL: https://arxiv.org/pdf/2504.08896
Copy Paste: [[2504.08896]] Position: Beyond Euclidean -- Foundation Models Should Embrace Non-Euclidean Geometries(https://arxiv.org/abs/2504.08896)
Keywords: foundation model
Abstract: In the era of foundation models and Large Language Models (LLMs), Euclidean space has been the de facto geometric setting for machine learning architectures. However, recent literature has demonstrated that this choice comes with fundamental limitations. At a large scale, real-world data often exhibit inherently non-Euclidean structures, such as multi-way relationships, hierarchies, symmetries, and non-isotropic scaling, in a variety of domains, such as languages, vision, and the natural sciences. It is challenging to effectively capture these structures within the constraints of Euclidean spaces. This position paper argues that moving beyond Euclidean geometry is not merely an optional enhancement but a necessity to maintain the scaling law for the next-generation of foundation models. By adopting these geometries, foundation models could more efficiently leverage the aforementioned structures. Task-aware adaptability that dynamically reconfigures embeddings to match the geometry of downstream applications could further enhance efficiency and expressivity. Our position is supported by a series of theoretical and empirical investigations of prevalent foundation this http URL, we outline a roadmap for integrating non-Euclidean geometries into foundation models, including strategies for building geometric foundation models via fine-tuning, training from scratch, and hybrid approaches.

Title: LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping

Authors: Pascal Chang, Sergio Sancho, Jingwei Tang, Markus Gross, Vinicius C. Azevedo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.08902
Pdf URL: https://arxiv.org/pdf/2504.08902
Copy Paste: [[2504.08902]] LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping(https://arxiv.org/abs/2504.08902)
Keywords: generative
Abstract: Anamorphosis refers to a category of images that are intentionally distorted, making them unrecognizable when viewed directly. Their true form only reveals itself when seen from a specific viewpoint, which can be through some catadioptric device like a mirror or a lens. While the construction of these mathematical devices can be traced back to as early as the 17th century, they are only interpretable when viewed from a specific vantage point and tend to lose meaning when seen normally. In this paper, we revisit these famous optical illusions with a generative twist. With the help of latent rectified flow models, we propose a method to create anamorphic images that still retain a valid interpretation when viewed directly. To this end, we introduce Laplacian Pyramid Warping, a frequency-aware image warping technique key to generating high-quality visuals. Our work extends Visual Anagrams (arXiv:2311.17919) to latent space models and to a wider range of spatial transforms, enabling the creation of novel generative perceptual illusions.

Title: Robust SAM: On the Adversarial Robustness of Vision Foundation Models

Authors: Jiahuan Long, Zhengqin Xu, Tingsong Jiang, Wen Yao, Shuai Jia, Chao Ma, Xiaoqian Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08906
Pdf URL: https://arxiv.org/pdf/2504.08906
Copy Paste: [[2504.08906]] Robust SAM: On the Adversarial Robustness of Vision Foundation Models(https://arxiv.org/abs/2504.08906)
Keywords: foundation model
Abstract: The Segment Anything Model (SAM) is a widely used vision foundation model with diverse applications, including image segmentation, detection, and tracking. Given SAM's wide applications, understanding its robustness against adversarial attacks is crucial for real-world deployment. However, research on SAM's robustness is still in its early stages. Existing attacks often overlook the role of prompts in evaluating SAM's robustness, and there has been insufficient exploration of defense methods to balance the robustness and accuracy. To address these gaps, this paper proposes an adversarial robustness framework designed to evaluate and enhance the robustness of SAM. Specifically, we introduce a cross-prompt attack method to enhance the attack transferability across different prompt types. Besides attacking, we propose a few-parameter adaptation strategy to defend SAM against various adversarial attacks. To balance robustness and accuracy, we use the singular value decomposition (SVD) to constrain the space of trainable parameters, where only singular values are adaptable. Experiments demonstrate that our cross-prompt attack method outperforms previous approaches in terms of attack success rate on both SAM and SAM 2. By adapting only 512 parameters, we achieve at least a 15\% improvement in mean intersection over union (mIoU) against various adversarial attacks. Compared to previous defense methods, our approach enhances the robustness of SAM while maximally maintaining its original performance.

Title: HyperCore: The Core Framework for Building Hyperbolic Foundation Models with Comprehensive Modules

Authors: Neil He, Menglin Yang, Rex Ying
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08912
Pdf URL: https://arxiv.org/pdf/2504.08912
Copy Paste: [[2504.08912]] HyperCore: The Core Framework for Building Hyperbolic Foundation Models with Comprehensive Modules(https://arxiv.org/abs/2504.08912)
Keywords: foundation model
Abstract: Hyperbolic neural networks have emerged as a powerful tool for modeling hierarchical data across diverse modalities. Recent studies show that token distributions in foundation models exhibit scale-free properties, suggesting that hyperbolic space is a more suitable ambient space than Euclidean space for many pre-training and downstream tasks. However, existing tools lack essential components for building hyperbolic foundation models, making it difficult to leverage recent advancements. We introduce HyperCore, a comprehensive open-source framework that provides core modules for constructing hyperbolic foundation models across multiple modalities. HyperCore's modules can be effortlessly combined to develop novel hyperbolic foundation models, eliminating the need to extensively modify Euclidean modules from scratch and possible redundant research efforts. To demonstrate its versatility, we build and test the first fully hyperbolic vision transformers (LViT) with a fine-tuning pipeline, the first fully hyperbolic multimodal CLIP model (L-CLIP), and a hybrid Graph RAG with a hyperbolic graph encoder. Our experiments demonstrate that LViT outperforms its Euclidean counterpart. Additionally, we benchmark and reproduce experiments across hyperbolic GNNs, CNNs, Transformers, and vision Transformers to highlight HyperCore's advantages.

Title: Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models

Authors: Jiahuan Long, Tingsong Jiang, Wen Yao, Yizhe Xiong, Zhengqin Xu, Shuai Jia, Chao Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08915
Pdf URL: https://arxiv.org/pdf/2504.08915
Copy Paste: [[2504.08915]] Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models(https://arxiv.org/abs/2504.08915)
Keywords: foundation model
Abstract: Vision foundation models (VFMs) are large pre-trained models that form the backbone of various vision tasks. Fine-tuning VFMs can further unlock their potential for downstream tasks or scenarios. However, VFMs often contain significant feature redundancy, which may limit their adaptability to new tasks. In this paper, we investigate the redundancies in the segment anything model (SAM) and then propose a parameter-free fine-tuning method to address this issue. Unlike traditional fine-tuning methods that adjust parameters, our method emphasizes selecting, reusing, and enhancing pre-trained features, offering a new perspective on model fine-tuning. Specifically, we introduce a channel selection algorithm based on the model's output difference to identify redundant and effective channels. By selectively replacing the redundant channels with more effective ones, we filter out less useful features and reuse the more relevant features to downstream tasks, thereby enhancing the task-specific feature representation. Experiments on both out-of-domain and in-domain datasets demonstrate the efficiency and effectiveness of our method. Notably, our approach can seamlessly integrate with existing fine-tuning strategies (e.g., LoRA, Adapter), further boosting the performance of already fine-tuned models. Moreover, since our channel selection involves only model inference, our method significantly reduces computational and GPU memory overhead.

Title: Long Context In-Context Compression by Getting to the Gist of Gisting

Authors: Aleksandar Petrov, Mark Sandler, Andrey Zhmoginov, Nolan Miller, Max Vladymyrov
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.08934
Pdf URL: https://arxiv.org/pdf/2504.08934
Copy Paste: [[2504.08934]] Long Context In-Context Compression by Getting to the Gist of Gisting(https://arxiv.org/abs/2504.08934)
Keywords: in-context
Abstract: Long context processing is critical for the adoption of LLMs, but existing methods often introduce architectural complexity that hinders their practical adoption. Gisting, an in-context compression method with no architectural modification to the decoder transformer, is a promising approach due to its simplicity and compatibility with existing frameworks. While effective for short instructions, we demonstrate that gisting struggles with longer contexts, with significant performance drops even at minimal compression rates. Surprisingly, a simple average pooling baseline consistently outperforms gisting. We analyze the limitations of gisting, including information flow interruptions, capacity limitations and the inability to restrict its attention to subsets of the context. Motivated by theoretical insights into the performance gap between gisting and average pooling, and supported by extensive experimentation, we propose GistPool, a new in-context compression method. GistPool preserves the simplicity of gisting, while significantly boosting its performance on long context compression tasks.

Title: MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer

Authors: Yilin Wang, Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Xinxin Zuo, Juwei Lu, Hai Jiang, Li Cheng
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2504.08959
Pdf URL: https://arxiv.org/pdf/2504.08959
Copy Paste: [[2504.08959]] MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer(https://arxiv.org/abs/2504.08959)
Keywords: diffusion, generative
Abstract: Generative masked transformers have demonstrated remarkable success across various content generation tasks, primarily due to their ability to effectively model large-scale dataset distributions with high consistency. However, in the animation domain, large datasets are not always available. Applying generative masked modeling to generate diverse instances from a single MoCap reference may lead to overfitting, a challenge that remains unexplored. In this work, we present MotionDreamer, a localized masked modeling paradigm designed to learn internal motion patterns from a given motion with arbitrary topology and duration. By embedding the given motion into quantized tokens with a novel distribution regularization method, MotionDreamer constructs a robust and informative codebook for local motion patterns. Moreover, a sliding window local attention is introduced in our masked transformer, enabling the generation of natural yet diverse animations that closely resemble the reference motion patterns. As demonstrated through comprehensive experiments, MotionDreamer outperforms the state-of-the-art methods that are typically GAN or Diffusion-based in both faithfulness and diversity. Thanks to the consistency and robustness of the quantization-based approach, MotionDreamer can also effectively perform downstream tasks such as temporal motion editing, \textcolor{update}{crowd animation}, and beat-aligned dance generation, all using a single reference motion. Visit our project page: this https URL

Title: Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization

Authors: Gen Li, Yang Xiao, Jie Ji, Kaiyuan Deng, Bo Hui, Linke Guo, Xiaolong Ma
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09039
Pdf URL: https://arxiv.org/pdf/2504.09039
Copy Paste: [[2504.09039]] Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization(https://arxiv.org/abs/2504.09039)
Keywords: diffusion, generative
Abstract: Text-to-image (T2I) diffusion models have achieved remarkable success in generating high-quality images from textual prompts. However, their ability to store vast amounts of knowledge raises concerns in scenarios where selective forgetting is necessary, such as removing copyrighted content, reducing biases, or eliminating harmful concepts. While existing unlearning methods can remove certain concepts, they struggle with multi-concept forgetting due to instability, residual knowledge persistence, and generation quality degradation. To address these challenges, we propose \textbf{Dynamic Mask coupled with Concept-Aware Loss}, a novel unlearning framework designed for multi-concept forgetting in diffusion models. Our \textbf{Dynamic Mask} mechanism adaptively updates gradient masks based on current optimization states, allowing selective weight modifications that prevent interference with unrelated knowledge. Additionally, our \textbf{Concept-Aware Loss} explicitly guides the unlearning process by enforcing semantic consistency through superclass alignment, while a regularization loss based on knowledge distillation ensures that previously unlearned concepts remain forgotten during sequential unlearning. We conduct extensive experiments to evaluate our approach. Results demonstrate that our method outperforms existing unlearning techniques in forgetting effectiveness, output fidelity, and semantic coherence, particularly in multi-concept scenarios. Our work provides a principled and flexible framework for stable and high-fidelity unlearning in generative models. The code will be released publicly.

Title: Multimodal 3D Genome Pre-training

Authors: Minghao Yang, Pengteng Li, Yan Liang, Qianyi Cai, Zhihang Zheng, Shichen Zhang, Pengfei Zhang, Zhi-An Huang, Hui Xiong
Subjects: cs.LG, cs.AI, q-bio.GN
Abstract URL: https://arxiv.org/abs/2504.09060
Pdf URL: https://arxiv.org/pdf/2504.09060
Copy Paste: [[2504.09060]] Multimodal 3D Genome Pre-training(https://arxiv.org/abs/2504.09060)
Keywords: foundation model
Abstract: Deep learning techniques have driven significant progress in various analytical tasks within 3D genomics in computational biology. However, a holistic understanding of 3D genomics knowledge remains underexplored. Here, we propose MIX-HIC, the first multimodal foundation model of 3D genome that integrates both 3D genome structure and epigenomic tracks, which obtains unified and comprehensive semantics. For accurate heterogeneous semantic fusion, we design the cross-modal interaction and mapping blocks for robust unified representation, yielding the accurate aggregation of 3D genome knowledge. Besides, we introduce the first large-scale dataset comprising over 1 million pairwise samples of Hi-C contact maps and epigenomic tracks for high-quality pre-training, enabling the exploration of functional implications in 3D genomics. Extensive experiments show that MIX-HIC can significantly surpass existing state-of-the-art methods in diverse downstream tasks. This work provides a valuable resource for advancing 3D genomics research.

Title: Privacy Preservation in Gen AI Applications

Authors: Swetha S, Ram Sundhar K Shaju, Rakshana M, Ganesh R, Balavedhaa S, Thiruvaazhi U
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09095
Pdf URL: https://arxiv.org/pdf/2504.09095
Copy Paste: [[2504.09095]] Privacy Preservation in Gen AI Applications(https://arxiv.org/abs/2504.09095)
Keywords: generative
Abstract: The ability of machines to comprehend and produce language that is similar to that of humans has revolutionized sectors like customer service, healthcare, and finance thanks to the quick advances in Natural Language Processing (NLP), which are fueled by Generative Artificial Intelligence (AI) and Large Language Models (LLMs). However, because LLMs trained on large datasets may unintentionally absorb and reveal Personally Identifiable Information (PII) from user interactions, these capabilities also raise serious privacy concerns. Deep neural networks' intricacy makes it difficult to track down or stop the inadvertent storing and release of private information, which raises serious concerns about the privacy and security of AI-driven data. This study tackles these issues by detecting Generative AI weaknesses through attacks such as data extraction, model inversion, and membership inference. A privacy-preserving Generative AI application that is resistant to these assaults is then developed. It ensures privacy without sacrificing functionality by using methods to identify, alter, or remove PII before to dealing with LLMs. In order to determine how well cloud platforms like Microsoft Azure, Google Cloud, and AWS provide privacy tools for protecting AI applications, the study also examines these technologies. In the end, this study offers a fundamental privacy paradigm for generative AI systems, focusing on data security and moral AI implementation, and opening the door to a more secure and conscientious use of these tools.

Title: BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting

Authors: Jeongwan On, Kyeonghwan Gwak, Gunyoung Kang, Junuk Cha, Soohyun Hwang, Hyein Hwang, Seungryul Baek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09097
Pdf URL: https://arxiv.org/pdf/2504.09097
Copy Paste: [[2504.09097]] BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting(https://arxiv.org/abs/2504.09097)
Keywords: diffusion
Abstract: Reconstructing 3Ds of hand-object interaction (HOI) is a fundamental problem that can find numerous applications. Despite recent advances, there is no comprehensive pipeline yet for bimanual class-agnostic interaction reconstruction from a monocular RGB video, where two hands and an unknown object are interacting with each other. Previous works tackled the limited hand-object interaction case, where object templates are pre-known or only one hand is involved in the interaction. The bimanual interaction reconstruction exhibits severe occlusions introduced by complex interactions between two hands and an object. To solve this, we first introduce BIGS (Bimanual Interaction 3D Gaussian Splatting), a method that reconstructs 3D Gaussians of hands and an unknown object from a monocular video. To robustly obtain object Gaussians avoiding severe occlusions, we leverage prior knowledge of pre-trained diffusion model with score distillation sampling (SDS) loss, to reconstruct unseen object parts. For hand Gaussians, we exploit the 3D priors of hand model (i.e., MANO) and share a single Gaussian for two hands to effectively accumulate hand 3D information, given limited views. To further consider the 3D alignment between hands and objects, we include the interacting-subjects optimization step during Gaussian optimization. Our method achieves the state-of-the-art accuracy on two challenging datasets, in terms of 3D hand pose estimation (MPJPE), 3D object reconstruction (CDh, CDo, F10), and rendering quality (PSNR, SSIM, LPIPS), respectively.

Title: CAShift: Benchmarking Log-Based Cloud Attack Detection under Normality Shift

Authors: Jiongchi Yu, Xiaofei Xie, Qiang Hu, Bowen Zhang, Ziming Zhao, Yun Lin, Lei Ma, Ruitao Feng, Frank Liau
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2504.09115
Pdf URL: https://arxiv.org/pdf/2504.09115
Copy Paste: [[2504.09115]] CAShift: Benchmarking Log-Based Cloud Attack Detection under Normality Shift(https://arxiv.org/abs/2504.09115)
Keywords: anomaly
Abstract: With the rapid advancement of cloud-native computing, securing cloud environments has become an important task. Log-based Anomaly Detection (LAD) is the most representative technique used in different systems for attack detection and safety guarantee, where multiple LAD methods and relevant datasets have been proposed. However, even though some of these datasets are specifically prepared for cloud systems, they only cover limited cloud behaviors and lack information from a whole-system perspective. Besides, another critical issue to consider is normality shift, which implies the test distribution could differ from the training distribution and highly affects the performance of LAD. Unfortunately, existing works only focus on simple shift types such as chronological changes, while other important and cloud-specific shift types are ignored, e.g., the distribution shift introduced by different deployed cloud architectures. Therefore, creating a new dataset that covers diverse behaviors of cloud systems and normality shift types is necessary. To fill the gap in evaluating LAD under real-world conditions, we present CAShift, the first normality shift-aware dataset for cloud systems. CAShift captures three shift types, including application, version, and cloud architecture shifts, and includes 20 diverse attack scenarios across various cloud components. Using CAShift, we conduct an empirical study showing that (1) all LAD methods are significantly affected by normality shifts, with performance drops of up to 34%, and (2) continuous learning techniques can improve F1-scores by up to 27%, depending on data usage and algorithm choice. Based on our findings, we offer valuable implications for future research in designing more robust LAD models and methods for LAD shift adaptation.

Title: Self-Supervised Autoencoder Network for Robust Heart Rate Extraction from Noisy Photoplethysmogram: Applying Blind Source Separation to Biosignal Analysis

Authors: Matthew B. Webster, Dongheon Lee, Joonnyong Lee
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2504.09132
Pdf URL: https://arxiv.org/pdf/2504.09132
Copy Paste: [[2504.09132]] Self-Supervised Autoencoder Network for Robust Heart Rate Extraction from Noisy Photoplethysmogram: Applying Blind Source Separation to Biosignal Analysis(https://arxiv.org/abs/2504.09132)
Keywords: self-supervised
Abstract: Biosignals can be viewed as mixtures measuring particular physiological events, and blind source separation (BSS) aims to extract underlying source signals from mixtures. This paper proposes a self-supervised multi-encoder autoencoder (MEAE) to separate heartbeat-related source signals from photoplethysmogram (PPG), enhancing heart rate (HR) detection in noisy PPG data. The MEAE is trained on PPG signals from a large open polysomnography database without any pre-processing or data selection. The trained network is then applied to a noisy PPG dataset collected during the daily activities of nine subjects. The extracted heartbeat-related source signal significantly improves HR detection as compared to the original PPG. The absence of pre-processing and the self-supervised nature of the proposed method, combined with its strong performance, highlight the potential of BSS in biosignal analysis.

Title: MatWheel: Addressing Data Scarcity in Materials Science Through Synthetic Data

Authors: Wentao Li, Yizhe Chen, Jiangjie Qiu, Xiaonan Wang
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2504.09152
Pdf URL: https://arxiv.org/pdf/2504.09152
Copy Paste: [[2504.09152]] MatWheel: Addressing Data Scarcity in Materials Science Through Synthetic Data(https://arxiv.org/abs/2504.09152)
Keywords: generative
Abstract: Data scarcity and the high cost of annotation have long been persistent challenges in the field of materials science. Inspired by its potential in other fields like computer vision, we propose the MatWheel framework, which train the material property prediction model using the synthetic data generated by the conditional generative model. We explore two scenarios: fully-supervised and semi-supervised learning. Using CGCNN for property prediction and Con-CDVAE as the conditional generative model, experiments on two data-scarce material property datasets from Matminer database are conducted. Results show that synthetic data has potential in extreme data-scarce scenarios, achieving performance close to or exceeding that of real samples in all two tasks. We also find that pseudo-labels have little impact on generated data quality. Future work will integrate advanced models and optimize generation conditions to boost the effectiveness of the materials data flywheel.

Title: Secure Physical Layer Communications for Low-Altitude Economy Networking: A Survey

Authors: Lingyi Cai, Jiacheng Wang, Ruichen Zhang, Yu Zhang, Tao Jiang, Dusit Niyato, Xianbin Wang, Abbas Jamalipour, Xuemin Shen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2504.09153
Pdf URL: https://arxiv.org/pdf/2504.09153
Copy Paste: [[2504.09153]] Secure Physical Layer Communications for Low-Altitude Economy Networking: A Survey(https://arxiv.org/abs/2504.09153)
Keywords: anomaly
Abstract: The Low-Altitude Economy Networking (LAENet) is emerging as a transformative paradigm that enables an integrated and sophisticated communication infrastructure to support aerial vehicles in carrying out a wide range of economic activities within low-altitude airspace. However, the physical layer communications in the LAENet face growing security threats due to inherent characteristics of aerial communication environments, such as signal broadcast nature and channel openness. These challenges highlight the urgent need for safeguarding communication confidentiality, availability, and integrity. In view of the above, this survey comprehensively reviews existing secure countermeasures for physical layer communication in the LAENet. We explore core methods focusing on anti-eavesdropping and authentication for ensuring communication confidentiality. Subsequently, availability-enhancing techniques are thoroughly discussed for anti-jamming and spoofing defense. Then, we review approaches for safeguarding integrity through anomaly detection and injection protection. Furthermore, we discuss future research directions, emphasizing energy-efficient physical layer security, multi-drone collaboration for secure communication, AI-driven security defense strategy, space-air-ground integrated security architecture, and 6G-enabled secure UAV communication. This survey may provide valuable references and new insights for researchers in the field of secure physical layer communication for the LAENet.

Title: Evolved Hierarchical Masking for Self-Supervised Learning

Authors: Zhanzhou Feng, Shiliang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09155
Pdf URL: https://arxiv.org/pdf/2504.09155
Copy Paste: [[2504.09155]] Evolved Hierarchical Masking for Self-Supervised Learning(https://arxiv.org/abs/2504.09155)
Keywords: self-supervised
Abstract: Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling this http URL paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1\% in imageNet-1K classification and 1.4\% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.

Title: Can postgraduate translation students identify machine-generated text?

Authors: Michael Farrell
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09164
Pdf URL: https://arxiv.org/pdf/2504.09164
Copy Paste: [[2504.09164]] Can postgraduate translation students identify machine-generated text?(https://arxiv.org/abs/2504.09164)
Keywords: generative
Abstract: Given the growing use of generative artificial intelligence as a tool for creating multilingual content and bypassing both machine and traditional translation methods, this study explores the ability of linguistically trained individuals to discern machine-generated output from human-written text (HT). After brief training sessions on the textual anomalies typically found in synthetic text (ST), twenty-three postgraduate translation students analysed excerpts of Italian prose and assigned likelihood scores to indicate whether they believed they were human-written or AI-generated (ChatGPT-4o). The results show that, on average, the students struggled to distinguish between HT and ST, with only two participants achieving notable accuracy. Closer analysis revealed that the students often identified the same textual anomalies in both HT and ST, although features such as low burstiness and self-contradiction were more frequently associated with ST. These findings suggest the need for improvements in the preparatory training. Moreover, the study raises questions about the necessity of editing synthetic text to make it sound more human-like and recommends further research to determine whether AI-generated text is already sufficiently natural-sounding not to require further refinement.

Title: From Visual Explanations to Counterfactual Explanations with Latent Diffusion

Authors: Tung Luu, Nam Le, Duc Le, Bac Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09202
Pdf URL: https://arxiv.org/pdf/2504.09202
Copy Paste: [[2504.09202]] From Visual Explanations to Counterfactual Explanations with Latent Diffusion(https://arxiv.org/abs/2504.09202)
Keywords: diffusion
Abstract: Visual counterfactual explanations are ideal hypothetical images that change the decision-making of the classifier with high confidence toward the desired class while remaining visually plausible and close to the initial image. In this paper, we propose a new approach to tackle two key challenges in recent prominent works: i) determining which specific counterfactual features are crucial for distinguishing the "concept" of the target class from the original class, and ii) supplying valuable explanations for the non-robust classifier without relying on the support of an adversarially robust model. Our method identifies the essential region for modification through algorithms that provide visual explanations, and then our framework generates realistic counterfactual explanations by combining adversarial attacks based on pruning the adversarial gradient of the target classifier and the latent diffusion model. The proposed method outperforms previous state-of-the-art results on various evaluation criteria on ImageNet and CelebA-HQ datasets. In general, our method can be applied to arbitrary classifiers, highlight the strong association between visual and counterfactual explanations, make semantically meaningful changes from the target classifier, and provide observers with subtle counterfactual images.

Title: VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation Models Beat GPT-4o & Gemini-1.5 Pro

Authors: Zheyuan Zhang, Monica Dou, Linkai Peng, Hongyi Pan, Ulas Bagci, Boqing Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09282
Pdf URL: https://arxiv.org/pdf/2504.09282
Copy Paste: [[2504.09282]] VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation Models Beat GPT-4o & Gemini-1.5 Pro(https://arxiv.org/abs/2504.09282)
Keywords: foundation model
Abstract: Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. They are often more complex than general videos of similar duration due to their structured narratives and rapid scene transitions, posing significant challenges to multi-modal large language models (MLLMs). In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by \textbf{manually} annotated diverse questions across three core tasks: visual finding, video summary, and visual reasoning. We propose a quantitative measure to compare VideoAds against existing benchmarks in terms of video complexity. Through extensive experiments, we find that Qwen2.5-VL-72B, an opensource MLLM, achieves 73.35\% accuracy on VideoAds, outperforming GPT-4o (66.82\%) and Gemini-1.5 Pro (69.66\%); the two proprietary models especially fall behind the opensource model in video summarization and reasoning, but perform the best in visual finding. Notably, human experts easily achieve a remarkable accuracy of 94.27\%. These results underscore the necessity of advancing MLLMs' temporal modeling capabilities and highlight VideoAds as a potentially pivotal benchmark for future research in understanding video that requires high FPS sampling. The dataset and evaluation code will be publicly available at this https URL.

Title: Enhancing Contrastive Demonstration Selection with Semantic Diversity for Robust In-Context Machine Translation

Authors: Owen Patterson, Chee Ng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09305
Pdf URL: https://arxiv.org/pdf/2504.09305
Copy Paste: [[2504.09305]] Enhancing Contrastive Demonstration Selection with Semantic Diversity for Robust In-Context Machine Translation(https://arxiv.org/abs/2504.09305)
Keywords: in-context
Abstract: In-Context Learning (ICL) empowers large language models to perform tasks by conditioning on a few input-output examples. However, the performance of ICL is highly sensitive to the selection of these demonstrations. While existing methods focus on similarity or contrastive selection, they often overlook the importance of diversity among the chosen examples. In this paper, we propose DiverseConE (Diversity-Enhanced Contrastive Example Selection), a novel approach for demonstration selection in in-context learning for machine translation. Our method builds upon contrastive selection by incorporating a diversity enhancement step based on embedding space dissimilarity. We conduct extensive experiments on the Llama2-7b model across four language pairs (English-Chinese, Chinese-English, Russian-German, German-Russian) in 1-shot and 3-shot settings, using COMET20 and COMET22 for evaluation. Our results demonstrate that DiverseConE consistently outperforms strong baseline methods, including random selection, BM25, TopK, and a state-of-the-art contrastive selection method. Further analysis, including diversity metrics and human evaluation, validates the effectiveness of our approach and highlights the benefits of considering demonstration diversity for improved translation quality.

Title: MedIL: Implicit Latent Spaces for Generating Heterogeneous Medical Images at Arbitrary Resolutions

Authors: Tyler Spears, Shen Zhu, Yinzhu Jin, Aman Shrivastava, P. Thomas Fletcher
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09322
Pdf URL: https://arxiv.org/pdf/2504.09322
Copy Paste: [[2504.09322]] MedIL: Implicit Latent Spaces for Generating Heterogeneous Medical Images at Arbitrary Resolutions(https://arxiv.org/abs/2504.09322)
Keywords: diffusion, generative
Abstract: In this work, we introduce MedIL, a first-of-its-kind autoencoder built for encoding medical images with heterogeneous sizes and resolutions for image generation. Medical images are often large and heterogeneous, where fine details are of vital clinical importance. Image properties change drastically when considering acquisition equipment, patient demographics, and pathology, making realistic medical image generation challenging. Recent work in latent diffusion models (LDMs) has shown success in generating images resampled to a fixed-size. However, this is a narrow subset of the resolutions native to image acquisition, and resampling discards fine anatomical details. MedIL utilizes implicit neural representations to treat images as continuous signals, where encoding and decoding can be performed at arbitrary resolutions without prior resampling. We quantitatively and qualitatively show how MedIL compresses and preserves clinically-relevant features over large multi-site, multi-resolution datasets of both T1w brain MRIs and lung CTs. We further demonstrate how MedIL can influence the quality of images generated with a diffusion model, and discuss how MedIL can enhance generative models to resemble raw clinical acquisitions.

Title: Text To 3D Object Generation For Scalable Room Assembly

Authors: Sonia Laguna, Alberto Garcia-Garcia, Marie-Julie Rakotosaona, Stylianos Moschoglou, Leonhard Helminger, Sergio Orts-Escolano
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09328
Pdf URL: https://arxiv.org/pdf/2504.09328
Copy Paste: [[2504.09328]] Text To 3D Object Generation For Scalable Room Assembly(https://arxiv.org/abs/2504.09328)
Keywords: diffusion
Abstract: Modern machine learning models for scene understanding, such as depth estimation and object tracking, rely on large, high-quality datasets that mimic real-world deployment scenarios. To address data scarcity, we propose an end-to-end system for synthetic data generation for scalable, high-quality, and customizable 3D indoor scenes. By integrating and adapting text-to-image and multi-view diffusion models with Neural Radiance Field-based meshing, this system generates highfidelity 3D object assets from text prompts and incorporates them into pre-defined floor plans using a rendering tool. By introducing novel loss functions and training strategies into existing methods, the system supports on-demand scene generation, aiming to alleviate the scarcity of current available data, generally manually crafted by artists. This system advances the role of synthetic data in addressing machine learning training limitations, enabling more robust and generalizable models for real-world applications.

Title: REMEMBER: Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning in Zero- and Few-shot Neurodegenerative Diagnosis

Authors: Duy-Cat Can, Quang-Huy Tang, Huong Ha, Binh T. Nguyen, Oliver Y. Chén
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2504.09354
Pdf URL: https://arxiv.org/pdf/2504.09354
Copy Paste: [[2504.09354]] REMEMBER: Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning in Zero- and Few-shot Neurodegenerative Diagnosis(https://arxiv.org/abs/2504.09354)
Keywords: generative
Abstract: Timely and accurate diagnosis of neurodegenerative disorders, such as Alzheimer's disease, is central to disease management. Existing deep learning models require large-scale annotated datasets and often function as "black boxes". Additionally, datasets in clinical practice are frequently small or unlabeled, restricting the full potential of deep learning methods. Here, we introduce REMEMBER -- Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning -- a new machine learning framework that facilitates zero- and few-shot Alzheimer's diagnosis using brain MRI scans through a reference-based reasoning process. Specifically, REMEMBER first trains a contrastively aligned vision-text model using expert-annotated reference data and extends pseudo-text modalities that encode abnormality types, diagnosis labels, and composite clinical descriptions. Then, at inference time, REMEMBER retrieves similar, human-validated cases from a curated dataset and integrates their contextual information through a dedicated evidence encoding module and attention-based inference head. Such an evidence-guided design enables REMEMBER to imitate real-world clinical decision-making process by grounding predictions in retrieved imaging and textual context. Specifically, REMEMBER outputs diagnostic predictions alongside an interpretable report, including reference images and explanations aligned with clinical workflows. Experimental results demonstrate that REMEMBER achieves robust zero- and few-shot performance and offers a powerful and explainable framework to neuroimaging-based diagnosis in the real world, especially under limited data.

Title: Structure-Accurate Medical Image Translation based on Dynamic Frequency Balance and Knowledge Guidance

Authors: Jiahua Xu, Dawei Zhou, Lei Hu, Zaiyi Liu, Nannan Wang, Xinbo Gao
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.09441
Pdf URL: https://arxiv.org/pdf/2504.09441
Copy Paste: [[2504.09441]] Structure-Accurate Medical Image Translation based on Dynamic Frequency Balance and Knowledge Guidance(https://arxiv.org/abs/2504.09441)
Keywords: diffusion
Abstract: Multimodal medical images play a crucial role in the precise and comprehensive clinical diagnosis. Diffusion model is a powerful strategy to synthesize the required medical images. However, existing approaches still suffer from the problem of anatomical structure distortion due to the overfitting of high-frequency information and the weakening of low-frequency information. Thus, we propose a novel method based on dynamic frequency balance and knowledge guidance. Specifically, we first extract the low-frequency and high-frequency components by decomposing the critical features of the model using wavelet transform. Then, a dynamic frequency balance module is designed to adaptively adjust frequency for enhancing global low-frequency features and effective high-frequency details as well as suppressing high-frequency noise. To further overcome the challenges posed by the large differences between different medical modalities, we construct a knowledge-guided mechanism that fuses the prior clinical knowledge from a visual language model with visual features, to facilitate the generation of accurate anatomical structures. Experimental evaluations on multiple datasets show the proposed method achieves significant improvements in qualitative and quantitative assessments, verifying its effectiveness and superiority.

Title: D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation

Authors: Weinan Jia, Mengqi Huang, Nan Chen, Lei Zhang, Zhendong Mao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09454
Pdf URL: https://arxiv.org/pdf/2504.09454
Copy Paste: [[2504.09454]] D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation(https://arxiv.org/abs/2504.09454)
Keywords: diffusion
Abstract: Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different image regions during the diffusion process, disregarding the naturally varying information densities present in these regions. However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. (2) Dynamic Diffusion Transformer (D$^2$iT) at second stage generates images by predicting multi-grained noise, consisting of coarse-grained (less latent code in smooth regions) and fine-grained (more latent codes in detailed regions), through an novel combination of the Dynamic Grain Transformer and the Dynamic Content Transformer. The strategy of combining rough prediction of noise with detailed regions correction achieves a unification of global consistency and local realism. Comprehensive experiments on various generation tasks validate the effectiveness of our approach. Code will be released at this https URL.

Title: CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models

Authors: Pooja Guhan, Divya Kothandaraman, Tsung-Wei Huang, Guan-Ming Su, Dinesh Manocha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09472
Pdf URL: https://arxiv.org/pdf/2504.09472
Copy Paste: [[2504.09472]] CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models(https://arxiv.org/abs/2504.09472)
Keywords: diffusion
Abstract: We introduce CamMimic, an innovative algorithm tailored for dynamic video editing needs. It is designed to seamlessly transfer the camera motion observed in a given reference video onto any scene of the user's choice in a zero-shot manner without requiring any additional data. Our algorithm achieves this using a two-phase strategy by leveraging a text-to-video diffusion model. In the first phase, we develop a multi-concept learning method using a combination of LoRA layers and an orthogonality loss to capture and understand the underlying spatial-temporal characteristics of the reference video as well as the spatial features of the user's desired scene. The second phase proposes a unique homography-based refinement strategy to enhance the temporal and spatial alignment of the generated video. We demonstrate the efficacy of our method through experiments conducted on a dataset containing combinations of diverse scenes and reference videos containing a variety of camera motions. In the absence of an established metric for assessing camera motion transfer between unrelated scenes, we propose CameraScore, a novel metric that utilizes homography representations to measure camera motion similarity between the reference and generated videos. Extensive quantitative and qualitative evaluations demonstrate that our approach generates high-quality, motion-enhanced videos. Additionally, a user study reveals that 70.31% of participants preferred our method for scene preservation, while 90.45% favored it for motion transfer. We hope this work lays the foundation for future advancements in camera motion transfer across different scenes.

Title: Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

Authors: Yongchao Feng, Yajie Liu, Shuai Yang, Wenrui Cai, Jinqing Zhang, Qiqi Zhan, Ziyue Huang, Hongxi Yan, Qiao Wan, Chenguang Liu, Junzhe Wang, Jiahui Lv, Ziqi Liu, Tengyuan Shi, Qingjie Liu, Yunhong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09480
Pdf URL: https://arxiv.org/pdf/2504.09480
Copy Paste: [[2504.09480]] Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation(https://arxiv.org/abs/2504.09480)
Keywords: foundation model
Abstract: Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated. In this work, we present the systematic review of VLM-based detection and segmentation, view VLM as the foundational model and conduct comprehensive evaluations across multiple downstream tasks for the first time: 1) The evaluation spans eight detection scenarios (closed-set detection, domain adaptation, crowded objects, etc.) and eight segmentation scenarios (few-shot, open-world, small object, etc.), revealing distinct performance advantages and limitations of various VLM architectures across tasks. 2) As for detection tasks, we evaluate VLMs under three finetuning granularities: \textit{zero prediction}, \textit{visual fine-tuning}, and \textit{text prompt}, and further analyze how different finetuning strategies impact performance under varied task. 3) Based on empirical findings, we provide in-depth analysis of the correlations between task characteristics, model architectures, and training methodologies, offering insights for future VLM design. 4) We believe that this work shall be valuable to the pattern recognition experts working in the fields of computer vision, multimodal learning, and vision foundation models by introducing them to the problem, and familiarizing them with the current status of the progress while providing promising directions for future research. A project associated with this review and evaluation has been created at this https URL.

Title: GenEDA: Unleashing Generative Reasoning on Netlist via Multimodal Encoder-Decoder Aligned Foundation Model

Authors: Wenji Fang, Jing Wang, Yao Lu, Shang Liu, Zhiyao Xie
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2504.09485
Pdf URL: https://arxiv.org/pdf/2504.09485
Copy Paste: [[2504.09485]] GenEDA: Unleashing Generative Reasoning on Netlist via Multimodal Encoder-Decoder Aligned Foundation Model(https://arxiv.org/abs/2504.09485)
Keywords: foundation model, generative
Abstract: The success of foundation AI has motivated the research of circuit foundation models, which are customized to assist the integrated circuit (IC) design process. However, existing pre-trained circuit models are typically limited to standalone encoders for predictive tasks or decoders for generative tasks. These two model types are developed independently, operate on different circuit modalities, and reside in separate latent spaces, which restricts their ability to complement each other for more advanced applications. In this work, we present GenEDA, the first framework that aligns circuit encoders with decoders within a shared latent space. GenEDA bridges the gap between graph-based circuit representations and text-based large language models (LLMs), enabling communication between their respective latent spaces. To achieve the alignment, we propose two paradigms that support both open-source trainable LLMs and commercial frozen LLMs. Built on this aligned architecture, GenEDA enables three unprecedented generative reasoning tasks over netlists, where the model reversely generates the high-level functionality from low-level netlists in different granularities. These tasks extend traditional gate-type prediction to direct generation of full-circuit functionality. Experiments demonstrate that GenEDA significantly boosts advanced LLMs' (e.g., GPT-4o and DeepSeek-V3) performance in all tasks.

Title: MADLLM: Multivariate Anomaly Detection via Pre-trained LLMs

Authors: Wei Tao, Xiaoyang Qu, Kai Lu, Jiguang Wan, Guokuan Li, Jianzong Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09504
Pdf URL: https://arxiv.org/pdf/2504.09504
Copy Paste: [[2504.09504]] MADLLM: Multivariate Anomaly Detection via Pre-trained LLMs(https://arxiv.org/abs/2504.09504)
Keywords: anomaly
Abstract: When applying pre-trained large language models (LLMs) to address anomaly detection tasks, the multivariate time series (MTS) modality of anomaly detection does not align with the text modality of LLMs. Existing methods simply transform the MTS data into multiple univariate time series sequences, which can cause many problems. This paper introduces MADLLM, a novel multivariate anomaly detection method via pre-trained LLMs. We design a new triple encoding technique to align the MTS modality with the text modality of LLMs. Specifically, this technique integrates the traditional patch embedding method with two novel embedding approaches: Skip Embedding, which alters the order of patch processing in traditional methods to help LLMs retain knowledge of previous features, and Feature Embedding, which leverages contrastive learning to allow the model to better understand the correlations between different features. Experimental results demonstrate that our method outperforms state-of-the-art methods in various public anomaly detection datasets.

Title: DiffuMural: Restoring Dunhuang Murals with Multi-scale Diffusion

Authors: Puyu Han, Jiaju Kang, Yuhang Pan, Erting Pan, Zeyu Zhang, Qunchao Jin, Juntao Jiang, Zhichen Liu, Luqi Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09513
Pdf URL: https://arxiv.org/pdf/2504.09513
Copy Paste: [[2504.09513]] DiffuMural: Restoring Dunhuang Murals with Multi-scale Diffusion(https://arxiv.org/abs/2504.09513)
Keywords: diffusion
Abstract: Large-scale pre-trained diffusion models have produced excellent results in the field of conditional image generation. However, restoration of ancient murals, as an important downstream task in this field, poses significant challenges to diffusion model-based restoration methods due to its large defective area and scarce training samples. Conditional restoration tasks are more concerned with whether the restored part meets the aesthetic standards of mural restoration in terms of overall style and seam detail, and such metrics for evaluating heuristic image complements are lacking in current research. We therefore propose DiffuMural, a combined Multi-scale convergence and Collaborative Diffusion mechanism with ControlNet and cyclic consistency loss to optimise the matching between the generated images and the conditional control. DiffuMural demonstrates outstanding capabilities in mural restoration, leveraging training data from 23 large-scale Dunhuang murals that exhibit consistent visual aesthetics. The model excels in restoring intricate details, achieving a coherent overall appearance, and addressing the unique challenges posed by incomplete murals lacking factual grounding. Our evaluation framework incorporates four key metrics to quantitatively assess incomplete murals: factual accuracy, textural detail, contextual semantics, and holistic visual coherence. Furthermore, we integrate humanistic value assessments to ensure the restored murals retain their cultural and artistic significance. Extensive experiments validate that our method outperforms state-of-the-art (SOTA) approaches in both qualitative and quantitative metrics.

Title: Causal integration of chemical structures improves representations of microscopy images for morphological profiling

Authors: Yemin Yu, Neil Tenenholtz, Lester Mackey, Ying Wei, David Alvarez-Melis, Ava P. Amini, Alex X. Lu
Subjects: cs.LG, cs.CE, cs.CV
Abstract URL: https://arxiv.org/abs/2504.09544
Pdf URL: https://arxiv.org/pdf/2504.09544
Copy Paste: [[2504.09544]] Causal integration of chemical structures improves representations of microscopy images for morphological profiling(https://arxiv.org/abs/2504.09544)
Keywords: self-supervised
Abstract: Recent advances in self-supervised deep learning have improved our ability to quantify cellular morphological changes in high-throughput microscopy screens, a process known as morphological profiling. However, most current methods only learn from images, despite many screens being inherently multimodal, as they involve both a chemical or genetic perturbation as well as an image-based readout. We hypothesized that incorporating chemical compound structure during self-supervised pre-training could improve learned representations of images in high-throughput microscopy screens. We introduce a representation learning framework, MICON (Molecular-Image Contrastive Learning), that models chemical compounds as treatments that induce counterfactual transformations of cell phenotypes. MICON significantly outperforms classical hand-crafted features such as CellProfiler and existing deep-learning-based representation learning methods in challenging evaluation settings where models must identify reproducible effects of drugs across independent replicates and data-generating centers. We demonstrate that incorporating chemical compound information into the learning process provides consistent improvements in our evaluation setting and that modeling compounds specifically as treatments in a causal framework outperforms approaches that directly align images and compounds in a single representation space. Our findings point to a new direction for representation learning in morphological profiling, suggesting that methods should explicitly account for the multimodal nature of microscopy screening data.

Title: SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

Authors: Xiang Hu, Pingping Zhang, Yuhao Wang, Bin Yan, Huchuan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09549
Pdf URL: https://arxiv.org/pdf/2504.09549
Copy Paste: [[2504.09549]] SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification(https://arxiv.org/abs/2504.09549)
Keywords: diffusion, generative
Abstract: Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative ReID models to maintain identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust network is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model's capability to represent persons. To address these issues, we propose a novel two-stage feature learning framework named SD-ReID for AG-ReID, which takes advantage of the powerful understanding capacity of generative models, e.g., Stable Diffusion (SD), to generate view-specific features between different viewpoints. In the first stage, we train a simple ViT-based model to extract coarse-grained representations and controllable conditions. Then, in the second stage, we fine-tune the SD model to learn complementary representations guided by the controllable conditions. Furthermore, we propose the View-Refine Decoder (VRD) to obtain additional controllable conditions to generate missing cross-view features. Finally, we use the coarse-grained representations and all-view features generated by SD to retrieve target persons. Extensive experiments on the AG-ReID benchmarks demonstrate the effectiveness of our proposed SD-ReID. The source code will be available upon acceptance.

Title: Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark

Authors: Jinhao Li, Zijian Chen, Runze Dong, Tingzhu Chen, Changbo Wang, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09555
Pdf URL: https://arxiv.org/pdf/2504.09555
Copy Paste: [[2504.09555]] Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark(https://arxiv.org/abs/2504.09555)
Keywords: diffusion, generative
Abstract: The oracle bone inscription (OBI) recognition plays a significant role in understanding the history and culture of ancient China. However, the existing OBI datasets suffer from a long-tail distribution problem, leading to biased performance of OBI recognition models across majority and minority classes. With recent advancements in generative models, OBI synthesis-based data augmentation has become a promising avenue to expand the sample size of minority classes. Unfortunately, current OBI datasets lack large-scale structure-aligned image pairs for generative model training. To address these problems, we first present the Oracle-P15K, a structure-aligned OBI dataset for OBI generation and denoising, consisting of 14,542 images infused with domain knowledge from OBI experts. Second, we propose a diffusion model-based pseudo OBI generator, called OBIDiff, to achieve realistic and controllable OBI generation. Given a clean glyph image and a target rubbing-style image, it can effectively transfer the noise style of the original rubbing to the glyph image. Extensive experiments on OBI downstream tasks and user preference studies show the effectiveness of the proposed Oracle-P15K dataset and demonstrate that OBIDiff can accurately preserve inherent glyph structures while transferring authentic rubbing styles effectively.

Title: TextSplat: Text-Guided Semantic Fusion for Generalizable Gaussian Splatting

Authors: Zhicong Wu, Hongbin Xu, Gang Xu, Ping Nie, Zhixin Yan, Jinkai Zheng, Liangqiong Qu, Ming Li, Liqiang Nie
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09588
Pdf URL: https://arxiv.org/pdf/2504.09588
Copy Paste: [[2504.09588]] TextSplat: Text-Guided Semantic Fusion for Generalizable Gaussian Splatting(https://arxiv.org/abs/2504.09588)
Keywords: diffusion
Abstract: Recent advancements in Generalizable Gaussian Splatting have enabled robust 3D reconstruction from sparse input views by utilizing feed-forward Gaussian Splatting models, achieving superior cross-scene generalization. However, while many methods focus on geometric consistency, they often neglect the potential of text-driven guidance to enhance semantic understanding, which is crucial for accurately reconstructing fine-grained details in complex scenes. To address this limitation, we propose TextSplat--the first text-driven Generalizable Gaussian Splatting framework. By employing a text-guided fusion of diverse semantic cues, our framework learns robust cross-modal feature representations that improve the alignment of geometric and semantic information, producing high-fidelity 3D reconstructions. Specifically, our framework employs three parallel modules to obtain complementary representations: the Diffusion Prior Depth Estimator for accurate depth information, the Semantic Aware Segmentation Network for detailed semantic information, and the Multi-View Interaction Network for refined cross-view features. Then, in the Text-Guided Semantic Fusion Module, these representations are integrated via the text-guided and attention-based feature aggregation mechanism, resulting in enhanced 3D Gaussian parameters enriched with detailed semantic cues. Experimental results on various benchmark datasets demonstrate improved performance compared to existing methods across multiple evaluation metrics, validating the effectiveness of our framework. The code will be publicly available.

Title: Mixture-of-Shape-Experts (MoSE): End-to-End Shape Dictionary Framework to Prompt SAM for Generalizable Medical Segmentation

Authors: Jia Wei, Xiaoqi Zhao, Jonghye Woo, Jinsong Ouyang, Georges El Fakhri, Qingyu Chen, Xiaofeng Liu
Subjects: cs.CV, cs.LG, cs.MM, eess.IV, physics.med-ph
Abstract URL: https://arxiv.org/abs/2504.09601
Pdf URL: https://arxiv.org/pdf/2504.09601
Copy Paste: [[2504.09601]] Mixture-of-Shape-Experts (MoSE): End-to-End Shape Dictionary Framework to Prompt SAM for Generalizable Medical Segmentation(https://arxiv.org/abs/2504.09601)
Keywords: foundation model
Abstract: Single domain generalization (SDG) has recently attracted growing attention in medical image segmentation. One promising strategy for SDG is to leverage consistent semantic shape priors across different imaging protocols, scanner vendors, and clinical sites. However, existing dictionary learning methods that encode shape priors often suffer from limited representational power with a small set of offline computed shape elements, or overfitting when the dictionary size grows. Moreover, they are not readily compatible with large foundation models such as the Segment Anything Model (SAM). In this paper, we propose a novel Mixture-of-Shape-Experts (MoSE) framework that seamlessly integrates the idea of mixture-of-experts (MoE) training into dictionary learning to efficiently capture diverse and robust shape priors. Our method conceptualizes each dictionary atom as a shape expert, which specializes in encoding distinct semantic shape information. A gating network dynamically fuses these shape experts into a robust shape map, with sparse activation guided by SAM encoding to prevent overfitting. We further provide this shape map as a prompt to SAM, utilizing the powerful generalization capability of SAM through bidirectional integration. All modules, including the shape dictionary, are trained in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate its effectiveness.

Title: Mitigating Many-Shot Jailbreaking

Authors: Christopher M. Ackerman, Nina Panickssery
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2504.09604
Pdf URL: https://arxiv.org/pdf/2504.09604
Copy Paste: [[2504.09604]] Mitigating Many-Shot Jailbreaking(https://arxiv.org/abs/2504.09604)
Keywords: in-context
Abstract: Many-shot jailbreaking (MSJ) is an adversarial technique that exploits the long context windows of modern LLMs to circumvent model safety training by including in the prompt many examples of a ``fake'' assistant responding inappropriately before the final request. With enough examples, the model's in-context learning abilities override its safety training, and it responds as if it were the ``fake'' assistant. In this work, we probe the effectiveness of different fine tuning and input sanitization approaches on mitigating MSJ attacks, alone and in combination. We find incremental mitigation effectiveness for each, and we show that the combined techniques significantly reduce the effectiveness of MSJ attacks, while retaining model performance in benign in-context learning and conversational tasks. We suggest that our approach could meaningfully ameliorate this vulnerability if incorporated into model safety post-training.

Title: Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training

Authors: Lexington Whalen, Zhenbang Du, Haoran You, Chaojian Li, Sixu Li, Yingyan (Celine)Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09606
Pdf URL: https://arxiv.org/pdf/2504.09606
Copy Paste: [[2504.09606]] Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training(https://arxiv.org/abs/2504.09606)
Keywords: diffusion, generative
Abstract: Training diffusion models (DMs) requires substantial computational resources due to multiple forward and backward passes across numerous timesteps, motivating research into efficient training techniques. In this paper, we propose EB-Diff-Train, a new efficient DM training approach that is orthogonal to other methods of accelerating DM training, by investigating and leveraging Early-Bird (EB) tickets -- sparse subnetworks that manifest early in the training process and maintain high generation quality. We first investigate the existence of traditional EB tickets in DMs, enabling competitive generation quality without fully training a dense model. Then, we delve into the concept of diffusion-dedicated EB tickets, drawing on insights from varying importance of different timestep regions. These tickets adapt their sparsity levels according to the importance of corresponding timestep regions, allowing for aggressive sparsity during non-critical regions while conserving computational resources for crucial timestep regions. Building on this, we develop an efficient DM training technique that derives timestep-aware EB tickets, trains them in parallel, and combines them during inference for image generation. Extensive experiments validate the existence of both traditional and timestep-aware EB tickets, as well as the effectiveness of our proposed EB-Diff-Train method. This approach can significantly reduce training time both spatially and temporally -- achieving 2.9$\times$ to 5.8$\times$ speedups over training unpruned dense models, and up to 10.3$\times$ faster training compared to standard train-prune-finetune pipelines -- without compromising generative quality. Our code is available at this https URL.

Title: Iterative Self-Training for Code Generation via Reinforced Re-Ranking

Authors: Nikita Sorokin, Ivan Sedykh, Valentin Malykh
Subjects: cs.CL, cs.IR, cs.SE
Abstract URL: https://arxiv.org/abs/2504.09643
Pdf URL: https://arxiv.org/pdf/2504.09643
Copy Paste: [[2504.09643]] Iterative Self-Training for Code Generation via Reinforced Re-Ranking(https://arxiv.org/abs/2504.09643)
Keywords: generative
Abstract: Generating high-quality code that solves complex programming tasks is challenging, especially with current decoder-based models that produce highly stochastic outputs. In code generation, even minor errors can easily break the entire solution. Leveraging multiple sampled solutions can significantly improve the overall output quality. One effective way to enhance code generation is by pairing a code generation model with a reranker model, which selects the best solution from the generated samples. We propose a novel iterative self-training approach for self-training reranker models using Proximal Policy Optimization (PPO), aimed at improving both reranking accuracy and the overall code generation process. Unlike traditional PPO approaches, where the focus is on optimizing a generative model with a reward model, our approach emphasizes the development of a robust reward/reranking model. This model improves the quality of generated code through reranking and addresses problems and errors that the reward model might overlook during PPO alignment with the reranker. Our method iteratively refines the training dataset by re-evaluating outputs, identifying high-scoring negative examples, and incorporating them into the training loop, that boosting model performance. Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter model outperforms a 33B model in code generation quality while being three times faster. Moreover, it achieves performance comparable to GPT-4 and surpasses it in one programming language.

Title: KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation

Authors: Xingrui Wang, Jiang Liu, Ze Wang, Xiaodong Yu, Jialian Wu, Ximeng Sun, Yusheng Su, Alan Yuille, Zicheng Liu, Emad Barsoum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09656
Pdf URL: https://arxiv.org/pdf/2504.09656
Copy Paste: [[2504.09656]] KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation(https://arxiv.org/abs/2504.09656)
Keywords: diffusion
Abstract: Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly. In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The code is released in this https URL.

Title: Computer-Aided Layout Generation for Building Design: A Review

Authors: Jiachen Liu, Yuan Xue, Haomiao Ni, Rui Yu, Zihan Zhou, Sharon X. Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09694
Pdf URL: https://arxiv.org/pdf/2504.09694
Copy Paste: [[2504.09694]] Computer-Aided Layout Generation for Building Design: A Review(https://arxiv.org/abs/2504.09694)
Keywords: generative
Abstract: Generating realistic building layouts for automatic building design has been studied in both the computer vision and architecture domains. Traditional approaches from the architecture domain, which are based on optimization techniques or heuristic design guidelines, can synthesize desirable layouts, but usually require post-processing and involve human interaction in the design pipeline, making them costly and timeconsuming. The advent of deep generative models has significantly improved the fidelity and diversity of the generated architecture layouts, reducing the workload by designers and making the process much more efficient. In this paper, we conduct a comprehensive review of three major research topics of architecture layout design and generation: floorplan layout generation, scene layout synthesis, and generation of some other formats of building layouts. For each topic, we present an overview of the leading paradigms, categorized either by research domains (architecture or machine learning) or by user input conditions or constraints. We then introduce the commonly-adopted benchmark datasets that are used to verify the effectiveness of the methods, as well as the corresponding evaluation metrics. Finally, we identify the well-solved problems and limitations of existing approaches, then propose new perspectives as promising directions for future research in this important research area. A project associated with this survey to maintain the resources is available at awesome-building-layout-generation.

Title: ToolTipNet: A Segmentation-Driven Deep Learning Baseline for Surgical Instrument Tip Detection

Authors: Zijian Wu, Shuojue Yang, Yueming Jin, Septimiu E Salcudean
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09700
Pdf URL: https://arxiv.org/pdf/2504.09700
Copy Paste: [[2504.09700]] ToolTipNet: A Segmentation-Driven Deep Learning Baseline for Surgical Instrument Tip Detection(https://arxiv.org/abs/2504.09700)
Keywords: foundation model
Abstract: In robot-assisted laparoscopic radical prostatectomy (RALP), the location of the instrument tip is important to register the ultrasound frame with the laparoscopic camera frame. A long-standing limitation is that the instrument tip position obtained from the da Vinci API is inaccurate and requires hand-eye calibration. Thus, directly computing the position of the tool tip in the camera frame using the vision-based method becomes an attractive solution. Besides, surgical instrument tip detection is the key component of other tasks, like surgical skill assessment and surgery automation. However, this task is challenging due to the small size of the tool tip and the articulation of the surgical instrument. Surgical instrument segmentation becomes relatively easy due to the emergence of the Segmentation Foundation Model, i.e., Segment Anything. Based on this advancement, we explore the deep learning-based surgical instrument tip detection approach that takes the part-level instrument segmentation mask as input. Comparison experiments with a hand-crafted image-processing approach demonstrate the superiority of the proposed method on simulated and real datasets.

Title: Dynamical symmetries in the fluctuation-driven regime: an application of Noether's theorem to noisy dynamical systems

Authors: John J. Vastola
Subjects: cs.LG, cond-mat.stat-mech
Abstract URL: https://arxiv.org/abs/2504.09761
Pdf URL: https://arxiv.org/pdf/2504.09761
Copy Paste: [[2504.09761]] Dynamical symmetries in the fluctuation-driven regime: an application of Noether's theorem to noisy dynamical systems(https://arxiv.org/abs/2504.09761)
Keywords: diffusion, generative
Abstract: Noether's theorem provides a powerful link between continuous symmetries and conserved quantities for systems governed by some variational principle. Perhaps unfortunately, most dynamical systems of interest in neuroscience and artificial intelligence cannot be described by any such principle. On the other hand, nonequilibrium physics provides a variational principle that describes how fairly generic noisy dynamical systems are most likely to transition between two states; in this work, we exploit this principle to apply Noether's theorem, and hence learn about how the continuous symmetries of dynamical systems constrain their most likely trajectories. We identify analogues of the conservation of energy, momentum, and angular momentum, and briefly discuss examples of each in the context of models of decision-making, recurrent neural networks, and diffusion generative models.

Title: Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems

Authors: Zaid Khan, Elias Stengel-Eskin, Archiki Prasad, Jaemin Cho, Mohit Bansal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09763
Pdf URL: https://arxiv.org/pdf/2504.09763
Copy Paste: [[2504.09763]] Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems(https://arxiv.org/abs/2504.09763)
Keywords: generative
Abstract: Scientists often infer abstract procedures from specific instances of problems and use the abstractions to generate new, related instances. For example, programs encoding the formal rules and properties of a system have been useful in fields ranging from RL (procedural environments) to physics (simulation engines). These programs can be seen as functions which execute to different outputs based on their parameterizations (e.g., gridworld configuration or initial physical conditions). We introduce the term EFA (Executable Functional Abstraction) to denote such programs for math problems. EFA-like constructs have been shown to be useful for math reasoning as problem generators for stress-testing models. However, prior work has been limited to abstractions for grade-school math (whose simple rules are easy to encode in programs), while generating EFAs for advanced math has thus far required human engineering. We explore the automatic construction of EFAs for advanced math problems. We operationalize the task of automatically constructing EFAs as a program synthesis task, and develop EFAGen, which conditions an LLM on a seed math problem and its step-by-step solution to generate candidate EFA programs that are faithful to the generalized problem and solution class underlying the seed problem. Furthermore, we formalize properties any valid EFA must possess in terms of executable unit tests, and show how the tests can be used as verifiable rewards to train LLMs to become better writers of EFAs. We demonstrate that EFAs constructed by EFAGen behave rationally by remaining faithful to seed problems, produce learnable problem variations, and that EFAGen can infer EFAs across multiple diverse sources of competition-level math problems. Finally, we show downstream uses of model-written EFAs e.g. finding problem variations that are harder or easier for a learner to solve, as well as data generation.

Title: EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise

Authors: Chao Liu, Arash Vahdat
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.09789
Pdf URL: https://arxiv.org/pdf/2504.09789
Copy Paste: [[2504.09789]] EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise(https://arxiv.org/abs/2504.09789)
Keywords: diffusion
Abstract: Temporally consistent video-to-video generation is essential for applications of video diffusion models in areas such as sim-to-real, style-transfer, video upsampling, etc. In this paper, we propose a video diffusion framework that leverages temporally consistent noise to generate coherent video frames without specialized modules or additional constraints. We show that the standard training objective of diffusion models, when applied with temporally consistent noise, encourages the model to be equivariant to spatial transformations in input video and noise. This enables our model to better follow motion patterns from the input video, producing aligned motion and high-fidelity frames. Furthermore, we extend our approach to 3D-consistent video generation by attaching noise as textures on 3D meshes, ensuring 3D consistency in sim-to-real applications. Experimental results demonstrate that our method surpasses state-of-the-art baselines in motion alignment, 3D consistency, and video quality while requiring only a few sampling steps in practice.

Title: VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Authors: Ryota Tanaka, Taichi Iki, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, Jun Suzuki
Subjects: cs.CL, cs.AI, cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2504.09795
Pdf URL: https://arxiv.org/pdf/2504.09795
Copy Paste: [[2504.09795]] VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents(https://arxiv.org/abs/2504.09795)
Keywords: self-supervised
Abstract: We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.

Title: Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution

Authors: Yiwen Wang, Ying Liang, Yuxuan Zhang, Xinning Chai, Zhengxue Cheng, Yingsheng Qin, Yucai Yang, Rong Xie, Li Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.09887
Pdf URL: https://arxiv.org/pdf/2504.09887
Copy Paste: [[2504.09887]] Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution(https://arxiv.org/abs/2504.09887)
Keywords: diffusion
Abstract: Due to the disparity between real-world degradations in user-generated content(UGC) images and synthetic degradations, traditional super-resolution methods struggle to generalize effectively, necessitating a more robust approach to model real-world distortions. In this paper, we propose a novel approach to UGC image super-resolution by integrating semantic guidance into a diffusion framework. Our method addresses the inconsistency between degradations in wild and synthetic datasets by separately simulating the degradation processes on the LSDIR dataset and combining them with the official paired training set. Furthermore, we enhance degradation removal and detail generation by incorporating a pretrained semantic extraction model (SAM2) and fine-tuning key hyperparameters for improved perceptual fidelity. Extensive experiments demonstrate the superiority of our approach against state-of-the-art methods. Additionally, the proposed model won second place in the CVPR NTIRE 2025 Short-form UGC Image Super-Resolution Challenge, further validating its effectiveness. The code is available at https://github.c10pom/Moonsofang/NTIRE-2025-SRlab.

Title: TWSSenti: A Novel Hybrid Framework for Topic-Wise Sentiment Analysis on Social Media Using Transformer Models

Authors: Aish Albladi, Md Kaosar Uddin, Minarul Islam, Cheryl Seals
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09896
Pdf URL: https://arxiv.org/pdf/2504.09896
Copy Paste: [[2504.09896]] TWSSenti: A Novel Hybrid Framework for Topic-Wise Sentiment Analysis on Social Media Using Transformer Models(https://arxiv.org/abs/2504.09896)
Keywords: generative
Abstract: Sentiment analysis is a crucial task in natural language processing (NLP) that enables the extraction of meaningful insights from textual data, particularly from dynamic platforms like Twitter and IMDB. This study explores a hybrid framework combining transformer-based models, specifically BERT, GPT-2, RoBERTa, XLNet, and DistilBERT, to improve sentiment classification accuracy and robustness. The framework addresses challenges such as noisy data, contextual ambiguity, and generalization across diverse datasets by leveraging the unique strengths of these models. BERT captures bidirectional context, GPT-2 enhances generative capabilities, RoBERTa optimizes contextual understanding with larger corpora and dynamic masking, XLNet models dependency through permutation-based learning, and DistilBERT offers efficiency with reduced computational overhead while maintaining high accuracy. We demonstrate text cleaning, tokenization, and feature extraction using Term Frequency Inverse Document Frequency (TF-IDF) and Bag of Words (BoW), ensure high-quality input data for the models. The hybrid approach was evaluated on benchmark datasets Sentiment140 and IMDB, achieving superior accuracy rates of 94\% and 95\%, respectively, outperforming standalone models. The results validate the effectiveness of combining multiple transformer models in ensemble-like setups to address the limitations of individual architectures. This research highlights its applicability to real-world tasks such as social media monitoring, customer sentiment analysis, and public opinion tracking which offers a pathway for future advancements in hybrid NLP frameworks.

Title: Learning to Erase Private Knowledge from Multi-Documents for Retrieval-Augmented Large Language Models

Authors: Yujing Wang, Hainan Zhang, Liang Pang, Yongxin Tong, Binghui Guo, Hongwei Zheng, Zhiming Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.09910
Pdf URL: https://arxiv.org/pdf/2504.09910
Copy Paste: [[2504.09910]] Learning to Erase Private Knowledge from Multi-Documents for Retrieval-Augmented Large Language Models(https://arxiv.org/abs/2504.09910)
Keywords: generative
Abstract: Retrieval-Augmented Generation (RAG) is a promising technique for applying LLMs to proprietary domains. However, retrieved documents may contain sensitive knowledge, posing risks of privacy leakage in generative results. Thus, effectively erasing private information from retrieved documents is a key challenge for RAG. Unlike traditional text anonymization, RAG should consider: (1) the inherent multi-document reasoning may face de-anonymization attacks; (2) private knowledge varies by scenarios, so users should be allowed to customize which information to erase; (3) preserving sufficient publicly available knowledge for generation tasks. This paper introduces the privacy erasure task for RAG and proposes Eraser4RAG, a private knowledge eraser which effectively removes user-defined private knowledge from documents while preserving sufficient public knowledge for generation. Specifically, we first construct a global knowledge graph to identify potential knowledge across documents, aiming to defend against de-anonymization attacks. Then we randomly split it into private and public sub-graphs, and fine-tune Flan-T5 to rewrite the retrieved documents excluding private triples. Finally, PPO algorithm optimizes the rewriting model to minimize private triples and maximize public triples retention. Experiments on four QA datasets demonstrate that Eraser4RAG achieves superior erase performance than GPT-4o.

Title: Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data

Authors: Xun Zhu, Fanbin Mo, Zheng Zhang, Jiaxi Wang, Yiming Shi, Ming Wu, Chuang Zhang, Miao Li, Ji Wu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.09967
Pdf URL: https://arxiv.org/pdf/2504.09967
Copy Paste: [[2504.09967]] Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data(https://arxiv.org/abs/2504.09967)
Keywords: foundation model
Abstract: The emergence of medical generalist foundation models has revolutionized conventional task-specific model development paradigms, aiming to better handle multiple tasks through joint training on large-scale medical datasets. However, recent advances prioritize simple data scaling or architectural component enhancement, while neglecting to re-examine multi-task learning from a data-centric perspective. Critically, simply aggregating existing data resources leads to decentralized image-task alignment, which fails to cultivate comprehensive image understanding or align with clinical needs for multi-dimensional image interpretation. In this paper, we introduce the image-centric multi-annotation X-ray dataset (IMAX), the first attempt to enhance the multi-task learning capabilities of medical multi-modal large language models (MLLMs) from the data construction level. To be specific, IMAX is featured from the following attributes: 1) High-quality data curation. A comprehensive collection of more than 354K entries applicable to seven different medical tasks. 2) Image-centric dense annotation. Each X-ray image is associated with an average of 4.10 tasks and 7.46 training entries, ensuring multi-task representation richness per image. Compared to the general decentralized multi-annotation X-ray dataset (DMAX), IMAX consistently demonstrates significant multi-task average performance gains ranging from 3.20% to 21.05% across seven open-source state-of-the-art medical MLLMs. Moreover, we investigate differences in statistical patterns exhibited by IMAX and DMAX training processes, exploring potential correlations between optimization dynamics and multi-task performance. Finally, leveraging the core concept of IMAX data construction, we propose an optimized DMAX-based training strategy to alleviate the dilemma of obtaining high-quality IMAX data in practical scenarios.

Title: GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting

Authors: Junlin Hao, Peiheng Wang, Haoyang Wang, Xinggong Zhang, Zongming Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10001
Pdf URL: https://arxiv.org/pdf/2504.10001
Copy Paste: [[2504.10001]] GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting(https://arxiv.org/abs/2504.10001)
Keywords: diffusion, generative
Abstract: Single-image 3D scene reconstruction presents significant challenges due to its inherently ill-posed nature and limited input constraints. Recent advances have explored two promising directions: multiview generative models that train on 3D consistent datasets but struggle with out-of-distribution generalization, and 3D scene inpainting and completion frameworks that suffer from cross-view inconsistency and suboptimal error handling, as they depend exclusively on depth data or 3D smoothness, which ultimately degrades output quality and computational performance. Building upon these approaches, we present GaussVideoDreamer, which advances generative multimedia approaches by bridging the gap between image, video, and 3D generation, integrating their strengths through two key innovations: (1) A progressive video inpainting strategy that harnesses temporal coherence for improved multiview consistency and faster convergence. (2) A 3D Gaussian Splatting consistency mask to guide the video diffusion with 3D consistent multiview evidence. Our pipeline combines three core components: a geometry-aware initialization protocol, Inconsistency-Aware Gaussian Splatting, and a progressive video inpainting strategy. Experimental results demonstrate that our approach achieves 32% higher LLaVA-IQA scores and at least 2x speedup compared to existing methods while maintaining robust performance across diverse scenes.

Title: Investigating the Role of Bilateral Symmetry for Inpainting Brain MRI

Authors: Sergey Kuznetsov, Sanduni Pinnawala, Peter A. Wijeratne, Ivor J. A. Simpson
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10039
Pdf URL: https://arxiv.org/pdf/2504.10039
Copy Paste: [[2504.10039]] Investigating the Role of Bilateral Symmetry for Inpainting Brain MRI(https://arxiv.org/abs/2504.10039)
Keywords: diffusion, anomaly
Abstract: Inpainting has recently emerged as a valuable and interesting technology to employ in the analysis of medical imaging data, in particular brain MRI. A wide variety of methodologies for inpainting MRI have been proposed and demonstrated on tasks including anomaly detection. In this work we investigate the statistical relationship between inpainted brain structures and the amount of subject-specific conditioning information, i.e. the other areas of the image that are masked. In particular, we analyse the distribution of inpainting results when masking additional regions of the image, specifically the contra-lateral structure. This allows us to elucidate where in the brain the model is drawing information from, and in particular, what is the importance of hemispherical symmetry? Our experiments interrogate a diffusion inpainting model through analysing the inpainting of subcortical brain structures based on intensity and estimated area change. We demonstrate that some structures show a strong influence of symmetry in the conditioning of the inpainting process.

Title: A Computational Cognitive Model for Processing Repetitions of Hierarchical Relations

Authors: Zeng Ren, Xinyi Guan, Martin Rohrmeier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2504.10065
Pdf URL: https://arxiv.org/pdf/2504.10065
Copy Paste: [[2504.10065]] A Computational Cognitive Model for Processing Repetitions of Hierarchical Relations(https://arxiv.org/abs/2504.10065)
Keywords: generative
Abstract: Patterns are fundamental to human cognition, enabling the recognition of structure and regularity across diverse domains. In this work, we focus on structural repeats, patterns that arise from the repetition of hierarchical relations within sequential data, and develop a candidate computational model of how humans detect and understand such structural repeats. Based on a weighted deduction system, our model infers the minimal generative process of a given sequence in the form of a Template program, a formalism that enriches the context-free grammar with repetition combinators. Such representation efficiently encodes the repetition of sub-computations in a recursive manner. As a proof of concept, we demonstrate the expressiveness of our model on short sequences from music and action planning. The proposed model offers broader insights into the mental representations and cognitive mechanisms underlying human pattern recognition.

Title: AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

Authors: Peizheng Li, Shuxiao Ding, You Zhou, Qingwen Zhang, Onat Inak, Larissa Triess, Niklas Hanselmann, Marius Cordts, Andreas Zell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10117
Pdf URL: https://arxiv.org/pdf/2504.10117
Copy Paste: [[2504.10117]] AGO: Adaptive Grounding for Open World 3D Occupancy Prediction(https://arxiv.org/abs/2504.10117)
Keywords: self-supervised
Abstract: Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects. Transferring open-vocabulary knowledge from vision-language models (VLMs) offers a promising direction but remains challenging. However, methods based on VLM-derived 2D pseudo-labels with traditional supervision are limited by a predefined label space and lack general prediction capabilities. Direct alignment with pretrained image embeddings, on the other hand, fails to achieve reliable performance due to often inconsistent image and text representations in VLMs. To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios. AGO first encodes surrounding images and class prompts into 3D and text embeddings, respectively, leveraging similarity-based grounding training with 3D pseudo-labels. Additionally, a modality adapter maps 3D embeddings into a space aligned with VLM-derived image embeddings, reducing modality gaps. Experiments on Occ3D-nuScenes show that AGO improves unknown object prediction in zero-shot and few-shot transfer while achieving state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU.

Title: Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers

Authors: Chunyang Zhang, Zhenhong Sun, Zhicheng Zhang, Junyan Wang, Yu Zhang, Dong Gong, Huadong Mo, Daoyi Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10148
Pdf URL: https://arxiv.org/pdf/2504.10148
Copy Paste: [[2504.10148]] Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers(https://arxiv.org/abs/2504.10148)
Keywords: diffusion
Abstract: Text-to-image (T2I) generation models often struggle with multi-instance synthesis (MIS), where they must accurately depict multiple distinct instances in a single image based on complex prompts detailing individual features. Traditional MIS control methods for UNet architectures like SD v1.5/SDXL fail to adapt to DiT-based models like FLUX and SD v3.5, which rely on integrated attention between image and text tokens rather than text-image cross-attention. To enhance MIS in DiT, we first analyze the mixed attention mechanism in DiT. Our token-wise and layer-wise analysis of attention maps reveals a hierarchical response structure: instance tokens dominate early layers, background tokens in middle layers, and attribute tokens in later layers. Building on this observation, we propose a training-free approach for enhancing MIS in DiT-based models with hierarchical and step-layer-wise attention specialty tuning (AST). AST amplifies key regions while suppressing irrelevant areas in distinct attention maps across layers and steps, guided by the hierarchical structure. This optimizes multimodal interactions by hierarchically decoupling the complex prompts with instance-based sketches. We evaluate our approach using upgraded sketch-based layouts for the T2I-CompBench and customized complex scenes. Both quantitative and qualitative results confirm our method enhances complex layout generation, ensuring precise instance placement and attribute representation in MIS.

Title: Efficient Generative Model Training via Embedded Representation Warmup

Authors: Deyuan Liu, Peng Sun, Xufeng Li, Tao Lin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.10188
Pdf URL: https://arxiv.org/pdf/2504.10188
Copy Paste: [[2504.10188]] Efficient Generative Model Training via Embedded Representation Warmup(https://arxiv.org/abs/2504.10188)
Keywords: diffusion, self-supervised, generative
Abstract: Diffusion models excel at generating high-dimensional data but fall short in training efficiency and representation quality compared to self-supervised methods. We identify a key bottleneck: the underutilization of high-quality, semantically rich representations during training notably slows down convergence. Our systematic analysis reveals a critical representation processing region -- primarily in the early layers -- where semantic and structural pattern learning takes place before generation can occur. To address this, we propose Embedded Representation Warmup (ERW), a plug-and-play framework where in the first stage we get the ERW module serves as a warmup that initializes the early layers of the diffusion model with high-quality, pretrained representations. This warmup minimizes the burden of learning representations from scratch, thereby accelerating convergence and boosting performance. Our theoretical analysis demonstrates that ERW's efficacy depends on its precise integration into specific neural network layers -- termed the representation processing region -- where the model primarily processes and transforms feature representations for later generation. We further establish that ERW not only accelerates training convergence but also enhances representation quality: empirically, our method achieves a 40$\times$ acceleration in training speed compared to REPA, the current state-of-the-art methods. Code is available at this https URL.

Title: A Model Zoo of Vision Transformers

Authors: Damian Falk, Léo Meynent, Florence Pfammatter, Konstantin Schürholt, Damian Borth
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10231
Pdf URL: https://arxiv.org/pdf/2504.10231
Copy Paste: [[2504.10231]] A Model Zoo of Vision Transformers(https://arxiv.org/abs/2504.10231)
Keywords: generative
Abstract: The availability of large, structured populations of neural networks - called 'model zoos' - has led to the development of a multitude of downstream tasks ranging from model analysis, to representation learning on model weights or generative modeling of neural network parameters. However, existing model zoos are limited in size and architecture and neglect the transformer, which is among the currently most successful neural network architectures. We address this gap by introducing the first model zoo of vision transformers (ViT). To better represent recent training approaches, we develop a new blueprint for model zoo generation that encompasses both pre-training and fine-tuning steps, and publish 250 unique models. They are carefully generated with a large span of generating factors, and their diversity is validated using a thorough choice of weight-space and behavioral metrics. To further motivate the utility of our proposed dataset, we suggest multiple possible applications grounded in both extensive exploratory experiments and a number of examples from the existing literature. By extending previous lines of similar work, our model zoo allows researchers to push their model population-based methods from the small model regime to state-of-the-art architectures. We make our model zoo available at this http URL.

Title: DiffMOD: Progressive Diffusion Point Denoising for Moving Object Detection in Remote Sensing

Authors: Jinyue Zhang, Xiangrong Zhang, Zhongjian Huang, Tianyang Zhang, Yifei Jiang, Licheng Jiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10278
Pdf URL: https://arxiv.org/pdf/2504.10278
Copy Paste: [[2504.10278]] DiffMOD: Progressive Diffusion Point Denoising for Moving Object Detection in Remote Sensing(https://arxiv.org/abs/2504.10278)
Keywords: diffusion
Abstract: Moving object detection (MOD) in remote sensing is significantly challenged by low resolution, extremely small object sizes, and complex noise interference. Current deep learning-based MOD methods rely on probability density estimation, which restricts flexible information interaction between objects and across temporal frames. To flexibly capture high-order inter-object and temporal relationships, we propose a point-based MOD in remote sensing. Inspired by diffusion models, the network optimization is formulated as a progressive denoising process that iteratively recovers moving object centers from sparse noisy points. Specifically, we sample scattered features from the backbone outputs as atomic units for subsequent processing, while global feature embeddings are aggregated to compensate for the limited coverage of sparse point features. By modeling spatial relative positions and semantic affinities, Spatial Relation Aggregation Attention is designed to enable high-order interactions among point-level features for enhanced object representation. To enhance temporal consistency, the Temporal Propagation and Global Fusion module is designed, which leverages an implicit memory reasoning mechanism for robust cross-frame feature integration. To align with the progressive denoising process, we propose a progressive MinK optimal transport assignment strategy that establishes specialized learning objectives at each denoising level. Additionally, we introduce a missing loss function to counteract the clustering tendency of denoised points around salient objects. Experiments on the RsData remote sensing MOD dataset show that our MOD method based on scattered point denoising can more effectively explore potential relationships between sparse moving objects and improve the detection capability and temporal consistency.

Title: $α$-Flow: A Unified Framework for Continuous-State Discrete Flow Matching Models

Authors: Chaoran Cheng, Jiahan Li, Jiajun Fan, Ge Liu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2504.10283
Pdf URL: https://arxiv.org/pdf/2504.10283
Copy Paste: [[2504.10283]] $α$-Flow: A Unified Framework for Continuous-State Discrete Flow Matching Models(https://arxiv.org/abs/2504.10283)
Keywords: generative
Abstract: Recent efforts have extended the flow-matching framework to discrete generative modeling. One strand of models directly works with the continuous probabilities instead of discrete tokens, which we colloquially refer to as Continuous-State Discrete Flow Matching (CS-DFM). Existing CS-DFM models differ significantly in their representations and geometric assumptions. This work presents a unified framework for CS-DFM models, under which the existing variants can be understood as operating on different $\alpha$-representations of probabilities. Building upon the theory of information geometry, we introduce $\alpha$-Flow, a family of CS-DFM models that adheres to the canonical $\alpha$-geometry of the statistical manifold, and demonstrate its optimality in minimizing the generalized kinetic energy. Theoretically, we show that the flow matching loss for $\alpha$-flow establishes a unified variational bound for the discrete negative log-likelihood. We comprehensively evaluate different instantiations of $\alpha$-flow on various discrete generation domains to demonstrate their effectiveness in discrete generative modeling, including intermediate values whose geometries have never been explored before. $\alpha$-flow significantly outperforms its discrete-state counterpart in image and protein sequence generation and better captures the entropy in language modeling.

Title: Noise2Ghost: Self-supervised deep convolutional reconstruction for ghost imaging

Authors: Mathieu Manni, Dmitry Karpov, K. Joost Batenburg, Sharon Shwartz, Nicola Viganò
Subjects: cs.CV, cs.LG, physics.data-an
Abstract URL: https://arxiv.org/abs/2504.10288
Pdf URL: https://arxiv.org/pdf/2504.10288
Copy Paste: [[2504.10288]] Noise2Ghost: Self-supervised deep convolutional reconstruction for ghost imaging(https://arxiv.org/abs/2504.10288)
Keywords: self-supervised
Abstract: We present a new self-supervised deep-learning-based Ghost Imaging (GI) reconstruction method, which provides unparalleled reconstruction performance for noisy acquisitions among unsupervised methods. We present the supporting mathematical framework and results from theoretical and real data use cases. Self-supervision removes the need for clean reference data while offering strong noise reduction. This provides the necessary tools for addressing signal-to-noise ratio concerns for GI acquisitions in emerging and cutting-edge low-light GI scenarios. Notable examples include micro- and nano-scale x-ray emission imaging, e.g., x-ray fluorescence imaging of dose-sensitive samples. Their applications include in-vivo and in-operando case studies for biological samples and batteries.

Title: Analysis of Attention in Video Diffusion Transformers

Authors: Yuxin Wen, Jim Wu, Ajay Jain, Tom Goldstein, Ashwinee Panda
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10317
Pdf URL: https://arxiv.org/pdf/2504.10317
Copy Paste: [[2504.10317]] Analysis of Attention in Video Diffusion Transformers(https://arxiv.org/abs/2504.10317)
Keywords: diffusion
Abstract: We conduct an in-depth analysis of attention in video diffusion transformers (VDiTs) and report a number of novel findings. We identify three key properties of attention in VDiTs: Structure, Sparsity, and Sinks. Structure: We observe that attention patterns across different VDiTs exhibit similar structure across different prompts, and that we can make use of the similarity of attention patterns to unlock video editing via self-attention map transfer. Sparse: We study attention sparsity in VDiTs, finding that proposed sparsity methods do not work for all VDiTs, because some layers that are seemingly sparse cannot be sparsified. Sinks: We make the first study of attention sinks in VDiTs, comparing and contrasting them to attention sinks in language models. We propose a number of future directions that can make use of our insights to improve the efficiency-quality Pareto frontier for VDiTs.

Title: SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model

Authors: Zongcan Ding, Haodong Zhang, Peng Wu, Guansong Pang, Zhiwei Yang, Peng Wang, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10320
Pdf URL: https://arxiv.org/pdf/2504.10320
Copy Paste: [[2504.10320]] SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model(https://arxiv.org/abs/2504.10320)
Keywords: anomaly
Abstract: Video anomaly detection (VAD) aims to identify unexpected events in videos and has wide applications in safety-critical domains. While semi-supervised methods trained on only normal samples have gained traction, they often suffer from high false alarm rates and poor interpretability. Recently, vision-language models (VLMs) have demonstrated strong multimodal reasoning capabilities, offering new opportunities for explainable anomaly detection. However, their high computational cost and lack of domain adaptation hinder real-time deployment and reliability. Inspired by dual complementary pathways in human visual perception, we propose SlowFastVAD, a hybrid framework that integrates a fast anomaly detector with a slow anomaly detector (namely a retrieval augmented generation (RAG) enhanced VLM), to address these limitations. Specifically, the fast detector first provides coarse anomaly confidence scores, and only a small subset of ambiguous segments, rather than the entire video, is further analyzed by the slower yet more interpretable VLM for elaborate detection and reasoning. Furthermore, to adapt VLMs to domain-specific VAD scenarios, we construct a knowledge base including normal patterns based on few normal samples and abnormal patterns inferred by VLMs. During inference, relevant patterns are retrieved and used to augment prompts for anomaly reasoning. Finally, we smoothly fuse the anomaly confidence of fast and slow detectors to enhance robustness of anomaly detection. Extensive experiments on four benchmarks demonstrate that SlowFastVAD effectively combines the strengths of both fast and slow detectors, and achieves remarkable detection accuracy and interpretability with significantly reduced computational overhead, making it well-suited for real-world VAD applications with high reliability requirements.

Title: LL-Gaussian: Low-Light Scene Reconstruction and Enhancement via Gaussian Splatting for Novel View Synthesis

Authors: Hao Sun, Fenggen Yu, Huiyao Xu, Tao Zhang, Changqing Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10331
Pdf URL: https://arxiv.org/pdf/2504.10331
Copy Paste: [[2504.10331]] LL-Gaussian: Low-Light Scene Reconstruction and Enhancement via Gaussian Splatting for Novel View Synthesis(https://arxiv.org/abs/2504.10331)
Keywords: diffusion
Abstract: Novel view synthesis (NVS) in low-light scenes remains a significant challenge due to degraded inputs characterized by severe noise, low dynamic range (LDR) and unreliable initialization. While recent NeRF-based approaches have shown promising results, most suffer from high computational costs, and some rely on carefully captured or pre-processed data--such as RAW sensor inputs or multi-exposure sequences--which severely limits their practicality. In contrast, 3D Gaussian Splatting (3DGS) enables real-time rendering with competitive visual fidelity; however, existing 3DGS-based methods struggle with low-light sRGB inputs, resulting in unstable Gaussian initialization and ineffective noise suppression. To address these challenges, we propose LL-Gaussian, a novel framework for 3D reconstruction and enhancement from low-light sRGB images, enabling pseudo normal-light novel view synthesis. Our method introduces three key innovations: 1) an end-to-end Low-Light Gaussian Initialization Module (LLGIM) that leverages dense priors from learning-based MVS approach to generate high-quality initial point clouds; 2) a dual-branch Gaussian decomposition model that disentangles intrinsic scene properties (reflectance and illumination) from transient interference, enabling stable and interpretable optimization; 3) an unsupervised optimization strategy guided by both physical constrains and diffusion prior to jointly steer decomposition and enhancement. Additionally, we contribute a challenging dataset collected in extreme low-light environments and demonstrate the effectiveness of LL-Gaussian. Compared to state-of-the-art NeRF-based methods, LL-Gaussian achieves up to 2,000 times faster inference and reduces training time to just 2%, while delivering superior reconstruction and rendering quality.

Title: Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis

Authors: Kaiwen Zheng, Xuri Ge, Junchen Fu, Jun Peng, Joemon M. Jose
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10351
Pdf URL: https://arxiv.org/pdf/2504.10351
Copy Paste: [[2504.10351]] Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis(https://arxiv.org/abs/2504.10351)
Keywords: foundation model
Abstract: Multimodal foundation models have significantly improved feature representation by integrating information from multiple modalities, making them highly suitable for a broader set of applications. However, the exploration of multimodal facial representation for understanding perception has been limited. Understanding and analyzing facial states, such as Action Units (AUs) and emotions, require a comprehensive and robust framework that bridges visual and linguistic modalities. In this paper, we present a comprehensive pipeline for multimodal facial state analysis. First, we compile a new Multimodal Face Dataset (MFA) by generating detailed multilevel language descriptions of face, incorporating Action Unit (AU) and emotion descriptions, by leveraging GPT-4o. Second, we introduce a novel Multilevel Multimodal Face Foundation model (MF^2) tailored for Action Unit (AU) and emotion recognition. Our model incorporates comprehensive visual feature modeling at both local and global levels of face image, enhancing its ability to represent detailed facial appearances. This design aligns visual representations with structured AU and emotion descriptions, ensuring effective cross-modal integration. Third, we develop a Decoupled Fine-Tuning Network (DFN) that efficiently adapts MF^2 across various tasks and datasets. This approach not only reduces computational overhead but also broadens the applicability of the foundation model to diverse scenarios. Experimentation show superior performance for AU and emotion detection tasks.

Title: Satellite Federated Fine-Tuning for Foundation Models in Space Computing Power Networks

Authors: Yan zhu, Jingyang zhu, Ting Wang, Yuanming Shi, Chunxiao Jiang, Khaled Ben Letaief
Subjects: cs.LG, cs.DC, cs.NI
Abstract URL: https://arxiv.org/abs/2504.10403
Pdf URL: https://arxiv.org/pdf/2504.10403
Copy Paste: [[2504.10403]] Satellite Federated Fine-Tuning for Foundation Models in Space Computing Power Networks(https://arxiv.org/abs/2504.10403)
Keywords: foundation model
Abstract: Advancements in artificial intelligence (AI) and low-earth orbit (LEO) satellites have promoted the application of large remote sensing foundation models for various downstream tasks. However, direct downloading of these models for fine-tuning on the ground is impeded by privacy concerns and limited bandwidth. Satellite federated learning (FL) offers a solution by enabling model fine-tuning directly on-board satellites and aggregating model updates without data downloading. Nevertheless, for large foundation models, the computational capacity of satellites is insufficient to support effective on-board fine-tuning in traditional satellite FL frameworks. To address these challenges, we propose a satellite-ground collaborative federated fine-tuning framework. The key of the framework lies in how to reasonably decompose and allocate model components to alleviate insufficient on-board computation capabilities. During fine-tuning, satellites exchange intermediate results with ground stations or other satellites for forward propagation and back propagation, which brings communication challenges due to the special communication topology of space transmission networks, such as intermittent satellite-ground communication, short duration of satellite-ground communication windows, and unstable inter-orbit inter-satellite links (ISLs). To reduce transmission delays, we further introduce tailored communication strategies that integrate both communication and computing resources. Specifically, we propose a parallel intra-orbit communication strategy, a topology-aware satellite-ground communication strategy, and a latency-minimalization inter-orbit communication strategy to reduce space communication costs. Simulation results demonstrate significant reductions in training time with improvements of approximately 33%.

Title: Foundation models for electronic health records: representation dynamics and transferability

Authors: Michael C. Burkhart, Bashar Ramadan, Zewei Liao, Kaveri Chhikara, Juan C. Rojas, William F. Parker, Brett K. Beaulieu-Jones
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2504.10422
Pdf URL: https://arxiv.org/pdf/2504.10422
Copy Paste: [[2504.10422]] Foundation models for electronic health records: representation dynamics and transferability(https://arxiv.org/abs/2504.10422)
Keywords: foundation model
Abstract: Foundation models (FMs) trained on electronic health records (EHRs) have shown strong performance on a range of clinical prediction tasks. However, adapting these models to local health systems remains challenging due to limited data availability and resource constraints. In this study, we investigated what these models learn and evaluated the transferability of an FM trained on MIMIC-IV to an institutional EHR dataset at the University of Chicago Medical Center. We assessed their ability to identify outlier patients and examined representation-space patient trajectories in relation to future clinical outcomes. We also evaluated the performance of supervised fine-tuned classifiers on both source and target datasets. Our findings offer insights into the adaptability of FMs across different healthcare systems, highlight considerations for their effective implementation, and provide an empirical analysis of the underlying factors that contribute to their predictive performance.

Title: MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model

Authors: Jian Liu, Wei Sun, Hui Yang, Jin Zheng, Zichen Geng, Hossein Rahmani, Ajmal Mian
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2504.10433
Pdf URL: https://arxiv.org/pdf/2504.10433
Copy Paste: [[2504.10433]] MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model(https://arxiv.org/abs/2504.10433)
Keywords: diffusion
Abstract: Object pose estimation is a core means for robots to understand and interact with their environment. For this task, monocular category-level methods are attractive as they require only a single RGB camera. However, current methods rely on shape priors or CAD models of the intra-class known objects. We propose a diffusion-based monocular category-level 9D object pose generation method, MonoDiff9D. Our motivation is to leverage the probabilistic nature of diffusion models to alleviate the need for shape priors, CAD models, or depth sensors for intra-class unknown object pose estimation. We first estimate coarse depth via DINOv2 from the monocular image in a zero-shot manner and convert it into a point cloud. We then fuse the global features of the point cloud with the input image and use the fused features along with the encoded time step to condition MonoDiff9D. Finally, we design a transformer-based denoiser to recover the object pose from Gaussian noise. Extensive experiments on two popular benchmark datasets show that MonoDiff9D achieves state-of-the-art monocular category-level 9D object pose estimation accuracy without the need for shape priors or CAD models at any stage. Our code will be made public at this https URL.

Title: Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing

Authors: Taihang Hu, Linxuan Li, Kai Wang, Yaxing Wang, Jian Yang, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10434
Pdf URL: https://arxiv.org/pdf/2504.10434
Copy Paste: [[2504.10434]] Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing(https://arxiv.org/abs/2504.10434)
Keywords: diffusion, generative
Abstract: Text-to-image generation has seen groundbreaking advancements with diffusion models, enabling high-fidelity synthesis and precise image editing through cross-attention manipulation. Recently, autoregressive (AR) models have re-emerged as powerful alternatives, leveraging next-token generation to match diffusion models. However, existing editing techniques designed for diffusion models fail to translate directly to AR models due to fundamental differences in structural control. Specifically, AR models suffer from spatial poverty of attention maps and sequential accumulation of structural errors during image editing, which disrupt object layouts and global consistency. In this work, we introduce Implicit Structure Locking (ISLock), the first training-free editing strategy for AR visual models. Rather than relying on explicit attention manipulation or fine-tuning, ISLock preserves structural blueprints by dynamically aligning self-attention patterns with reference images through the Anchor Token Matching (ATM) protocol. By implicitly enforcing structural consistency in latent space, our method ISLock enables structure-aware editing while maintaining generative autonomy. Extensive experiments demonstrate that ISLock achieves high-quality, structure-consistent edits without additional training and is superior or comparable to conventional editing techniques. Our findings pioneer the way for efficient and flexible AR-based image editing, further bridging the performance gap between diffusion and autoregressive generative models. The code will be publicly available at this https URL

Title: Art3D: Training-Free 3D Generation from Flat-Colored Illustration

Authors: Xiaoyan Cong, Jiayi Shen, Zekun Li, Rao Fu, Tao Lu, Srinath Sridhar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10466
Pdf URL: https://arxiv.org/pdf/2504.10466
Copy Paste: [[2504.10466]] Art3D: Training-Free 3D Generation from Flat-Colored Illustration(https://arxiv.org/abs/2504.10466)
Keywords: generative
Abstract: Large-scale pre-trained image-to-3D generative models have exhibited remarkable capabilities in diverse shape generations. However, most of them struggle to synthesize plausible 3D assets when the reference image is flat-colored like hand drawings due to the lack of 3D illusion, which are often the most user-friendly input modalities in art content creation. To this end, we propose Art3D, a training-free method that can lift flat-colored 2D designs into 3D. By leveraging structural and semantic features with pre- trained 2D image generation models and a VLM-based realism evaluation, Art3D successfully enhances the three-dimensional illusion in reference images, thus simplifying the process of generating 3D from 2D, and proves adaptable to a wide range of painting styles. To benchmark the generalization performance of existing image-to-3D models on flat-colored images without 3D feeling, we collect a new dataset, Flat-2D, with over 100 samples. Experimental results demonstrate the performance and robustness of Art3D, exhibiting superior generalizable capacity and promising practical applicability. Our source code and dataset will be publicly available on our project page: this https URL .

Title: REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

Authors: Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, Liang Zheng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.10483
Pdf URL: https://arxiv.org/pdf/2504.10483
Copy Paste: [[2504.10483]] REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers(https://arxiv.org/abs/2504.10483)
Keywords: diffusion
Abstract: In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss -- allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.26 and 1.83 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at this https URL.

Title: Decoupled Diffusion Sparks Adaptive Scene Generation

Authors: Yunsong Zhou, Naisheng Ye, William Ljungbergh, Tianyu Li, Jiazhi Yang, Zetong Yang, Hongzi Zhu, Christoffer Petersson, Hongyang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.10485
Pdf URL: https://arxiv.org/pdf/2504.10485
Copy Paste: [[2504.10485]] Decoupled Diffusion Sparks Adaptive Scene Generation(https://arxiv.org/abs/2504.10485)
Keywords: diffusion
Abstract: Controllable scene generation could reduce the cost of diverse data collection substantially for autonomous driving. Prior works formulate the traffic layout generation as predictive progress, either by denoising entire sequences at once or by iteratively predicting the next frame. However, full sequence denoising hinders online reaction, while the latter's short-sighted next-frame prediction lacks precise goal-state guidance. Further, the learned model struggles to generate complex or challenging scenarios due to a large number of safe and ordinal driving behaviors from open datasets. To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. At the core of the decoupled pipeline is the integration of a partial noise-masking training strategy and a noise-aware schedule that ensures timely environmental updates throughout the denoising process. To complement challenging scenario generation, we collect a dataset consisting of complex corner cases. It covers 540 hours of simulated data, including high-risk interactions such as cut-in, sudden braking, and collision. Nexus achieves superior generation realism while preserving reactivity and goal orientation, with a 40% reduction in displacement error. We further demonstrate that Nexus improves closed-loop planning by 20% through data augmentation and showcase its capability in safety-critical data generation.