2025-04-17

Title: LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

Authors: Wei-Jer Chang, Wei Zhan, Masayoshi Tomizuka, Manmohan Chandraker, Francesco Pittaluga
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2504.11521
Pdf URL: https://arxiv.org/pdf/2504.11521
Copy Paste: [[2504.11521]] LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation(https://arxiv.org/abs/2504.11521)
Keywords: diffusion
Abstract: Evaluating autonomous vehicles with controllability enables scalable testing in counterfactual or structured settings, enhancing both efficiency and safety. We introduce LangTraj, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors, generating nuanced and realistic scenarios. Unlike prior approaches that depend on domain-specific guidance functions, LangTraj incorporates language conditioning during training, facilitating more intuitive traffic simulation control. We propose a novel closed-loop training strategy for diffusion models, explicitly tailored to enhance stability and realism during closed-loop simulation. To support language-conditioned simulation, we develop Inter-Drive, a large-scale dataset with diverse and interactive labels for training language-conditioned diffusion models. Our dataset is built upon a scalable pipeline for annotating agent-agent interactions and single-agent behaviors, ensuring rich and varied supervision. Validated on the Waymo Motion Dataset, LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation, establishing a new paradigm for flexible and scalable autonomous vehicle testing.

Title: Possibility for Proactive Anomaly Detection

Authors: Jinsung Jeon, Jaehyeon Park, Sewon Park, Jeongwhan Choi, Minjung Kim, Noseong Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11623
Pdf URL: https://arxiv.org/pdf/2504.11623
Copy Paste: [[2504.11623]] Possibility for Proactive Anomaly Detection(https://arxiv.org/abs/2504.11623)
Keywords: anomaly
Abstract: Time-series anomaly detection, which detects errors and failures in a workflow, is one of the most important topics in real-world applications. The purpose of time-series anomaly detection is to reduce potential damages or losses. However, existing anomaly detection models detect anomalies through the error between the model output and the ground truth (observed) value, which makes them impractical. In this work, we present a \textit{proactive} approach for time-series anomaly detection based on a time-series forecasting model specialized for anomaly detection and a data-driven anomaly detection model. Our proactive approach establishes an anomaly threshold from training data with a data-driven anomaly detection model, and anomalies are subsequently detected by identifying predicted values that exceed the anomaly threshold. In addition, we extensively evaluated the model using four anomaly detection benchmarks and analyzed both predictable and unpredictable anomalies. We attached the source code as supplementary material.

Title: Improving Instruct Models for Free: A Study on Partial Adaptation

Authors: Ozan İrsoy, Pengxiang Cheng, Jennifer L. Chen, Daniel Preoţiuc-Pietro, Shiyue Zhang, Duccio Pappadopulo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11626
Pdf URL: https://arxiv.org/pdf/2504.11626
Copy Paste: [[2504.11626]] Improving Instruct Models for Free: A Study on Partial Adaptation(https://arxiv.org/abs/2504.11626)
Keywords: in-context
Abstract: Instruct models, obtained from various instruction tuning or post-training steps, are commonly deemed superior and more usable than their base counterpart. While the model gains instruction following ability, instruction tuning may lead to forgetting the knowledge from pre-training or it may encourage the model being overly conversational or verbose. This, in turn, can lead to degradation of in-context few-shot learning performance. In this work, we study the performance trajectory between base and instruct models by scaling down the strength of instruction-tuning via the partial adaption method. We show that, across several model families and model sizes, reducing the strength of instruction-tuning results in material improvement on a few-shot in-context learning benchmark covering a variety of classic natural language tasks. This comes at the cost of losing some degree of instruction following ability as measured by AlpacaEval. Our study shines light on the potential trade-off between in-context learning and instruction following abilities that is worth considering in practice.

Title: Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics

Authors: Yiran He, Yun Cao, Bowen Yang, Zeyu Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11686
Pdf URL: https://arxiv.org/pdf/2504.11686
Copy Paste: [[2504.11686]] Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics(https://arxiv.org/abs/2504.11686)
Keywords: generative
Abstract: The rapid development of generative AI facilitates content creation and makes image manipulation easier and more difficult to detect. While multimodal Large Language Models (LLMs) have encoded rich world knowledge, they are not inherently tailored for combating AI-generated Content (AIGC) and struggle to comprehend local forgery details. In this work, we investigate the application of multimodal LLMs in forgery detection. We propose a framework capable of evaluating image authenticity, localizing tampered regions, providing evidence, and tracing generation methods based on semantic tampering clues. Our method demonstrates that the potential of LLMs in forgery analysis can be effectively unlocked through meticulous prompt engineering and the application of few-shot learning techniques. We conduct qualitative and quantitative experiments and show that GPT4V can achieve an accuracy of 92.1% in Autosplice and 86.3% in LaMa, which is competitive with state-of-the-art AIGC detection methods. We further discuss the limitations of multimodal LLMs in such tasks and propose potential improvements.

Title: H$^3$GNNs: Harmonizing Heterophily and Homophily in GNNs via Joint Structural Node Encoding and Self-Supervised Learning

Authors: Rui Xue, Tianfu Wu
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2504.11699
Pdf URL: https://arxiv.org/pdf/2504.11699
Copy Paste: [[2504.11699]] H$^3$GNNs: Harmonizing Heterophily and Homophily in GNNs via Joint Structural Node Encoding and Self-Supervised Learning(https://arxiv.org/abs/2504.11699)
Keywords: self-supervised
Abstract: Graph Neural Networks (GNNs) struggle to balance heterophily and homophily in representation learning, a challenge further amplified in self-supervised settings. We propose H$^3$GNNs, an end-to-end self-supervised learning framework that harmonizes both structural properties through two key innovations: (i) Joint Structural Node Encoding. We embed nodes into a unified space combining linear and non-linear feature projections with K-hop structural representations via a Weighted Graph Convolution Network(WGCN). A cross-attention mechanism enhances awareness and adaptability to heterophily and homophily. (ii) Self-Supervised Learning Using Teacher-Student Predictive Architectures with Node-Difficulty Driven Dynamic Masking Strategies. We use a teacher-student model, the student sees the masked input graph and predicts node features inferred by the teacher that sees the full input graph in the joint encoding space. To enhance learning difficulty, we introduce two novel node-predictive-difficulty-based masking strategies. Experiments on seven benchmarks (four heterophily datasets and three homophily datasets) confirm the effectiveness and efficiency of H$^3$GNNs across diverse graph types. Our H$^3$GNNs achieves overall state-of-the-art performance on the four heterophily datasets, while retaining on-par performance to previous state-of-the-art methods on the three homophily datasets.

Title: Learning What NOT to Count

Authors: Adriano D'Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11705
Pdf URL: https://arxiv.org/pdf/2504.11705
Copy Paste: [[2504.11705]] Learning What NOT to Count(https://arxiv.org/abs/2504.11705)
Keywords: generative
Abstract: Few/zero-shot object counting methods reduce the need for extensive annotations but often struggle to distinguish between fine-grained categories, especially when multiple similar objects appear in the same scene. To address this limitation, we propose an annotation-free approach that enables the seamless integration of new fine-grained categories into existing few/zero-shot counting models. By leveraging latent generative models, we synthesize high-quality, category-specific crowded scenes, providing a rich training source for adapting to new categories without manual labeling. Our approach introduces an attention prediction network that identifies fine-grained category boundaries trained using only synthetic pseudo-annotated data. At inference, these fine-grained attention estimates refine the output of existing few/zero-shot counting networks. To benchmark our method, we further introduce the FGTC dataset, a taxonomy-specific fine-grained object counting dataset for natural images. Our method substantially enhances pre-trained state-of-the-art models on fine-grained taxon counting tasks, while using only synthetic data. Code and data to be released upon acceptance.

Title: Towards Safe Synthetic Image Generation On the Web: A Multimodal Robust NSFW Defense and Million Scale Dataset

Authors: Muhammad Shahid Muneer, Simon S. Woo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11707
Pdf URL: https://arxiv.org/pdf/2504.11707
Copy Paste: [[2504.11707]] Towards Safe Synthetic Image Generation On the Web: A Multimodal Robust NSFW Defense and Million Scale Dataset(https://arxiv.org/abs/2504.11707)
Keywords: diffusion
Abstract: In the past years, we have witnessed the remarkable success of Text-to-Image (T2I) models and their widespread use on the web. Extensive research in making T2I models produce hyper-realistic images has led to new concerns, such as generating Not-Safe-For-Work (NSFW) web content and polluting the web society. To help prevent misuse of T2I models and create a safer web environment for users features like NSFW filters and post-hoc security checks are used in these models. However, recent work unveiled how these methods can easily fail to prevent misuse. In particular, adversarial attacks on text and image modalities can easily outplay defensive measures. %Exploiting such leads to the growing concern of preventing adversarial attacks on text and image modalities. Moreover, there is currently no robust multimodal NSFW dataset that includes both prompt and image pairs and adversarial examples. This work proposes a million-scale prompt and image dataset generated using open-source diffusion models. Second, we develop a multimodal defense to distinguish safe and NSFW text and images, which is robust against adversarial attacks and directly alleviates current challenges. Our extensive experiments show that our model performs well against existing SOTA NSFW detection methods in terms of accuracy and recall, drastically reducing the Attack Success Rate (ASR) in multimodal adversarial attack scenarios. Code: this https URL.

Title: Adjoint Sampling: Highly Scalable Diffusion Samplers via Adjoint Matching

Authors: Aaron Havens, Benjamin Kurt Miller, Bing Yan, Carles Domingo-Enrich, Anuroop Sriram, Brandon Wood, Daniel Levine, Bin Hu, Brandon Amos, Brian Karrer, Xiang Fu, Guan-Horng Liu, Ricky T. Q. Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11713
Pdf URL: https://arxiv.org/pdf/2504.11713
Copy Paste: [[2504.11713]] Adjoint Sampling: Highly Scalable Diffusion Samplers via Adjoint Matching(https://arxiv.org/abs/2504.11713)
Keywords: diffusion
Abstract: We introduce Adjoint Sampling, a highly scalable and efficient algorithm for learning diffusion processes that sample from unnormalized densities, or energy functions. It is the first on-policy approach that allows significantly more gradient updates than the number of energy evaluations and model samples, allowing us to scale to much larger problem settings than previously explored by similar methods. Our framework is theoretically grounded in stochastic optimal control and shares the same theoretical guarantees as Adjoint Matching, being able to train without the need for corrective measures that push samples towards the target distribution. We show how to incorporate key symmetries, as well as periodic boundary conditions, for modeling molecules in both cartesian and torsional coordinates. We demonstrate the effectiveness of our approach through extensive experiments on classical energy functions, and further scale up to neural network-based energy models where we perform amortized conformer generation across many molecular systems. To encourage further research in developing highly scalable sampling methods, we plan to open source these challenging benchmarks, where successful methods can directly impact progress in computational chemistry.

Title: EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos

Authors: Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11732
Pdf URL: https://arxiv.org/pdf/2504.11732
Copy Paste: [[2504.11732]] EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos(https://arxiv.org/abs/2504.11732)
Keywords: diffusion, foundation model
Abstract: Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present EgoExo-Gen that explicitly models the hand-object dynamics for cross-view video prediction. EgoExo-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop an automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed EgoExo-Gen achieves better prediction performance compared to previous video prediction models on the Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.

Title: The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation

Authors: Bingjie Gao, Xinyu Gao, Xiaoxue Wu, Yujie Zhou, Yu Qiao, Li Niu, Xinyuan Chen, Yaohui Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2504.11739
Pdf URL: https://arxiv.org/pdf/2504.11739
Copy Paste: [[2504.11739]] The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation(https://arxiv.org/abs/2504.11739)
Keywords: generative
Abstract: The evolution of Text-to-video (T2V) generative models, trained on large-scale datasets, has been marked by significant progress. However, the sensitivity of T2V generative models to input prompts highlights the critical role of prompt design in influencing generative outcomes. Prior research has predominantly relied on Large Language Models (LLMs) to align user-provided prompts with the distribution of training prompts, albeit without tailored guidance encompassing prompt vocabulary and sentence structure nuances. To this end, we introduce \textbf{RAPO}, a novel \textbf{R}etrieval-\textbf{A}ugmented \textbf{P}rompt \textbf{O}ptimization framework. In order to address potential inaccuracies and ambiguous details generated by LLM-generated prompts. RAPO refines the naive prompts through dual optimization branches, selecting the superior prompt for T2V generation. The first branch augments user prompts with diverse modifiers extracted from a learned relational graph, refining them to align with the format of training prompts via a fine-tuned LLM. Conversely, the second branch rewrites the naive prompt using a pre-trained LLM following a well-defined instruction set. Extensive experiments demonstrate that RAPO can effectively enhance both the static and dynamic dimensions of generated videos, demonstrating the significance of prompt optimization for user-provided prompts. Project website: \href{this https URL}{GitHub}.

Title: GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision

Authors: Zihui Zhang, Yafei Yang, Hongtao Wen, Bo Yang
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2504.11754
Pdf URL: https://arxiv.org/pdf/2504.11754
Copy Paste: [[2504.11754]] GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision(https://arxiv.org/abs/2504.11754)
Keywords: generative
Abstract: We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision. By relying on the similarity of pretrained 2D features or external signals such as motion to group 3D points as objects, existing unsupervised methods are usually limited to identifying simple objects like cars or their segmented objects are often inferior due to the lack of objectness in pretrained features. In this paper, we propose a new two-stage pipeline called GrabS. The core concept of our method is to learn generative and discriminative object-centric priors as a foundation from object datasets in the first stage, and then design an embodied agent to learn to discover multiple objects by querying against the pretrained generative priors in the second stage. We extensively evaluate our method on two real-world datasets and a newly created synthetic dataset, demonstrating remarkable segmentation performance, clearly surpassing all existing unsupervised methods.

Title: PCDiff: Proactive Control for Ownership Protection in Diffusion Models with Watermark Compatibility

Authors: Keke Gai, Ziyue Shen, Jing Yu, Liehuang Zhu, Qi Wu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11774
Pdf URL: https://arxiv.org/pdf/2504.11774
Copy Paste: [[2504.11774]] PCDiff: Proactive Control for Ownership Protection in Diffusion Models with Watermark Compatibility(https://arxiv.org/abs/2504.11774)
Keywords: diffusion
Abstract: With the growing demand for protecting the intellectual property (IP) of text-to-image diffusion models, we propose PCDiff -- a proactive access control framework that redefines model authorization by regulating generation quality. At its core, PCDIFF integrates a trainable fuser module and hierarchical authentication layers into the decoder architecture, ensuring that only users with valid encrypted credentials can generate high-fidelity images. In the absence of valid keys, the system deliberately degrades output quality, effectively preventing unauthorized this http URL, while the primary mechanism enforces active access control through architectural intervention, its decoupled design retains compatibility with existing watermarking techniques. This satisfies the need of model owners to actively control model ownership while preserving the traceability capabilities provided by traditional watermarking this http URL experimental evaluations confirm a strong dependency between credential verification and image quality across various attack scenarios. Moreover, when combined with typical post-processing operations, PCDIFF demonstrates powerful performance alongside conventional watermarking methods. This work shifts the paradigm from passive detection to proactive enforcement of authorization, laying the groundwork for IP management of diffusion models.

Title: ACMamba: Fast Unsupervised Anomaly Detection via An Asymmetrical Consensus State Space Model

Authors: Guanchun Wang, Xiangrong Zhang, Yifei Zhang, Zelin Peng, Tianyang Zhang, Xu Tang, Licheng Jiao
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.11781
Pdf URL: https://arxiv.org/pdf/2504.11781
Copy Paste: [[2504.11781]] ACMamba: Fast Unsupervised Anomaly Detection via An Asymmetrical Consensus State Space Model(https://arxiv.org/abs/2504.11781)
Keywords: anomaly
Abstract: Unsupervised anomaly detection in hyperspectral images (HSI), aiming to detect unknown targets from backgrounds, is challenging for earth surface monitoring. However, current studies are hindered by steep computational costs due to the high-dimensional property of HSI and dense sampling-based training paradigm, constraining their rapid deployment. Our key observation is that, during training, not all samples within the same homogeneous area are indispensable, whereas ingenious sampling can provide a powerful substitute for reducing costs. Motivated by this, we propose an Asymmetrical Consensus State Space Model (ACMamba) to significantly reduce computational costs without compromising accuracy. Specifically, we design an asymmetrical anomaly detection paradigm that utilizes region-level instances as an efficient alternative to dense pixel-level samples. In this paradigm, a low-cost Mamba-based module is introduced to discover global contextual attributes of regions that are essential for HSI reconstruction. Additionally, we develop a consensus learning strategy from the optimization perspective to simultaneously facilitate background reconstruction and anomaly compression, further alleviating the negative impact of anomaly reconstruction. Theoretical analysis and extensive experiments across eight benchmarks verify the superiority of ACMamba, demonstrating a faster speed and stronger performance over the state-of-the-art.

Title: Real-World Depth Recovery via Structure Uncertainty Modeling and Inaccurate GT Depth Fitting

Authors: Delong Suzhang, Meng Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11820
Pdf URL: https://arxiv.org/pdf/2504.11820
Copy Paste: [[2504.11820]] Real-World Depth Recovery via Structure Uncertainty Modeling and Inaccurate GT Depth Fitting(https://arxiv.org/abs/2504.11820)
Keywords: foundation model
Abstract: The low-quality structure in raw depth maps is prevalent in real-world RGB-D datasets, which makes real-world depth recovery a critical task in recent years. However, the lack of paired raw-ground truth (raw-GT) data in the real world poses challenges for generalized depth recovery. Existing methods insufficiently consider the diversity of structure misalignment in raw depth maps, which leads to poor generalization in real-world depth recovery. Notably, random structure misalignments are not limited to raw depth data but also affect GT depth in real-world datasets. In the proposed method, we tackle the generalization problem from both input and output perspectives. For input, we enrich the diversity of structure misalignment in raw depth maps by designing a new raw depth generation pipeline, which helps the network avoid overfitting to a specific condition. Furthermore, a structure uncertainty module is designed to explicitly identify the misaligned structure for input raw depth maps to better generalize in unseen scenarios. Notably the well-trained depth foundation model (DFM) can help the structure uncertainty module estimate the structure uncertainty better. For output, a robust feature alignment module is designed to precisely align with the accurate structure of RGB images avoiding the interference of inaccurate GT depth. Extensive experiments on multiple datasets demonstrate the proposed method achieves competitive accuracy and generalization capabilities across various challenging raw depth maps.

Title: Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation

Authors: Julia Kreutzer, Eleftheria Briakou, Sweta Agrawal, Marzieh Fadaee, Kocmi Tom
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11829
Pdf URL: https://arxiv.org/pdf/2504.11829
Copy Paste: [[2504.11829]] Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation(https://arxiv.org/abs/2504.11829)
Keywords: generative
Abstract: Generation capabilities and language coverage of multilingual large language models (mLLMs) are advancing rapidly. However, evaluation practices for generative abilities of mLLMs are still lacking comprehensiveness, scientific rigor, and consistent adoption across research labs, which undermines their potential to meaningfully guide mLLM development. We draw parallels with machine translation (MT) evaluation, a field that faced similar challenges and has, over decades, developed transparent reporting standards and reliable evaluations for multilingual generative models. Through targeted experiments across key stages of the generative evaluation pipeline, we demonstrate how best practices from MT evaluation can deepen the understanding of quality differences between models. Additionally, we identify essential components for robust meta-evaluation of mLLMs, ensuring the evaluation methods themselves are rigorously assessed. We distill these insights into a checklist of actionable recommendations for mLLM research and development.

Title: Boosting Multi-View Stereo with Depth Foundation Model in the Absence of Real-World Labels

Authors: Jie Zhu, Bo Peng, Zhe Zhang, Bingzheng Liu, Jianjun Lei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11845
Pdf URL: https://arxiv.org/pdf/2504.11845
Copy Paste: [[2504.11845]] Boosting Multi-View Stereo with Depth Foundation Model in the Absence of Real-World Labels(https://arxiv.org/abs/2504.11845)
Keywords: foundation model
Abstract: Learning-based Multi-View Stereo (MVS) methods have made remarkable progress in recent years. However, how to effectively train the network without using real-world labels remains a challenging problem. In this paper, driven by the recent advancements of vision foundation models, a novel method termed DFM-MVS, is proposed to leverage the depth foundation model to generate the effective depth prior, so as to boost MVS in the absence of real-world labels. Specifically, a depth prior-based pseudo-supervised training mechanism is developed to simulate realistic stereo correspondences using the generated depth prior, thereby constructing effective supervision for the MVS network. Besides, a depth prior-guided error correction strategy is presented to leverage the depth prior as guidance to mitigate the error propagation problem inherent in the widely-used coarse-to-fine network structure. Experimental results on DTU and Tanks & Temples datasets demonstrate that the proposed DFM-MVS significantly outperforms existing MVS methods without using real-world labels.

Title: ACE: Attentional Concept Erasure in Diffusion Models

Authors: Finn Carter
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11850
Pdf URL: https://arxiv.org/pdf/2504.11850
Copy Paste: [[2504.11850]] ACE: Attentional Concept Erasure in Diffusion Models(https://arxiv.org/abs/2504.11850)
Keywords: diffusion
Abstract: Large text-to-image diffusion models have demonstrated remarkable image synthesis capabilities, but their indiscriminate training on Internet-scale data has led to learned concepts that enable harmful, copyrighted, or otherwise undesirable content generation. We address the task of concept erasure in diffusion models, i.e., removing a specified concept from a pre-trained model such that prompting the concept (or related synonyms) no longer yields its depiction, while preserving the model's ability to generate other content. We propose a novel method, Attentional Concept Erasure (ACE), that integrates a closed-form attention manipulation with lightweight fine-tuning. Theoretically, we formulate concept erasure as aligning the model's conditional distribution on the target concept with a neutral distribution. Our approach identifies and nullifies concept-specific latent directions in the cross-attention modules via a gated low-rank adaptation, followed by adversarially augmented fine-tuning to ensure thorough erasure of the concept and its synonyms. Empirically, we demonstrate on multiple benchmarks, including object classes, celebrity faces, explicit content, and artistic styles, that ACE achieves state-of-the-art concept removal efficacy and robustness. Compared to prior methods, ACE better balances generality (erasing concept and related terms) and specificity (preserving unrelated content), scales to dozens of concepts, and is efficient, requiring only a few seconds of adaptation per concept. We will release our code to facilitate safer deployment of diffusion models.

Title: Search is All You Need for Few-shot Anomaly Detection

Authors: Qishan Wang, Jia Guo, Shuyong Gao, Haofen Wang, Li Xiong, Junjie Hu, Hanqi Guo, Wenqiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11895
Pdf URL: https://arxiv.org/pdf/2504.11895
Copy Paste: [[2504.11895]] Search is All You Need for Few-shot Anomaly Detection(https://arxiv.org/abs/2504.11895)
Keywords: foundation model, anomaly
Abstract: Few-shot anomaly detection (FSAD) has emerged as a crucial yet challenging task in industrial inspection, where normal distribution modeling must be accomplished with only a few normal images. While existing approaches typically employ multi-modal foundation models combining language and vision modalities for prompt-guided anomaly detection, these methods often demand sophisticated prompt engineering and extensive manual tuning. In this paper, we demonstrate that a straightforward nearest-neighbor search framework can surpass state-of-the-art performance in both single-class and multi-class FSAD scenarios. Our proposed method, VisionAD, consists of four simple yet essential components: (1) scalable vision foundation models that extract universal and discriminative features; (2) dual augmentation strategies - support augmentation to enhance feature matching adaptability and query augmentation to address the oversights of single-view prediction; (3) multi-layer feature integration that captures both low-frequency global context and high-frequency local details with minimal computational overhead; and (4) a class-aware visual memory bank enabling efficient one-for-all multi-class detection. Extensive evaluations across MVTec-AD, VisA, and Real-IAD benchmarks demonstrate VisionAD's exceptional performance. Using only 1 normal images as support, our method achieves remarkable image-level AUROC scores of 97.4%, 94.8%, and 70.8% respectively, outperforming current state-of-the-art approaches by significant margins (+1.6%, +3.2%, and +1.4%). The training-free nature and superior few-shot capabilities of VisionAD make it particularly appealing for real-world applications where samples are scarce or expensive to obtain. Code is available at this https URL.

Title: AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection

Authors: Yuhao Chao, Jie Liu, Jie Tang, Gangshan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11914
Pdf URL: https://arxiv.org/pdf/2504.11914
Copy Paste: [[2504.11914]] AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection(https://arxiv.org/abs/2504.11914)
Keywords: anomaly
Abstract: Industrial Anomaly Detection (IAD) poses a formidable challenge due to the scarcity of defective samples, making it imperative to deploy models capable of robust generalization to detect unseen anomalies effectively. Traditional approaches, often constrained by hand-crafted features or domain-specific expert models, struggle to address this limitation, underscoring the need for a paradigm shift. We introduce AnomalyR1, a pioneering framework that leverages VLM-R1, a Multimodal Large Language Model (MLLM) renowned for its exceptional generalization and interpretability, to revolutionize IAD. By integrating MLLM with Group Relative Policy Optimization (GRPO), enhanced by our novel Reasoned Outcome Alignment Metric (ROAM), AnomalyR1 achieves a fully end-to-end solution that autonomously processes inputs of image and domain knowledge, reasons through analysis, and generates precise anomaly localizations and masks. Based on the latest multimodal IAD benchmark, our compact 3-billion-parameter model outperforms existing methods, establishing state-of-the-art results. As MLLM capabilities continue to advance, this study is the first to deliver an end-to-end VLM-based IAD solution that demonstrates the transformative potential of ROAM-enhanced GRPO, positioning our framework as a forward-looking cornerstone for next-generation intelligent anomaly detection systems in industrial applications with limited defective data.

Title: SemDiff: Generating Natural Unrestricted Adversarial Examples via Semantic Attributes Optimization in Diffusion Models

Authors: Zeyu Dai, Shengcai Liu, Rui He, Jiahao Wu, Ning Lu, Wenqi Fan, Qing Li, Ke Tang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2504.11923
Pdf URL: https://arxiv.org/pdf/2504.11923
Copy Paste: [[2504.11923]] SemDiff: Generating Natural Unrestricted Adversarial Examples via Semantic Attributes Optimization in Diffusion Models(https://arxiv.org/abs/2504.11923)
Keywords: diffusion
Abstract: Unrestricted adversarial examples (UAEs), allow the attacker to create non-constrained adversarial examples without given clean samples, posing a severe threat to the safety of deep learning models. Recent works utilize diffusion models to generate UAEs. However, these UAEs often lack naturalness and imperceptibility due to simply optimizing in intermediate latent noises. In light of this, we propose SemDiff, a novel unrestricted adversarial attack that explores the semantic latent space of diffusion models for meaningful attributes, and devises a multi-attributes optimization approach to ensure attack success while maintaining the naturalness and imperceptibility of generated UAEs. We perform extensive experiments on four tasks on three high-resolution datasets, including CelebA-HQ, AFHQ and ImageNet. The results demonstrate that SemDiff outperforms state-of-the-art methods in terms of attack success rate and imperceptibility. The generated UAEs are natural and exhibit semantically meaningful changes, in accord with the attributes' weights. In addition, SemDiff is found capable of evading different defenses, which further validates its effectiveness and threatening.

Title: Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning

Authors: Hairui Ren, Fan Tang, He Zhao, Zixuan Wang, Dandan Guo, Yi Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11930
Pdf URL: https://arxiv.org/pdf/2504.11930
Copy Paste: [[2504.11930]] Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning(https://arxiv.org/abs/2504.11930)
Keywords: diffusion
Abstract: Fine-tuning vision-language models (VLMs) with large amounts of unlabeled data has recently garnered significant interest. However, a key challenge remains the lack of high-quality pseudo-labeled data. Current pseudo-labeling strategies often struggle with mismatches between semantic and visual information, leading to sub-optimal performance of unsupervised prompt learning (UPL) methods. In this paper, we introduce a simple yet effective approach called \textbf{A}ugmenting D\textbf{i}scriminative \textbf{R}ichness via Diffusions (AiR), toward learning a richer discriminating way to represent the class comprehensively and thus facilitate classification. Specifically, our approach includes a pseudo-label generation module that leverages high-fidelity synthetic samples to create an auxiliary classifier, which captures richer visual variation, bridging text-image-pair classification to a more robust image-image-pair classification. Additionally, we exploit the diversity of diffusion-based synthetic samples to enhance prompt learning, providing greater information for semantic-visual alignment. Extensive experiments on five public benchmarks, including RESISC45 and Flowers102, and across three learning paradigms-UL, SSL, and TRZSL-demonstrate that AiR achieves substantial and consistent performance improvements over state-of-the-art unsupervised prompt learning methods.

Title: VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning

Authors: Xuyang Chen, Guojian Wang, Keyu Yan, Lin Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11944
Pdf URL: https://arxiv.org/pdf/2504.11944
Copy Paste: [[2504.11944]] VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning(https://arxiv.org/abs/2504.11944)
Keywords: self-supervised
Abstract: Offline reinforcement learning (RL) learns effective policies from pre-collected datasets, offering a practical solution for applications where online interactions are risky or costly. Model-based approaches are particularly advantageous for offline RL, owing to their data efficiency and generalizability. However, due to inherent model errors, model-based methods often artificially introduce conservatism guided by heuristic uncertainty estimation, which can be unreliable. In this paper, we introduce VIPO, a novel model-based offline RL algorithm that incorporates self-supervised feedback from value estimation to enhance model training. Specifically, the model is learned by additionally minimizing the inconsistency between the value learned directly from the offline data and the one estimated from the model. We perform comprehensive evaluations from multiple perspectives to show that VIPO can learn a highly accurate model efficiently and consistently outperform existing methods. It offers a general framework that can be readily integrated into existing model-based offline RL algorithms to systematically enhance model accuracy. As a result, VIPO achieves state-of-the-art performance on almost all tasks in both D4RL and NeoRL benchmarks.

Title: R-Meshfusion: Reinforcement Learning Powered Sparse-View Mesh Reconstruction with Diffusion Priors

Authors: Haoyang Wang, Liming Liu, Peiheng Wang, Junlin Hao, Jiangkai Wu, Xinggong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11946
Pdf URL: https://arxiv.org/pdf/2504.11946
Copy Paste: [[2504.11946]] R-Meshfusion: Reinforcement Learning Powered Sparse-View Mesh Reconstruction with Diffusion Priors(https://arxiv.org/abs/2504.11946)
Keywords: diffusion
Abstract: Mesh reconstruction from multi-view images is a fundamental problem in computer vision, but its performance degrades significantly under sparse-view conditions, especially in unseen regions where no ground-truth observations are available. While recent advances in diffusion models have demonstrated strong capabilities in synthesizing novel views from limited inputs, their outputs often suffer from visual artifacts and lack 3D consistency, posing challenges for reliable mesh optimization. In this paper, we propose a novel framework that leverages diffusion models to enhance sparse-view mesh reconstruction in a principled and reliable manner. To address the instability of diffusion outputs, we propose a Consensus Diffusion Module that filters unreliable generations via interquartile range (IQR) analysis and performs variance-aware image fusion to produce robust pseudo-supervision. Building on this, we design an online reinforcement learning strategy based on the Upper Confidence Bound (UCB) to adaptively select the most informative viewpoints for enhancement, guided by diffusion loss. Finally, the fused images are used to jointly supervise a NeRF-based model alongside sparse-view ground truth, ensuring consistency across both geometry and appearance. Extensive experiments demonstrate that our method achieves significant improvements in both geometric quality and rendering quality.

Title: Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions

Authors: Yifei Dong, Fengyi Wu, Sanjian Zhang, Guangyu Chen, Yuzhi Hu, Masumi Yano, Jingdong Sun, Siyu Huang, Feng Liu, Qi Dai, Zhi-Qi Cheng
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2504.11967
Pdf URL: https://arxiv.org/pdf/2504.11967
Copy Paste: [[2504.11967]] Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions(https://arxiv.org/abs/2504.11967)
Keywords: diffusion, self-supervised
Abstract: Unmanned Aerial Vehicles (UAVs) are indispensable for infrastructure inspection, surveillance, and related tasks, yet they also introduce critical security challenges. This survey provides a wide-ranging examination of the anti-UAV domain, centering on three core objectives-classification, detection, and tracking-while detailing emerging methodologies such as diffusion-based data synthesis, multi-modal fusion, vision-language modeling, self-supervised learning, and reinforcement learning. We systematically evaluate state-of-the-art solutions across both single-modality and multi-sensor pipelines (spanning RGB, infrared, audio, radar, and RF) and discuss large-scale as well as adversarially oriented benchmarks. Our analysis reveals persistent gaps in real-time performance, stealth detection, and swarm-based scenarios, underscoring pressing needs for robust, adaptive anti-UAV systems. By highlighting open research directions, we aim to foster innovation and guide the development of next-generation defense strategies in an era marked by the extensive use of UAVs.

Title: Language Models as Quasi-Crystalline Thought: Structure, Constraint, and Emergence in Generative Systems

Authors: Jose Manuel Guevara-Vela
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.11986
Pdf URL: https://arxiv.org/pdf/2504.11986
Copy Paste: [[2504.11986]] Language Models as Quasi-Crystalline Thought: Structure, Constraint, and Emergence in Generative Systems(https://arxiv.org/abs/2504.11986)
Keywords: generative
Abstract: This essay proposes an analogy between large language models (LLMs) and quasicrystals: systems that exhibit global coherence without periodic repetition and that are generated through local constraints. While LLMs are often evaluated in terms of predictive accuracy, factuality, or alignment, this structural perspective suggests that their most characteristic behavior is the production of internally resonant linguistic patterns. Just as quasicrystals forced a redefinition of order in physical systems, viewing LLMs as generators of quasi-structured language opens new paths for evaluation and design: privileging propagation of constraint over token-level accuracy, and coherence of form over fixed meaning. LLM outputs should be read not only for what they say, but for the patterns of constraint and coherence that organize them. This shift reframes generative language as a space of emergent patterning: LLMs are neither fully random nor strictly rule-based, but defined by a logic of constraint, resonance, and structural depth.

Title: A Complex-valued SAR Foundation Model Based on Physically Inspired Representation Learning

Authors: Mengyu Wang, Hanbo Bi, Yingchao Feng, Linlin Xin, Shuo Gong, Tianqi Wang, Zhiyuan Yan, Peijin Wang, Wenhui Diao, Xian Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.11999
Pdf URL: https://arxiv.org/pdf/2504.11999
Copy Paste: [[2504.11999]] A Complex-valued SAR Foundation Model Based on Physically Inspired Representation Learning(https://arxiv.org/abs/2504.11999)
Keywords: foundation model
Abstract: Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient information utilization and poor interpretability. In this paper, we propose a remote sensing foundation model based on complex-valued SAR data, which simulates the polarimetric decomposition process for pre-training, i.e., characterizing pixel scattering intensity as a weighted combination of scattering bases and scattering coefficients, thereby endowing the foundation model with physical interpretability. Specifically, we construct a series of scattering queries, each representing an independent and meaningful scattering basis, which interact with SAR features in the scattering query decoder and output the corresponding scattering coefficient. To guide the pre-training process, polarimetric decomposition loss and power self-supervision loss are constructed. The former aligns the predicted coefficients with Yamaguchi coefficients, while the latter reconstructs power from the predicted coefficients and compares it to the input image's power. The performance of our foundation model is validated on six typical downstream tasks, achieving state-of-the-art results. Notably, the foundation model can extract stable feature representations and exhibits strong generalization, even in data-scarce conditions.

Title: Balancing Graph Embedding Smoothness in Self-Supervised Learning via Information-Theoretic Decomposition

Authors: Heesoo Jung, Hogun Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12011
Pdf URL: https://arxiv.org/pdf/2504.12011
Copy Paste: [[2504.12011]] Balancing Graph Embedding Smoothness in Self-Supervised Learning via Information-Theoretic Decomposition(https://arxiv.org/abs/2504.12011)
Keywords: self-supervised
Abstract: Self-supervised learning (SSL) in graphs has garnered significant attention, particularly in employing Graph Neural Networks (GNNs) with pretext tasks initially designed for other domains, such as contrastive learning and feature reconstruction. However, it remains uncertain whether these methods effectively reflect essential graph properties, precisely representation similarity with its neighbors. We observe that existing methods position opposite ends of a spectrum driven by the graph embedding smoothness, with each end corresponding to outperformance on specific downstream tasks. Decomposing the SSL objective into three terms via an information-theoretic framework with a neighbor representation variable reveals that this polarization stems from an imbalance among the terms, which existing methods may not effectively maintain. Further insights suggest that balancing between the extremes can lead to improved performance across a wider range of downstream tasks. A framework, BSG (Balancing Smoothness in Graph SSL), introduces novel loss functions designed to supplement the representation quality in graph-based SSL by balancing the derived three terms: neighbor loss, minimal loss, and divergence loss. We present a theoretical analysis of the effects of these loss functions, highlighting their significance from both the SSL and graph smoothness perspectives. Extensive experiments on multiple real-world datasets across node classification and link prediction consistently demonstrate that BSG achieves state-of-the-art performance, outperforming existing methods. Our implementation code is available at this https URL.

Title: Understanding Attention Mechanism in Video Diffusion Models

Authors: Bingyan Liu, Chengyu Wang, Tongtong Su, Huan Ten, Jun Huang, Kailing Guo, Kui Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12027
Pdf URL: https://arxiv.org/pdf/2504.12027
Copy Paste: [[2504.12027]] Understanding Attention Mechanism in Video Diffusion Models(https://arxiv.org/abs/2504.12027)
Keywords: diffusion
Abstract: Text-to-video (T2V) synthesis models, such as OpenAI's Sora, have garnered significant attention due to their ability to generate high-quality videos from a text prompt. In diffusion-based T2V models, the attention mechanism is a critical component. However, it remains unclear what intermediate features are learned and how attention blocks in T2V models affect various aspects of video synthesis, such as image quality and temporal consistency. In this paper, we conduct an in-depth perturbation analysis of the spatial and temporal attention blocks of T2V models using an information-theoretic approach. Our results indicate that temporal and spatial attention maps affect not only the timing and layout of the videos but also the complexity of spatiotemporal elements and the aesthetic quality of the synthesized videos. Notably, high-entropy attention maps are often key elements linked to superior video quality, whereas low-entropy attention maps are associated with the video's intra-frame structure. Based on our findings, we propose two novel methods to enhance video quality and enable text-guided video editing. These methods rely entirely on lightweight manipulation of the attention matrices in T2V models. The efficacy and effectiveness of our methods are further validated through experimental evaluation across multiple datasets.

Title: Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM

Authors: Zirui Pan, Xin Wang, Yipeng Zhang, Hong Chen, Kwan Man Cheng, Yaofei Wu, Wenwu Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12048
Pdf URL: https://arxiv.org/pdf/2504.12048
Copy Paste: [[2504.12048]] Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM(https://arxiv.org/abs/2504.12048)
Keywords: diffusion
Abstract: Text-to-Video generation, which utilizes the provided text prompt to generate high-quality videos, has drawn increasing attention and achieved great success due to the development of diffusion models recently. Existing methods mainly rely on a pre-trained text encoder to capture the semantic information and perform cross attention with the encoded text prompt to guide the generation of video. However, when it comes to complex prompts that contain dynamic scenes and multiple camera-view transformations, these methods can not decompose the overall information into separate scenes, as well as fail to smoothly change scenes based on the corresponding camera-views. To solve these problems, we propose a novel method, i.e., Modular-Cam. Specifically, to better understand a given complex prompt, we utilize a large language model to analyze user instructions and decouple them into multiple scenes together with transition actions. To generate a video containing dynamic scenes that match the given camera-views, we incorporate the widely-used temporal transformer into the diffusion model to ensure continuity within a single scene and propose CamOperator, a modular network based module that well controls the camera movements. Moreover, we propose AdaControlNet, which utilizes ControlNet to ensure consistency across scenes and adaptively adjusts the color tone of the generated video. Extensive qualitative and quantitative experiments prove our proposed Modular-Cam's strong capability of generating multi-scene videos together with its ability to achieve fine-grained control of camera movements. Generated results are available at this https URL.

Title: Generative Deep Learning Framework for Inverse Design of Fuels

Authors: Kiran K. Yalamanchi, Pinaki Pal, Balaji Mohan, Abdullah S. AlRamadan, Jihad A. Badra, Yuanjiang Pei
Subjects: cs.LG, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2504.12075
Pdf URL: https://arxiv.org/pdf/2504.12075
Copy Paste: [[2504.12075]] Generative Deep Learning Framework for Inverse Design of Fuels(https://arxiv.org/abs/2504.12075)
Keywords: generative
Abstract: In the present work, a generative deep learning framework combining a Co-optimized Variational Autoencoder (Co-VAE) architecture with quantitative structure-property relationship (QSPR) techniques is developed to enable accelerated inverse design of fuels. The Co-VAE integrates a property prediction component coupled with the VAE latent space, enhancing molecular reconstruction and accurate estimation of Research Octane Number (RON) (chosen as the fuel property of interest). A subset of the GDB-13 database, enriched with a curated RON database, is used for model training. Hyperparameter tuning is further utilized to optimize the balance among reconstruction fidelity, chemical validity, and RON prediction. An independent regression model is then used to refine RON prediction, while a differential evolution algorithm is employed to efficiently navigate the VAE latent space and identify promising fuel molecule candidates with high RON. This methodology addresses the limitations of traditional fuel screening approaches by capturing complex structure-property relationships within a comprehensive latent representation. The generative model provides a flexible tool for systematically exploring vast chemical spaces, paving the way for discovering fuels with superior anti-knock properties. The demonstrated approach can be readily extended to incorporate additional fuel properties and synthesizability criteria to enhance applicability and reliability for de novo design of new fuels.

Title: DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

Authors: Mengshi Qi, Pengfei Zhu, Xiangtai Li, Xiaoyang Bi, Lu Qi, Huadong Ma, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12080
Pdf URL: https://arxiv.org/pdf/2504.12080
Copy Paste: [[2504.12080]] DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency(https://arxiv.org/abs/2504.12080)
Keywords: in-context
Abstract: Given a single labeled example, in-context segmentation aims to segment corresponding objects. This setting, known as one-shot segmentation in few-shot learning, explores the segmentation model's generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state-of-the-art results in interactive segmentation, these approaches are not directly applicable to in-context segmentation. In this work, we propose the Dual Consistency SAM (DC-SAM) method based on prompt-tuning to adapt SAM and SAM2 for in-context segmentation of both images and videos. Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle-consistent cross-attention on fused features and initial visual prompts. Next, a dual-branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask-tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in-context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In-Context Video Object Segmentation (IC-VOS), to better assess the in-context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO-20i, 73.0 (+1.1) mIoU on PASCAL-5i, and a J&F score of 71.52 on the proposed IC-VOS benchmark. Our source code and benchmark are available at this https URL.

Title: Selective Demonstration Retrieval for Improved Implicit Hate Speech Detection

Authors: Yumin Kim, Hwanhee Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2504.12082
Pdf URL: https://arxiv.org/pdf/2504.12082
Copy Paste: [[2504.12082]] Selective Demonstration Retrieval for Improved Implicit Hate Speech Detection(https://arxiv.org/abs/2504.12082)
Keywords: in-context
Abstract: Hate speech detection is a crucial area of research in natural language processing, essential for ensuring online community safety. However, detecting implicit hate speech, where harmful intent is conveyed in subtle or indirect ways, remains a major challenge. Unlike explicit hate speech, implicit expressions often depend on context, cultural subtleties, and hidden biases, making them more challenging to identify consistently. Additionally, the interpretation of such speech is influenced by external knowledge and demographic biases, resulting in varied detection results across different language models. Furthermore, Large Language Models often show heightened sensitivity to toxic language and references to vulnerable groups, which can lead to misclassifications. This over-sensitivity results in false positives (incorrectly identifying harmless statements as hateful) and false negatives (failing to detect genuinely harmful content). Addressing these issues requires methods that not only improve detection precision but also reduce model biases and enhance robustness. To address these challenges, we propose a novel method, which utilizes in-context learning without requiring model fine-tuning. By adaptively retrieving demonstrations that focus on similar groups or those with the highest similarity scores, our approach enhances contextual comprehension. Experimental results show that our method outperforms current state-of-the-art techniques. Implementation details and code are available at TBD.

Title: Generalized Visual Relation Detection with Diffusion Models

Authors: Kaifeng Gao, Siqi Chen, Hanwang Zhang, Jun Xiao, Yueting Zhuang, Qianru Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12100
Pdf URL: https://arxiv.org/pdf/2504.12100
Copy Paste: [[2504.12100]] Generalized Visual Relation Detection with Diffusion Models(https://arxiv.org/abs/2504.12100)
Keywords: diffusion, generative
Abstract: Visual relation detection (VRD) aims to identify relationships (or interactions) between object pairs in an image. Although recent VRD models have achieved impressive performance, they are all restricted to pre-defined relation categories, while failing to consider the semantic ambiguity characteristic of visual relations. Unlike objects, the appearance of visual relations is always subtle and can be described by multiple predicate words from different perspectives, e.g., ``ride'' can be depicted as ``race'' and ``sit on'', from the sports and spatial position views, respectively. To this end, we propose to model visual relations as continuous embeddings, and design diffusion models to achieve generalized VRD in a conditional generative manner, termed Diff-VRD. We model the diffusion process in a latent space and generate all possible relations in the image as an embedding sequence. During the generation, the visual and text embeddings of subject-object pairs serve as conditional signals and are injected via cross-attention. After the generation, we design a subsequent matching stage to assign the relation words to subject-object pairs by considering their semantic similarities. Benefiting from the diffusion-based generative process, our Diff-VRD is able to generate visual relations beyond the pre-defined category labels of datasets. To properly evaluate this generalized VRD task, we introduce two evaluation metrics, i.e., text-to-image retrieval and SPICE PR Curve inspired by image captioning. Extensive experiments in both human-object interaction (HOI) detection and scene graph generation (SGG) benchmarks attest to the superiority and effectiveness of Diff-VRD.

Title: A Diffusion-Based Framework for Terrain-Aware Remote Sensing Image Reconstruction

Authors: Zhenyu Yu, Mohd Yamani Inda Idris, Pei Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.12112
Pdf URL: https://arxiv.org/pdf/2504.12112
Copy Paste: [[2504.12112]] A Diffusion-Based Framework for Terrain-Aware Remote Sensing Image Reconstruction(https://arxiv.org/abs/2504.12112)
Keywords: diffusion
Abstract: Remote sensing imagery is essential for environmental monitoring, agricultural management, and disaster response. However, data loss due to cloud cover, sensor failures, or incomplete acquisition-especially in high-resolution and high-frequency tasks-severely limits satellite imagery's effectiveness. Traditional interpolation methods struggle with large missing areas and complex structures. Remote sensing imagery consists of multiple bands, each with distinct meanings, and ensuring consistency across bands is critical to avoid anomalies in the combined images. This paper proposes SatelliteMaker, a diffusion-based method that reconstructs missing data across varying levels of data loss while maintaining spatial, spectral, and temporal consistency. We also propose Digital Elevation Model (DEM) as a conditioning input and use tailored prompts to generate realistic images, making diffusion models applicable to quantitative remote sensing tasks. Additionally, we propose a VGG-Adapter module based on Distribution Loss, which reduces distribution discrepancy and ensures style consistency. Extensive experiments show that SatelliteMaker achieves state-of-the-art performance across multiple tasks.

Title: Anti-Aesthetics: Protecting Facial Privacy against Customized Text-to-Image Synthesis

Authors: Songping Wang, Yueming Lyu, Shiqi Liu, Ning Li, Tong Tong, Hao Sun, Caifeng Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12129
Pdf URL: https://arxiv.org/pdf/2504.12129
Copy Paste: [[2504.12129]] Anti-Aesthetics: Protecting Facial Privacy against Customized Text-to-Image Synthesis(https://arxiv.org/abs/2504.12129)
Keywords: diffusion
Abstract: The rise of customized diffusion models has spurred a boom in personalized visual content creation, but also poses risks of malicious misuse, severely threatening personal privacy and copyright protection. Some studies show that the aesthetic properties of images are highly positively correlated with human perception of image quality. Inspired by this, we approach the problem from a novel and intriguing aesthetic perspective to degrade the generation quality of maliciously customized models, thereby achieving better protection of facial identity. Specifically, we propose a Hierarchical Anti-Aesthetic (HAA) framework to fully explore aesthetic cues, which consists of two key branches: 1) Global Anti-Aesthetics: By establishing a global anti-aesthetic reward mechanism and a global anti-aesthetic loss, it can degrade the overall aesthetics of the generated content; 2) Local Anti-Aesthetics: A local anti-aesthetic reward mechanism and a local anti-aesthetic loss are designed to guide adversarial perturbations to disrupt local facial identity. By seamlessly integrating both branches, our HAA effectively achieves the goal of anti-aesthetics from a global to a local level during customized generation. Extensive experiments show that HAA outperforms existing SOTA methods largely in identity removal, providing a powerful tool for protecting facial privacy and copyright.

Title: RADLER: Radar Object Detection Leveraging Semantic 3D City Models and Self-Supervised Radar-Image Learning

Authors: Yuan Luo, Rudolf Hoffmann, Yan Xia, Olaf Wysocki, Benedikt Schwab, Thomas H. Kolbe, Daniel Cremers
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2504.12167
Pdf URL: https://arxiv.org/pdf/2504.12167
Copy Paste: [[2504.12167]] RADLER: Radar Object Detection Leveraging Semantic 3D City Models and Self-Supervised Radar-Image Learning(https://arxiv.org/abs/2504.12167)
Keywords: self-supervised
Abstract: Semantic 3D city models are worldwide easy-accessible, providing accurate, object-oriented, and semantic-rich 3D priors. To date, their potential to mitigate the noise impact on radar object detection remains under-explored. In this paper, we first introduce a unique dataset, RadarCity, comprising 54K synchronized radar-image pairs and semantic 3D city models. Moreover, we propose a novel neural network, RADLER, leveraging the effectiveness of contrastive self-supervised learning (SSL) and semantic 3D city models to enhance radar object detection of pedestrians, cyclists, and cars. Specifically, we first obtain the robust radar features via a SSL network in the radar-image pretext task. We then use a simple yet effective feature fusion strategy to incorporate semantic-depth features from semantic 3D city models. Having prior 3D information as guidance, RADLER obtains more fine-grained details to enhance radar object detection. We extensively evaluate RADLER on the collected RadarCity dataset and demonstrate average improvements of 5.46% in mean avarage precision (mAP) and 3.51% in mean avarage recall (mAR) over previous radar object detection methods. We believe this work will foster further research on semantic-guided and map-supported radar object detection. Our project page is publicly available athttps://gppthis http URL .

Title: Towards a General-Purpose Zero-Shot Synthetic Low-Light Image and Video Pipeline

Authors: Joanne Lin, Crispian Morris, Ruirui Lin, Fan Zhang, David Bull, Nantheera Anantrasirichai
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.12169
Pdf URL: https://arxiv.org/pdf/2504.12169
Copy Paste: [[2504.12169]] Towards a General-Purpose Zero-Shot Synthetic Low-Light Image and Video Pipeline(https://arxiv.org/abs/2504.12169)
Keywords: self-supervised
Abstract: Low-light conditions pose significant challenges for both human and machine annotation. This in turn has led to a lack of research into machine understanding for low-light images and (in particular) videos. A common approach is to apply annotations obtained from high quality datasets to synthetically created low light versions. In addition, these approaches are often limited through the use of unrealistic noise models. In this paper, we propose a new Degradation Estimation Network (DEN), which synthetically generates realistic standard RGB (sRGB) noise without the requirement for camera metadata. This is achieved by estimating the parameters of physics-informed noise distributions, trained in a self-supervised manner. This zero-shot approach allows our method to generate synthetic noisy content with a diverse range of realistic noise characteristics, unlike other methods which focus on recreating the noise characteristics of the training data. We evaluate our proposed synthetic pipeline using various methods trained on its synthetic data for typical low-light tasks including synthetic noise replication, video enhancement, and object detection, showing improvements of up to 24\% KLD, 21\% LPIPS, and 62\% AP$_{50-95}$, respectively.

Title: d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

Authors: Siyan Zhao, Devaansh Gupta, Qinqing Zheng, Aditya Grover
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2504.12216
Pdf URL: https://arxiv.org/pdf/2504.12216
Copy Paste: [[2504.12216]] d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning(https://arxiv.org/abs/2504.12216)
Keywords: diffusion
Abstract: Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning. To this end, we propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL. Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO. Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and logical reasoning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.

Title: Coding-Prior Guided Diffusion Network for Video Deblurring

Authors: Yike Liu, Jianhui Zhang, Haipeng Li, Shuaicheng Liu, Bing Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12222
Pdf URL: https://arxiv.org/pdf/2504.12222
Copy Paste: [[2504.12222]] Coding-Prior Guided Diffusion Network for Video Deblurring(https://arxiv.org/abs/2504.12222)
Keywords: diffusion, generative
Abstract: While recent video deblurring methods have advanced significantly, they often overlook two valuable prior information: (1) motion vectors (MVs) and coding residuals (CRs) from video codecs, which provide efficient inter-frame alignment cues, and (2) the rich real-world knowledge embedded in pre-trained diffusion generative models. We present CPGDNet, a novel two-stage framework that effectively leverages both coding priors and generative diffusion priors for high-quality deblurring. First, our coding-prior feature propagation (CPFP) module utilizes MVs for efficient frame alignment and CRs to generate attention masks, addressing motion inaccuracies and texture variations. Second, a coding-prior controlled generation (CPC) module network integrates coding priors into a pretrained diffusion model, guiding it to enhance critical regions and synthesize realistic details. Experiments demonstrate our method achieves state-of-the-art perceptual quality with up to 30% improvement in IQA metrics. Both the code and the codingprior-augmented dataset will be open-sourced.

Title: Cobra: Efficient Line Art COlorization with BRoAder References

Authors: Junhao Zhuang, Lingen Li, Xuan Ju, Zhaoyang Zhang, Chun Yuan, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12240
Pdf URL: https://arxiv.org/pdf/2504.12240
Copy Paste: [[2504.12240]] Cobra: Efficient Line Art COlorization with BRoAder References(https://arxiv.org/abs/2504.12240)
Keywords: diffusion
Abstract: The comic production industry requires reference-based line art colorization with high accuracy, efficiency, contextual consistency, and flexible control. A comic page often involves diverse characters, objects, and backgrounds, which complicates the coloring process. Despite advancements in diffusion models for image generation, their application in line art colorization remains limited, facing challenges related to handling extensive reference images, time-consuming inference, and flexible control. We investigate the necessity of extensive contextual image guidance on the quality of line art colorization. To address these challenges, we introduce Cobra, an efficient and versatile method that supports color hints and utilizes over 200 reference images while maintaining low latency. Central to Cobra is a Causal Sparse DiT architecture, which leverages specially designed positional encodings, causal sparse attention, and Key-Value Cache to effectively manage long-context references and ensure color identity consistency. Results demonstrate that Cobra achieves accurate line art colorization through extensive contextual reference, significantly enhancing inference speed and interactivity, thereby meeting critical industrial demands. We release our codes and models on our project page: this https URL.

Title: SIDME: Self-supervised Image Demoiréing via Masked Encoder-Decoder Reconstruction

Authors: Xia Wang, Haiyang Sun, Tiantian Cao, Yueying Sun, Min Feng
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2504.12245
Pdf URL: https://arxiv.org/pdf/2504.12245
Copy Paste: [[2504.12245]] SIDME: Self-supervised Image Demoiréing via Masked Encoder-Decoder Reconstruction(https://arxiv.org/abs/2504.12245)
Keywords: self-supervised
Abstract: Moiré patterns, resulting from aliasing between object light signals and camera sampling frequencies, often degrade image quality during capture. Traditional demoiréing methods have generally treated images as a whole for processing and training, neglecting the unique signal characteristics of different color channels. Moreover, the randomness and variability of moiré pattern generation pose challenges to the robustness of existing methods when applied to real-world data. To address these issues, this paper presents SIDME (Self-supervised Image Demoiréing via Masked Encoder-Decoder Reconstruction), a novel model designed to generate high-quality visual images by effectively processing moiré patterns. SIDME combines a masked encoder-decoder architecture with self-supervised learning, allowing the model to reconstruct images using the inherent properties of camera sampling frequencies. A key innovation is the random masked image reconstructor, which utilizes an encoder-decoder structure to handle the reconstruction task. Furthermore, since the green channel in camera sampling has a higher sampling frequency compared to red and blue channels, a specialized self-supervised loss function is designed to improve the training efficiency and effectiveness. To ensure the generalization ability of the model, a self-supervised moiré image generation method has been developed to produce a dataset that closely mimics real-world conditions. Extensive experiments demonstrate that SIDME outperforms existing methods in processing real moiré pattern data, showing its superior generalization performance and robustness.

Title: VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate

Authors: Zhihang Yuan, Rui Xie, Yuzhang Shang, Hanling Zhang, Siyuan Wang, Shengen Yan, Guohao Dai, Yu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2504.12259
Pdf URL: https://arxiv.org/pdf/2504.12259
Copy Paste: [[2504.12259]] VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate(https://arxiv.org/abs/2504.12259)
Keywords: diffusion
Abstract: Diffusion Transformer(DiT)-based generation models have achieved remarkable success in video generation. However, their inherent computational demands pose significant efficiency challenges. In this paper, we exploit the inherent temporal non-uniformity of real-world videos and observe that videos exhibit dynamic information density, with high-motion segments demanding greater detail preservation than static scenes. Inspired by this temporal non-uniformity, we propose VGDFR, a training-free approach for Diffusion-based Video Generation with Dynamic Latent Frame Rate. VGDFR adaptively adjusts the number of elements in latent space based on the motion frequency of the latent space content, using fewer tokens for low-frequency segments while preserving detail in high-frequency segments. Specifically, our key contributions are: (1) A dynamic frame rate scheduler for DiT video generation that adaptively assigns frame rates for video segments. (2) A novel latent-space frame merging method to align latent representations with their denoised counterparts before merging those redundant in low-resolution space. (3) A preference analysis of Rotary Positional Embeddings (RoPE) across DiT layers, informing a tailored RoPE strategy optimized for semantic and local information capture. Experiments show that VGDFR can achieve a speedup up to 3x for video generation with minimal quality degradation.

Title: How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions

Authors: Aditya Prakash, Benjamin Lundell, Dmitry Andreychuk, David Forsyth, Saurabh Gupta, Harpreet Sawhney
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.12284
Pdf URL: https://arxiv.org/pdf/2504.12284
Copy Paste: [[2504.12284]] How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions(https://arxiv.org/abs/2504.12284)
Keywords: diffusion
Abstract: We tackle the novel problem of predicting 3D hand motion and contact maps (or Interaction Trajectories) given a single RGB view, action text, and a 3D contact point on the object as input. Our approach consists of (1) Interaction Codebook: a VQVAE model to learn a latent codebook of hand poses and contact points, effectively tokenizing interaction trajectories, (2) Interaction Predictor: a transformer-decoder module to predict the interaction trajectory from test time inputs by using an indexer module to retrieve a latent affordance from the learned codebook. To train our model, we develop a data engine that extracts 3D hand poses and contact trajectories from the diverse HoloAssist dataset. We evaluate our model on a benchmark that is 2.5-10X larger than existing works, in terms of diversity of objects and interactions observed, and test for generalization of the model across object categories, action categories, tasks, and scenes. Experimental results show the effectiveness of our approach over transformer & diffusion baselines across all settings.

Title: SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians

Authors: Liam Schoneveld, Zhe Chen, Davide Davoli, Jiapeng Tang, Saimon Terazawa, Ko Nishino, Matthias Nießner
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2504.12292
Pdf URL: https://arxiv.org/pdf/2504.12292
Copy Paste: [[2504.12292]] SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians(https://arxiv.org/abs/2504.12292)
Keywords: self-supervised
Abstract: Accurate, real-time 3D reconstruction of human heads from monocular images and videos underlies numerous visual applications. As 3D ground truth data is hard to come by at scale, previous methods have sought to learn from abundant 2D videos in a self-supervised manner. Typically, this involves the use of differentiable mesh rendering, which is effective but faces limitations. To improve on this, we propose SHeaP (Self-supervised Head Geometry Predictor Learned via 2D Gaussians). Given a source image, we predict a 3DMM mesh and a set of Gaussians that are rigged to this mesh. We then reanimate this rigged head avatar to match a target frame, and backpropagate photometric losses to both the 3DMM and Gaussian prediction networks. We find that using Gaussians for rendering substantially improves the effectiveness of this self-supervised approach. Training solely on 2D data, our method surpasses existing self-supervised approaches in geometric evaluations on the NoW benchmark for neutral faces and a new benchmark for non-neutral expressions. Our method also produces highly expressive meshes, outperforming state-of-the-art in emotion classification.