2025-03-18

Title: Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning

Authors: Donghao Huang, Zhaoxia Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11655
Pdf URL: https://arxiv.org/pdf/2503.11655
Copy Paste: [[2503.11655]] Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning(https://arxiv.org/abs/2503.11655)
Keywords: in-context
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced sentiment analysis capabilities. However, the trade-offs between model performance, efficiency, and explainability of some latest models remain underexplored. This study presents the first comprehensive evaluation of the DeepSeek-R1 series of models, reasoning open-source LLMs, for sentiment analysis, comparing them against OpenAI's GPT-4 and GPT-4-mini. We systematically analyze their performance under few-shot prompting conditions, scaling up to 50-shot configurations to assess in-context learning effectiveness. Our experiments reveal that DeepSeek-R1 demonstrates competitive accuracy, particularly in multi-class sentiment tasks, while offering enhanced interpretability through its detailed reasoning process. Additionally, we highlight the impact of increasing few-shot examples on model performance and discuss key trade-offs between explainability and computational efficiency.

Title: Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs

Authors: Vincent Li, Yule Fu, Tim Knappe, Kevin Han, Kevin Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11657
Pdf URL: https://arxiv.org/pdf/2503.11657
Copy Paste: [[2503.11657]] Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs(https://arxiv.org/abs/2503.11657)
Keywords: foundation model
Abstract: Large Language Models have demonstrated remarkable capabilities in natural language processing tasks, including mathematical problem-solving that requires multi-step logical reasoning. However, challenges persist in automating the identification of key mathematical concepts, understanding their interrelations, and formalizing proofs within a rigorous framework. We present a novel framework that leverages knowledge graphs to augment LLMs to construct and formalize mathematical proofs. Our results demonstrate significant performance improvements across multiple datasets, with using knowledge graphs, achieving up to a 34% success rate on the MUSTARDSAUCE dataset on o1-mini and consistently outperforming baseline approaches by 2-11% across different models. We show how this approach bridges the gap between natural language understanding and formal logic proof systems and achieve elevated results for foundation models over baseline.

Title: A Survey of Direct Preference Optimization

Authors: Shunyu Liu, Wenkai Fang, Zetian Hu, Junjie Zhang, Yang Zhou, Kongcheng Zhang, Rongcheng Tu, Ting-En Lin, Fei Huang, Mingli Song, Yongbin Li, Dacheng Tao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11701
Pdf URL: https://arxiv.org/pdf/2503.11701
Copy Paste: [[2503.11701]] A Survey of Direct Preference Optimization(https://arxiv.org/abs/2503.11701)
Keywords: generative
Abstract: Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful paradigm for aligning LLMs with human preferences, its reliance on complex reward modeling introduces inherent trade-offs in computational efficiency and training stability. In this context, Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative that directly optimizes LLMs using human preferences, thereby circumventing the need for explicit reward modeling. Owing to its theoretical elegance and computational efficiency, DPO has rapidly attracted substantial research efforts exploring its various implementations and applications. However, this field currently lacks systematic organization and comparative analysis. In this survey, we conduct a comprehensive overview of DPO and introduce a novel taxonomy, categorizing previous works into four key dimensions: data strategy, learning framework, constraint mechanism, and model property. We further present a rigorous empirical analysis of DPO variants across standardized benchmarks. Additionally, we discuss real-world applications, open challenges, and future directions for DPO. This work delivers both a conceptual framework for understanding DPO and practical guidance for practitioners, aiming to advance robust and generalizable alignment paradigms. All collected resources are available and will be continuously updated at this https URL.

Title: Fine-Tuning Diffusion Generative Models via Rich Preference Optimization

Authors: Hanyang Zhao, Haoxian Chen, Yucheng Guo, Genta Indra Winata, Tingting Ou, Ziyu Huang, David D. Yao, Wenpin Tang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11720
Pdf URL: https://arxiv.org/pdf/2503.11720
Copy Paste: [[2503.11720]] Fine-Tuning Diffusion Generative Models via Rich Preference Optimization(https://arxiv.org/abs/2503.11720)
Keywords: diffusion, generative
Abstract: We introduce Rich Preference Optimization (RPO), a novel pipeline that leverages rich feedback signals to improve the curation of preference pairs for fine-tuning text-to-image diffusion models. Traditional methods, like Diffusion-DPO, often rely solely on reward model labeling, which can be opaque, offer limited insights into the rationale behind preferences, and are prone to issues such as reward hacking or overfitting. In contrast, our approach begins with generating detailed critiques of synthesized images to extract reliable and actionable image editing instructions. By implementing these instructions, we create refined images, resulting in synthetic, informative preference pairs that serve as enhanced tuning datasets. We demonstrate the effectiveness of our pipeline and the resulting datasets in fine-tuning state-of-the-art diffusion models.

Title: BACE-RUL: A Bi-directional Adversarial Network with Covariate Encoding for Machine Remaining Useful Life Prediction

Authors: Zekai Zhang, Dan Li, Shunyu Wu, Junya Cai, Bo Zhang, See Kiong Ng, Zibin Zheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11730
Pdf URL: https://arxiv.org/pdf/2503.11730
Copy Paste: [[2503.11730]] BACE-RUL: A Bi-directional Adversarial Network with Covariate Encoding for Machine Remaining Useful Life Prediction(https://arxiv.org/abs/2503.11730)
Keywords: generative
Abstract: Prognostic and Health Management (PHM) are crucial ways to avoid unnecessary maintenance for Cyber-Physical Systems (CPS) and improve system reliability. Predicting the Remaining Useful Life (RUL) is one of the most challenging tasks for PHM. Existing methods require prior knowledge about the system, contrived assumptions, or temporal mining to model the life cycles of machine equipment/devices, resulting in diminished accuracy and limited applicability in real-world scenarios. This paper proposes a Bi-directional Adversarial network with Covariate Encoding for machine Remaining Useful Life (BACE-RUL) prediction, which only adopts sensor measurements from the current life cycle to predict RUL rather than relying on previous consecutive cycle recordings. The current sensor measurements of mechanical devices are encoded to a conditional space to better understand the implicit inner mechanical status. The predictor is trained as a conditional generative network with the encoded sensor measurements as its conditions. Various experiments on several real-world datasets, including the turbofan aircraft engine dataset and the dataset collected from degradation experiments of Li-Ion battery cells, show that the proposed model is a general framework and outperforms state-of-the-art methods.

Title: Industrial-Grade Sensor Simulation via Gaussian Splatting: A Modular Framework for Scalable Editing and Full-Stack Validation

Authors: Xianming Zeng, Sicong Du, Qifeng Chen, Lizhe Liu, Haoyu Shu, Jiaxuan Gao, Jiarun Liu, Jiulong Xu, Jianyun Xu, Mingxia Chen, Yiru Zhao, Peng Chen, Yapeng Xue, Chunming Zhao, Sheng Yang, Qiang Li
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.11731
Pdf URL: https://arxiv.org/pdf/2503.11731
Copy Paste: [[2503.11731]] Industrial-Grade Sensor Simulation via Gaussian Splatting: A Modular Framework for Scalable Editing and Full-Stack Validation(https://arxiv.org/abs/2503.11731)
Keywords: diffusion
Abstract: Sensor simulation is pivotal for scalable validation of autonomous driving systems, yet existing Neural Radiance Fields (NeRF) based methods face applicability and efficiency challenges in industrial workflows. This paper introduces a Gaussian Splatting (GS) based system to address these challenges: We first break down sensor simulator components and analyze the possible advantages of GS over NeRF. Then in practice, we refactor three crucial components through GS, to leverage its explicit scene representation and real-time rendering: (1) choosing the 2D neural Gaussian representation for physics-compliant scene and sensor modeling, (2) proposing a scene editing pipeline to leverage Gaussian primitives library for data augmentation, and (3) coupling a controllable diffusion model for scene expansion and harmonization. We implement this framework on a proprietary autonomous driving dataset supporting cameras and LiDAR sensors. We demonstrate through ablation studies that our approach reduces frame-wise simulation latency, achieves better geometric and photometric consistency, and enables interpretable explicit scene editing and expansion. Furthermore, we showcase how integrating such a GS-based sensor simulator with traffic and dynamic simulators enables full-stack testing of end-to-end autonomy algorithms. Our work provides both algorithmic insights and practical validation, establishing GS as a cornerstone for industrial-grade sensor simulation.

Title: UBMF: Uncertainty-Aware Bayesian Meta-Learning Framework for Fault Diagnosis with Imbalanced Industrial Data

Authors: Zhixuan Lian, Shangyu Li, Qixuan Huang, Zijian Huang, Haifei Liu, Jianan Qiu, Puyu Yang, Laifa Tao
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.11774
Pdf URL: https://arxiv.org/pdf/2503.11774
Copy Paste: [[2503.11774]] UBMF: Uncertainty-Aware Bayesian Meta-Learning Framework for Fault Diagnosis with Imbalanced Industrial Data(https://arxiv.org/abs/2503.11774)
Keywords: self-supervised
Abstract: Fault diagnosis of mechanical equipment involves data collection, feature extraction, and pattern recognition but is often hindered by the imbalanced nature of industrial data, introducing significant uncertainty and reducing diagnostic reliability. To address these challenges, this study proposes the Uncertainty-Aware Bayesian Meta-Learning Framework (UBMF), which integrates four key modules: data perturbation injection for enhancing feature robustness, cross-task self-supervised feature extraction for improving transferability, uncertainty-based sample filtering for robust out-of-domain generalization, and Bayesian meta-knowledge integration for fine-grained classification. Experimental results on ten open-source datasets under various imbalanced conditions, including cross-task, small-sample, and unseen-sample scenarios, demonstrate the superiority of UBMF, achieving an average improvement of 42.22% across ten Any-way 1-5-shot diagnostic tasks. This integrated framework effectively enhances diagnostic accuracy, generalization, and adaptability, providing a reliable solution for complex industrial fault diagnosis.

Title: StyleMorpheus: A Style-Based 3D-Aware Morphable Face Model

Authors: Peizhi Yan, Rabab K. Ward, Dan Wang, Qiang Tang, Shan Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11792
Pdf URL: https://arxiv.org/pdf/2503.11792
Copy Paste: [[2503.11792]] StyleMorpheus: A Style-Based 3D-Aware Morphable Face Model(https://arxiv.org/abs/2503.11792)
Keywords: generative
Abstract: For 3D face modeling, the recently developed 3D-aware neural rendering methods are able to render photorealistic face images with arbitrary viewing directions. The training of the parametric controllable 3D-aware face models, however, still relies on a large-scale dataset that is lab-collected. To address this issue, this paper introduces "StyleMorpheus", the first style-based neural 3D Morphable Face Model (3DMM) that is trained on in-the-wild images. It inherits 3DMM's disentangled controllability (over face identity, expression, and appearance) but without the need for accurately reconstructed explicit 3D shapes. StyleMorpheus employs an auto-encoder structure. The encoder aims at learning a representative disentangled parametric code space and the decoder improves the disentanglement using shape and appearance-related style codes in the different sub-modules of the network. Furthermore, we fine-tune the decoder through style-based generative adversarial learning to achieve photorealistic 3D rendering quality. The proposed style-based design enables StyleMorpheus to achieve state-of-the-art 3D-aware face reconstruction results, while also allowing disentangled control of the reconstructed face. Our model achieves real-time rendering speed, allowing its use in virtual reality applications. We also demonstrate the capability of the proposed style-based design in face editing applications such as style mixing and color editing. Project homepage: this https URL.

Title: Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring

Authors: Kezia Oketch, John P. Lalor, Yi Yang, Ahmed Abbasi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11827
Pdf URL: https://arxiv.org/pdf/2503.11827
Copy Paste: [[2503.11827]] Bridging the LLM Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open LLMs for Automated Essay Scoring(https://arxiv.org/abs/2503.11827)
Keywords: generative
Abstract: Closed large language models (LLMs) such as GPT-4 have set state-of-the-art results across a number of NLP tasks and have become central to NLP and machine learning (ML)-driven solutions. Closed LLMs' performance and wide adoption has sparked considerable debate about their accessibility in terms of availability, cost, and transparency. In this study, we perform a rigorous comparative analysis of nine leading LLMs, spanning closed, open, and open-source LLM ecosystems, across text assessment and generation tasks related to automated essay scoring. Our findings reveal that for few-shot learning-based assessment of human generated essays, open LLMs such as Llama 3 and Qwen2.5 perform comparably to GPT-4 in terms of predictive performance, with no significant differences in disparate impact scores when considering age- or race-related fairness. Moreover, Llama 3 offers a substantial cost advantage, being up to 37 times more cost-efficient than GPT-4. For generative tasks, we find that essays generated by top open LLMs are comparable to closed LLMs in terms of their semantic composition/embeddings and ML assessed scores. Our findings challenge the dominance of closed LLMs and highlight the democratizing potential of open LLMs, suggesting they can effectively bridge accessibility divides while maintaining competitive performance and fairness.

Title: How Can Time Series Analysis Benefit From Multiple Modalities? A Survey and Outlook

Authors: Haoxin Liu, Harshavardhan Kamarthi, Zhiyuan Zhao, Shangqing Xu, Shiyu Wang, Qingsong Wen, Tom Hartvigsen, Fei Wang, B. Aditya Prakash
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.11835
Pdf URL: https://arxiv.org/pdf/2503.11835
Copy Paste: [[2503.11835]] How Can Time Series Analysis Benefit From Multiple Modalities? A Survey and Outlook(https://arxiv.org/abs/2503.11835)
Keywords: foundation model
Abstract: Time series analysis (TSA) is a longstanding research topic in the data mining community and has wide real-world significance. Compared to "richer" modalities such as language and vision, which have recently experienced explosive development and are densely connected, the time-series modality remains relatively underexplored and isolated. We notice that many recent TSA works have formed a new research field, i.e., Multiple Modalities for TSA (MM4TSA). In general, these MM4TSA works follow a common motivation: how TSA can benefit from multiple modalities. This survey is the first to offer a comprehensive review and a detailed outlook for this emerging field. Specifically, we systematically discuss three benefits: (1) reusing foundation models of other modalities for efficient TSA, (2) multimodal extension for enhanced TSA, and (3) cross-modality interaction for advanced TSA. We further group the works by the introduced modality type, including text, images, audio, tables, and others, within each perspective. Finally, we identify the gaps with future opportunities, including the reused modalities selections, heterogeneous modality combinations, and unseen tasks generalizations, corresponding to the three benefits. We release an up-to-date GitHub repository that includes key papers and resources.

Title: Trust Under Siege: Label Spoofing Attacks against Machine Learning for Android Malware Detection

Authors: Tianwei Lan, Luca Demetrio, Farid Nait-Abdesselam, Yufei Han, Simone Aonzo
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11841
Pdf URL: https://arxiv.org/pdf/2503.11841
Copy Paste: [[2503.11841]] Trust Under Siege: Label Spoofing Attacks against Machine Learning for Android Malware Detection(https://arxiv.org/abs/2503.11841)
Keywords: anomaly
Abstract: Machine learning (ML) malware detectors rely heavily on crowd-sourced AntiVirus (AV) labels, with platforms like VirusTotal serving as a trusted source of malware annotations. But what if attackers could manipulate these labels to classify benign software as malicious? We introduce label spoofing attacks, a new threat that contaminates crowd-sourced datasets by embedding minimal and undetectable malicious patterns into benign samples. These patterns coerce AV engines into misclassifying legitimate files as harmful, enabling poisoning attacks against ML-based malware classifiers trained on those data. We demonstrate this scenario by developing AndroVenom, a methodology for polluting realistic data sources, causing consequent poisoning attacks against ML malware detectors. Experiments show that not only state-of-the-art feature extractors are unable to filter such injection, but also various ML models experience Denial of Service already with 1% poisoned samples. Additionally, attackers can flip decisions of specific unaltered benign samples by modifying only 0.015% of the training data, threatening their reputation and market share and being unable to be stopped by anomaly detectors on training data. We conclude our manuscript by raising the alarm on the trustworthiness of the training process based on AV annotations, requiring further investigation on how to produce proper labels for ML malware detectors.

Title: Test-Time Training Provably Improves Transformers as In-context Learners

Authors: Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Mahdi Soltanolkotabi, Marco Mondelli, Samet Oymak
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.11842
Pdf URL: https://arxiv.org/pdf/2503.11842
Copy Paste: [[2503.11842]] Test-Time Training Provably Improves Transformers as In-context Learners(https://arxiv.org/abs/2503.11842)
Keywords: foundation model, in-context
Abstract: Test-time training (TTT) methods explicitly update the weights of a model to adapt to the specific test instance, and they have found success in a variety of settings, including most recently language modeling and reasoning. To demystify this success, we investigate a gradient-based TTT algorithm for in-context learning, where we train a transformer model on the in-context demonstrations provided in the test prompt. Specifically, we provide a comprehensive theoretical characterization of linear transformers when the update rule is a single gradient step. Our theory (i) delineates the role of alignment between pretraining distribution and target task, (ii) demystifies how TTT can alleviate distribution shift, and (iii) quantifies the sample complexity of TTT including how it can significantly reduce the eventual sample size required for in-context learning. As our empirical contribution, we study the benefits of TTT for TabPFN, a tabular foundation model. In line with our theory, we demonstrate that TTT significantly reduces the required sample size for tabular classification (3 to 5 times fewer) unlocking substantial inference efficiency with a negligible training cost.

Title: Towards a Unified Copernicus Foundation Model for Earth Vision

Authors: Yi Wang, Zhitong Xiong, Chenying Liu, Adam J. Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taixé, Xiao Xiang Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11849
Pdf URL: https://arxiv.org/pdf/2503.11849
Copy Paste: [[2503.11849]] Towards a Unified Copernicus Foundation Model for Earth Vision(https://arxiv.org/abs/2503.11849)
Keywords: foundation model
Abstract: Advances in Earth observation (EO) foundation models have unlocked the potential of big satellite data to learn generic representations from space, benefiting a wide range of downstream applications crucial to our planet. However, most existing efforts remain limited to fixed spectral sensors, focus solely on the Earth's surface, and overlook valuable metadata beyond imagery. In this work, we take a step towards next-generation EO foundation models with three key components: 1) Copernicus-Pretrain, a massive-scale pretraining dataset that integrates 18.7M aligned images from all major Copernicus Sentinel missions, spanning from the Earth's surface to its atmosphere; 2) Copernicus-FM, a unified foundation model capable of processing any spectral or non-spectral sensor modality using extended dynamic hypernetworks and flexible metadata encoding; and 3) Copernicus-Bench, a systematic evaluation benchmark with 15 hierarchical downstream tasks ranging from preprocessing to specialized applications for each Sentinel mission. Our dataset, model, and benchmark greatly improve the scalability, versatility, and multimodal adaptability of EO foundation models, while also creating new opportunities to connect EO, weather, and climate research. Codes, datasets and models are available at this https URL.

Title: Spatio-temporal Fourier Transformer (StFT) for Long-term Dynamics Prediction

Authors: Da Long, Shandian Zhe, Samuel Williams, Leonid Oliker, Zhe Bai
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2503.11899
Pdf URL: https://arxiv.org/pdf/2503.11899
Copy Paste: [[2503.11899]] Spatio-temporal Fourier Transformer (StFT) for Long-term Dynamics Prediction(https://arxiv.org/abs/2503.11899)
Keywords: generative
Abstract: Simulating the long-term dynamics of multi-scale and multi-physics systems poses a significant challenge in understanding complex phenomena across science and engineering. The complexity arises from the intricate interactions between scales and the interplay of diverse physical processes. Neural operators have emerged as promising models for predicting such dynamics due to their flexibility and computational efficiency. However, they often fail to effectively capture multi-scale interactions or quantify the uncertainties inherent in the predictions. These limitations lead to rapid error accumulation, particularly in long-term forecasting of systems characterized by complex and coupled dynamics. To address these challenges, we propose a spatio-temporal Fourier transformer (StFT), in which each transformer block is designed to learn dynamics at a specific scale. By leveraging a structured hierarchy of StFT blocks, the model explicitly captures dynamics across both macro- and micro- spatial scales. Furthermore, a generative residual correction mechanism is integrated to estimate and mitigate predictive uncertainties, enhancing both the accuracy and reliability of long-term forecasts. Evaluations conducted on three benchmark datasets (plasma, fluid, and atmospheric dynamics) demonstrate the advantages of our approach over state-of-the-art ML methods.

Title: Upcycling Text-to-Image Diffusion Models for Multi-Task Capabilities

Authors: Ruchika Chavhan, Abhinav Mehrotra, Malcolm Chadwick, Alberto Gil Ramos, Luca Morreale, Mehdi Noroozi, Sourav Bhattacharya
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11905
Pdf URL: https://arxiv.org/pdf/2503.11905
Copy Paste: [[2503.11905]] Upcycling Text-to-Image Diffusion Models for Multi-Task Capabilities(https://arxiv.org/abs/2503.11905)
Keywords: diffusion
Abstract: Text-to-image synthesis has witnessed remarkable advancements in recent years. Many attempts have been made to adopt text-to-image models to support multiple tasks. However, existing approaches typically require resource-intensive re-training or additional parameters to accommodate for the new tasks, which makes the model inefficient for on-device deployment. We propose Multi-Task Upcycling (MTU), a simple yet effective recipe that extends the capabilities of a pre-trained text-to-image diffusion model to support a variety of image-to-image generation tasks. MTU replaces Feed-Forward Network (FFN) layers in the diffusion model with smaller FFNs, referred to as experts, and combines them with a dynamic routing mechanism. To the best of our knowledge, MTU is the first multi-task diffusion modeling approach that seamlessly blends multi-tasking with on-device compatibility, by mitigating the issue of parameter inflation. We show that the performance of MTU is on par with the single-task fine-tuned diffusion models across several tasks including image editing, super-resolution, and inpainting, while maintaining similar latency and computational load (GFLOPs) as the single-task fine-tuned models.

Title: REGEN: A Dataset and Benchmarks with Natural Language Critiques and Narratives

Authors: Kun Su, Krishna Sayana, Hubert Pham, James Pine, Yuri Vasilevski, Raghavendra Vasudeva, Marialena Kyriakidi, Liam Hebert, Ambarish Jash, Anushya Subbiah, Sukhdeep Sodhi
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11924
Pdf URL: https://arxiv.org/pdf/2503.11924
Copy Paste: [[2503.11924]] REGEN: A Dataset and Benchmarks with Natural Language Critiques and Narratives(https://arxiv.org/abs/2503.11924)
Keywords: generative
Abstract: This paper introduces a novel dataset REGEN (Reviews Enhanced with GEnerative Narratives), designed to benchmark the conversational capabilities of recommender Large Language Models (LLMs), addressing the limitations of existing datasets that primarily focus on sequential item prediction. REGEN extends the Amazon Product Reviews dataset by inpainting two key natural language features: (1) user critiques, representing user "steering" queries that lead to the selection of a subsequent item, and (2) narratives, rich textual outputs associated with each recommended item taking into account prior context. The narratives include product endorsements, purchase explanations, and summaries of user preferences. Further, we establish an end-to-end modeling benchmark for the task of conversational recommendation, where models are trained to generate both recommendations and corresponding narratives conditioned on user history (items and critiques). For this joint task, we introduce a modeling framework LUMEN (LLM-based Unified Multi-task Model with Critiques, Recommendations, and Narratives) which uses an LLM as a backbone for critiquing, retrieval and generation. We also evaluate the dataset's quality using standard auto-rating techniques and benchmark it by training both traditional and LLM-based recommender models. Our results demonstrate that incorporating critiques enhances recommendation quality by enabling the recommender to learn language understanding and integrate it with recommendation signals. Furthermore, LLMs trained on our dataset effectively generate both recommendations and contextual narratives, achieving performance comparable to state-of-the-art recommenders and language models.

Title: Generating a Biometrically Unique and Realistic Iris Database

Authors: Jingxuan Zhang, Robert J. Hart, Ziqian Bi, Shiaofen Fang, Susan Walsh
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11930
Pdf URL: https://arxiv.org/pdf/2503.11930
Copy Paste: [[2503.11930]] Generating a Biometrically Unique and Realistic Iris Database(https://arxiv.org/abs/2503.11930)
Keywords: diffusion
Abstract: The use of the iris as a biometric identifier has increased dramatically over the last 30 years, prompting privacy and security concerns about the use of iris images in research. It can be difficult to acquire iris image databases due to ethical concerns, and this can be a barrier for those performing biometrics research. In this paper, we describe and show how to create a database of realistic, biometrically unidentifiable colored iris images by training a diffusion model within an open-source diffusion framework. Not only were we able to verify that our model is capable of creating iris textures that are biometrically unique from the training data, but we were also able to verify that our model output creates a full distribution of realistic iris pigmentations. We highlight the fact that the utility of diffusion networks to achieve these criteria with relative ease, warrants additional research in its use within the context of iris database generation and presentation attack security.

Title: Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder

Authors: Wonwoong Cho, Yan-Ying Chen, Matthew Klenk, David I. Inouye, Yanxia Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11937
Pdf URL: https://arxiv.org/pdf/2503.11937
Copy Paste: [[2503.11937]] Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder(https://arxiv.org/abs/2503.11937)
Keywords: diffusion
Abstract: Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in generating high quality images. However, enabling precise control of continuous attributes, especially multiple attributes simultaneously, in a new domain (e.g., numeric values like eye openness or car width) with text-only guidance remains a significant challenge. To address this, we introduce the Attribute (Att) Adapter, a novel plug-and-play module designed to enable fine-grained, multi-attributes control in pretrained diffusion models. Our approach learns a single control adapter from a set of sample images that can be unpaired and contain multiple visual attributes. The Att-Adapter leverages the decoupled cross attention module to naturally harmonize the multiple domain attributes with text conditioning. We further introduce Conditional Variational Autoencoder (CVAE) to the Att-Adapter to mitigate overfitting, matching the diverse nature of the visual world. Evaluations on two public datasets show that Att-Adapter outperforms all LoRA-based baselines in controlling continuous attributes. Additionally, our method enables a broader control range and also improves disentanglement across multiple attributes, surpassing StyleGAN-based techniques. Notably, Att-Adapter is flexible, requiring no paired synthetic data for training, and is easily scalable to multiple attributes within a single model.

Title: Your Text Encoder Can Be An Object-Level Watermarking Controller

Authors: Naresh Kumar Devulapally, Mingzhen Huang, Vishal Asnani, Shruti Agarwal, Siwei Lyu, Vishnu Suresh Lokhande
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11945
Pdf URL: https://arxiv.org/pdf/2503.11945
Copy Paste: [[2503.11945]] Your Text Encoder Can Be An Object-Level Watermarking Controller(https://arxiv.org/abs/2503.11945)
Keywords: diffusion
Abstract: Invisible watermarking of AI-generated images can help with copyright protection, enabling detection and identification of AI-generated media. In this work, we present a novel approach to watermark images of T2I Latent Diffusion Models (LDMs). By only fine-tuning text token embeddings $W_*$, we enable watermarking in selected objects or parts of the image, offering greater flexibility compared to traditional full-image watermarking. Our method leverages the text encoder's compatibility across various LDMs, allowing plug-and-play integration for different LDMs. Moreover, introducing the watermark early in the encoding stage improves robustness to adversarial perturbations in later stages of the pipeline. Our approach achieves $99\%$ bit accuracy ($48$ bits) with a $10^5 \times$ reduction in model parameters, enabling efficient watermarking.

Title: Winning the MIDST Challenge: New Membership Inference Attacks on Diffusion Models for Tabular Data Synthesis

Authors: Xiaoyu Wu, Yifei Pang, Terrance Liu, Steven Wu
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.12008
Pdf URL: https://arxiv.org/pdf/2503.12008
Copy Paste: [[2503.12008]] Winning the MIDST Challenge: New Membership Inference Attacks on Diffusion Models for Tabular Data Synthesis(https://arxiv.org/abs/2503.12008)
Keywords: diffusion
Abstract: Tabular data synthesis using diffusion models has gained significant attention for its potential to balance data utility and privacy. However, existing privacy evaluations often rely on heuristic metrics or weak membership inference attacks (MIA), leaving privacy risks inadequately assessed. In this work, we conduct a rigorous MIA study on diffusion-based tabular synthesis, revealing that state-of-the-art attacks designed for image models fail in this setting. We identify noise initialization as a key factor influencing attack efficacy and propose a machine-learning-driven approach that leverages loss features across different noises and time steps. Our method, implemented with a lightweight MLP, effectively learns membership signals, eliminating the need for manual optimization. Experimental results from the MIDST Challenge @ SaTML 2025 demonstrate the effectiveness of our approach, securing first place across all tracks. Code is available at this https URL.

Title: QDM: Quadtree-Based Region-Adaptive Sparse Diffusion Models for Efficient Image Super-Resolution

Authors: Donglin Yang, Paul Vicol, Xiaojuan Qi, Renjie Liao, Xiaofan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12015
Pdf URL: https://arxiv.org/pdf/2503.12015
Copy Paste: [[2503.12015]] QDM: Quadtree-Based Region-Adaptive Sparse Diffusion Models for Efficient Image Super-Resolution(https://arxiv.org/abs/2503.12015)
Keywords: diffusion
Abstract: Deep learning-based super-resolution (SR) methods often perform pixel-wise computations uniformly across entire images, even in homogeneous regions where high-resolution refinement is redundant. We propose the Quadtree Diffusion Model (QDM), a region-adaptive diffusion framework that leverages a quadtree structure to selectively enhance detail-rich regions while reducing computations in homogeneous areas. By guiding the diffusion with a quadtree derived from the low-quality input, QDM identifies key regions-represented by leaf nodes-where fine detail is essential and applies minimal refinement elsewhere. This mask-guided, two-stream architecture adaptively balances quality and efficiency, producing high-fidelity outputs with low computational redundancy. Experiments demonstrate QDM's effectiveness in high-resolution SR tasks across diverse image types, particularly in medical imaging (e.g., CT scans), where large homogeneous regions are prevalent. Furthermore, QDM outperforms or is comparable to state-of-the-art SR methods on standard benchmarks while significantly reducing computational costs, highlighting its efficiency and suitability for resource-limited environments. Our code is available at this https URL.

Title: Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art

Authors: Zhe Jin, Tat-Seng Chua
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12018
Pdf URL: https://arxiv.org/pdf/2503.12018
Copy Paste: [[2503.12018]] Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art(https://arxiv.org/abs/2503.12018)
Keywords: diffusion
Abstract: Text-to-Image (T2I) diffusion models (DM) have garnered widespread adoption due to their capability in generating high-fidelity outputs and accessibility to anyone able to put imagination into words. However, DMs are often predisposed to generate unappealing outputs, much like the random images on the internet they were trained on. Existing approaches to address this are founded on the implicit premise that visual aesthetics is universal, which is limiting. Aesthetics in the T2I context should be about personalization and we propose the novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output. Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ, known as the Principles of Art (PoA). To facilitate this study, we introduce CompArt, a large-scale compositional art dataset building on top of WikiArt with PoA analysis annotated by a capable Multimodal LLM. Leveraging the expressive power of LLMs and training a lightweight and transferrable adapter, we demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions. Additionally, we design an appropriate evaluation framework to assess the efficacy of our approach.

Title: Leveraging Motion Information for Better Self-Supervised Video Correspondence Learning

Authors: Zihan Zhoua, Changrui Daia, Aibo Songa, Xiaolin Fang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12026
Pdf URL: https://arxiv.org/pdf/2503.12026
Copy Paste: [[2503.12026]] Leveraging Motion Information for Better Self-Supervised Video Correspondence Learning(https://arxiv.org/abs/2503.12026)
Keywords: self-supervised
Abstract: Self-supervised video correspondence learning depends on the ability to accurately associate pixels between video frames that correspond to the same visual object. However, achieving reliable pixel matching without supervision remains a major challenge. To address this issue, recent research has focused on feature learning techniques that aim to encode unique pixel representations for matching. Despite these advances, existing methods still struggle to achieve exact pixel correspondences and often suffer from false matches, limiting their effectiveness in self-supervised settings. To this end, we explore an efficient self-supervised Video Correspondence Learning framework (MER) that aims to accurately extract object details from unlabeled videos. First, we design a dedicated Motion Enhancement Engine that emphasizes capturing the dynamic motion of objects in videos. In addition, we introduce a flexible sampling strategy for inter-pixel correspondence information (Multi-Cluster Sampler) that enables the model to pay more attention to the pixel changes of important objects in motion. Through experiments, our algorithm outperforms the state-of-the-art competitors on video correspondence learning tasks such as video object segmentation and video object keypoint tracking.

Title: Unsupervised Graph Anomaly Detection via Multi-Hypersphere Heterophilic Graph Learning

Authors: Hang Ni, Jindong Han, Nengjun Zhu, Hao Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12037
Pdf URL: https://arxiv.org/pdf/2503.12037
Copy Paste: [[2503.12037]] Unsupervised Graph Anomaly Detection via Multi-Hypersphere Heterophilic Graph Learning(https://arxiv.org/abs/2503.12037)
Keywords: anomaly
Abstract: Graph Anomaly Detection (GAD) plays a vital role in various data mining applications such as e-commerce fraud prevention and malicious user detection. Recently, Graph Neural Network (GNN) based approach has demonstrated great effectiveness in GAD by first encoding graph data into low-dimensional representations and then identifying anomalies under the guidance of supervised or unsupervised signals. However, existing GNN-based approaches implicitly follow the homophily principle (i.e., the "like attracts like" phenomenon) and fail to learn discriminative embedding for anomalies that connect vast normal nodes. Moreover, such approaches identify anomalies in a unified global perspective but overlook diversified abnormal patterns conditioned on local graph context, leading to suboptimal performance. To overcome the aforementioned limitations, in this paper, we propose a Multi-hypersphere Heterophilic Graph Learning (MHetGL) framework for unsupervised GAD. Specifically, we first devise a Heterophilic Graph Encoding (HGE) module to learn distinguishable representations for potential anomalies by purifying and augmenting their neighborhood in a fully unsupervised manner. Then, we propose a Multi-Hypersphere Learning (MHL) module to enhance the detection capability for context-dependent anomalies by jointly incorporating critical patterns from both global and local perspectives. Extensive experiments on ten real-world datasets show that MHetGL outperforms 14 baselines. Our code is publicly available at this https URL.

Title: TACO: Taming Diffusion for in-the-wild Video Amodal Completion

Authors: Ruijie Lu, Yixin Chen, Yu Liu, Jiaxiang Tang, Junfeng Ni, Diwen Wan, Gang Zeng, Siyuan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12049
Pdf URL: https://arxiv.org/pdf/2503.12049
Copy Paste: [[2503.12049]] TACO: Taming Diffusion for in-the-wild Video Amodal Completion(https://arxiv.org/abs/2503.12049)
Keywords: diffusion
Abstract: Humans can infer complete shapes and appearances of objects from limited visual cues, relying on extensive prior knowledge of the physical world. However, completing partially observable objects while ensuring consistency across video frames remains challenging for existing models, especially for unstructured, in-the-wild videos. This paper tackles the task of Video Amodal Completion (VAC), which aims to generate the complete object consistently throughout the video given a visual prompt specifying the object of interest. Leveraging the rich, consistent manifolds learned by pre-trained video diffusion models, we propose a conditional diffusion model, TACO, that repurposes these manifolds for VAC. To enable its effective and robust generalization to challenging in-the-wild scenarios, we curate a large-scale synthetic dataset with multiple difficulty levels by systematically imposing occlusions onto un-occluded videos. Building on this, we devise a progressive fine-tuning paradigm that starts with simpler recovery tasks and gradually advances to more complex ones. We demonstrate TACO's versatility on a wide range of in-the-wild videos from Internet, as well as on diverse, unseen datasets commonly used in autonomous driving, robotic manipulation, and scene understanding. Moreover, we show that TACO can be effectively applied to various downstream tasks like object reconstruction and pose estimation, highlighting its potential to facilitate physical world understanding and reasoning. Our project page is available at this https URL.

Title: Tailor: An Integrated Text-Driven CG-Ready Human and Garment Generation System

Authors: Zhiyao Sun, Yu-Hui Wen, Matthieu Lin, Ho-Jui Fang, Sheng Ye, Tian Lv, Yong-Jin Liu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2503.12052
Pdf URL: https://arxiv.org/pdf/2503.12052
Copy Paste: [[2503.12052]] Tailor: An Integrated Text-Driven CG-Ready Human and Garment Generation System(https://arxiv.org/abs/2503.12052)
Keywords: diffusion, generative
Abstract: Creating detailed 3D human avatars with garments typically requires specialized expertise and labor-intensive processes. Although recent advances in generative AI have enabled text-to-3D human/clothing generation, current methods fall short in offering accessible, integrated pipelines for producing ready-to-use clothed avatars. To solve this, we introduce Tailor, an integrated text-to-avatar system that generates high-fidelity, customizable 3D humans with simulation-ready garments. Our system includes a three-stage pipeline. We first employ a large language model to interpret textual descriptions into parameterized body shapes and semantically matched garment templates. Next, we develop topology-preserving deformation with novel geometric losses to adapt garments precisely to body geometries. Furthermore, an enhanced texture diffusion module with a symmetric local attention mechanism ensures both view consistency and photorealistic details. Quantitative and qualitative evaluations demonstrate that Tailor outperforms existing SoTA methods in terms of fidelity, usability, and diversity. Code will be available for academic use.

Title: A Comprehensive Survey on Knowledge Distillation

Authors: Amir M. Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Elaheh Badali Golezani, Zeynab Yasamani Ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, Shohreh Kasaei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12067
Pdf URL: https://arxiv.org/pdf/2503.12067
Copy Paste: [[2503.12067]] A Comprehensive Survey on Knowledge Distillation(https://arxiv.org/abs/2503.12067)
Keywords: diffusion, foundation model
Abstract: Deep Neural Networks (DNNs) have achieved notable performance in the fields of computer vision and natural language processing with various applications in both academia and industry. However, with recent advancements in DNNs and transformer models with a tremendous number of parameters, deploying these large models on edge devices causes serious issues such as high runtime and memory consumption. This is especially concerning with the recent large-scale foundation models, Vision-Language Models (VLMs), and Large Language Models (LLMs). Knowledge Distillation (KD) is one of the prominent techniques proposed to address the aforementioned problems using a teacher-student architecture. More specifically, a lightweight student model is trained using additional knowledge from a cumbersome teacher model. In this work, a comprehensive survey of knowledge distillation methods is proposed. This includes reviewing KD from different aspects: distillation sources, distillation schemes, distillation algorithms, distillation by modalities, applications of distillation, and comparison among existing methods. In contrast to most existing surveys, which are either outdated or simply update former surveys, this work proposes a comprehensive survey with a new point of view and representation structure that categorizes and investigates the most recent methods in knowledge distillation. This survey considers various critically important subcategories, including KD for diffusion models, 3D inputs, foundational models, transformers, and LLMs. Furthermore, existing challenges in KD and possible future research directions are discussed. Github page of the project: this https URL

Title: Temporally Consistent Mitral Annulus Measurements from Sparse Annotations in Echocardiographic Videos

Authors: Gino E. Jansen, Mark J. Schuuring, Berto J. Bouma, Ivana Išgum
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12087
Pdf URL: https://arxiv.org/pdf/2503.12087
Copy Paste: [[2503.12087]] Temporally Consistent Mitral Annulus Measurements from Sparse Annotations in Echocardiographic Videos(https://arxiv.org/abs/2503.12087)
Keywords: self-supervised
Abstract: This work presents a novel approach to achieving temporally consistent mitral annulus landmark localization in echocardiography videos using sparse annotations. Our method introduces a self-supervised loss term that enforces temporal consistency between neighboring frames, which smooths the position of landmarks and enhances measurement accuracy over time. Additionally, we incorporate realistic field-of-view augmentations to improve the recognition of missing anatomical landmarks. We evaluate our approach on both a public and private dataset, and demonstrate significant improvements in Mitral Annular Plane Systolic Excursion (MAPSE) calculations and overall landmark tracking stability. The method achieves a mean absolute MAPSE error of 1.81 $\pm$ 0.14 mm, an annulus size error of 2.46 $\pm$ 0.31 mm, and a landmark localization error of 2.48 $\pm$ 0.07 mm. Finally, it achieves a 0.99 ROC-AUC for recognition of missing landmarks.

Title: A Speech-to-Video Synthesis Approach Using Spatio-Temporal Diffusion for Vocal Tract MRI

Authors: Paula Andrea Pérez-Toro, Tomás Arias-Vergara, Fangxu Xing, Xiaofeng Liu, Maureen Stone, Jiachen Zhuo, Juan Rafael Orozco-Arroyave, Elmar Nöth, Jana Hutter, Jerry L. Prince, Andreas Maier, Jonghye Woo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12102
Pdf URL: https://arxiv.org/pdf/2503.12102
Copy Paste: [[2503.12102]] A Speech-to-Video Synthesis Approach Using Spatio-Temporal Diffusion for Vocal Tract MRI(https://arxiv.org/abs/2503.12102)
Keywords: diffusion
Abstract: Understanding the relationship between vocal tract motion during speech and the resulting acoustic signal is crucial for aided clinical assessment and developing personalized treatment and rehabilitation strategies. Toward this goal, we introduce an audio-to-video generation framework for creating Real Time/cine-Magnetic Resonance Imaging (RT-/cine-MRI) visuals of the vocal tract from speech signals. Our framework first preprocesses RT-/cine-MRI sequences and speech samples to achieve temporal alignment, ensuring synchronization between visual and audio data. We then employ a modified stable diffusion model, integrating structural and temporal blocks, to effectively capture movement characteristics and temporal dynamics in the synchronized data. This process enables the generation of MRI sequences from new speech inputs, improving the conversion of audio into visual data. We evaluated our framework on healthy controls and tongue cancer patients by analyzing and comparing the vocal tract movements in synthesized videos. Our framework demonstrated adaptability to new speech inputs and effective generalization. In addition, positive human evaluations confirmed its effectiveness, with realistic and accurate visualizations, suggesting its potential for outpatient therapy and personalized simulation of vocal tract visualizations.

Title: Robust Isolation Forest using Soft Sparse Random Projection and Valley Emphasis Method

Authors: Hun Kang, Kyoungok Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12125
Pdf URL: https://arxiv.org/pdf/2503.12125
Copy Paste: [[2503.12125]] Robust Isolation Forest using Soft Sparse Random Projection and Valley Emphasis Method(https://arxiv.org/abs/2503.12125)
Keywords: anomaly
Abstract: Isolation Forest (iForest) is an unsupervised anomaly detection algorithm designed to effectively detect anomalies under the assumption that anomalies are ``few and different." Various studies have aimed to enhance iForest, but the resulting algorithms often exhibited significant performance disparities across datasets. Additionally, the challenge of isolating rare and widely distributed anomalies persisted in research focused on improving splits. To address these challenges, we introduce Robust iForest (RiForest). RiForest leverages both existing features and random hyperplanes obtained through soft sparse random projection to identify superior split features for anomaly detection, independent of datasets. It utilizes the underutilized valley emphasis method for optimal split point determination and incorporates sparsity randomization in soft sparse random projection for enhanced anomaly detection robustness. Across 24 benchmark datasets, experiments demonstrate RiForest's consistent outperformance of existing algorithms in anomaly detection, emphasizing stability and robustness to noise variables.

Title: DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap

Authors: Shentong Mo, Zehua Chen, Fan Bao, Jun Zhu
Subjects: cs.CV, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.12131
Pdf URL: https://arxiv.org/pdf/2503.12131
Copy Paste: [[2503.12131]] DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap(https://arxiv.org/abs/2503.12131)
Keywords: diffusion, generative
Abstract: Recent works in cross-modal understanding and generation, notably through models like CLAP (Contrastive Language-Audio Pretraining) and CAVP (Contrastive Audio-Visual Pretraining), have significantly enhanced the alignment of text, video, and audio embeddings via a single contrastive loss. However, these methods often overlook the bidirectional interactions and inherent noises present in each modality, which can crucially impact the quality and efficacy of cross-modal integration. To address this limitation, we introduce DiffGAP, a novel approach incorporating a lightweight generative module within the contrastive space. Specifically, our DiffGAP employs a bidirectional diffusion process tailored to bridge the cross-modal gap more effectively. This involves a denoising process on text and video embeddings conditioned on audio embeddings and vice versa, thus facilitating a more nuanced and robust cross-modal interaction. Our experimental results on VGGSound and AudioCaps datasets demonstrate that DiffGAP significantly improves performance in video/text-audio generation and retrieval tasks, confirming its effectiveness in enhancing cross-modal understanding and generation capabilities.

Title: Probabilistic Graph Circuits: Deep Generative Models for Tractable Probabilistic Inference over Graphs

Authors: Milan Papež, Martin Rektoris, Václav Šmídl, Tomáš Pevný
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12162
Pdf URL: https://arxiv.org/pdf/2503.12162
Copy Paste: [[2503.12162]] Probabilistic Graph Circuits: Deep Generative Models for Tractable Probabilistic Inference over Graphs(https://arxiv.org/abs/2503.12162)
Keywords: generative, anomaly
Abstract: Deep generative models (DGMs) have recently demonstrated remarkable success in capturing complex probability distributions over graphs. Although their excellent performance is attributed to powerful and scalable deep neural networks, it is, at the same time, exactly the presence of these highly non-linear transformations that makes DGMs intractable. Indeed, despite representing probability distributions, intractable DGMs deny probabilistic foundations by their inability to answer even the most basic inference queries without approximations or design choices specific to a very narrow range of queries. To address this limitation, we propose probabilistic graph circuits (PGCs), a framework of tractable DGMs that provide exact and efficient probabilistic inference over (arbitrary parts of) graphs. Nonetheless, achieving both exactness and efficiency is challenging in the permutation-invariant setting of graphs. We design PGCs that are inherently invariant and satisfy these two requirements, yet at the cost of low expressive power. Therefore, we investigate two alternative strategies to achieve the invariance: the first sacrifices the efficiency, and the second sacrifices the exactness. We demonstrate that ignoring the permutation invariance can have severe consequences in anomaly detection, and that the latter approach is competitive with, and sometimes better than, existing intractable DGMs in the context of molecular graph generation.

Title: SEAL: Semantic Aware Image Watermarking

Authors: Kasra Arabi, R. Teal Witter, Chinmay Hegde, Niv Cohen
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2503.12172
Pdf URL: https://arxiv.org/pdf/2503.12172
Copy Paste: [[2503.12172]] SEAL: Semantic Aware Image Watermarking(https://arxiv.org/abs/2503.12172)
Keywords: diffusion, generative
Abstract: Generative models have rapidly evolved to generate realistic outputs. However, their synthetic outputs increasingly challenge the clear distinction between natural and AI-generated content, necessitating robust watermarking techniques. Watermarks are typically expected to preserve the integrity of the target image, withstand removal attempts, and prevent unauthorized replication onto unrelated images. To address this need, recent methods embed persistent watermarks into images produced by diffusion models using the initial noise. Yet, to do so, they either distort the distribution of generated images or rely on searching through a long dictionary of used keys for detection. In this paper, we propose a novel watermarking method that embeds semantic information about the generated image directly into the watermark, enabling a distortion-free watermark that can be verified without requiring a database of key patterns. Instead, the key pattern can be inferred from the semantic embedding of the image using locality-sensitive hashing. Furthermore, conditioning the watermark detection on the original image content improves robustness against forgery attacks. To demonstrate that, we consider two largely overlooked attack strategies: (i) an attacker extracting the initial noise and generating a novel image with the same pattern; (ii) an attacker inserting an unrelated (potentially harmful) object into a watermarked image, possibly while preserving the watermark. We empirically validate our method's increased robustness to these attacks. Taken together, our results suggest that content-aware watermarks can mitigate risks arising from image-generative models.

Title: STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation

Authors: Ruyu Wang, Xuefeng Hou, Sabrina Schmedding, Marco F. Huber
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12213
Pdf URL: https://arxiv.org/pdf/2503.12213
Copy Paste: [[2503.12213]] STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation(https://arxiv.org/abs/2503.12213)
Keywords: diffusion, self-supervised
Abstract: In layout-to-image (L2I) synthesis, controlled complex scenes are generated from coarse information like bounding boxes. Such a task is exciting to many downstream applications because the input layouts offer strong guidance to the generation process while remaining easily reconfigurable by humans. In this paper, we proposed STyled LAYout Diffusion (STAY Diffusion), a diffusion-based model that produces photo-realistic images and provides fine-grained control of stylized objects in scenes. Our approach learns a global condition for each layout, and a self-supervised semantic map for weight modulation using a novel Edge-Aware Normalization (EA Norm). A new Styled-Mask Attention (SM Attention) is also introduced to cross-condition the global condition and image feature for capturing the objects' relationships. These measures provide consistent guidance through the model, enabling more accurate and controllable image generation. Extensive benchmarking demonstrates that our STAY Diffusion presents high-quality images while surpassing previous state-of-the-art methods in generation diversity, accuracy, and controllability.

Title: Cross-Modal Diffusion for Biomechanical Dynamical Systems Through Local Manifold Alignment

Authors: Sharmita Dey, Sarath Ravindran Nair
Subjects: cs.LG, math.DS
Abstract URL: https://arxiv.org/abs/2503.12214
Pdf URL: https://arxiv.org/pdf/2503.12214
Copy Paste: [[2503.12214]] Cross-Modal Diffusion for Biomechanical Dynamical Systems Through Local Manifold Alignment(https://arxiv.org/abs/2503.12214)
Keywords: diffusion
Abstract: We present a mutually aligned diffusion framework for cross-modal biomechanical motion generation, guided by a dynamical systems perspective. By treating each modality, e.g., observed joint angles ($X$) and ground reaction forces ($Y$), as complementary observations of a shared underlying locomotor dynamical system, our method aligns latent representations at each diffusion step, so that one modality can help denoise and disambiguate the other. Our alignment approach is motivated by the fact that local time windows of $X$ and $Y$ represent the same phase of an underlying dynamical system, thereby benefiting from a shared latent manifold. We introduce a simple local latent manifold alignment (LLMA) strategy that incorporates first-order and second-order alignment within the latent space for robust cross-modal biomechanical generation without bells and whistles. Through experiments on multimodal human biomechanics data, we show that aligning local latent dynamics across modalities improves generation fidelity and yields better representations.

Title: Minuscule Cell Detection in AS-OCT Images with Progressive Field-of-View Focusing

Authors: Boyu Chen, Ameenat L. Solebo, Daqian Shi, Jinge Wu, Paul Taylor
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12249
Pdf URL: https://arxiv.org/pdf/2503.12249
Copy Paste: [[2503.12249]] Minuscule Cell Detection in AS-OCT Images with Progressive Field-of-View Focusing(https://arxiv.org/abs/2503.12249)
Keywords: foundation model
Abstract: Anterior Segment Optical Coherence Tomography (AS-OCT) is an emerging imaging technique with great potential for diagnosing anterior uveitis, a vision-threatening ocular inflammatory condition. A hallmark of this condition is the presence of inflammatory cells in the eye's anterior chamber, and detecting these cells using AS-OCT images has attracted research interest. While recent efforts aim to replace manual cell detection with automated computer vision approaches, detecting extremely small (minuscule) objects in high-resolution images, such as AS-OCT, poses substantial challenges: (1) each cell appears as a minuscule particle, representing less than 0.005\% of the image, making the detection difficult, and (2) OCT imaging introduces pixel-level noise that can be mistaken for cells, leading to false positive detections. To overcome these challenges, we propose a minuscule cell detection framework through a progressive field-of-view focusing strategy. This strategy systematically refines the detection scope from the whole image to a target region where cells are likely to be present, and further to minuscule regions potentially containing individual cells. Our framework consists of two modules. First, a Field-of-Focus module uses a vision foundation model to segment the target region. Subsequently, a Fine-grained Object Detection module introduces a specialized Minuscule Region Proposal followed by a Spatial Attention Network to distinguish individual cells from noise within the segmented region. Experimental results demonstrate that our framework outperforms state-of-the-art methods for cell detection, providing enhanced efficacy for clinical applications. Our code is publicly available at: this https URL.

Title: Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection

Authors: Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Arsh Koneru, Yusuke Kato, Kazuki Kozuka, Aditya Grover
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12271
Pdf URL: https://arxiv.org/pdf/2503.12271
Copy Paste: [[2503.12271]] Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection(https://arxiv.org/abs/2503.12271)
Keywords: diffusion, in-context
Abstract: The predominant approach to advancing text-to-image generation has been training-time scaling, where larger models are trained on more data using greater computational resources. While effective, this approach is computationally expensive, leading to growing interest in inference-time scaling to improve performance. Currently, inference-time scaling for text-to-image diffusion models is largely limited to best-of-N sampling, where multiple images are generated per prompt and a selection model chooses the best output. Inspired by the recent success of reasoning models like DeepSeek-R1 in the language domain, we introduce an alternative to naive best-of-N sampling by equipping text-to-image Diffusion Transformers with in-context reflection capabilities. We propose Reflect-DiT, a method that enables Diffusion Transformers to refine their generations using in-context examples of previously generated images alongside textual feedback describing necessary improvements. Instead of passively relying on random sampling and hoping for a better result in a future generation, Reflect-DiT explicitly tailors its generations to address specific aspects requiring enhancement. Experimental results demonstrate that Reflect-DiT improves performance on the GenEval benchmark (+0.19) using SANA-1.0-1.6B as a base model. Additionally, it achieves a new state-of-the-art score of 0.81 on GenEval while generating only 20 samples per prompt, surpassing the previous best score of 0.80, which was obtained using a significantly larger model (SANA-1.5-4.8B) with 2048 samples under the best-of-N approach.

Title: Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study

Authors: Liying Han, Gaofeng Dong, Xiaomin Ouyang, Lance Kaplan, Federico Cerutti, Mani Srivastava
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12282
Pdf URL: https://arxiv.org/pdf/2503.12282
Copy Paste: [[2503.12282]] Toward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study(https://arxiv.org/abs/2503.12282)
Keywords: foundation model
Abstract: Complex events (CEs) play a crucial role in CPS-IoT applications, enabling high-level decision-making in domains such as smart monitoring and autonomous systems. However, most existing models focus on short-span perception tasks, lacking the long-term reasoning required for CE detection. CEs consist of sequences of short-time atomic events (AEs) governed by spatiotemporal dependencies. Detecting them is difficult due to long, noisy sensor data and the challenge of filtering out irrelevant AEs while capturing meaningful patterns. This work explores CE detection as a case study for CPS-IoT foundation models capable of long-term reasoning. We evaluate three approaches: (1) leveraging large language models (LLMs), (2) employing various neural architectures that learn CE rules from data, and (3) adopting a neurosymbolic approach that integrates neural models with symbolic engines embedding human knowledge. Our results show that the state-space model, Mamba, which belongs to the second category, outperforms all methods in accuracy and generalization to longer, unseen sensor traces. These findings suggest that state-space models could be a strong backbone for CPS-IoT foundation models for long-span reasoning tasks.

Title: Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes

Authors: Da Wu, Zhanliang Wang, Quan Nguyen, Kai Wang
Subjects: cs.CL, cs.AI, q-bio.GN, q-bio.QM
Abstract URL: https://arxiv.org/abs/2503.12286
Pdf URL: https://arxiv.org/pdf/2503.12286
Copy Paste: [[2503.12286]] Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes(https://arxiv.org/abs/2503.12286)
Keywords: foundation model
Abstract: Background: Several studies show that large language models (LLMs) struggle with phenotype-driven gene prioritization for rare diseases. These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes. However, in real-world settings, foundation models are not optimized for domain-specific tasks like clinical diagnosis, yet inputs are unstructured clinical notes rather than standardized terms. How LLMs can be instructed to predict candidate genes or disease diagnosis from unstructured clinical notes remains a major challenge. Methods: We introduce RAG-driven CoT and CoT-driven RAG, two methods that combine Chain-of-Thought (CoT) and Retrieval Augmented Generation (RAG) to analyze clinical notes. A five-question CoT protocol mimics expert reasoning, while RAG retrieves data from sources like HPO and OMIM (Online Mendelian Inheritance in Man). We evaluated these approaches on rare disease datasets, including 5,980 Phenopacket-derived notes, 255 literature-based narratives, and 220 in-house clinical notes from Childrens Hospital of Philadelphia. Results: We found that recent foundations models, including Llama 3.3-70B-Instruct and DeepSeek-R1-Distill-Llama-70B, outperformed earlier versions such as Llama 2 and GPT-3.5. We also showed that RAG-driven CoT and CoT-driven RAG both outperform foundation models in candidate gene prioritization from clinical notes; in particular, both methods with DeepSeek backbone resulted in a top-10 gene accuracy of over 40% on Phenopacket-derived clinical notes. RAG-driven CoT works better for high-quality notes, where early retrieval can anchor the subsequent reasoning steps in domain-specific evidence, while CoT-driven RAG has advantage when processing lengthy and noisy notes.

Title: The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation

Authors: Olivier Gouvert, Julie Hunter, Jérôme Louradour, Christophe Cerisara, Evan Dufraisse, Yaya Sy, Laura Rivière, Jean-Pierre Lorré, OpenLLM-France community
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12294
Pdf URL: https://arxiv.org/pdf/2503.12294
Copy Paste: [[2503.12294]] The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation(https://arxiv.org/abs/2503.12294)
Keywords: foundation model
Abstract: We present both the Lucie Training Dataset and the Lucie-7B foundation model. The Lucie Training Dataset is a multilingual collection of textual corpora centered around French and designed to offset anglo-centric biases found in many datasets for large language model pretraining. Its French data is pulled not only from traditional web sources, but also from French cultural heritage documents, filling an important gap in modern datasets. Beyond French, which makes up the largest share of the data, we added documents to support several other European languages, including English, Spanish, German, and Italian. Apart from its value as a resource for French language and culture, an important feature of this dataset is that it prioritizes data rights by minimizing copyrighted material. In addition, building on the philosophy of past open projects, it is redistributed in the form used for training and its processing is described on Hugging Face and GitHub. The Lucie-7B foundation model is trained on equal amounts of data in French and English -- roughly 33% each -- in an effort to better represent cultural aspects of French-speaking communities. We also describe two instruction fine-tuned models, Lucie-7B-Instruct-v1.1 and Lucie-7B-Instruct-human-data, which we release as demonstrations of Lucie-7B in use. These models achieve promising results compared to state-of-the-art models, demonstrating that an open approach prioritizing data rights can still deliver strong performance. We see these models as an initial step toward developing more performant, aligned models in the near future. Model weights for Lucie-7B and the Lucie instruct models, along with intermediate checkpoints for the former, are published on Hugging Face, while model training and data preparation code is available on GitHub. This makes Lucie-7B one of the first OSI compliant language models according to the new OSI definition.

Title: ProbDiffFlow: An Efficient Learning-Free Framework for Probabilistic Single-Image Optical Flow Estimation

Authors: Mo Zhou, Jianwei Wang, Xuanmeng Zhang, Dylan Campbell, Kai Wang, Long Yuan, Wenjie Zhang, Xuemin Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12348
Pdf URL: https://arxiv.org/pdf/2503.12348
Copy Paste: [[2503.12348]] ProbDiffFlow: An Efficient Learning-Free Framework for Probabilistic Single-Image Optical Flow Estimation(https://arxiv.org/abs/2503.12348)
Keywords: diffusion
Abstract: This paper studies optical flow estimation, a critical task in motion analysis with applications in autonomous navigation, action recognition, and film production. Traditional optical flow methods require consecutive frames, which are often unavailable due to limitations in data acquisition or real-world scene disruptions. Thus, single-frame optical flow estimation is emerging in the literature. However, existing single-frame approaches suffer from two major limitations: (1) they rely on labeled training data, making them task-specific, and (2) they produce deterministic predictions, failing to capture motion uncertainty. To overcome these challenges, we propose ProbDiffFlow, a training-free framework that estimates optical flow distributions from a single image. Instead of directly predicting motion, ProbDiffFlow follows an estimation-by-synthesis paradigm: it first generates diverse plausible future frames using a diffusion-based model, then estimates motion from these synthesized samples using a pre-trained optical flow model, and finally aggregates the results into a probabilistic flow distribution. This design eliminates the need for task-specific training while capturing multiple plausible motions. Experiments on both synthetic and real-world datasets demonstrate that ProbDiffFlow achieves superior accuracy, diversity, and efficiency, outperforming existing single-image and two-frame baselines.

Title: Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation

Authors: Byung Hyun Lee, Sungjin Lim, Se Young Chun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12356
Pdf URL: https://arxiv.org/pdf/2503.12356
Copy Paste: [[2503.12356]] Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation(https://arxiv.org/abs/2503.12356)
Keywords: diffusion
Abstract: Fine-tuning based concept erasing has demonstrated promising results in preventing generation of harmful contents from text-to-image diffusion models by removing target concepts while preserving remaining concepts. To maintain the generation capability of diffusion models after concept erasure, it is necessary to remove only the image region containing the target concept when it locally appears in an image, leaving other regions intact. However, prior arts often compromise fidelity of the other image regions in order to erase the localized target concept appearing in a specific area, thereby reducing the overall performance of image generation. To address these limitations, we first introduce a framework called localized concept erasure, which allows for the deletion of only the specific area containing the target concept in the image while preserving the other regions. As a solution for the localized concept erasure, we propose a training-free approach, dubbed Gated Low-rank adaptation for Concept Erasure (GLoCE), that injects a lightweight module into the diffusion model. GLoCE consists of low-rank matrices and a simple gate, determined only by several generation steps for concepts without training. By directly applying GLoCE to image embeddings and designing the gate to activate only for target concepts, GLoCE can selectively remove only the region of the target concepts, even when target and remaining concepts coexist within an image. Extensive experiments demonstrated GLoCE not only improves the image fidelity to text prompts after erasing the localized target concepts, but also outperforms prior arts in efficacy, specificity, and robustness by large margin and can be extended to mass concept erasure.

Title: Pathology Image Restoration via Mixture of Prompts

Authors: Jiangdong Cai, Yan Chen, Zhenrong Shen, Haotian Jiang, Honglin Xiong, Kai Xuan, Lichi Zhang, Qian Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.12399
Pdf URL: https://arxiv.org/pdf/2503.12399
Copy Paste: [[2503.12399]] Pathology Image Restoration via Mixture of Prompts(https://arxiv.org/abs/2503.12399)
Keywords: diffusion, foundation model
Abstract: In digital pathology, acquiring all-in-focus images is essential to high-quality imaging and high-efficient clinical workflow. Traditional scanners achieve this by scanning at multiple focal planes of varying depths and then merging them, which is relatively slow and often struggles with complex tissue defocus. Recent prevailing image restoration technique provides a means to restore high-quality pathology images from scans of single focal planes. However, existing image restoration methods are inadequate, due to intricate defocus patterns in pathology images and their domain-specific semantic complexities. In this work, we devise a two-stage restoration solution cascading a transformer and a diffusion model, to benefit from their powers in preserving image fidelity and perceptual quality, respectively. We particularly propose a novel mixture of prompts for the two-stage solution. Given initial prompt that models defocus in microscopic imaging, we design two prompts that describe the high-level image semantics from pathology foundation model and the fine-grained tissue structures via edge extraction. We demonstrate that, by feeding the prompt mixture to our method, we can restore high-quality pathology images from single-focal-plane scans, implying high potentials of the mixture of prompts to clinical usage. Code will be publicly available at this https URL.

Title: MExD: An Expert-Infused Diffusion Model for Whole-Slide Image Classification

Authors: Jianwei Zhao, Xin Li, Fan Yang, Qiang Zhai, Ao Luo, Yang Zhao, Hong Cheng, Huazhu Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12401
Pdf URL: https://arxiv.org/pdf/2503.12401
Copy Paste: [[2503.12401]] MExD: An Expert-Infused Diffusion Model for Whole-Slide Image Classification(https://arxiv.org/abs/2503.12401)
Keywords: diffusion, generative
Abstract: Whole Slide Image (WSI) classification poses unique challenges due to the vast image size and numerous non-informative regions, which introduce noise and cause data imbalance during feature aggregation. To address these issues, we propose MExD, an Expert-Infused Diffusion Model that combines the strengths of a Mixture-of-Experts (MoE) mechanism with a diffusion model for enhanced classification. MExD balances patch feature distribution through a novel MoE-based aggregator that selectively emphasizes relevant information, effectively filtering noise, addressing data imbalance, and extracting essential features. These features are then integrated via a diffusion-based generative process to directly yield the class distribution for the WSI. Moving beyond conventional discriminative approaches, MExD represents the first generative strategy in WSI classification, capturing fine-grained details for robust and precise results. Our MExD is validated on three widely-used benchmarks-Camelyon16, TCGA-NSCLC, and BRACS consistently achieving state-of-the-art performance in both binary and multi-class tasks.

Title: SAM2-ELNet: Label Enhancement and Automatic Annotation for Remote Sensing Segmentation

Authors: Jianhao Yang, Wenshuo Yu, Yuanchao Lv, Jiance Sun, Bokang Sun, Mingyang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12404
Pdf URL: https://arxiv.org/pdf/2503.12404
Copy Paste: [[2503.12404]] SAM2-ELNet: Label Enhancement and Automatic Annotation for Remote Sensing Segmentation(https://arxiv.org/abs/2503.12404)
Keywords: self-supervised
Abstract: Remote sensing image segmentation is crucial for environmental monitoring, disaster assessment, and resource management, directly affecting the accuracy and efficiency of surface information extraction. The performance of existing supervised models in remote sensing image segmentation tasks highly depends on the quality of label data. However, current label data mainly relies on manual annotation, which comes with high time costs and is subject to subjective interference, resulting in distortion of label boundaries and often a loss of detail. To solve the above problems, our work proposes an Edge-enhanced Labeling Network, called SAM2-ELNet, which incorporates a labeling module and an edge attention mechanism. This model effectively addresses issues such as label detail loss, fragmentation, and inaccurate boundaries. Due to the scarcity of manually annotated remote sensing data, the feature extraction capabilities of traditional neural networks are limited. Our method uses the Hiera backbone of the pre-trained self-supervised large model segment anything model 2 (SAM2) as the encoder, achieves high-quality and efficient feature extraction even with small samples by fine-tuning on downstream tasks. This study compared the training effects of original and enhanced labels on the manually annotated Deep-SAR Oil Spill (SOS) dataset. Results showed that the model trained with enhanced labels performed better and had a lower final loss, indicating closer alignment with the real data distribution. Our work also explores the potential of extending the model into an efficient automatic annotation framework through generalization experiments, facilitating large-scale remote sensing image interpretation and intelligent recognition.

Title: Diffusion-based Synthetic Data Generation for Visible-Infrared Person Re-Identification

Authors: Wenbo Dai, Lijing Lu, Zhihang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12472
Pdf URL: https://arxiv.org/pdf/2503.12472
Copy Paste: [[2503.12472]] Diffusion-based Synthetic Data Generation for Visible-Infrared Person Re-Identification(https://arxiv.org/abs/2503.12472)
Keywords: diffusion
Abstract: The performance of models is intricately linked to the abundance of training data. In Visible-Infrared person Re-IDentification (VI-ReID) tasks, collecting and annotating large-scale images of each individual under various cameras and modalities is tedious, time-expensive, costly and must comply with data protection laws, posing a severe challenge in meeting dataset requirements. Current research investigates the generation of synthetic data as an efficient and privacy-ensuring alternative to collecting real data in the field. However, a specific data synthesis technique tailored for VI-ReID models has yet to be explored. In this paper, we present a novel data generation framework, dubbed Diffusion-based VI-ReID data Expansion (DiVE), that automatically obtain massive RGB-IR paired images with identity preserving by decoupling identity and modality to improve the performance of VI-ReID models. Specifically, identity representation is acquired from a set of samples sharing the same ID, whereas the modality of images is learned by fine-tuning the Stable Diffusion (SD) on modality-specific data. DiVE extend the text-driven image synthesis to identity-preserving RGB-IR multimodal image synthesis. This approach significantly reduces data collection and annotation costs by directly incorporating synthetic data into ReID model training. Experiments have demonstrated that VI-ReID models trained on synthetic data produced by DiVE consistently exhibit notable enhancements. In particular, the state-of-the-art method, CAJ, trained with synthetic images, achieves an improvement of about $9\%$ in mAP over the baseline on the LLCM dataset. Code: this https URL

Title: KDSelector: A Knowledge-Enhanced and Data-Efficient Model Selector Learning Framework for Time Series Anomaly Detection

Authors: Zhiyu Liang, Dongrui Cai, Chenyuan Zhang, Zheng Liang, Chen Liang, Bo Zheng, Shi Qiu, Jin Wang, Hongzhi Wang
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2503.12478
Pdf URL: https://arxiv.org/pdf/2503.12478
Copy Paste: [[2503.12478]] KDSelector: A Knowledge-Enhanced and Data-Efficient Model Selector Learning Framework for Time Series Anomaly Detection(https://arxiv.org/abs/2503.12478)
Keywords: anomaly
Abstract: Model selection has been raised as an essential problem in the area of time series anomaly detection (TSAD), because there is no single best TSAD model for the highly heterogeneous time series in real-world applications. However, despite the success of existing model selection solutions that train a classification model (especially neural network, NN) using historical data as a selector to predict the correct TSAD model for each series, the NN-based selector learning methods used by existing solutions do not make full use of the knowledge in the historical data and require iterating over all training samples, which limits the accuracy and training speed of the selector. To address these limitations, we propose KDSelector, a novel knowledge-enhanced and data-efficient framework for learning the NN-based TSAD model selector, of which three key components are specifically designed to integrate available knowledge into the selector and dynamically prune less important and redundant samples during the learning. We develop a TSAD model selection system with KDSelector as the internal, to demonstrate how users improve the accuracy and training speed of their selectors by using KDSelector as a plug-and-play module. Our demonstration video is hosted at this https URL.

Title: Cross-Modal Consistency Learning for Sign Language Recognition

Authors: Kepeng Wu, Zecheng Li, Weichao Zhao, Hezhen Hu, Wengang Zhou, Houqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12485
Pdf URL: https://arxiv.org/pdf/2503.12485
Copy Paste: [[2503.12485]] Cross-Modal Consistency Learning for Sign Language Recognition(https://arxiv.org/abs/2503.12485)
Keywords: self-supervised
Abstract: Pre-training has been proven to be effective in boosting the performance of Isolated Sign Language Recognition (ISLR). Existing pre-training methods solely focus on the compact pose data, which eliminate background perturbation but inevitably suffer from insufficient semantic cues compared to raw RGB videos. Nevertheless, direct representation learning only from RGB videos remains challenging due to the presence of sign-independent visual features. To address this dilemma, we propose a Cross-modal Consistency Learning framework (CCL-SLR), which leverages the cross-modal consistency from both RGB and pose modalities based on self-supervised pre-training. First, CCL-SLR employs contrastive learning for instance discrimination within and across modalities. Through the single-modal and cross-modal contrastive learning, CCL-SLR gradually aligns the feature spaces of RGB and pose modalities, thereby extracting consistent sign representations. Second, we further introduce Motion-Preserving Masking (MPM) and Semantic Positive Mining (SPM) techniques to improve cross-modal consistency from the perspective of data augmentation and sample similarity, respectively. Extensive experiments on four ISLR benchmarks show that CCL-SLR achieves impressive performance, demonstrating its effectiveness. The code will be released to the public.

Title: Segment Any-Quality Images with Generative Latent Space Enhancement

Authors: Guangqian Guo, Yoong Guo, Xuehui Yu, Wenbo Li, Yaoxing Wang, Shan Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12507
Pdf URL: https://arxiv.org/pdf/2503.12507
Copy Paste: [[2503.12507]] Segment Any-Quality Images with Generative Latent Space Enhancement(https://arxiv.org/abs/2503.12507)
Keywords: diffusion, generative
Abstract: Despite their success, Segment Anything Models (SAMs) experience significant performance drops on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios. To address this, we propose GleSAM, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images, thus enabling generalization across various image qualities. Specifically, we adapt the concept of latent diffusion to SAM-based segmentation frameworks and perform the generative diffusion process in the latent space of SAM to reconstruct high-quality representation, thereby improving segmentation. Additionally, we introduce two techniques to improve compatibility between the pre-trained diffusion model and the segmentation framework. Our method can be applied to pre-trained SAM and SAM2 with only minimal additional learnable parameters, allowing for efficient optimization. We also construct the LQSeg dataset with a greater diversity of degradation types and levels for training and evaluating the model. Extensive experiments demonstrate that GleSAM significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Furthermore, GleSAM also performs well on unseen degradations, underscoring the versatility of our approach and dataset.

Title: Multi Activity Sequence Alignment via Implicit Clustering

Authors: Taein Kwon, Zador Pataki, Mahdi Rad, Marc Pollefeys
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12519
Pdf URL: https://arxiv.org/pdf/2503.12519
Copy Paste: [[2503.12519]] Multi Activity Sequence Alignment via Implicit Clustering(https://arxiv.org/abs/2503.12519)
Keywords: self-supervised
Abstract: Self-supervised temporal sequence alignment can provide rich and effective representations for a wide range of applications. However, existing methods for achieving optimal performance are mostly limited to aligning sequences of the same activity only and require separate models to be trained for each activity. We propose a novel framework that overcomes these limitations using sequence alignment via implicit clustering. Specifically, our key idea is to perform implicit clip-level clustering while aligning frames in sequences. This coupled with our proposed dual augmentation technique enhances the network's ability to learn generalizable and discriminative representations. Our experiments show that our proposed method outperforms state-of-the-art results and highlight the generalization capability of our framework with multi activity and different modalities on three diverse datasets, H2O, PennAction, and IKEA ASM. We will release our code upon acceptance.

Title: Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks

Authors: Mehmet Kerem Turkcan, Mattia Ballo, Filippo Filicori, Zoran Kostic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12531
Pdf URL: https://arxiv.org/pdf/2503.12531
Copy Paste: [[2503.12531]] Towards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks(https://arxiv.org/abs/2503.12531)
Keywords: diffusion, generative
Abstract: We introduce specialized diffusion-based generative models that capture the spatiotemporal dynamics of fine-grained robotic surgical sub-stitch actions through supervised learning on annotated laparoscopic surgery footage. The proposed models form a foundation for data-driven world models capable of simulating the biomechanical interactions and procedural dynamics of surgical suturing with high temporal fidelity. Annotating a dataset of $\sim2K$ clips extracted from simulation videos, we categorize surgical actions into fine-grained sub-stitch classes including ideal and non-ideal executions of needle positioning, targeting, driving, and withdrawal. We fine-tune two state-of-the-art video diffusion models, LTX-Video and HunyuanVideo, to generate high-fidelity surgical action sequences at $\ge$768x512 resolution and $\ge$49 frames. For training our models, we explore both Low-Rank Adaptation (LoRA) and full-model fine-tuning approaches. Our experimental results demonstrate that these world models can effectively capture the dynamics of suturing, potentially enabling improved training simulators, surgical skill assessment tools, and autonomous surgical systems. The models also display the capability to differentiate between ideal and non-ideal technique execution, providing a foundation for building surgical training and evaluation systems. We release our models for testing and as a foundation for future research. Project Page: this https URL

Title: Time-EAPCR-T: A Universal Deep Learning Approach for Anomaly Detection in Industrial Equipment

Authors: Huajie Liang, Di Wang, Yuchao Lu, Mengke Song, Lei Liu, Ling An, Ying Liang, Xingjie Ma, Zhenyu Zhang, Chichun Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12534
Pdf URL: https://arxiv.org/pdf/2503.12534
Copy Paste: [[2503.12534]] Time-EAPCR-T: A Universal Deep Learning Approach for Anomaly Detection in Industrial Equipment(https://arxiv.org/abs/2503.12534)
Keywords: anomaly
Abstract: With the advancement of Industry 4.0, intelligent manufacturing extensively employs sensors for real-time multidimensional data collection, playing a crucial role in equipment monitoring, process optimisation, and efficiency enhancement. Industrial data exhibit characteristics such as multi-source heterogeneity, nonlinearity, strong coupling, and temporal interactions, while also being affected by noise interference. These complexities make it challenging for traditional anomaly detection methods to extract key features, impacting detection accuracy and stability. Traditional machine learning approaches often struggle with such complex data due to limitations in processing capacity and generalisation ability, making them inadequate for practical applications. While deep learning feature extraction modules have demonstrated remarkable performance in image and text processing, they remain ineffective when applied to multi-source heterogeneous industrial data lacking explicit correlations. Moreover, existing multi-source heterogeneous data processing techniques still rely on dimensionality reduction and feature selection, which can lead to information loss and difficulty in capturing high-order interactions. To address these challenges, this study applies the EAPCR and Time-EAPCR models proposed in previous research and introduces a new model, Time-EAPCR-T, where Transformer replaces the LSTM module in the time-series processing component of Time-EAPCR. This modification effectively addresses multi-source data heterogeneity, facilitates efficient multi-source feature fusion, and enhances the temporal feature extraction capabilities of multi-source industrial this http URL results demonstrate that the proposed method outperforms existing approaches across four industrial datasets, highlighting its broad application potential.

Title: Debiasing Diffusion Model: Enhancing Fairness through Latent Representation Learning in Stable Diffusion Model

Authors: Lin-Chun Huang, Ching Chieh Tsao, Fang-Yi Su, Jung-Hsien Chiang
Subjects: cs.LG, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2503.12536
Pdf URL: https://arxiv.org/pdf/2503.12536
Copy Paste: [[2503.12536]] Debiasing Diffusion Model: Enhancing Fairness through Latent Representation Learning in Stable Diffusion Model(https://arxiv.org/abs/2503.12536)
Keywords: diffusion, generative
Abstract: Image generative models, particularly diffusion-based models, have surged in popularity due to their remarkable ability to synthesize highly realistic images. However, since these models are data-driven, they inherit biases from the training datasets, frequently leading to disproportionate group representations that exacerbate societal inequities. Traditionally, efforts to debiase these models have relied on predefined sensitive attributes, classifiers trained on such attributes, or large language models to steer outputs toward fairness. However, these approaches face notable drawbacks: predefined attributes do not adequately capture complex and continuous variations among groups. To address these issues, we introduce the Debiasing Diffusion Model (DDM), which leverages an indicator to learn latent representations during training, promoting fairness through balanced representations without requiring predefined sensitive attributes. This approach not only demonstrates its effectiveness in scenarios previously addressed by conventional techniques but also enhances fairness without relying on predefined sensitive attributes as conditions. In this paper, we discuss the limitations of prior bias mitigation techniques in diffusion-based models, elaborate on the architecture of the DDM, and validate the effectiveness of our approach through experiments.

Title: Diffusion on Graph: Augmentation of Graph Structure for Node Classification

Authors: Yancheng Wang, Changyu Liu, Yingzhen Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12563
Pdf URL: https://arxiv.org/pdf/2503.12563
Copy Paste: [[2503.12563]] Diffusion on Graph: Augmentation of Graph Structure for Node Classification(https://arxiv.org/abs/2503.12563)
Keywords: diffusion
Abstract: Graph diffusion models have recently been proposed to synthesize entire graphs, such as molecule graphs. Although existing methods have shown great performance in generating entire graphs for graph-level learning tasks, no graph diffusion models have been developed to generate synthetic graph structures, that is, synthetic nodes and associated edges within a given graph, for node-level learning tasks. Inspired by the research in the computer vision literature using synthetic data for enhanced performance, we propose Diffusion on Graph (DoG), which generates synthetic graph structures to boost the performance of GNNs. The synthetic graph structures generated by DoG are combined with the original graph to form an augmented graph for the training of node-level learning tasks, such as node classification and graph contrastive learning (GCL). To improve the efficiency of the generation process, a Bi-Level Neighbor Map Decoder (BLND) is introduced in DoG. To mitigate the adverse effect of the noise introduced by the synthetic graph structures, a low-rank regularization method is proposed for the training of graph neural networks (GNNs) on the augmented graphs. Extensive experiments on various graph datasets for semi-supervised node classification and graph contrastive learning have been conducted to demonstrate the effectiveness of DoG with low-rank regularization. The code of DoG is available at this https URL.

Title: GAN-Based Single-Stage Defense for Traffic Sign Classification Under Adversarial Patch Attack

Authors: Abyad Enan, Mashrur Chowdhury
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12567
Pdf URL: https://arxiv.org/pdf/2503.12567
Copy Paste: [[2503.12567]] GAN-Based Single-Stage Defense for Traffic Sign Classification Under Adversarial Patch Attack(https://arxiv.org/abs/2503.12567)
Keywords: generative
Abstract: Computer Vision plays a critical role in ensuring the safe navigation of autonomous vehicles (AVs). An AV perception module is responsible for capturing and interpreting the surrounding environment to facilitate safe navigation. This module enables AVs to recognize traffic signs, traffic lights, and various road users. However, the perception module is vulnerable to adversarial attacks, which can compromise their accuracy and reliability. One such attack is the adversarial patch attack (APA), a physical attack in which an adversary strategically places a specially crafted sticker on an object to deceive object classifiers. In APA, an adversarial patch is positioned on a target object, leading the classifier to misidentify it. Such an APA can cause AVs to misclassify traffic signs, leading to catastrophic incidents. To enhance the security of an AV perception system against APAs, this study develops a Generative Adversarial Network (GAN)-based single-stage defense strategy for traffic sign classification. This approach is tailored to defend against APAs on different classes of traffic signs without prior knowledge of a patch's design. This study found this approach to be effective against patches of varying sizes. Our experimental analysis demonstrates that the defense strategy presented in this paper improves the classifier's accuracy under APA conditions by up to 80.8% and enhances overall classification accuracy for all the traffic signs considered in this study by 58%, compared to a classifier without any defense mechanism. Our defense strategy is model-agnostic, making it applicable to any traffic sign classifier, regardless of the underlying classification model.

Title: BalancedDPO: Adaptive Multi-Metric Alignment

Authors: Dipesh Tamboli, Souradip Chakraborty, Aditya Malusare, Biplab Banerjee, Amrit Singh Bedi, Vaneet Aggarwal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12575
Pdf URL: https://arxiv.org/pdf/2503.12575
Copy Paste: [[2503.12575]] BalancedDPO: Adaptive Multi-Metric Alignment(https://arxiv.org/abs/2503.12575)
Keywords: diffusion
Abstract: Text-to-image (T2I) diffusion models have made remarkable advancements, yet aligning them with diverse preferences remains a persistent challenge. Current methods often optimize single metrics or depend on narrowly curated datasets, leading to overfitting and limited generalization across key visual quality metrics. We present BalancedDPO, a novel extension of Direct Preference Optimization (DPO) that addresses these limitations by simultaneously aligning T2I diffusion models with multiple metrics, including human preference, CLIP score, and aesthetic quality. Our key novelty lies in aggregating consensus labels from diverse metrics in the preference distribution space as compared to existing reward mixing approaches, enabling robust and scalable multi-metric alignment while maintaining the simplicity of the standard DPO pipeline that we refer to as BalancedDPO. Our evaluations on the Pick-a-Pic, PartiPrompt and HPD datasets show that BalancedDPO achieves state-of-the-art results, outperforming existing approaches across all major metrics. BalancedDPO improves the average win rates by 15%, 7.1%, and 10.3% on Pick-a-pic, PartiPrompt and HPD, respectively, from the DiffusionDPO.

Title: Personalize Anything for Free with Diffusion Transformer

Authors: Haoran Feng, Zehuan Huang, Lin Li, Hairong Lv, Lu Sheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12590
Pdf URL: https://arxiv.org/pdf/2503.12590
Copy Paste: [[2503.12590]] Personalize Anything for Free with Diffusion Transformer(https://arxiv.org/abs/2503.12590)
Keywords: diffusion
Abstract: Personalized image generation aims to produce images of user-specified concepts while enabling flexible editing. Recent training-free approaches, while exhibit higher computational efficiency than training-based methods, struggle with identity preservation, applicability, and compatibility with diffusion transformers (DiTs). In this paper, we uncover the untapped potential of DiT, where simply replacing denoising tokens with those of a reference subject achieves zero-shot subject reconstruction. This simple yet effective feature injection technique unlocks diverse scenarios, from personalization to image editing. Building upon this observation, we propose \textbf{Personalize Anything}, a training-free framework that achieves personalized image generation in DiT through: 1) timestep-adaptive token replacement that enforces subject consistency via early-stage injection and enhances flexibility through late-stage regularization, and 2) patch perturbation strategies to boost structural diversity. Our method seamlessly supports layout-guided generation, multi-subject personalization, and mask-controlled editing. Evaluations demonstrate state-of-the-art performance in identity preservation and versatility. Our work establishes new insights into DiTs while delivering a practical paradigm for efficient personalization.

Title: SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models

Authors: Kunyang Sun, Dorian Bagni, Joseph M. Cavanagh, Yingze Wang, Jacob M. Sawyer, Andrew Gritsevskiy, Teresa Head-Gordon
Subjects: cs.LG, physics.bio-ph
Abstract URL: https://arxiv.org/abs/2503.12602
Pdf URL: https://arxiv.org/pdf/2503.12602
Copy Paste: [[2503.12602]] SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models(https://arxiv.org/abs/2503.12602)
Keywords: generative
Abstract: Generative machine learning models for small molecule drug discovery have shown immense promise, but many molecules generated by this approach are too difficult to synthesize to be worth further investigation or further development. We present a novel approach by fine-tuning Meta's Llama3 large language models (LLMs) to create SynLlama, which generates full synthetic pathways made of commonly accessible Enamine building blocks and robust organic reaction templates. SynLlama explores a large synthesizable space using significantly less data compared to other state-of-the-art methods, and offers strong performance in bottom-up synthesis, synthesizable analog generation, and hit expansion, offering medicinal chemists a valuable tool for drug discovery developments. We find that SynLlama can effectively generalize to unseen yet purchasable building blocks, meaning that its reconstruction capabilities extend to a broader synthesizable chemical space than the training data.

Title: LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization

Authors: Alessio Spagnoletti, Jean Prost, Andrés Almansa, Nicolas Papadakis, Marcelo Pereyra
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12615
Pdf URL: https://arxiv.org/pdf/2503.12615
Copy Paste: [[2503.12615]] LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization(https://arxiv.org/abs/2503.12615)
Keywords: diffusion, generative
Abstract: Text-to-image latent diffusion models (LDMs) have recently emerged as powerful generative models with great potential for solving inverse problems in imaging. However, leveraging such models in a Plug & Play (PnP), zero-shot manner remains challenging because it requires identifying a suitable text prompt for the unknown image of interest. Also, existing text-to-image PnP approaches are highly computationally expensive. We herein address these challenges by proposing a novel PnP inference paradigm specifically designed for embedding generative models within stochastic inverse solvers, with special attention to Latent Consistency Models (LCMs), which distill LDMs into fast generators. We leverage our framework to propose LAtent consisTency INverse sOlver (LATINO), the first zero-shot PnP framework to solve inverse problems with priors encoded by LCMs. Our conditioning mechanism avoids automatic differentiation and reaches SOTA quality in as little as 8 neural function evaluations. As a result, LATINO delivers remarkably accurate solutions and is significantly more memory and computationally efficient than previous approaches. We then embed LATINO within an empirical Bayesian framework that automatically calibrates the text prompt from the observed measurements by marginal maximum likelihood estimation. Extensive experiments show that prompt self-calibration greatly improves estimation, allowing LATINO with PRompt Optimization to define new SOTAs in image reconstruction quality and computational efficiency.

Title: FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization

Authors: Hao Mark Chen, Shell Xu Hu, Wayne Luk, Timothy Hospedales, Hongxiang Fan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12649
Pdf URL: https://arxiv.org/pdf/2503.12649
Copy Paste: [[2503.12649]] FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization(https://arxiv.org/abs/2503.12649)
Keywords: foundation model
Abstract: Model merging has emerged as a promising approach for multi-task learning (MTL), offering a data-efficient alternative to conventional fine-tuning. However, with the rapid development of the open-source AI ecosystem and the increasing availability of fine-tuned foundation models, existing model merging methods face two key limitations: (i) They are primarily designed for in-house fine-tuned models, making them less adaptable to diverse model sources with partially unknown model and task information, (ii) They struggle to scale effectively when merging numerous model checkpoints. To address these challenges, we formulate model merging as a constrained optimization problem and introduce a novel approach: Frank-Wolfe Merging (FW-Merging). Inspired by Frank-Wolfe optimization, our approach iteratively selects the most relevant model in the pool to minimize a linear approximation of the objective function and then executes a local merging similar to the Frank-Wolfe update. The objective function is designed to capture the desired behavior of the target-merged model, while the fine-tuned candidate models define the constraint set. More importantly, FW-Merging serves as an orthogonal technique for existing merging methods, seamlessly integrating with them to further enhance accuracy performance. Our experiments show that FW-Merging scales across diverse model sources, remaining stable with 16 irrelevant models and improving by 15.3% with 16 relevant models on 20 CV tasks, while maintaining constant memory overhead, unlike the linear overhead of data-informed merging methods. Compared with the state-of-the-art approaches, FW-Merging surpasses the data-free merging method by 32.8% and outperforms the data-informed Adamerging by 8.39% when merging 20 ViT models.

Title: UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

Authors: Tsu-Jui Fu, Yusu Qian, Chen Chen, Wenze Hu, Zhe Gan, Yinfei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12652
Pdf URL: https://arxiv.org/pdf/2503.12652
Copy Paste: [[2503.12652]] UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing(https://arxiv.org/abs/2503.12652)
Keywords: diffusion
Abstract: Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.

Title: AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration

Authors: Javier Tirado-Garín, Javier Civera
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12701
Pdf URL: https://arxiv.org/pdf/2503.12701
Copy Paste: [[2503.12701]] AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration(https://arxiv.org/abs/2503.12701)
Keywords: foundation model
Abstract: We present AnyCalib, a method for calibrating the intrinsic parameters of a camera from a single in-the-wild image, that is agnostic to the camera model. Current methods are predominantly tailored to specific camera models and/or require extrinsic cues, such as the direction of gravity, to be visible in the image. In contrast, we argue that the perspective and distortion cues inherent in images are sufficient for model-agnostic camera calibration. To demonstrate this, we frame the calibration process as the regression of the rays corresponding to each pixel. We show, for the first time, that this intermediate representation allows for a closed-form recovery of the intrinsics for a wide range of camera models, including but not limited to: pinhole, Brown-Conrady and Kannala-Brandt. Our approach also applies to edited -- cropped and stretched -- images. Experimentally, we demonstrate that AnyCalib consistently outperforms alternative methods, including 3D foundation models, despite being trained on orders of magnitude less data. Code is available at this https URL.

Title: GenStereo: Towards Open-World Generation of Stereo Images and Unsupervised Matching

Authors: Feng Qiao, Zhexiao Xiong, Eric Xing, Nathan Jacobs
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12720
Pdf URL: https://arxiv.org/pdf/2503.12720
Copy Paste: [[2503.12720]] GenStereo: Towards Open-World Generation of Stereo Images and Unsupervised Matching(https://arxiv.org/abs/2503.12720)
Keywords: diffusion
Abstract: Stereo images are fundamental to numerous applications, including extended reality (XR) devices, autonomous driving, and robotics. Unfortunately, acquiring high-quality stereo images remains challenging due to the precise calibration requirements of dual-camera setups and the complexity of obtaining accurate, dense disparity maps. Existing stereo image generation methods typically focus on either visual quality for viewing or geometric accuracy for matching, but not both. We introduce GenStereo, a diffusion-based approach, to bridge this gap. The method includes two primary innovations (1) conditioning the diffusion process on a disparity-aware coordinate embedding and a warped input image, allowing for more precise stereo alignment than previous methods, and (2) an adaptive fusion mechanism that intelligently combines the diffusion-generated image with a warped image, improving both realism and disparity consistency. Through extensive training on 11 diverse stereo datasets, GenStereo demonstrates strong generalization ability. GenStereo achieves state-of-the-art performance in both stereo image generation and unsupervised stereo matching tasks. Our framework eliminates the need for complex hardware setups while enabling high-quality stereo image generation, making it valuable for both real-world applications and unsupervised learning scenarios. Project page is available at this https URL

Title: In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention

Authors: Jianliang He, Xintian Pan, Siyu Chen, Zhuoran Yang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.12734
Pdf URL: https://arxiv.org/pdf/2503.12734
Copy Paste: [[2503.12734]] In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention(https://arxiv.org/abs/2503.12734)
Keywords: in-context
Abstract: We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns: a diagonal and homogeneous pattern in the key-query (KQ) weights, and a last-entry-only and zero-sum pattern in the output-value (OV) weights. Remarkably, these patterns consistently appear from gradient-based training starting from random initialization. Our analysis reveals that such emergent structures enable multi-head attention to approximately implement a debiased gradient descent predictor -- one that outperforms single-head attention and nearly achieves Bayesian optimality up to proportional factor. Furthermore, compared to linear transformers, the softmax attention readily generalizes to sequences longer than those seen during training. We also extend our study to scenarios with non-isotropic covariates and multi-task linear regression. In the former, multi-head attention learns to implement a form of pre-conditioned gradient descent. In the latter, we uncover an intriguing regime where the interplay between head number and task number triggers a superposition phenomenon that efficiently resolves multi-task in-context learning. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution, paving the way for deeper understanding and broader applications of in-context learning.

Title: VasTSD: Learning 3D Vascular Tree-state Space Diffusion Model for Angiography Synthesis

Authors: Zhifeng Wang, Renjiao Yi, Xin Wen, Chenyang Zhu, Kai Xu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.12758
Pdf URL: https://arxiv.org/pdf/2503.12758
Copy Paste: [[2503.12758]] VasTSD: Learning 3D Vascular Tree-state Space Diffusion Model for Angiography Synthesis(https://arxiv.org/abs/2503.12758)
Keywords: diffusion, generative
Abstract: Angiography imaging is a medical imaging technique that enhances the visibility of blood vessels within the body by using contrast agents. Angiographic images can effectively assist in the diagnosis of vascular diseases. However, contrast agents may bring extra radiation exposure which is harmful to patients with health risks. To mitigate these concerns, in this paper, we aim to automatically generate angiography from non-angiographic inputs, by leveraging and enhancing the inherent physical properties of vascular structures. Previous methods relying on 2D slice-based angiography synthesis struggle with maintaining continuity in 3D vascular structures and exhibit limited effectiveness across different imaging modalities. We propose VasTSD, a 3D vascular tree-state space diffusion model to synthesize angiography from 3D non-angiographic volumes, with a novel state space serialization approach that dynamically constructs vascular tree topologies, integrating these with a diffusion-based generative model to ensure the generation of anatomically continuous vasculature in 3D volumes. A pre-trained vision embedder is employed to construct vascular state space representations, enabling consistent modeling of vascular structures across multiple modalities. Extensive experiments on various angiographic datasets demonstrate the superiority of VasTSD over prior works, achieving enhanced continuity of blood vessels in synthesized angiographic synthesis for multiple modalities and anatomical regions.

Title: RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

Authors: Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Julia Hockenmaier, Tong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.12759
Pdf URL: https://arxiv.org/pdf/2503.12759
Copy Paste: [[2503.12759]] RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning(https://arxiv.org/abs/2503.12759)
Keywords: generative
Abstract: Recent research highlights the challenges retrieval models face in retrieving useful contexts and the limitations of generation models in effectively utilizing those contexts in retrieval-augmented generation (RAG) settings. To address these challenges, we introduce RAG-RL, the first reasoning language model (RLM) specifically trained for RAG. RAG-RL demonstrates that stronger answer generation models can identify relevant contexts within larger sets of retrieved information -- thereby alleviating the burden on retrievers -- while also being able to utilize those contexts more effectively. Moreover, we show that curriculum design in the reinforcement learning (RL) post-training process is a powerful approach to enhancing model performance. We benchmark our method on two open-domain question-answering datasets and achieve state-of-the-art results, surpassing previous SOTA generative reader models. In addition, we offers empirical insights into various curriculum learning strategies, providing a deeper understanding of their impact on model performance.

Title: A Survey on Human Interaction Motion Generation

Authors: Kewei Sui, Anindita Ghosh, Inwoo Hwang, Jian Wang, Chuan Guo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12763
Pdf URL: https://arxiv.org/pdf/2503.12763
Copy Paste: [[2503.12763]] A Survey on Human Interaction Motion Generation(https://arxiv.org/abs/2503.12763)
Keywords: generative
Abstract: Humans inhabit a world defined by interactions -- with other humans, objects, and environments. These interactive movements not only convey our relationships with our surroundings but also demonstrate how we perceive and communicate with the real world. Therefore, replicating these interaction behaviors in digital systems has emerged as an important topic for applications in robotics, virtual reality, and animation. While recent advances in deep generative models and new datasets have accelerated progress in this field, significant challenges remain in modeling the intricate human dynamics and their interactions with entities in the external world. In this survey, we present, for the first time, a comprehensive overview of the literature in human interaction motion generation. We begin by establishing foundational concepts essential for understanding the research background. We then systematically review existing solutions and datasets across three primary interaction tasks -- human-human, human-object, and human-scene interactions -- followed by evaluation metrics. Finally, we discuss open research directions and future opportunities.

Title: TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image

Authors: Haoxiao Wang, Kaichen Zhou, Binrui Gu, Zhiyuan Feng, Weijie Wang, Peilin Sun, Yicheng Xiao, Jianhua Zhang, Hao Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12779
Pdf URL: https://arxiv.org/pdf/2503.12779
Copy Paste: [[2503.12779]] TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image(https://arxiv.org/abs/2503.12779)
Keywords: diffusion
Abstract: Manipulating transparent objects presents significant challenges due to the complexities introduced by their reflection and refraction properties, which considerably hinder the accurate estimation of their 3D shapes. To address these challenges, we propose a single-view RGB-D-based depth completion framework, TransDiff, that leverages the Denoising Diffusion Probabilistic Models(DDPM) to achieve material-agnostic object grasping in desktop. Specifically, we leverage features extracted from RGB images, including semantic segmentation, edge maps, and normal maps, to condition the depth map generation process. Our method learns an iterative denoising process that transforms a random depth distribution into a depth map, guided by initially refined depth information, ensuring more accurate depth estimation in scenarios involving transparent objects. Additionally, we propose a novel training method to better align the noisy depth and RGB image features, which are used as conditions to refine depth estimation step by step. Finally, we utilized an improved inference process to accelerate the denoising procedure. Through comprehensive experimental validation, we demonstrate that our method significantly outperforms the baselines in both synthetic and real-world benchmarks with acceptable inference time. The demo of our method can be found on this https URL

Title: SAM2 for Image and Video Segmentation: A Comprehensive Survey

Authors: Zhang Jiaxing, Tang Hao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12781
Pdf URL: https://arxiv.org/pdf/2503.12781
Copy Paste: [[2503.12781]] SAM2 for Image and Video Segmentation: A Comprehensive Survey(https://arxiv.org/abs/2503.12781)
Keywords: foundation model
Abstract: Despite significant advances in deep learning for image and video segmentation, existing models continue to face challenges in cross-domain adaptability and generalization. Image and video segmentation are fundamental tasks in computer vision with wide-ranging applications in healthcare, agriculture, industrial inspection, and autonomous driving. With the advent of large-scale foundation models, SAM2 - an improved version of SAM (Segment Anything Model)has been optimized for segmentation tasks, demonstrating enhanced performance in complex scenarios. However, SAM2's adaptability and limitations in specific domains require further investigation. This paper systematically analyzes the application of SAM2 in image and video segmentation and evaluates its performance in various fields. We begin by introducing the foundational concepts of image segmentation, categorizing foundation models, and exploring the technical characteristics of SAM and SAM2. Subsequently, we delve into SAM2's applications in static image and video segmentation, emphasizing its performance in specialized areas such as medical imaging and the challenges of cross-domain adaptability. As part of our research, we reviewed over 200 related papers to provide a comprehensive analysis of the topic. Finally, the paper highlights the strengths and weaknesses of SAM2 in segmentation tasks, identifies the technical challenges it faces, and proposes future development directions. This review provides valuable insights and practical recommendations for optimizing and applying SAM2 in real-world scenarios.

Title: A Reinforcement Learning-Driven Transformer GAN for Molecular Generation

Authors: Chen Li, Huidong Tang, Ye Zhu, Yoshihiro Yamanishi
Subjects: cs.LG, cs.CL, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2503.12796
Pdf URL: https://arxiv.org/pdf/2503.12796
Copy Paste: [[2503.12796]] A Reinforcement Learning-Driven Transformer GAN for Molecular Generation(https://arxiv.org/abs/2503.12796)
Keywords: generative
Abstract: Generating molecules with desired chemical properties presents a critical challenge in fields such as chemical synthesis and drug discovery. Recent advancements in artificial intelligence (AI) and deep learning have significantly contributed to data-driven molecular generation. However, challenges persist due to the inherent sensitivity of simplified molecular input line entry system (SMILES) representations and the difficulties in applying generative adversarial networks (GANs) to discrete data. This study introduces RL-MolGAN, a novel Transformer-based discrete GAN framework designed to address these challenges. Unlike traditional Transformer architectures, RL-MolGAN utilizes a first-decoder-then-encoder structure, facilitating the generation of drug-like molecules from both $de~novo$ and scaffold-based designs. In addition, RL-MolGAN integrates reinforcement learning (RL) and Monte Carlo tree search (MCTS) techniques to enhance the stability of GAN training and optimize the chemical properties of the generated molecules. To further improve the model's performance, RL-MolWGAN, an extension of RL-MolGAN, incorporates Wasserstein distance and mini-batch discrimination, which together enhance the stability of the GAN. Experimental results on two widely used molecular datasets, QM9 and ZINC, validate the effectiveness of our models in generating high-quality molecular structures with diverse and desirable chemical properties.

Title: From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration

Authors: Mingyang Song, Xiaoye Qu, Jiawei Zhou, Yu Cheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12821
Pdf URL: https://arxiv.org/pdf/2503.12821
Copy Paste: [[2503.12821]] From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration(https://arxiv.org/abs/2503.12821)
Keywords: diffusion
Abstract: Large Vision-Language Models (LVLMs) have achieved significant progress in combining visual comprehension with language generation. Despite this success, the training data of LVLMs still suffers from Long-Tail (LT) problems, where the data distribution is highly imbalanced. Previous works have mainly focused on traditional VLM architectures, i.e., CLIP or ViT, and specific tasks such as recognition and classification. Nevertheless, the exploration of LVLM (e.g. LLaVA) and more general tasks (e.g. Visual Question Answering and Visual Reasoning) remains under-explored. In this paper, we first conduct an in-depth analysis of the LT issues in LVLMs and identify two core causes: the overrepresentation of head concepts and the underrepresentation of tail concepts. Based on the above observation, we propose an $\textbf{A}$daptive $\textbf{D}$ata $\textbf{R}$efinement Framework ($\textbf{ADR}$), which consists of two stages: $\textbf{D}$ata $\textbf{R}$ebalancing ($\textbf{DR}$) and $\textbf{D}$ata $\textbf{S}$ynthesis ($\textbf{DS}$). In the DR stage, we adaptively rebalance the redundant data based on entity distributions, while in the DS stage, we leverage Denoising Diffusion Probabilistic Models (DDPMs) and scarce images to supplement underrepresented portions. Through comprehensive evaluations across eleven benchmarks, our proposed ADR effectively mitigates the long-tail problem in the training data, improving the average performance of LLaVA 1.5 relatively by 4.36%, without increasing the training data volume.

Title: DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Mode

Authors: Junjia Huang, Pengxiang Yan, Jinhang Cai, Jiyang Liu, Zhao Wang, Yitong Wang, Xinglong Wu, Guanbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12838
Pdf URL: https://arxiv.org/pdf/2503.12838
Copy Paste: [[2503.12838]] DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Mode(https://arxiv.org/abs/2503.12838)
Keywords: diffusion
Abstract: Text-driven image generation using diffusion models has recently gained significant attention. To enable more flexible image manipulation and editing, recent research has expanded from single image generation to transparent layer generation and multi-layer compositions. However, existing approaches often fail to provide a thorough exploration of multi-layer structures, leading to inconsistent inter-layer interactions, such as occlusion relationships, spatial layout, and shadowing. In this paper, we introduce DreamLayer, a novel framework that enables coherent text-driven generation of multiple image layers, by explicitly modeling the relationship between transparent foreground and background layers. DreamLayer incorporates three key components, i.e., Context-Aware Cross-Attention (CACA) for global-local information exchange, Layer-Shared Self-Attention (LSSA) for establishing robust inter-layer connections, and Information Retained Harmonization (IRH) for refining fusion details at the latent level. By leveraging a coherent full-image context, DreamLayer builds inter-layer connections through attention mechanisms and applies a harmonization step to achieve seamless layer fusion. To facilitate research in multi-layer generation, we construct a high-quality, diverse multi-layer dataset including 400k samples. Extensive experiments and user studies demonstrate that DreamLayer generates more coherent and well-aligned layers, with broad applicability, including latent-space image editing and image-to-layer decomposition.

Title: Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data

Authors: Haozhe Si, Yuxuan Wan, Minh Do, Deepak Vasisht, Han Zhao, Hendrik F. Hamann
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.12843
Pdf URL: https://arxiv.org/pdf/2503.12843
Copy Paste: [[2503.12843]] Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data(https://arxiv.org/abs/2503.12843)
Keywords: self-supervised, foundation model
Abstract: Geospatial raster (imagery) data, such as that collected by satellite-based imaging systems at different times and spectral bands, hold immense potential for enabling a wide range of high-impact applications. This potential stems from the rich information that is spatially and temporally contextualized across multiple channels and sensing modalities. Recent work has adapted existing self-supervised learning approaches for such geospatial data. However, they fall short of scalable model architectures, leading to inflexibility and computational inefficiencies when faced with an increasing number of channels and modalities. To address these limitations, we introduce Low-rank Efficient Spatial-Spectral Vision Transformer (LESS ViT) with three key innovations: i) the LESS Attention Block that approximates high-dimensional spatial-spectral attention through Kronecker's product of the low-dimensional spatial and spectral attention components; ii) the Continuous Positional-Channel Embedding Layer that preserves both spatial and spectral continuity and physical characteristics of each patch; and iii) the Perception Field Mask that exploits local spatial dependencies by constraining attention to neighboring patches. To evaluate the proposed innovations, we construct a benchmark, GFM-Bench, which serves as a comprehensive benchmark for such geospatial raster data. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies. Experimental results demonstrate that our proposed method surpasses current state-of-the-art multi-modal geospatial foundation models, achieving superior performance with less computation and fewer parameters. The flexibility and extensibility of our framework make it a promising direction for future geospatial data analysis tasks that involve a wide range of modalities and channels.

Title: UniReg: Foundation Model for Controllable Medical Image Registration

Authors: Zi Li, Jianpeng Zhang, Tai Ma, Tony C. W. Mok, Yan-Jie Zhou, Zeli Chen, Xianghua Ye, Le Lu, Dakai Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12868
Pdf URL: https://arxiv.org/pdf/2503.12868
Copy Paste: [[2503.12868]] UniReg: Foundation Model for Controllable Medical Image Registration(https://arxiv.org/abs/2503.12868)
Keywords: foundation model
Abstract: Learning-based medical image registration has achieved performance parity with conventional methods while demonstrating a substantial advantage in computational efficiency. However, learning-based registration approaches lack generalizability across diverse clinical scenarios, requiring the laborious development of multiple isolated networks for specific registration tasks, e.g., inter-/intra-subject registration or organ-specific alignment. % To overcome this limitation, we propose \textbf{UniReg}, the first interactive foundation model for medical image registration, which combines the precision advantages of task-specific learning methods with the generalization of traditional optimization methods. Our key innovation is a unified framework for diverse registration scenarios, achieved through a conditional deformation field estimation within a unified registration model. This is realized through a dynamic learning paradigm that explicitly encodes: (1) anatomical structure priors, (2) registration type constraints (inter/intra-subject), and (3) instance-specific features, enabling the generation of scenario-optimal deformation fields. % Through comprehensive experiments encompassing $90$ anatomical structures at different body regions, our UniReg model demonstrates comparable performance with contemporary state-of-the-art methodologies while achieving ~50\% reduction in required training iterations relative to the conventional learning-based paradigm. This optimization contributes to a significant reduction in computational resources, such as training time. Code and model will be available.

Title: An interpretable approach to automating the assessment of biofouling in video footage

Authors: Evelyn J. Mannix, Bartholomew A. Woodham
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12875
Pdf URL: https://arxiv.org/pdf/2503.12875
Copy Paste: [[2503.12875]] An interpretable approach to automating the assessment of biofouling in video footage(https://arxiv.org/abs/2503.12875)
Keywords: foundation model
Abstract: Biofouling$\unicode{x2013}$communities of organisms that grow on hard surfaces immersed in water$\unicode{x2013}$provides a pathway for the spread of invasive marine species and diseases. To address this risk, international vessels are increasingly being obligated to provide evidence of their biofouling management practices. Verification that these activities are effective requires underwater inspections, using divers or underwater remotely operated vehicles (ROVs), and the collection and analysis of large amounts of imagery and footage. Automated assessment using computer vision techniques can significantly streamline this process, and this work shows how this challenge can be addressed efficiently and effectively using the interpretable Component Features (ComFe) approach with a DINOv2 Vision Transformer (ViT) foundation model. ComFe is able to obtain improved performance in comparison to previous non-interpretable Convolutional Neural Network (CNN) methods, with significantly fewer weights and greater transparency$\unicode{x2013}$through identifying which regions of the image contribute to the classification, and which images in the training data lead to that conclusion. All code, data and model weights are publicly released.

Title: UCF-Crime-DVS: A Novel Event-Based Dataset for Video Anomaly Detection with Spiking Neural Networks

Authors: Yuanbin Qian, Shuhan Ye, Chong Wang, Xiaojie Cai, Jiangbo Qian, Jiafei Wu
Subjects: cs.CV, cs.NE
Abstract URL: https://arxiv.org/abs/2503.12905
Pdf URL: https://arxiv.org/pdf/2503.12905
Copy Paste: [[2503.12905]] UCF-Crime-DVS: A Novel Event-Based Dataset for Video Anomaly Detection with Spiking Neural Networks(https://arxiv.org/abs/2503.12905)
Keywords: anomaly
Abstract: Video anomaly detection plays a significant role in intelligent surveillance systems. To enhance model's anomaly recognition ability, previous works have typically involved RGB, optical flow, and text features. Recently, dynamic vision sensors (DVS) have emerged as a promising technology, which capture visual information as discrete events with a very high dynamic range and temporal resolution. It reduces data redundancy and enhances the capture capacity of moving objects compared to conventional camera. To introduce this rich dynamic information into the surveillance field, we created the first DVS video anomaly detection benchmark, namely UCF-Crime-DVS. To fully utilize this new data modality, a multi-scale spiking fusion network (MSF) is designed based on spiking neural networks (SNNs). This work explores the potential application of dynamic information from event data in video anomaly detection. Our experiments demonstrate the effectiveness of our framework on UCF-Crime-DVS and its superior performance compared to other models, establishing a new baseline for SNN-based weakly supervised video anomaly detection.

Title: MFP-CLIP: Exploring the Efficacy of Multi-Form Prompts for Zero-Shot Industrial Anomaly Detection

Authors: Jingyi Yuan, Pengyu Jie, Junyin Zhang, Ziao Li, Chenqiang Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12910
Pdf URL: https://arxiv.org/pdf/2503.12910
Copy Paste: [[2503.12910]] MFP-CLIP: Exploring the Efficacy of Multi-Form Prompts for Zero-Shot Industrial Anomaly Detection(https://arxiv.org/abs/2503.12910)
Keywords: anomaly
Abstract: Recently, zero-shot anomaly detection (ZSAD) has emerged as a pivotal paradigm for identifying defects in unseen categories without requiring target samples in training phase. However, existing ZSAD methods struggle with the boundary of small and complex defects due to insufficient representations. Most of them use the single manually designed prompts, failing to work for diverse objects and anomalies. In this paper, we propose MFP-CLIP, a novel prompt-based CLIP framework which explores the efficacy of multi-form prompts for zero-shot industrial anomaly detection. We employ an image to text prompting(I2TP) mechanism to better represent the object in the image. MFP-CLIP enhances perception to multi-scale and complex anomalies by self prompting(SP) and a multi-patch feature aggregation(MPFA) module. To precisely localize defects, we introduce the mask prompting(MP) module to guide model to focus on potential anomaly regions. Extensive experiments are conducted on two wildly used industrial anomaly detection benchmarks, MVTecAD and VisA, demonstrating MFP-CLIP's superiority in ZSAD.

Title: AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction

Authors: Xuying Zhang, Yupeng Zhou, Kai Wang, Yikai Wang, Zhen Li, Xiuli Shao, Daquan Zhou, Qibin Hou, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12929
Pdf URL: https://arxiv.org/pdf/2503.12929
Copy Paste: [[2503.12929]] AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction(https://arxiv.org/abs/2503.12929)
Keywords: diffusion
Abstract: Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our empirical observation that the target views closer to the input views exhibit higher fidelity. With this inspiration, we propose AR-1-to-3, a novel next-view prediction paradigm based on diffusion models that first generates views close to the input views, which are then utilized as contextual information to progressively synthesize farther views. To encode the generated view subsequences as local and global conditions for the next-view prediction, we accordingly develop a stacked local feature encoding strategy (Stacked-LE) and an LSTM-based global feature encoding strategy (LSTM-GE). Extensive experiments demonstrate that our method significantly improves the consistency between the generated views and the input views, producing high-fidelity 3D assets.

Title: Efficient Action-Constrained Reinforcement Learning via Acceptance-Rejection Method and Augmented MDPs

Authors: Wei Hung, Shao-Hua Sun, Ping-Chun Hsieh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.12932
Pdf URL: https://arxiv.org/pdf/2503.12932
Copy Paste: [[2503.12932]] Efficient Action-Constrained Reinforcement Learning via Acceptance-Rejection Method and Augmented MDPs(https://arxiv.org/abs/2503.12932)
Keywords: generative
Abstract: Action-constrained reinforcement learning (ACRL) is a generic framework for learning control policies with zero action constraint violation, which is required by various safety-critical and resource-constrained applications. The existing ACRL methods can typically achieve favorable constraint satisfaction but at the cost of either high computational burden incurred by the quadratic programs (QP) or increased architectural complexity due to the use of sophisticated generative models. In this paper, we propose a generic and computationally efficient framework that can adapt a standard unconstrained RL method to ACRL through two modifications: (i) To enforce the action constraints, we leverage the classic acceptance-rejection method, where we treat the unconstrained policy as the proposal distribution and derive a modified policy with feasible actions. (ii) To improve the acceptance rate of the proposal distribution, we construct an augmented two-objective Markov decision process (MDP), which include additional self-loop state transitions and a penalty signal for the rejected actions. This augmented MDP incentives the learned policy to stay close to the feasible action sets. Through extensive experiments in both robot control and resource allocation domains, we demonstrate that the proposed framework enjoys faster training progress, better constraint satisfaction, and a lower action inference time simultaneously than the state-of-the-art ACRL methods. We have made the source code publicly available to encourage further research in this direction.

Title: Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction

Authors: Zheyuan Liu, Junyan Wang, Zicheng Duan, Cristian Rodriguez-Opazo, Anton van den Hengel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12953
Pdf URL: https://arxiv.org/pdf/2503.12953
Copy Paste: [[2503.12953]] Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction(https://arxiv.org/abs/2503.12953)
Keywords: diffusion
Abstract: Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames given a series of initial video frames and text describing the required motion. In practice TVP methods focus on a particular category of videos depicting manipulations of objects carried out by human beings or robot arms. Previous methods adapt models pre-trained on text-to-image tasks, and thus tend to generate video that lacks the required continuity. A natural progression would be to leverage more recent pre-trained text-to-video (T2V) models. This approach is rendered more challenging by the fact that the most common fine-tuning technique, low-rank adaptation (LoRA), yields undesirable results. In this work, we propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA). Within the module, we devise a sub-module that produces frame-wise text embeddings from the input text, which acts as an additional text condition to aid generation. We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition. We compare and discuss the more effective strategy for injecting such embeddings into the T2V model. We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new state-of-the-art for the task of TVP. The project page is at this https URL .

Title: HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding

Authors: Jiahe Zhao, Ruibing Hou, Zejie Tian, Hong Chang, Shiguang Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12955
Pdf URL: https://arxiv.org/pdf/2503.12955
Copy Paste: [[2503.12955]] HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding(https://arxiv.org/abs/2503.12955)
Keywords: foundation model
Abstract: We propose a new task to benchmark human-in-scene understanding for embodied agents: Human-In-Scene Question Answering (HIS-QA). Given a human motion within a 3D scene, HIS-QA requires the agent to comprehend human states and behaviors, reason about its surrounding environment, and answer human-related questions within the scene. To support this new task, we present HIS-Bench, a multimodal benchmark that systematically evaluates HIS understanding across a broad spectrum, from basic perception to commonsense reasoning and planning. Our evaluation of various vision-language models on HIS-Bench reveals significant limitations in their ability to handle HIS-QA tasks. To this end, we propose HIS-GPT, the first foundation model for HIS understanding. HIS-GPT integrates 3D scene context and human motion dynamics into large language models while incorporating specialized mechanisms to capture human-scene interactions. Extensive experiments demonstrate that HIS-GPT sets a new state-of-the-art on HIS-QA tasks. We hope this work inspires future research on human behavior analysis in 3D scenes, advancing embodied AI and world models.

Title: Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

Authors: Chaolong Yang, Kai Yao, Yuyao Yan, Chenru Jiang, Weiguang Zhao, Jie Sun, Guangliang Cheng, Yifei Zhang, Bin Dong, Kaizhu Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12963
Pdf URL: https://arxiv.org/pdf/2503.12963
Copy Paste: [[2503.12963]] Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait(https://arxiv.org/abs/2503.12963)
Keywords: diffusion, generative
Abstract: Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution this http URL codes are available at this https URL.

Title: Training Video Foundation Models with NVIDIA NeMo

Authors: Zeeshan Patel, Ethan He, Parth Mannan, Xiaowei Ren, Ryan Wolf, Niket Agarwal, Jacob Huffman, Zhuoyao Wang, Carl Wang, Jack Chang, Yan Bai, Tommy Huang, Linnan Wang, Sahil Jain, Shanmugam Ramasamy, Joseph Jennings, Ekaterina Sirazitdinova, Oleg Sudakov, Mingyuan Ma, Bobby Chen, Forrest Lin, Hao Wang, Vasanth Rao Naik Sabavat, Sriharsha Niverty, Rong Ou, Pallab Bhattacharya, David Page, Nima Tajbakhsh, Ashwath Aithal
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.12964
Pdf URL: https://arxiv.org/pdf/2503.12964
Copy Paste: [[2503.12964]] Training Video Foundation Models with NVIDIA NeMo(https://arxiv.org/abs/2503.12964)
Keywords: diffusion, foundation model
Abstract: Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.

Title: Optimal Denoising in Score-Based Generative Models: The Role of Data Regularity

Authors: Eliot Beyler (SIERRA), Francis Bach (SIERRA)
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.12966
Pdf URL: https://arxiv.org/pdf/2503.12966
Copy Paste: [[2503.12966]] Optimal Denoising in Score-Based Generative Models: The Role of Data Regularity(https://arxiv.org/abs/2503.12966)
Keywords: generative
Abstract: Score-based generative models achieve state-of-the-art sampling performance by denoising a distribution perturbed by Gaussian noise. In this paper, we focus on a single deterministic denoising step, and compare the optimal denoiser for the quadratic loss, we name ''full-denoising'', to the alternative ''half-denoising'' introduced by Hyv{ä}rinen (2024). We show that looking at the performances in term of distance between distribution tells a more nuanced story, with different assumptions on the data leading to very different this http URL prove that half-denoising is better than full-denoising for regular enough densities, while full-denoising is better for singular densities such as mixtures of Dirac measures or densities supported on a low-dimensional subspace. In the latter case, we prove that full-denoising can alleviate the curse of dimensionality under a linear manifold hypothesis.

Title: Prospects for Mitigating Spectral Variability in Tropical Species Classification Using Self-Supervised Learning

Authors: Colin Prieur, Nassim Ait Ali Braham, Paul Tresson, Grégoire Vincent, Jocelyn Chanussot
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.12973
Pdf URL: https://arxiv.org/pdf/2503.12973
Copy Paste: [[2503.12973]] Prospects for Mitigating Spectral Variability in Tropical Species Classification Using Self-Supervised Learning(https://arxiv.org/abs/2503.12973)
Keywords: self-supervised
Abstract: Airborne hyperspectral imaging is a promising method for identifying tropical species, but spectral variability between acquisitions hinders consistent results. This paper proposes using Self-Supervised Learning (SSL) to encode spectral features that are robust to abiotic variability and relevant for species identification. By employing the state-of-the-art Barlow-Twins approach on repeated spectral acquisitions, we demonstrate the ability to develop stable features. For the classification of 40 tropical species, experiments show that these features can outperform typical reflectance products in terms of robustness to spectral variability by 10 points of accuracy across dates.

Title: A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models

Authors: Palakorn Achananuparp, Ee-Peng Lim
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2503.12989
Pdf URL: https://arxiv.org/pdf/2503.12989
Copy Paste: [[2503.12989]] A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models(https://arxiv.org/abs/2503.12989)
Keywords: in-context
Abstract: Automatically annotating job data with standardized occupations from taxonomies, known as occupation classification, is crucial for labor market analysis. However, this task is often hindered by data scarcity and the challenges of manual annotations. While large language models (LLMs) hold promise due to their extensive world knowledge and in-context learning capabilities, their effectiveness depends on their knowledge of occupational taxonomies, which remains unclear. In this study, we assess the ability of LLMs to generate precise taxonomic entities from taxonomy, highlighting their limitations. To address these challenges, we propose a multi-stage framework consisting of inference, retrieval, and reranking stages, which integrates taxonomy-guided reasoning examples to enhance performance by aligning outputs with taxonomic knowledge. Evaluations on a large-scale dataset show significant improvements in classification accuracy. Furthermore, we demonstrate the framework's adaptability for multi-label skill classification. Our results indicate that the framework outperforms existing LLM-based methods, offering a practical and scalable solution for occupation classification and related tasks across LLMs.

Title: TFDM: Time-Variant Frequency-Based Point Cloud Diffusion with Mamba

Authors: Jiaxu Liu, Li Li, Hubert P. H. Shum, Toby P. Breckon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13004
Pdf URL: https://arxiv.org/pdf/2503.13004
Copy Paste: [[2503.13004]] TFDM: Time-Variant Frequency-Based Point Cloud Diffusion with Mamba(https://arxiv.org/abs/2503.13004)
Keywords: diffusion, generative
Abstract: Diffusion models currently demonstrate impressive performance over various generative tasks. Recent work on image diffusion highlights the strong capabilities of Mamba (state space models) due to its efficient handling of long-range dependencies and sequential data modeling. Unfortunately, joint consideration of state space models with 3D point cloud generation remains limited. To harness the powerful capabilities of the Mamba model for 3D point cloud generation, we propose a novel diffusion framework containing dual latent Mamba block (DM-Block) and a time-variant frequency encoder (TF-Encoder). The DM-Block apply a space-filling curve to reorder points into sequences suitable for Mamba state-space modeling, while operating in a latent space to mitigate the computational overhead that arises from direct 3D data processing. Meanwhile, the TF-Encoder takes advantage of the ability of the diffusion model to refine fine details in later recovery stages by prioritizing key points within the U-Net architecture. This frequency-based mechanism ensures enhanced detail quality in the final stages of generation. Experimental results on the ShapeNet-v2 dataset demonstrate that our method achieves state-of-the-art performance (ShapeNet-v2: 0.14\% on 1-NNA-Abs50 EMD and 57.90\% on COV EMD) on certain metrics for specific categories while reducing computational parameters and inference time by up to 10$\times$ and 9$\times$, respectively. Source code is available in Supplementary Materials and will be released upon accpetance.

Title: Overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) Task

Authors: Junjie Chen, Haitao Li, Zhumin Chu, Yiqun Liu, Qingyao Ai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13038
Pdf URL: https://arxiv.org/pdf/2503.13038
Copy Paste: [[2503.13038]] Overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) Task(https://arxiv.org/abs/2503.13038)
Keywords: generative
Abstract: In this paper, we provide an overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) task. As large language models (LLMs) grow popular in both academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue. Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics). To advance the innovation of automatic evaluation, we propose the AEOLLM task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as dialogue generation, text expansion, summary generation and non-factoid question answering to comprehensively test different methods. This year, we received 48 runs from 4 teams in total. This paper will describe the background of the task, the data set, the evaluation measures and the evaluation results, respectively.

Title: InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving

Authors: Ruiqi Song, Xianda Guo, Hangbin Wu, Qinggong Wei, Long Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13047
Pdf URL: https://arxiv.org/pdf/2503.13047
Copy Paste: [[2503.13047]] InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving(https://arxiv.org/abs/2503.13047)
Keywords: foundation model
Abstract: Directly generating planning results from raw sensors has become increasingly prevalent due to its adaptability and robustness in complex scenarios. Scene representation, as a key module in the pipeline, has traditionally relied on conventional perception, which focus on the global scene. However, in driving scenarios, human drivers typically focus only on regions that directly impact driving, which often coincide with those required for end-to-end autonomous driving. In this paper, a novel end-to-end autonomous driving method called InsightDrive is proposed, which organizes perception by language-guided scene representation. We introduce an instance-centric scene tokenizer that transforms the surrounding environment into map- and object-aware instance tokens. Scene attention language descriptions, which highlight key regions and obstacles affecting the ego vehicle's movement, are generated by a vision-language model that leverages the cognitive reasoning capabilities of foundation models. We then align scene descriptions with visual features using the vision-language model, guiding visual attention through these descriptions to give effectively scene representation. Furthermore, we employ self-attention and cross-attention mechanisms to model the ego-agents and ego-map relationships to comprehensively build the topological relationships of the scene. Finally, based on scene understanding, we jointly perform motion prediction and planning. Extensive experiments on the widely used nuScenes benchmark demonstrate that the proposed InsightDrive achieves state-of-the-art performance in end-to-end autonomous driving. The code is available at this https URL

Title: Bitcoin Battle: Burning Bitcoin for Geopolitical Fun and Profit

Authors: Kris Oosthoek, Kelvin Lubbertsen, Georgios Smaragdakis
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.13052
Pdf URL: https://arxiv.org/pdf/2503.13052
Copy Paste: [[2503.13052]] Bitcoin Battle: Burning Bitcoin for Geopolitical Fun and Profit(https://arxiv.org/abs/2503.13052)
Keywords: anomaly
Abstract: This study empirically analyzes the transaction activity of Bitcoin addresses linked to Russian intelligence services, which have liquidated over 7 Bitcoin (BTC), i.e., equivalent to approximately US$300,000 based on the exchange rate at the time. Our investigation begins with an observed anomaly in transaction outputs featuring the Bitcoin Script operation code, tied to input addresses identified by cyber threat intelligence sources and court documents as belonging to Russian intelligence agencies. We explore how an unauthorized entity appears to have gained control of the associated private keys, with messages embedded in the outputs confirming the seizure. Tracing the funds' origins, we connect them to cryptocurrency mixers and establish a link to the Russian ransomware group Conti, implicating intelligence service involvement. This analysis represents one of the first empirical studies of large-scale Bitcoin misuse by nation-state cyber actors.

Title: MaskSDM with Shapley values to improve flexibility, robustness, and explainability in species distribution modeling

Authors: Robin Zbinden, Nina van Tiel, Gencer Sumbul, Chiara Vanalli, Benjamin Kellenberger, Devis Tuia
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.13057
Pdf URL: https://arxiv.org/pdf/2503.13057
Copy Paste: [[2503.13057]] MaskSDM with Shapley values to improve flexibility, robustness, and explainability in species distribution modeling(https://arxiv.org/abs/2503.13057)
Keywords: foundation model
Abstract: Species Distribution Models (SDMs) play a vital role in biodiversity research, conservation planning, and ecological niche modeling by predicting species distributions based on environmental conditions. The selection of predictors is crucial, strongly impacting both model accuracy and how well the predictions reflect ecological patterns. To ensure meaningful insights, input variables must be carefully chosen to match the study objectives and the ecological requirements of the target species. However, existing SDMs, including both traditional and deep learning-based approaches, often lack key capabilities for variable selection: (i) flexibility to choose relevant predictors at inference without retraining; (ii) robustness to handle missing predictor values without compromising accuracy; and (iii) explainability to interpret and accurately quantify each predictor's contribution. To overcome these limitations, we introduce MaskSDM, a novel deep learning-based SDM that enables flexible predictor selection by employing a masked training strategy. This approach allows the model to make predictions with arbitrary subsets of input variables while remaining robust to missing data. It also provides a clearer understanding of how adding or removing a given predictor affects model performance and predictions. Additionally, MaskSDM leverages Shapley values for precise predictor contribution assessments, improving upon traditional approximations. We evaluate MaskSDM on the global sPlotOpen dataset, modeling the distributions of 12,738 plant species. Our results show that MaskSDM outperforms imputation-based methods and approximates models trained on specific subsets of variables. These findings underscore MaskSDM's potential to increase the applicability and adoption of SDMs, laying the groundwork for developing foundation models in SDMs that can be readily applied to diverse ecological applications.

Title: Do Vision Models Develop Human-Like Progressive Difficulty Understanding?

Authors: Zeyi Huang, Utkarsh Ojha, Yuyang Ji, Donghyun Lee, Yong Jae Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13058
Pdf URL: https://arxiv.org/pdf/2503.13058
Copy Paste: [[2503.13058]] Do Vision Models Develop Human-Like Progressive Difficulty Understanding?(https://arxiv.org/abs/2503.13058)
Keywords: generative
Abstract: When a human undertakes a test, their responses likely follow a pattern: if they answered an easy question $(2 \times 3)$ incorrectly, they would likely answer a more difficult one $(2 \times 3 \times 4)$ incorrectly; and if they answered a difficult question correctly, they would likely answer the easy one correctly. Anything else hints at memorization. Do current visual recognition models exhibit a similarly structured learning capacity? In this work, we consider the task of image classification and study if those models' responses follow that pattern. Since real images aren't labeled with difficulty, we first create a dataset of 100 categories, 10 attributes, and 3 difficulty levels using recent generative models: for each category (e.g., dog) and attribute (e.g., occlusion), we generate images of increasing difficulty (e.g., a dog without occlusion, a dog only partly visible). We find that most of the models do in fact behave similarly to the aforementioned pattern around 80-90% of the time. Using this property, we then explore a new way to evaluate those models. Instead of testing the model on every possible test image, we create an adaptive test akin to GRE, in which the model's performance on the current round of images determines the test images in the next round. This allows the model to skip over questions too easy/hard for itself, and helps us get its overall performance in fewer steps.

Title: Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation

Authors: Yihong Luo, Tianyang Hu, Weijian Luo, Kenji Kawaguchi, Jing Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13070
Pdf URL: https://arxiv.org/pdf/2503.13070
Copy Paste: [[2503.13070]] Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation(https://arxiv.org/abs/2503.13070)
Keywords: diffusion, generative
Abstract: Aligning generated images to complicated text prompts and human preferences is a central challenge in Artificial Intelligence-Generated Content (AIGC). With reward-enhanced diffusion distillation emerging as a promising approach that boosts controllability and fidelity of text-to-image models, we identify a fundamental paradigm shift: as conditions become more specific and reward signals stronger, the rewards themselves become the dominant force in generation. In contrast, the diffusion losses serve as an overly expensive form of regularization. To thoroughly validate our hypothesis, we introduce R0, a novel conditional generation approach via regularized reward maximization. Instead of relying on tricky diffusion distillation losses, R0 proposes a new perspective that treats image generations as an optimization problem in data space which aims to search for valid images that have high compositional rewards. By innovative designs of the generator parameterization and proper regularization techniques, we train state-of-the-art few-step text-to-image generative models with R0 at scales. Our results challenge the conventional wisdom of diffusion post-training and conditional generation by demonstrating that rewards play a dominant role in scenarios with complex conditions. We hope our findings can contribute to further research into human-centric and reward-centric generation paradigms across the broader field of AIGC. Code is available at this https URL.

Title: Towards Better Sample Efficiency in Multi-Agent Reinforcement Learning via Exploration

Authors: Amir Baghi, Jens Sjölund, Joakim Bergdahl, Linus Gisslén, Alessandro Sestini
Subjects: cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2503.13077
Pdf URL: https://arxiv.org/pdf/2503.13077
Copy Paste: [[2503.13077]] Towards Better Sample Efficiency in Multi-Agent Reinforcement Learning via Exploration(https://arxiv.org/abs/2503.13077)
Keywords: self-supervised
Abstract: Multi-agent reinforcement learning has shown promise in learning cooperative behaviors in team-based environments. However, such methods often demand extensive training time. For instance, the state-of-the-art method TiZero takes 40 days to train high-quality policies for a football environment. In this paper, we hypothesize that better exploration mechanisms can improve the sample efficiency of multi-agent methods. We propose two different approaches for better exploration in TiZero: a self-supervised intrinsic reward and a random network distillation bonus. Additionally, we introduce architectural modifications to the original algorithm to enhance TiZero's computational efficiency. We evaluate the sample efficiency of these approaches through extensive experiments. Our results show that random network distillation improves training sample efficiency by 18.8% compared to the original TiZero. Furthermore, we evaluate the qualitative behavior of the models produced by both variants against a heuristic AI, with the self-supervised reward encouraging possession and random network distillation leading to a more offensive performance. Our results highlights the applicability of our random network distillation variant in practical settings. Lastly, due to the nature of the proposed method, we acknowledge its use beyond football simulation, especially in environments with strong multi-agent and strategic aspects.

Title: REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities

Authors: Alexander Pugachev, Alena Fenogenova, Vladislav Mikhailov, Ekaterina Artemova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.13102
Pdf URL: https://arxiv.org/pdf/2503.13102
Copy Paste: [[2503.13102]] REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities(https://arxiv.org/abs/2503.13102)
Keywords: generative
Abstract: Recent advances in large language models (LLMs) have introduced the novel paradigm of using LLMs as judges, where an LLM evaluates and scores the outputs of another LLM, which often correlates highly with human preferences. However, the use of LLM-as-a-judge has been primarily studied in English. In this paper, we evaluate this framework in Russian by introducing the Russian Error tyPes Annotation dataset (REPA), a dataset of 1k user queries and 2k LLM-generated responses. Human annotators labeled each response pair expressing their preferences across ten specific error types, as well as selecting an overall preference. We rank six generative LLMs across the error types using three rating systems based on human preferences. We also evaluate responses using eight LLM judges in zero-shot and few-shot settings. We describe the results of analyzing the judges and position and length biases. Our findings reveal a notable gap between LLM judge performance in Russian and English. However, rankings based on human and LLM preferences show partial alignment, suggesting that while current LLM judges struggle with fine-grained evaluation in Russian, there is potential for improvement.

Title: DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry

Authors: Jing Li, Yihang Fu, Falai Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13110
Pdf URL: https://arxiv.org/pdf/2503.13110
Copy Paste: [[2503.13110]] DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry(https://arxiv.org/abs/2503.13110)
Keywords: diffusion, generative
Abstract: Boundary representation (B-rep) of geometric models is a fundamental format in Computer-Aided Design (CAD). However, automatically generating valid and high-quality B-rep models remains challenging due to the complex interdependence between the topology and geometry of the models. Existing methods tend to prioritize geometric representation while giving insufficient attention to topological constraints, making it difficult to maintain structural validity and geometric accuracy. In this paper, we propose DTGBrepGen, a novel topology-geometry decoupled framework for B-rep generation that explicitly addresses both aspects. Our approach first generates valid topological structures through a two-stage process that independently models edge-face and edge-vertex adjacency relationships. Subsequently, we employ Transformer-based diffusion models for sequential geometry generation, progressively generating vertex coordinates, followed by edge geometries and face geometries which are represented as B-splines. Extensive experiments on diverse CAD datasets show that DTGBrepGen significantly outperforms existing methods in both topological validity and geometric accuracy, achieving higher validity rates and producing more diverse and realistic B-reps. Our code is publicly available at this https URL.

Title: 3D Human Interaction Generation: A Survey

Authors: Siyuan Fan, Wenke Huang, Xiantao Cai, Bo Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13120
Pdf URL: https://arxiv.org/pdf/2503.13120
Copy Paste: [[2503.13120]] 3D Human Interaction Generation: A Survey(https://arxiv.org/abs/2503.13120)
Keywords: generative
Abstract: 3D human interaction generation has emerged as a key research area, focusing on producing dynamic and contextually relevant interactions between humans and various interactive entities. Recent rapid advancements in 3D model representation methods, motion capture technologies, and generative models have laid a solid foundation for the growing interest in this domain. Existing research in this field can be broadly categorized into three areas: human-scene interaction, human-object interaction, and human-human interaction. Despite the rapid advancements in this area, challenges remain due to the need for naturalness in human motion generation and the accurate interaction between humans and interactive entities. In this survey, we present a comprehensive literature review of human interaction generation, which, to the best of our knowledge, is the first of its kind. We begin by introducing the foundational technologies, including model representations, motion capture methods, and generative models. Subsequently, we introduce the approaches proposed for the three sub-tasks, along with their corresponding datasets and evaluation metrics. Finally, we discuss potential future research directions in this area and conclude the survey. Through this survey, we aim to offer a comprehensive overview of the current advancements in the field, highlight key challenges, and inspire future research works.

Title: ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation

Authors: Ling-An Zeng, Guohong Huang, Yi-Lin Wei, Shengbo Gu, Yu-Ming Tang, Jingke Meng, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13130
Pdf URL: https://arxiv.org/pdf/2503.13130
Copy Paste: [[2503.13130]] ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation(https://arxiv.org/abs/2503.13130)
Keywords: generative
Abstract: We propose ChainHOI, a novel approach for text-driven human-object interaction (HOI) generation that explicitly models interactions at both the joint and kinetic chain levels. Unlike existing methods that implicitly model interactions using full-body poses as tokens, we argue that explicitly modeling joint-level interactions is more natural and effective for generating realistic HOIs, as it directly captures the geometric and semantic relationships between joints, rather than modeling interactions in the latent pose space. To this end, ChainHOI introduces a novel joint graph to capture potential interactions with objects, and a Generative Spatiotemporal Graph Convolution Network to explicitly model interactions at the joint level. Furthermore, we propose a Kinematics-based Interaction Module that explicitly models interactions at the kinetic chain level, ensuring more realistic and biomechanically coherent motions. Evaluations on two public datasets demonstrate that ChainHOI significantly outperforms previous methods, generating more realistic, and semantically consistent HOIs. Code is available \href{this https URL}{here}.

Title: Patient-specific radiomic feature selection with reconstructed healthy persona of knee MR images

Authors: Yaxi Chen, Simin Ni, Aleksandra Ivanova, Shaheer U. Saeed, Rikin Hargunani, Jie Huang, Chaozong Liu, Yipeng Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13131
Pdf URL: https://arxiv.org/pdf/2503.13131
Copy Paste: [[2503.13131]] Patient-specific radiomic feature selection with reconstructed healthy persona of knee MR images(https://arxiv.org/abs/2503.13131)
Keywords: diffusion, generative
Abstract: Classical radiomic features have been designed to describe image appearance and intensity patterns. These features are directly interpretable and readily understood by radiologists. Compared with end-to-end deep learning (DL) models, lower dimensional parametric models that use such radiomic features offer enhanced interpretability but lower comparative performance in clinical tasks. In this study, we propose an approach where a standard logistic regression model performance is substantially improved by learning to select radiomic features for individual patients, from a pool of candidate features. This approach has potentials to maintain the interpretability of such approaches while offering comparable performance to DL. We also propose to expand the feature pool by generating a patient-specific healthy persona via mask-inpainting using a denoising diffusion model trained on healthy subjects. Such a pathology-free baseline feature set allows further opportunity in novel feature discovery and improved condition classification. We demonstrate our method on multiple clinical tasks of classifying general abnormalities, anterior cruciate ligament tears, and meniscus tears. Experimental results demonstrate that our approach achieved comparable or even superior performance than state-of-the-art DL approaches while offering added interpretability by using radiomic features extracted from images and supplemented by generating healthy personas. Example clinical cases are discussed in-depth to demonstrate the intepretability-enabled utilities such as human-explainable feature discovery and patient-specific location/view selection. These findings highlight the potentials of the combination of subject-specific feature selection with generative models in augmenting radiomic analysis for more interpretable decision-making. The codes are available at: this https URL

Title: Language-guided Open-world Video Anomaly Detection

Authors: Zihao Liu, Xiaoyu Wu, Jianqin Wu, Xuxu Wang, Linlin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13160
Pdf URL: https://arxiv.org/pdf/2503.13160
Copy Paste: [[2503.13160]] Language-guided Open-world Video Anomaly Detection(https://arxiv.org/abs/2503.13160)
Keywords: anomaly
Abstract: Video anomaly detection models aim to detect anomalies that deviate from what is expected. In open-world scenarios, the expected events may change as requirements change. For example, not wearing a mask is considered abnormal during a flu outbreak but normal otherwise. However, existing methods assume that the definition of anomalies is invariable, and thus are not applicable to the open world. To address this, we propose a novel open-world VAD paradigm with variable definitions, allowing guided detection through user-provided natural language at inference time. This paradigm necessitates establishing a robust mapping from video and textual definition to anomaly score. Therefore, we propose LaGoVAD (Language-guided Open-world VAD), a model that dynamically adapts anomaly definitions through two regularization strategies: diversifying the relative durations of anomalies via dynamic video synthesis, and enhancing feature robustness through contrastive learning with negative mining. Training such adaptable models requires diverse anomaly definitions, but existing datasets typically provide given labels without semantic descriptions. To bridge this gap, we collect PreVAD (Pre-training Video Anomaly Dataset), the largest and most diverse video anomaly dataset to date, featuring 35,279 annotated videos with multi-level category labels and descriptions that explicitly define anomalies. Zero-shot experiments on seven datasets demonstrate SOTA performance. Data and code will be released.

Title: DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction

Authors: Rui Wang, Quentin Lohmeyer, Mirko Meboldt, Siyu Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13176
Pdf URL: https://arxiv.org/pdf/2503.13176
Copy Paste: [[2503.13176]] DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction(https://arxiv.org/abs/2503.13176)
Keywords: self-supervised
Abstract: Reconstructing clean, distractor-free 3D scenes from real-world captures remains a significant challenge, particularly in highly dynamic and cluttered settings such as egocentric videos. To tackle this problem, we introduce DeGauss, a simple and robust self-supervised framework for dynamic scene reconstruction based on a decoupled dynamic-static Gaussian Splatting design. DeGauss models dynamic elements with foreground Gaussians and static content with background Gaussians, using a probabilistic mask to coordinate their composition and enable independent yet complementary optimization. DeGauss generalizes robustly across a wide range of real-world scenarios, from casual image collections to long, dynamic egocentric videos, without relying on complex heuristics or extensive supervision. Experiments on benchmarks including NeRF-on-the-go, ADT, AEA, Hot3D, and EPIC-Fields demonstrate that DeGauss consistently outperforms existing methods, establishing a strong baseline for generalizable, distractor-free 3D reconstructionin highly dynamic, interaction-rich environments.

Title: Triad: Empowering LMM-based Anomaly Detection with Vision Expert-guided Visual Tokenizer and Manufacturing Process

Authors: Yuanze Li, Shihao Yuan, Haolin Wang, Qizhang Li, Ming Liu, Chen Xu, Guangming Shi, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13184
Pdf URL: https://arxiv.org/pdf/2503.13184
Copy Paste: [[2503.13184]] Triad: Empowering LMM-based Anomaly Detection with Vision Expert-guided Visual Tokenizer and Manufacturing Process(https://arxiv.org/abs/2503.13184)
Keywords: anomaly
Abstract: Although recent methods have tried to introduce large multimodal models (LMMs) into industrial anomaly detection (IAD), their generalization in the IAD field is far inferior to that for general purposes. We summarize the main reasons for this gap into two aspects. On one hand, general-purpose LMMs lack cognition of defects in the visual modality, thereby failing to sufficiently focus on defect areas. Therefore, we propose to modify the AnyRes structure of the LLaVA model, providing the potential anomalous areas identified by existing IAD models to the LMMs. On the other hand, existing methods mainly focus on identifying defects by learning defect patterns or comparing with normal samples, yet they fall short of understanding the causes of these defects. Considering that the generation of defects is closely related to the manufacturing process, we propose a manufacturing-driven IAD paradigm. An instruction-tuning dataset for IAD (InstructIAD) and a data organization approach for Chain-of-Thought with manufacturing (CoT-M) are designed to leverage the manufacturing process for IAD. Based on the above two modifications, we present Triad, a novel LMM-based method incorporating an expert-guided region-of-interest tokenizer and manufacturing process for industrial anomaly detection. Extensive experiments show that our Triad not only demonstrates competitive performance against current LMMs but also achieves further improved accuracy when equipped with manufacturing processes. Source code, training data, and pre-trained models will be publicly available at this https URL.

Title: Deep Learning Advancements in Anomaly Detection: A Comprehensive Survey

Authors: Haoqi Huang, Ping Wang, Jianhua Pei, Jiacheng Wang, Shahen Alexanian, Dusit Niyato
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13195
Pdf URL: https://arxiv.org/pdf/2503.13195
Copy Paste: [[2503.13195]] Deep Learning Advancements in Anomaly Detection: A Comprehensive Survey(https://arxiv.org/abs/2503.13195)
Keywords: anomaly
Abstract: The rapid expansion of data from diverse sources has made anomaly detection (AD) increasingly essential for identifying unexpected observations that may signal system failures, security breaches, or fraud. As datasets become more complex and high-dimensional, traditional detection methods struggle to effectively capture intricate patterns. Advances in deep learning have made AD methods more powerful and adaptable, improving their ability to handle high-dimensional and unstructured data. This survey provides a comprehensive review of over 180 recent studies, focusing on deep learning-based AD techniques. We categorize and analyze these methods into reconstruction-based and prediction-based approaches, highlighting their effectiveness in modeling complex data distributions. Additionally, we explore the integration of traditional and deep learning methods, highlighting how hybrid approaches combine the interpretability of traditional techniques with the flexibility of deep learning to enhance detection accuracy and model transparency. Finally, we identify open issues and propose future research directions to advance the field of AD. This review bridges gaps in existing literature and serves as a valuable resource for researchers and practitioners seeking to enhance AD techniques using deep learning.

Title: MedLoRD: A Medical Low-Resource Diffusion Model for High-Resolution 3D CT Image Synthesis

Authors: Marvin Seyfarth, Salman Ul Hassan Dar, Isabelle Ayx, Matthias Alexander Fink, Stefan O. Schoenberg, Hans-Ulrich Kauczor, Sandy Engelhardt
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.13211
Pdf URL: https://arxiv.org/pdf/2503.13211
Copy Paste: [[2503.13211]] MedLoRD: A Medical Low-Resource Diffusion Model for High-Resolution 3D CT Image Synthesis(https://arxiv.org/abs/2503.13211)
Keywords: diffusion, generative
Abstract: Advancements in AI for medical imaging offer significant potential. However, their applications are constrained by the limited availability of data and the reluctance of medical centers to share it due to patient privacy concerns. Generative models present a promising solution by creating synthetic data as a substitute for real patient data. However, medical images are typically high-dimensional, and current state-of-the-art methods are often impractical for computational resource-constrained healthcare environments. These models rely on data sub-sampling, raising doubts about their feasibility and real-world applicability. Furthermore, many of these models are evaluated on quantitative metrics that alone can be misleading in assessing the image quality and clinical meaningfulness of the generated images. To address this, we introduce MedLoRD, a generative diffusion model designed for computational resource-constrained environments. MedLoRD is capable of generating high-dimensional medical volumes with resolutions up to 512$\times$512$\times$256, utilizing GPUs with only 24GB VRAM, which are commonly found in standard desktop workstations. MedLoRD is evaluated across multiple modalities, including Coronary Computed Tomography Angiography and Lung Computed Tomography datasets. Extensive evaluations through radiological evaluation, relative regional volume analysis, adherence to conditional masks, and downstream tasks show that MedLoRD generates high-fidelity images closely adhering to segmentation mask conditions, surpassing the capabilities of current state-of-the-art generative models for medical image synthesis in computational resource-constrained environments.

Title: HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures

Authors: Yongkang Cheng, Shaoli Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13229
Pdf URL: https://arxiv.org/pdf/2503.13229
Copy Paste: [[2503.13229]] HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures(https://arxiv.org/abs/2503.13229)
Keywords: diffusion
Abstract: Animating virtual characters with holistic co-speech gestures is a challenging but critical task. Previous systems have primarily focused on the weak correlation between audio and gestures, leading to physically unnatural outcomes that degrade the user experience. To address this problem, we introduce HoleGest, a novel neural network framework based on decoupled diffusion and motion priors for the automatic generation of high-quality, expressive co-speech gestures. Our system leverages large-scale human motion datasets to learn a robust prior with low audio dependency and high motion reliance, enabling stable global motion and detailed finger movements. To improve the generation efficiency of diffusion-based models, we integrate implicit joint constraints with explicit geometric and conditional constraints, capturing complex motion distributions between large strides. This integration significantly enhances generation speed while maintaining high-quality motion. Furthermore, we design a shared embedding space for gesture-transcription text alignment, enabling the generation of semantically correct gesture actions. Extensive experiments and user feedback demonstrate the effectiveness and potential applications of our model, with our method achieving a level of realism close to the ground truth, providing an immersive user experience. Our code, model, and demo are are available at this https URL.

Title: FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis

Authors: Luxi Chen, Zihan Zhou, Min Zhao, Yikai Wang, Ge Zhang, Wenhao Huang, Hao Sun, Ji-Rong Wen, Chongxuan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13265
Pdf URL: https://arxiv.org/pdf/2503.13265
Copy Paste: [[2503.13265]] FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis(https://arxiv.org/abs/2503.13265)
Keywords: diffusion
Abstract: Generating flexible-view 3D scenes, including 360° rotation and zooming, from single images is challenging due to a lack of 3D data. To this end, we introduce FlexWorld, a novel framework consisting of two key components: (1) a strong video-to-video (V2V) diffusion model to generate high-quality novel view images from incomplete input rendered from a coarse scene, and (2) a progressive expansion process to construct a complete 3D scene. In particular, leveraging an advanced pre-trained video model and accurate depth-estimated training pairs, our V2V model can generate novel views under large camera pose variations. Building upon it, FlexWorld progressively generates new 3D content and integrates it into the global scene through geometry-aware scene fusion. Extensive experiments demonstrate the effectiveness of FlexWorld in generating high-quality novel view videos and flexible-view 3D scenes from single images, achieving superior visual quality under multiple popular metrics and datasets compared to existing state-of-the-art methods. Qualitatively, we highlight that FlexWorld can generate high-fidelity scenes with flexible views like 360° rotations and zooming. Project page: this https URL.

Title: Graph Generative Models Evaluation with Masked Autoencoder

Authors: Chengen Wang, Murat Kantarcioglu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13271
Pdf URL: https://arxiv.org/pdf/2503.13271
Copy Paste: [[2503.13271]] Graph Generative Models Evaluation with Masked Autoencoder(https://arxiv.org/abs/2503.13271)
Keywords: generative
Abstract: In recent years, numerous graph generative models (GGMs) have been proposed. However, evaluating these models remains a considerable challenge, primarily due to the difficulty in extracting meaningful graph features that accurately represent real-world graphs. The traditional evaluation techniques, which rely on graph statistical properties like node degree distribution, clustering coefficients, or Laplacian spectrum, overlook node features and lack scalability. There are newly proposed deep learning-based methods employing graph random neural networks or contrastive learning to extract graph features, demonstrating superior performance compared to traditional statistical methods, but their experimental results also demonstrate that these methods do not always working well across different metrics. Although there are overlaps among these metrics, they are generally not interchangeable, each evaluating generative models from a different perspective. In this paper, we propose a novel method that leverages graph masked autoencoders to effectively extract graph features for GGM evaluations. We conduct extensive experiments on graphs and empirically demonstrate that our method can be more reliable and effective than previously proposed methods across a number of GGM evaluation metrics, such as "Fréchet Distance (FD)" and "MMD Linear". However, no single method stands out consistently across all metrics and datasets. Therefore, this study also aims to raise awareness of the significance and challenges associated with GGM evaluation techniques, especially in light of recent advances in generative models.

Title: Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

Authors: Katja Schwarz, Norman Mueller, Peter Kontschieder
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13272
Pdf URL: https://arxiv.org/pdf/2503.13272
Copy Paste: [[2503.13272]] Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors(https://arxiv.org/abs/2503.13272)
Keywords: diffusion, generative
Abstract: Synthesizing consistent and photorealistic 3D scenes is an open problem in computer vision. Video diffusion models generate impressive videos but cannot directly synthesize 3D representations, i.e., lack 3D consistency in the generated sequences. In addition, directly training generative 3D models is challenging due to a lack of 3D training data at scale. In this work, we present Generative Gaussian Splatting (GGS) -- a novel approach that integrates a 3D representation with a pre-trained latent video diffusion model. Specifically, our model synthesizes a feature field parameterized via 3D Gaussian primitives. The feature field is then either rendered to feature maps and decoded into multi-view images, or directly upsampled into a 3D radiance field. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet+, and find that our proposed GGS model significantly improves both the 3D consistency of the generated multi-view images, and the quality of the generated 3D scenes over all relevant baselines. Compared to a similar model without 3D representation, GGS improves FID on the generated 3D scenes by ~20% on both RealEstate10K and ScanNet+. Project page: this https URL

Title: MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Portrait Few-Step Synthesis

Authors: Shitong Shao, Hongwei Yi, Hanzhong Guo, Tian Ye, Daquan Zhou, Michael Lingelbach, Zhiqiang Xu, Zeke Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13319
Pdf URL: https://arxiv.org/pdf/2503.13319
Copy Paste: [[2503.13319]] MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Portrait Few-Step Synthesis(https://arxiv.org/abs/2503.13319)
Keywords: diffusion
Abstract: Fine-tuning open-source large-scale VDMs for the portrait video synthesis task can result in significant improvements across multiple dimensions, such as visual quality and natural facial motion dynamics. Despite their advancements, how to achieve step distillation and reduce the substantial computational overhead of large-scale VDMs remains unexplored. To fill this gap, this paper proposes Weak-to-Strong Video Distillation (W2SVD) to mitigate both the issue of insufficient training memory and the problem of training collapse observed in vanilla DMD during the training process. Specifically, we first leverage LoRA to fine-tune the fake diffusion transformer (DiT) to address the out-of-memory issue. Then, we employ the W2S distribution matching to adjust the real DiT's parameter, subtly shifting it toward the fake DiT's parameter. This adjustment is achieved by utilizing the weak weight of the low-rank branch, effectively alleviate the conundrum where the video synthesized by the few-step generator deviates from the real data distribution, leading to inaccuracies in the KL divergence approximation. Additionally, we minimize the distance between the fake data distribution and the ground truth distribution to further enhance the visual quality of the synthesized videos. As experimentally demonstrated on HunyuanVideo, W2SVD surpasses the standard Euler, LCM, DMD and even the 28-step standard sampling in FID/FVD and VBench in 1/4-step video synthesis. The project page is in this https URL.

Title: Edit Transfer: Learning Image Editing via Vision In-Context Relations

Authors: Lan Chen, Qi Mao, Yuchao Gu, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13327
Pdf URL: https://arxiv.org/pdf/2503.13327
Copy Paste: [[2503.13327]] Edit Transfer: Learning Image Editing via Vision In-Context Relations(https://arxiv.org/abs/2503.13327)
Keywords: in-context
Abstract: We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image. While text-based methods excel at semantic manipulations through textual prompts, they often struggle with precise geometric details (e.g., poses and viewpoint changes). Reference-based editing, on the other hand, typically focuses on style or appearance and fails at non-rigid transformations. By explicitly learning the editing transformation from a source-target pair, Edit Transfer mitigates the limitations of both text-only and appearance-centric references. Drawing inspiration from in-context learning in large language models, we propose a visual relation in-context learning paradigm, building upon a DiT-based text-to-image model. We arrange the edited example and the query image into a unified four-panel composite, then apply lightweight LoRA fine-tuning to capture complex spatial transformations from minimal examples. Despite using only 42 training samples, Edit Transfer substantially outperforms state-of-the-art TIE and RIE methods on diverse non-rigid scenarios, demonstrating the effectiveness of few-shot visual relation learning.

Title: One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation

Authors: Daniil Selikhanovych, David Li, Aleksei Leonov, Nikita Gushchin, Sergei Kushneriuk, Alexander Filippov, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13358
Pdf URL: https://arxiv.org/pdf/2503.13358
Copy Paste: [[2503.13358]] One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation(https://arxiv.org/abs/2503.13358)
Keywords: diffusion
Abstract: Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift, one of the top diffusion-based SR models. Our method is based on training the student network to produce such images that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a large margin. We show that our distillation method can surpass the other distillation-based method for ResShift - SinSR - making it on par with state-of-the-art diffusion-based SR distillation methods. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality, provides images with better alignment to degraded input images, and requires fewer parameters and GPU memory. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K.

Title: SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization

Authors: Xulin Fan, Heting Gao, Ziyi Chen, Peng Chang, Mei Han, Mark Hasegawa-Johnson
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13371
Pdf URL: https://arxiv.org/pdf/2503.13371
Copy Paste: [[2503.13371]] SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization(https://arxiv.org/abs/2503.13371)
Keywords: diffusion
Abstract: Talking head synthesis, also known as speech-to-lip synthesis, reconstructs the facial motions that align with the given audio tracks. The synthesized videos are evaluated on mainly two aspects, lip-speech synchronization and image fidelity. Recent studies demonstrate that GAN-based and diffusion-based models achieve state-of-the-art (SOTA) performance on this task, with diffusion-based models achieving superior image fidelity but experiencing lower synchronization compared to their GAN-based counterparts. To this end, we propose SyncDiff, a simple yet effective approach to improve diffusion-based models using a temporal pose frame with information bottleneck and facial-informative audio features extracted from AVHuBERT, as conditioning input into the diffusion process. We evaluate SyncDiff on two canonical talking head datasets, LRS2 and LRS3 for direct comparison with other SOTA models. Experiments on LRS2/LRS3 datasets show that SyncDiff achieves a synchronization score 27.7%/62.3% relatively higher than previous diffusion-based methods, while preserving their high-fidelity characteristics.

Title: Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation

Authors: Xinyu Lian, Zichao Yu, Ruiming Liang, Yitong Wang, Li Ray Luo, Kaixu Chen, Yuanzhen Zhou, Qihong Tang, Xudong Xu, Zhaoyang Lyu, Bo Dai, Jiangmiao Pang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13424
Pdf URL: https://arxiv.org/pdf/2503.13424
Copy Paste: [[2503.13424]] Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation(https://arxiv.org/abs/2503.13424)
Keywords: generative
Abstract: Large-scale articulated objects with high quality are desperately needed for multiple tasks related to embodied AI. Most existing methods for creating articulated objects are either data-driven or simulation based, which are limited by the scale and quality of the training data or the fidelity and heavy labour of the simulation. In this paper, we propose Infinite Mobility, a novel method for synthesizing high-fidelity articulated objects through procedural generation. User study and quantitative evaluation demonstrate that our method can produce results that excel current state-of-the-art methods and are comparable to human-annotated datasets in both physics property and mesh quality. Furthermore, we show that our synthetic data can be used as training data for generative models, enabling next-step scaling up. Code is available at this https URL

Title: Measuring In-Context Computation Complexity via Hidden State Prediction

Authors: Vincent Herrmann, Róbert Csordás, Jürgen Schmidhuber
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.13431
Pdf URL: https://arxiv.org/pdf/2503.13431
Copy Paste: [[2503.13431]] Measuring In-Context Computation Complexity via Hidden State Prediction(https://arxiv.org/abs/2503.13431)
Keywords: in-context
Abstract: Detecting when a neural sequence model does "interesting" computation is an open problem. The next token prediction loss is a poor indicator: Low loss can stem from trivially predictable sequences that are uninteresting, while high loss may reflect unpredictable but also irrelevant information that can be ignored by the model. We propose a better metric: measuring the model's ability to predict its own future hidden states. We show empirically that this metric -- in contrast to the next token prediction loss -- correlates with the intuitive interestingness of the task. To measure predictability, we introduce the architecture-agnostic "prediction of hidden states" (PHi) layer that serves as an information bottleneck on the main pathway of the network (e.g., the residual stream in Transformers). We propose a novel learned predictive prior that enables us to measure the novel information gained in each computation step, which serves as our metric. We show empirically that our metric predicts the description length of formal languages learned in-context, the complexity of mathematical reasoning problems, and the correctness of self-generated reasoning chains.

Title: BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

Authors: Yaowei Li, Lingen Li, Zhaoyang Zhang, Xiaoyu Li, Guangzhi Wang, Hongxiang Li, Xiaodong Cun, Ying Shan, Yuexian Zou
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2503.13434
Pdf URL: https://arxiv.org/pdf/2503.13434
Copy Paste: [[2503.13434]] BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing(https://arxiv.org/abs/2503.13434)
Keywords: diffusion, self-supervised
Abstract: Element-level visual manipulation is essential in digital content creation, but current diffusion-based methods lack the precision and flexibility of traditional tools. In this work, we introduce BlobCtrl, a framework that unifies element-level generation and editing using a probabilistic blob-based representation. By employing blobs as visual primitives, our approach effectively decouples and represents spatial location, semantic content, and identity information, enabling precise element-level manipulation. Our key contributions include: 1) a dual-branch diffusion architecture with hierarchical feature fusion for seamless foreground-background integration; 2) a self-supervised training paradigm with tailored data augmentation and score functions; and 3) controllable dropout strategies to balance fidelity and diversity. To support further research, we introduce BlobData for large-scale training and BlobBench for systematic evaluation. Experiments show that BlobCtrl excels in various element-level manipulation tasks while maintaining computational efficiency, offering a practical solution for precise and flexible visual content creation. Project page: this https URL

Title: Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images

Authors: Tianhao Wu, Chuanxia Zheng, Frank Guan, Andrea Vedaldi, Tat-Jen Cham
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.13439
Pdf URL: https://arxiv.org/pdf/2503.13439
Copy Paste: [[2503.13439]] Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images(https://arxiv.org/abs/2503.13439)
Keywords: generative
Abstract: Most image-based 3D object reconstructors assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional 3D generative model designed to reconstruct 3D objects from partial observations. We start from a "foundation" 3D generative model and extend it to recover plausible 3D geometry and appearance from occluded objects. We introduce a mask-weighted multi-head cross-attention mechanism followed by an occlusion-aware attention layer that explicitly leverages occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes. It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction.