2025-03-17

Title: Evaluating Local and Cloud-Based Large Language Models for Simulating Consumer Choices in Energy Stated Preference Surveys

Authors: Han Wang, Jacek Pawlak, Aruna Sivakumar
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.10652
Pdf URL: https://arxiv.org/pdf/2503.10652
Copy Paste: [[2503.10652]] Evaluating Local and Cloud-Based Large Language Models for Simulating Consumer Choices in Energy Stated Preference Surveys(https://arxiv.org/abs/2503.10652)
Keywords: in-context
Abstract: Survey research is essential in energy demand studies for capturing consumer preferences and informing policy decisions. Stated preference (SP) surveys, in particular, analyse how individuals make trade-offs in hypothetical scenarios. However, traditional survey methods are costly, time-consuming, and affected by biases and respondent fatigue. Large language models (LLMs) have emerged as a potential tool to address these challenges by generating human-like textual responses. This study investigates the ability of LLMs to simulate consumer choices in energy-related SP surveys. A series of test scenarios evaluated the simulation performance of LLMs at both individual and aggregated levels, considering factors in the prompt, in-context learning (ICL), chain-of-thought (CoT) reasoning, the comparison between local and cloud-based LLMs, integration with traditional choice models, and potential biases. Results indicate that while LLMs achieve an average accuracy of up to 48%, surpassing random guessing, their performance remains insufficient for practical application. Local and cloud-based LLMs perform similarly in simulation accuracy but exhibit differences in adherence to prompt requirements and susceptibility to social desirability biases. Findings suggest that previous SP choices are the most effective input factor, while longer prompts with varied factor formats may reduce accuracy. Furthermore, the traditional mixed logit choice model outperforms LLMs and provides insights for refining LLM prompts. Despite their limitations, LLMs provide scalability and efficiency advantages, requiring minimal historical data compared to traditional survey methods. Future research should refine prompt structures, further investigate CoT reasoning, and explore fine-tuning techniques to improve LLM-based energy survey simulations.

Title: Video Anomaly Detection with Structured Keywords

Authors: Thomas Foltz
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10653
Pdf URL: https://arxiv.org/pdf/2503.10653
Copy Paste: [[2503.10653]] Video Anomaly Detection with Structured Keywords(https://arxiv.org/abs/2503.10653)
Keywords: anomaly
Abstract: This paper focuses on detecting anomalies in surveillance video using keywords by leveraging foundational models' feature representation generalization capabilities. We present a novel, lightweight pipeline for anomaly classification using keyword weights. Our pipeline employs a two-stage process: induction followed by deduction. In induction, descriptions are generated from normal and anomalous frames to identify and assign weights to relevant keywords. In deduction, inference frame descriptions are converted into keyword encodings using induction-derived weights for input into our neural network for anomaly classification. We achieved comparable performance on the three benchmarks UCSD Ped2, Shanghai Tech, and CUHK Avenue, with ROC AUC scores of 0.865, 0.745, and 0.742, respectively. These results are achieved without temporal context, making such a system viable for real-time applications. Our model improves implementation setup, interpretability, and inference speed for surveillance devices on the edge, introducing a performance trade-off against other video anomaly detection systems. As the generalization capabilities of open-source foundational models improve, our model demonstrates that the exclusive use of text for feature representations is a promising direction for efficient real-time interpretable video anomaly detection.

Title: Text-to-3D Generation using Jensen-Shannon Score Distillation

Authors: Khoi Do, Binh-Son Hua
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10660
Pdf URL: https://arxiv.org/pdf/2503.10660
Copy Paste: [[2503.10660]] Text-to-3D Generation using Jensen-Shannon Score Distillation(https://arxiv.org/abs/2503.10660)
Keywords: diffusion, generative
Abstract: Score distillation sampling is an effective technique to generate 3D models from text prompts, utilizing pre-trained large-scale text-to-image diffusion models as guidance. However, the produced 3D assets tend to be over-saturating, over-smoothing, with limited diversity. These issues are results from a reverse Kullback-Leibler (KL) divergence objective, which makes the optimization unstable and results in mode-seeking behavior. In this paper, we derive a bounded score distillation objective based on Jensen-Shannon divergence (JSD), which stabilizes the optimization process and produces high-quality 3D generation. JSD can match well generated and target distribution, therefore mitigating mode seeking. We provide a practical implementation of JSD by utilizing the theory of generative adversarial networks to define an approximate objective function for the generator, assuming the discriminator is well trained. By assuming the discriminator following a log-odds classifier, we propose a minority sampling algorithm to estimate the gradients of our proposed objective, providing a practical implementation for JSD. We conduct both theoretical and empirical studies to validate our method. Experimental results on T3Bench demonstrate that our method can produce high-quality and diversified 3D assets.

Title: A Survey on Knowledge-Oriented Retrieval-Augmented Generation

Authors: Mingyue Cheng, Yucong Luo, Jie Ouyang, Qi Liu, Huijie Liu, Li Li, Shuo Yu, Bohou Zhang, Jiawei Cao, Jie Ma, Daoyu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10677
Pdf URL: https://arxiv.org/pdf/2503.10677
Copy Paste: [[2503.10677]] A Survey on Knowledge-Oriented Retrieval-Augmented Generation(https://arxiv.org/abs/2503.10677)
Keywords: generative
Abstract: Retrieval-Augmented Generation (RAG) has gained significant attention in recent years for its potential to enhance natural language understanding and generation by combining large-scale retrieval systems with generative models. RAG leverages external knowledge sources, such as documents, databases, or structured data, to improve model performance and generate more accurate and contextually relevant outputs. This survey aims to provide a comprehensive overview of RAG by examining its fundamental components, including retrieval mechanisms, generation processes, and the integration between the two. We discuss the key characteristics of RAG, such as its ability to augment generative models with dynamic external knowledge, and the challenges associated with aligning retrieved information with generative objectives. We also present a taxonomy that categorizes RAG methods, ranging from basic retrieval-augmented approaches to more advanced models incorporating multi-modal data and reasoning capabilities. Additionally, we review the evaluation benchmarks and datasets commonly used to assess RAG systems, along with a detailed exploration of its applications in fields such as question answering, summarization, and information retrieval. Finally, we highlight emerging research directions and opportunities for improving RAG systems, such as enhanced retrieval efficiency, model interpretability, and domain-specific adaptations. This paper concludes by outlining the prospects for RAG in addressing real-world challenges and its potential to drive further advancements in natural language processing.

Title: VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion

Authors: Lehan Yang, Jincen Song, Tianlong Wang, Daiqing Qi, Weili Shi, Yuheng Liu, Sheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10678
Pdf URL: https://arxiv.org/pdf/2503.10678
Copy Paste: [[2503.10678]] VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion(https://arxiv.org/abs/2503.10678)
Keywords: diffusion
Abstract: We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code are available at this https URL.

Title: End-to-end Learning of Sparse Interventions on Activations to Steer Generation

Authors: Pau Rodriguez, Michal Klein, Eleonora Gualdoni, Arno Blaas, Luca Zappella, Marco Cuturi, Xavier Suau
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10679
Pdf URL: https://arxiv.org/pdf/2503.10679
Copy Paste: [[2503.10679]] End-to-end Learning of Sparse Interventions on Activations to Steer Generation(https://arxiv.org/abs/2503.10679)
Keywords: diffusion, generative
Abstract: The growing use of generative models in daily life calls for efficient mechanisms to control their generation, to e.g., produce safe content or provide users with tools to explore style changes. Ideally, such mechanisms should be cheap, both at train and inference time, while preserving output quality. Recent research has shown that such mechanisms can be obtained by intervening exclusively on model activations, with the goal of correcting distributional differences between activations seen when using prompts from a source vs. a target set (e.g., toxic and non-toxic sentences). While cheap, these fast methods are inherently crude: their maps are tuned locally, not accounting for their impact on downstream layers, resulting in interventions that cause unintended shifts when used out-of-sample. We propose in this work linear end-to-end activation steering (LinEAS), an approach trained with a global loss that accounts simultaneously for all layerwise distributional shifts. In addition to being more robust, the loss used to train LinEAS can be regularized with sparsifying norms, which can automatically carry out neuron and layer selection. Empirically, LinEAS only requires a handful of samples to be effective, and beats similar baselines on toxicity mitigation, while performing on par with far more involved finetuning approaches. We show that LinEAS interventions can be composed, study the impact of sparsity on their performance, and showcase applications in text-to-image diffusions.

Title: Understanding the Quality-Diversity Trade-off in Diffusion Language Models

Authors: Zak Buzzard
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10683
Pdf URL: https://arxiv.org/pdf/2503.10683
Copy Paste: [[2503.10683]] Understanding the Quality-Diversity Trade-off in Diffusion Language Models(https://arxiv.org/abs/2503.10683)
Keywords: diffusion
Abstract: Diffusion models have seen immense success in modelling continuous data across a range of domains such as vision and audio. Despite the challenges of adapting diffusion models to discrete data, recent work explores their application to text generation by working in the continuous embedding space. However, these models lack a natural means to control the inherent trade-off between quality and diversity as afforded by the temperature hyperparameter in autoregressive models, hindering understanding of model performance and restricting generation quality. This work proposes the use of classifier-free guidance and stochastic clamping for manipulating the quality-diversity trade-off on sequence-to-sequence tasks, demonstrating that these techniques may be used to improve the performance of a diffusion language model.

Title: Open-World Skill Discovery from Unsegmented Demonstrations

Authors: Jingwen Deng, Zihao Wang, Shaofei Cai, Anji Liu, Yitao Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10684
Pdf URL: https://arxiv.org/pdf/2503.10684
Copy Paste: [[2503.10684]] Open-World Skill Discovery from Unsegmented Demonstrations(https://arxiv.org/abs/2503.10684)
Keywords: self-supervised
Abstract: Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on sequence sampling or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in Minecraft, a rich open-world simulator with extensive gameplay videos available online. Our SBD-generated segments improved the average performance of conditioned policies by 63.7% and 52.1% on short-term atomic skill tasks, and their corresponding hierarchical agents by 11.3% and 20.8% on long-horizon tasks. Our method can leverage the diverse YouTube videos to train instruction-following agents. The project page can be found in this https URL.

Title: VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation

Authors: Brunó B. Englert, Gijs Dubbelman
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10685
Pdf URL: https://arxiv.org/pdf/2503.10685
Copy Paste: [[2503.10685]] VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation(https://arxiv.org/abs/2503.10685)
Keywords: foundation model
Abstract: Unsupervised Domain Adaptation (UDA) has shown remarkably strong generalization from a labeled source domain to an unlabeled target domain while requiring relatively little data. At the same time, large-scale pretraining without labels of so-called Vision Foundation Models (VFMs), has also significantly improved downstream generalization. This motivates us to research how UDA can best utilize the benefits of VFMs. The earlier work of VFM-UDA showed that beyond state-of-the-art (SotA) results can be obtained by replacing non-VFM with VFM encoders in SotA UDA methods. In this work, we take it one step further and improve on the UDA architecture and data strategy themselves. We observe that VFM-UDA, the current SotA UDA method, does not use multi-scale inductive biases or feature distillation losses, while it is known that these can improve generalization. We address both limitations in VFM-UDA++ and obtain beyond SotA generalization on standard UDA benchmarks of up to +5.3 mIoU. Inspired by work on VFM fine-tuning, such as Rein, we also explore the benefits of adding more easy-to-generate synthetic source data with easy-to-obtain unlabeled target data and realize a +6.6 mIoU over the current SotA. The improvements of VFM-UDA++ are most significant for smaller models, however, we show that for larger models, the obtained generalization is only 2.8 mIoU from that of fully-supervised learning with all target labels. Based on these strong results, we provide essential insights to help researchers and practitioners advance UDA.

Title: Context-guided Responsible Data Augmentation with Diffusion Models

Authors: Khawar Islam, Naveed Akhtar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10687
Pdf URL: https://arxiv.org/pdf/2503.10687
Copy Paste: [[2503.10687]] Context-guided Responsible Data Augmentation with Diffusion Models(https://arxiv.org/abs/2503.10687)
Keywords: diffusion, generative
Abstract: Generative diffusion models offer a natural choice for data augmentation when training complex vision models. However, ensuring reliability of their generative content as augmentation samples remains an open challenge. Despite a number of techniques utilizing generative images to strengthen model training, it remains unclear how to utilize the combination of natural and generative images as a rich supervisory signal for effective model induction. In this regard, we propose a text-to-image (T2I) data augmentation method, named DiffCoRe-Mix, that computes a set of generative counterparts for a training sample with an explicitly constrained diffusion model that leverages sample-based context and negative prompting for a reliable augmentation sample generation. To preserve key semantic axes, we also filter out undesired generative samples in our augmentation process. To that end, we propose a hard-cosine filtration in the embedding space of CLIP. Our approach systematically mixes the natural and generative images at pixel and patch levels. We extensively evaluate our technique on ImageNet-1K,Tiny ImageNet-200, CIFAR-100, Flowers102, CUB-Birds, Stanford Cars, and Caltech datasets, demonstrating a notable increase in performance across the board, achieving up to $\sim 3\%$ absolute gain for top-1 accuracy over the state-of-the-art methods, while showing comparable computational overhead. Our code is publicly available at this https URL

Title: Zero-Shot Subject-Centric Generation for Creative Application Using Entropy Fusion

Authors: Kaifeng Zou, Xiaoyi Feng, Peng Wang, Tao Huang, Zizhou Huang, Zhang Haihang, Yuntao Zou, Dagang Li
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10697
Pdf URL: https://arxiv.org/pdf/2503.10697
Copy Paste: [[2503.10697]] Zero-Shot Subject-Centric Generation for Creative Application Using Entropy Fusion(https://arxiv.org/abs/2503.10697)
Keywords: generative
Abstract: Generative models are widely used in visual content creation. However, current text-to-image models often face challenges in practical applications-such as textile pattern design and meme generation-due to the presence of unwanted elements that are difficult to separate with existing methods. Meanwhile, subject-reference generation has emerged as a key research trend, highlighting the need for techniques that can produce clean, high-quality subject images while effectively removing extraneous components. To address this challenge, we introduce a framework for reliable subject-centric image generation. In this work, we propose an entropy-based feature-weighted fusion method to merge the informative cross-attention features obtained from each sampling step of the pretrained text-to-image model FLUX, enabling a precise mask prediction and subject-centric generation. Additionally, we have developed an agent framework based on Large Language Models (LLMs) that translates users' casual inputs into more descriptive prompts, leading to highly detailed image generation. Simultaneously, the agents extract primary elements of prompts to guide the entropy-based feature fusion, ensuring focused primary element generation without extraneous components. Experimental results and user studies demonstrate our methods generates high-quality subject-centric images, outperform existing methods or other possible pipelines, highlighting the effectiveness of our approach.

Title: TA-V2A: Textually Assisted Video-to-Audio Generation

Authors: Yuhuan You, Xihong Wu, Tianshu Qu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.10700
Pdf URL: https://arxiv.org/pdf/2503.10700
Copy Paste: [[2503.10700]] TA-V2A: Textually Assisted Video-to-Audio Generation(https://arxiv.org/abs/2503.10700)
Keywords: diffusion
Abstract: As artificial intelligence-generated content (AIGC) continues to evolve, video-to-audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation. While Transformer and Diffusion models have advanced audio generation, a significant challenge persists in extracting precise semantic information from videos, as current models often lose sequential context by relying solely on frame-based features. To address this, we present TA-V2A, a method that integrates language, audio, and video features to improve semantic representation in latent space. By incorporating large language models for enhanced video comprehension, our approach leverages text guidance to enrich semantic expression. Our diffusion model-based system utilizes automated text modulation to enhance inference quality and efficiency, providing personalized control through text-guided interfaces. This integration enhances semantic expression while ensuring temporal alignment, leading to more accurate and coherent video-to-audio generation.

Title: Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework

Authors: Jing Wang, Fengzhuo Zhang, Xiaoli Li, Vincent Y. F. Tan, Tianyu Pang, Chao Du, Aixin Sun, Zhuoran Yang
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.10704
Pdf URL: https://arxiv.org/pdf/2503.10704
Copy Paste: [[2503.10704]] Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework(https://arxiv.org/abs/2503.10704)
Keywords: diffusion
Abstract: A variety of Auto-Regressive Video Diffusion Models (ARVDM) have achieved remarkable successes in generating realistic long-form videos. However, theoretical analyses of these models remain scant. In this work, we develop theoretical underpinnings for these models and use our insights to improve the performance of existing models. We first develop Meta-ARVDM, a unified framework of ARVDMs that subsumes most existing methods. Using Meta-ARVDM, we analyze the KL-divergence between the videos generated by Meta-ARVDM and the true videos. Our analysis uncovers two important phenomena inherent to ARVDM -- error accumulation and memory bottleneck. By deriving an information-theoretic impossibility result, we show that the memory bottleneck phenomenon cannot be avoided. To mitigate the memory bottleneck, we design various network structures to explicitly use more past frames. We also achieve a significantly improved trade-off between the mitigation of the memory bottleneck and the inference efficiency by compressing the frames. Experimental results on DMLab and Minecraft validate the efficacy of our methods. Our experiments also demonstrate a Pareto-frontier between the error accumulation and memory bottleneck across different methods.

Title: Team NYCU at Defactify4: Robust Detection and Source Identification of AI-Generated Images Using CNN and CLIP-Based Models

Authors: Tsan-Tsung Yang, I-Wei Chen, Kuan-Ting Chen, Shang-Hsuan Chiang, Wen-Chih Peng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10718
Pdf URL: https://arxiv.org/pdf/2503.10718
Copy Paste: [[2503.10718]] Team NYCU at Defactify4: Robust Detection and Source Identification of AI-Generated Images Using CNN and CLIP-Based Models(https://arxiv.org/abs/2503.10718)
Keywords: generative
Abstract: With the rapid advancement of generative AI, AI-generated images have become increasingly realistic, raising concerns about creativity, misinformation, and content authenticity. Detecting such images and identifying their source models has become a critical challenge in ensuring the integrity of digital media. This paper tackles the detection of AI-generated images and identifying their source models using CNN and CLIP-ViT classifiers. For the CNN-based classifier, we leverage EfficientNet-B0 as the backbone and feed with RGB channels, frequency features, and reconstruction errors, while for CLIP-ViT, we adopt a pretrained CLIP image encoder to extract image features and SVM to perform classification. Evaluated on the Defactify 4 dataset, our methods demonstrate strong performance in both tasks, with CLIP-ViT showing superior robustness to image perturbations. Compared to baselines like AEROBLADE and OCC-CLIP, our approach achieves competitive results. Notably, our method ranked Top-3 overall in the Defactify 4 competition, highlighting its effectiveness and generalizability. All of our implementations can be found in this https URL

Title: Numerical and statistical analysis of NeuralODE with Runge-Kutta time integration

Authors: Emily C. Ehrhardt, Hanno Gottschalk, Tobias J. Riedlinger
Subjects: cs.LG, math.CA, math.NA, math.PR
Abstract URL: https://arxiv.org/abs/2503.10729
Pdf URL: https://arxiv.org/pdf/2503.10729
Copy Paste: [[2503.10729]] Numerical and statistical analysis of NeuralODE with Runge-Kutta time integration(https://arxiv.org/abs/2503.10729)
Keywords: generative
Abstract: NeuralODE is one example for generative machine learning based on the push forward of a simple source measure with a bijective mapping, which in the case of NeuralODE is given by the flow of a ordinary differential equation. Using Liouville's formula, the log-density of the push forward measure is easy to compute and thus NeuralODE can be trained based on the maximum Likelihood method such that the Kulback-Leibler divergence between the push forward through the flow map and the target measure generating the data becomes small. In this work, we give a detailed account on the consistency of Maximum Likelihood based empirical risk minimization for a generic class of target measures. In contrast to prior work, we do not only consider the statistical learning theory, but also give a detailed numerical analysis of the NeuralODE algorithm based on the 2nd order Runge-Kutta (RK) time integration. Using the universal approximation theory for deep ReQU networks, the stability and convergence rated for the RK scheme as well as metric entropy and concentration inequalities, we are able to prove that NeuralODE is a probably approximately correct (PAC) learning algorithm.

Title: Visual Polarization Measurement Using Counterfactual Image Generation

Authors: Mohammad Mosaffa, Omid Rafieian, Hema Yoganarasimhan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10738
Pdf URL: https://arxiv.org/pdf/2503.10738
Copy Paste: [[2503.10738]] Visual Polarization Measurement Using Counterfactual Image Generation(https://arxiv.org/abs/2503.10738)
Keywords: generative
Abstract: Political polarization is a significant issue in American politics, influencing public discourse, policy, and consumer behavior. While studies on polarization in news media have extensively focused on verbal content, non-verbal elements, particularly visual content, have received less attention due to the complexity and high dimensionality of image data. Traditional descriptive approaches often rely on feature extraction from images, leading to biased polarization estimates due to information loss. In this paper, we introduce the Polarization Measurement using Counterfactual Image Generation (PMCIG) method, which combines economic theory with generative models and multi-modal deep learning to fully utilize the richness of image data and provide a theoretically grounded measure of polarization in visual content. Applying this framework to a decade-long dataset featuring 30 prominent politicians across 20 major news outlets, we identify significant polarization in visual content, with notable variations across outlets and politicians. At the news outlet level, we observe significant heterogeneity in visual slant. Outlets such as Daily Mail, Fox News, and Newsmax tend to favor Republican politicians in their visual content, while The Washington Post, USA Today, and The New York Times exhibit a slant in favor of Democratic politicians. At the politician level, our results reveal substantial variation in polarized coverage, with Donald Trump and Barack Obama among the most polarizing figures, while Joe Manchin and Susan Collins are among the least. Finally, we conduct a series of validation tests demonstrating the consistency of our proposed measures with external measures of media slant that rely on non-image-based sources.

Title: Panopticon: Advancing Any-Sensor Foundation Models for Earth Observation

Authors: Leonard Waldmann, Ando Shah, Yi Wang, Nils Lehmann, Adam J. Stewart, Zhitong Xiong, Xiao Xiang Zhu, Stefan Bauer, John Chuang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10845
Pdf URL: https://arxiv.org/pdf/2503.10845
Copy Paste: [[2503.10845]] Panopticon: Advancing Any-Sensor Foundation Models for Earth Observation(https://arxiv.org/abs/2503.10845)
Keywords: foundation model
Abstract: Earth observation (EO) data features diverse sensing platforms with varying spectral bands, spatial resolutions, and sensing modalities. While most prior work has constrained inputs to fixed sensors, a new class of any-sensor foundation models able to process arbitrary sensors has recently emerged. Contributing to this line of work, we propose Panopticon, an any-sensor foundation model built on the DINOv2 framework. We extend DINOv2 by (1) treating images of the same geolocation across sensors as natural augmentations, (2) subsampling channels to diversify spectral input, and (3) adding a cross attention over channels as a flexible patch embedding mechanism. By encoding the wavelength and modes of optical and synthetic aperture radar sensors, respectively, Panopticon can effectively process any combination of arbitrary channels. In extensive evaluations, we achieve state-of-the-art performance on GEO-Bench, especially on the widely-used Sentinel-1 and Sentinel-2 sensors, while out-competing other any-sensor models, as well as domain adapted fixed-sensor models on unique sensor configurations. Panopticon enables immediate generalization to both existing and future satellite platforms, advancing sensor-agnostic EO.

Title: RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors

Authors: Avinash Paliwal, Xilong Zhou, Wei Ye, Jinhui Xiong, Rakesh Ranjan, Nima Khademi Kalantari
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2503.10860
Pdf URL: https://arxiv.org/pdf/2503.10860
Copy Paste: [[2503.10860]] RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors(https://arxiv.org/abs/2503.10860)
Keywords: diffusion
Abstract: In this paper, we propose RI3D, a novel 3DGS-based approach that harnesses the power of diffusion models to reconstruct high-quality novel views given a sparse set of input images. Our key contribution is separating the view synthesis process into two tasks of reconstructing visible regions and hallucinating missing regions, and introducing two personalized diffusion models, each tailored to one of these tasks. Specifically, one model ('repair') takes a rendered image as input and predicts the corresponding high-quality image, which in turn is used as a pseudo ground truth image to constrain the optimization. The other model ('inpainting') primarily focuses on hallucinating details in unobserved areas. To integrate these models effectively, we introduce a two-stage optimization strategy: the first stage reconstructs visible areas using the repair model, and the second stage reconstructs missing regions with the inpainting model while ensuring coherence through further optimization. Moreover, we augment the optimization with a novel Gaussian initialization method that obtains per-image depth by combining 3D-consistent and smooth depth with highly detailed relative depth. We demonstrate that by separating the process into two tasks and addressing them with the repair and inpainting models, we produce results with detailed textures in both visible and missing regions that outperform state-of-the-art approaches on a diverse set of scenes with extremely sparse inputs.

Title: Memory-Efficient 3D High-Resolution Medical Image Synthesis Using CRF-Guided GANs

Authors: Mahshid Shiri, Alessandro Bruno, Daniele Loiacono
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10899
Pdf URL: https://arxiv.org/pdf/2503.10899
Copy Paste: [[2503.10899]] Memory-Efficient 3D High-Resolution Medical Image Synthesis Using CRF-Guided GANs(https://arxiv.org/abs/2503.10899)
Keywords: generative
Abstract: Generative Adversarial Networks (GANs) have many potential medical imaging applications. Due to the limited memory of Graphical Processing Units (GPUs), most current 3D GAN models are trained on low-resolution medical images, these models cannot scale to high-resolution or are susceptible to patchy artifacts. In this work, we propose an end-to-end novel GAN architecture that uses Conditional Random field (CRF) to model dependencies so that it can generate consistent 3D medical Images without exploiting memory. To achieve this purpose, the generator is divided into two parts during training, the first part produces an intermediate representation and CRF is applied to this intermediate representation to capture correlations. The second part of the generator produces a random sub-volume of image using a subset of the intermediate representation. This structure has two advantages: first, the correlations are modeled by using the features that the generator is trying to optimize. Second, the generator can generate full high-resolution images during inference. Experiments on Lung CTs and Brain MRIs show that our architecture outperforms state-of-the-art while it has lower memory usage and less complexity.

Title: OuroMamba: A Data-Free Quantization Framework for Vision Mamba Models

Authors: Akshat Ramachandran, Mingyu Lee, Huan Xu, Souvik Kundu, Tushar Krishna
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10959
Pdf URL: https://arxiv.org/pdf/2503.10959
Copy Paste: [[2503.10959]] OuroMamba: A Data-Free Quantization Framework for Vision Mamba Models(https://arxiv.org/abs/2503.10959)
Keywords: generative
Abstract: We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). We identify two key challenges in enabling DFQ for VMMs, (1) VMM's recurrent state transitions restricts capturing of long-range interactions and leads to semantically weak synthetic data, (2) VMM activations exhibit dynamic outlier variations across time-steps, rendering existing static PTQ techniques ineffective. To address these challenges, OuroMamba presents a two-stage framework: (1) OuroMamba-Gen to generate semantically rich and meaningful synthetic data. It applies contrastive learning on patch level VMM features generated through neighborhood interactions in the latent state space, (2) OuroMamba-Quant to employ mixed-precision quantization with lightweight dynamic outlier detection during inference. In specific, we present a thresholding based outlier channel selection strategy for activations that gets updated every time-step. Extensive experiments across vision and generative tasks show that our data-free OuroMamba surpasses existing data-driven PTQ techniques, achieving state-of-the-art performance across diverse quantization settings. Additionally, we implement efficient GPU kernels to achieve practical latency speedup of up to 2.36x. Code will be released soon.

Title: EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models

Authors: Yixuan Zhang, Qing Chang, Yuxi Wang, Guang Chen, Zhaoxiang Zhang, Junran Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11028
Pdf URL: https://arxiv.org/pdf/2503.11028
Copy Paste: [[2503.11028]] EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models(https://arxiv.org/abs/2503.11028)
Keywords: diffusion
Abstract: Speech-driven 3D facial animation seeks to produce lifelike facial expressions that are synchronized with the speech content and its emotional nuances, finding applications in various multimedia fields. However, previous methods often overlook emotional facial expressions or fail to disentangle them effectively from the speech content. To address these challenges, we present EmoDiffusion, a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions. Specifically, our method employs two Variational Autoencoders (VAEs) to separately generate the upper face region and mouth region, thereby learning a more refined representation of the facial sequence. Unlike traditional methods that use diffusion models to connect facial expression sequences with audio inputs, we perform the diffusion process in the latent space. Furthermore, we introduce an Emotion Adapter to evaluate upper face movements accurately. Given the paucity of 3D emotional talking face data in the animation industry, we capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone. This effort results in the creation of an innovative 3D blendshape emotional talking face dataset (3D-BEF) used to train our network. Extensive experiments and perceptual evaluations validate the effectiveness of our approach, confirming its superiority in generating realistic and emotionally rich facial animations.

Title: ACMo: Attribute Controllable Motion Generation

Authors: Mingjie Wei, Xuemei Xie, Guangming Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11038
Pdf URL: https://arxiv.org/pdf/2503.11038
Copy Paste: [[2503.11038]] ACMo: Attribute Controllable Motion Generation(https://arxiv.org/abs/2503.11038)
Keywords: diffusion
Abstract: Attributes such as style, fine-grained text, and trajectory are specific conditions for describing motion. However, existing methods often lack precise user control over motion attributes and suffer from limited generalizability to unseen motions. This work introduces an Attribute Controllable Motion generation architecture, to address these challenges via decouple any conditions and control them separately. Firstly, we explored the Attribute Diffusion Model to imporve text-to-motion performance via decouple text and motion learning, as the controllable model relies heavily on the pre-trained model. Then, we introduce Motion Adpater to quickly finetune previously unseen motion patterns. Its motion prompts inputs achieve multimodal text-to-motion generation that captures user-specified styles. Finally, we propose a LLM Planner to bridge the gap between unseen attributes and dataset-specific texts via local knowledage for user-friendly interaction. Our approach introduces the capability for motion prompts for stylize generation, enabling fine-grained and user-friendly attribute control while providing performance comparable to state-of-the-art methods. Project page: this https URL

Title: InverseBench: Benchmarking Plug-and-Play Diffusion Priors for Inverse Problems in Physical Sciences

Authors: Hongkai Zheng, Wenda Chu, Bingliang Zhang, Zihui Wu, Austin Wang, Berthy T. Feng, Caifeng Zou, Yu Sun, Nikola Kovachki, Zachary E. Ross, Katherine L. Bouman, Yisong Yue
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11043
Pdf URL: https://arxiv.org/pdf/2503.11043
Copy Paste: [[2503.11043]] InverseBench: Benchmarking Plug-and-Play Diffusion Priors for Inverse Problems in Physical Sciences(https://arxiv.org/abs/2503.11043)
Keywords: diffusion
Abstract: Plug-and-play diffusion priors (PnPDP) have emerged as a promising research direction for solving inverse problems. However, current studies primarily focus on natural image restoration, leaving the performance of these algorithms in scientific inverse problems largely unexplored. To address this gap, we introduce \textsc{InverseBench}, a framework that evaluates diffusion models across five distinct scientific inverse problems. These problems present unique structural challenges that differ from existing benchmarks, arising from critical scientific applications such as optical tomography, medical imaging, black hole imaging, seismology, and fluid dynamics. With \textsc{InverseBench}, we benchmark 14 inverse problem algorithms that use plug-and-play diffusion priors against strong, domain-specific baselines, offering valuable new insights into the strengths and weaknesses of existing algorithms. To facilitate further research and development, we open-source the codebase, along with datasets and pre-trained models, at this https URL.

Title: PSF-4D: A Progressive Sampling Framework for View Consistent 4D Editing

Authors: Hasan Iqbal, Nazmul Karim, Umar Khalid, Azib Farooq, Zichun Zhong, Jing Hua, Chen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11044
Pdf URL: https://arxiv.org/pdf/2503.11044
Copy Paste: [[2503.11044]] PSF-4D: A Progressive Sampling Framework for View Consistent 4D Editing(https://arxiv.org/abs/2503.11044)
Keywords: diffusion, generative
Abstract: Instruction-guided generative models, especially those using text-to-image (T2I) and text-to-video (T2V) diffusion frameworks, have advanced the field of content editing in recent years. To extend these capabilities to 4D scene, we introduce a progressive sampling framework for 4D editing (PSF-4D) that ensures temporal and multi-view consistency by intuitively controlling the noise initialization during forward diffusion. For temporal coherence, we design a correlated Gaussian noise structure that links frames over time, allowing each frame to depend meaningfully on prior frames. Additionally, to ensure spatial consistency across views, we implement a cross-view noise model, which uses shared and independent noise components to balance commonalities and distinct details among different views. To further enhance spatial coherence, PSF-4D incorporates view-consistent iterative refinement, embedding view-aware information into the denoising process to ensure aligned edits across frames and views. Our approach enables high-quality 4D editing without relying on external models, addressing key challenges in previous methods. Through extensive evaluation on multiple benchmarks and multiple editing aspects (e.g., style transfer, multi-attribute editing, object removal, local editing, etc.), we show the effectiveness of our proposed method. Experimental results demonstrate that our proposed method outperforms state-of-the-art 4D editing methods in diverse benchmarks.

Title: Measuring Similarity in Causal Graphs: A Framework for Semantic and Structural Analysis

Authors: Ning-Yuan Georgia Liu, Flower Yang, Mohammad S. Jalali
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11046
Pdf URL: https://arxiv.org/pdf/2503.11046
Copy Paste: [[2503.11046]] Measuring Similarity in Causal Graphs: A Framework for Semantic and Structural Analysis(https://arxiv.org/abs/2503.11046)
Keywords: generative
Abstract: Causal graphs are commonly used to understand and model complex systems. Researchers often construct these graphs from different perspectives, leading to significant variations for the same problem. Comparing causal graphs is, therefore, essential for evaluating assumptions, integrating insights, and resolving disagreements. The rise of AI tools has further amplified this need, as they are increasingly used to generate hypothesized causal graphs by synthesizing information from various sources such as prior research and community inputs, providing the potential for automating and scaling causal modeling for complex systems. Similar to humans, these tools also produce inconsistent results across platforms, versions, and iterations. Despite its importance, research on causal graph comparison remains scarce. Existing methods often focus solely on structural similarities, assuming identical variable names, and fail to capture nuanced semantic relationships, which is essential for causal graph comparison. We address these gaps by investigating methods for comparing causal graphs from both semantic and structural perspectives. First, we reviewed over 40 existing metrics and, based on predefined criteria, selected nine for evaluation from two threads of machine learning: four semantic similarity metrics and five learning graph kernels. We discuss the usability of these metrics in simple examples to illustrate their strengths and limitations. We then generated a synthetic dataset of 2,000 causal graphs using generative AI based on a reference diagram. Our findings reveal that each metric captures a different aspect of similarity, highlighting the need to use multiple metrics.

Title: Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning

Authors: Jieyi Tan, Chengwei Zhang, Bo Dang, Yansheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11051
Pdf URL: https://arxiv.org/pdf/2503.11051
Copy Paste: [[2503.11051]] Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning(https://arxiv.org/abs/2503.11051)
Keywords: foundation model
Abstract: Traditional Remote Sensing Foundation models (RSFMs) are pre-trained with a data-centralized paradigm, through self-supervision on large-scale curated remote sensing data. For each institution, however, pre-training RSFMs with limited data in a standalone manner may lead to suboptimal performance, while aggregating remote sensing data from multiple institutions for centralized pre-training raises privacy concerns. Seeking for collaboration is a promising solution to resolve this dilemma, where multiple institutions can collaboratively train RSFMs without sharing private data. In this paper, we propose a novel privacy-preserved pre-training framework (FedSense), which enables multiple institutions to collaboratively train RSFMs without sharing private data. However, it is a non-trivial task hindered by a vicious cycle, which results from model drift by remote sensing data heterogeneity and high communication overhead. To break this vicious cycle, we introduce Federated Mutual-guidance Learning. Specifically, we propose a Server-to-Clients Guidance (SCG) mechanism to guide clients updates towards global-flatness optimal solutions. Additionally, we propose a Clients-to-Server Guidance (CSG) mechanism to inject local knowledge into the server by low-bit communication. Extensive experiments on four downstream tasks demonstrate the effectiveness of our FedSense in both full-precision and communication-reduced scenarios, showcasing remarkable communication efficiency and performance gains.

Title: Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

Authors: Kyle Sargent, Kyle Hsu, Justin Johnson, Li Fei-Fei, Jiajun Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11056
Pdf URL: https://arxiv.org/pdf/2503.11056
Copy Paste: [[2503.11056]] Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization(https://arxiv.org/abs/2503.11056)
Keywords: diffusion, generative
Abstract: Since the advent of popular visual generation frameworks like VQGAN and latent diffusion models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. We propose FlowMo, a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates without using convolutions, adversarial losses, spatially-aligned two-dimensional latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. In addition, we conduct extensive analyses and explore the training of generative models atop the FlowMo tokenizer. Our code and models will be available at this http URL .

Title: Generative Modelling for Mathematical Discovery

Authors: Jordan S. Ellenberg, Cristofero S. Fraser-Taliente, Thomas R. Harvey, Karan Srivastava, Andrew V. Sutherland
Subjects: cs.LG, math.CO
Abstract URL: https://arxiv.org/abs/2503.11061
Pdf URL: https://arxiv.org/pdf/2503.11061
Copy Paste: [[2503.11061]] Generative Modelling for Mathematical Discovery(https://arxiv.org/abs/2503.11061)
Keywords: generative
Abstract: We present a new implementation of the LLM-driven genetic algorithm {\it funsearch}, whose aim is to generate examples of interest to mathematicians and which has already had some success in problems in extremal combinatorics. Our implementation is designed to be useful in practice for working mathematicians; it does not require expertise in machine learning or access to high-performance computing resources. Applying {\it funsearch} to a new problem involves modifying a small segment of Python code and selecting a large language model (LLM) from one of many third-party providers. We benchmarked our implementation on three different problems, obtaining metrics that may inform applications of {\it funsearch} to new problems. Our results demonstrate that {\it funsearch} successfully learns in a variety of combinatorial and number-theoretic settings, and in some contexts learns principles that generalize beyond the problem originally trained on.

Title: Falcon: A Remote Sensing Vision-Language Foundation Model

Authors: Kelu Yao, Nuo Xu, Rong Yang, Yingying Xu, Zhuoyan Gao, Titinunt Kitrungrotsakul, Yi Ren, Pu Zhang, Jin Wang, Ning Wei, Chao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11070
Pdf URL: https://arxiv.org/pdf/2503.11070
Copy Paste: [[2503.11070]] Falcon: A Remote Sensing Vision-Language Foundation Model(https://arxiv.org/abs/2503.11070)
Keywords: foundation model
Abstract: This paper introduces a holistic vision-language foundation model tailored for remote sensing, named Falcon. Falcon offers a unified, prompt-based paradigm that effectively executes comprehensive and complex remote sensing tasks. Falcon demonstrates powerful understanding and reasoning abilities at the image, region, and pixel levels. Specifically, given simple natural language instructions and remote sensing images, Falcon can produce impressive results in text form across 14 distinct tasks, i.e., image classification, object detection, segmentation, image captioning, and etc. To facilitate Falcon's training and empower its representation capacity to encode rich spatial and semantic information, we developed Falcon_SFT, a large-scale, multi-task, instruction-tuning dataset in the field of remote sensing. The Falcon_SFT dataset consists of approximately 78 million high-quality data samples, covering 5.6 million multi-spatial resolution and multi-view remote sensing images with diverse instructions. It features hierarchical annotations and undergoes manual sampling verification to ensure high data quality and reliability. Extensive comparative experiments are conducted, which verify that Falcon achieves remarkable performance over 67 datasets and 14 tasks, despite having only 0.7B parameters. We release the complete dataset, code, and model weights at this https URL, hoping to help further develop the open-source community.

Title: Harnessing Frequency Spectrum Insights for Image Copyright Protection Against Diffusion Models

Authors: Zhenguang Liu, Chao Shuai, Shaojing Fan, Ziping Dong, Jinwu Hu, Zhongjie Ba, Kui Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11071
Pdf URL: https://arxiv.org/pdf/2503.11071
Copy Paste: [[2503.11071]] Harnessing Frequency Spectrum Insights for Image Copyright Protection Against Diffusion Models(https://arxiv.org/abs/2503.11071)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable success in novel view synthesis, but their reliance on large, diverse, and often untraceable Web datasets has raised pressing concerns about image copyright protection. Current methods fall short in reliably identifying unauthorized image use, as they struggle to generalize across varied generation tasks and fail when the training dataset includes images from multiple sources with few identifiable (watermarked or poisoned) samples. In this paper, we present novel evidence that diffusion-generated images faithfully preserve the statistical properties of their training data, particularly reflected in their spectral features. Leveraging this insight, we introduce \emph{CoprGuard}, a robust frequency domain watermarking framework to safeguard against unauthorized image usage in diffusion model training and fine-tuning. CoprGuard demonstrates remarkable effectiveness against a wide range of models, from naive diffusion models to sophisticated text-to-image models, and is robust even when watermarked images comprise a mere 1\% of the training dataset. This robust and versatile approach empowers content owners to protect their intellectual property in the era of AI-driven image generation.

Title: Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models

Authors: Hongyang Wei, Shuaizheng Liu, Chun Yuan, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11073
Pdf URL: https://arxiv.org/pdf/2503.11073
Copy Paste: [[2503.11073]] Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models(https://arxiv.org/abs/2503.11073)
Keywords: diffusion, generative
Abstract: By leveraging the generative priors from pre-trained text-to-image diffusion models, significant progress has been made in real-world image super-resolution (Real-ISR). However, these methods tend to generate inaccurate and unnatural reconstructions in complex and/or heavily degraded scenes, primarily due to their limited perception and understanding capability of the input low-quality image. To address these limitations, we propose, for the first time to our knowledge, to adapt the pre-trained autoregressive multimodal model such as Lumina-mGPT into a robust Real-ISR model, namely PURE, which Perceives and Understands the input low-quality image, then REstores its high-quality counterpart. Specifically, we implement instruction tuning on Lumina-mGPT to perceive the image degradation level and the relationships between previously generated image tokens and the next token, understand the image content by generating image semantic descriptions, and consequently restore the image by generating high-quality image tokens autoregressively with the collected information. In addition, we reveal that the image token entropy reflects the image structure and present a entropy-based Top-k sampling strategy to optimize the local structure of the image during inference. Experimental results demonstrate that PURE preserves image content while generating realistic details, especially in complex scenes with multiple objects, showcasing the potential of autoregressive multimodal generative models for robust Real-ISR. The model and code will be available at this https URL.

Title: Understanding Flatness in Generative Models: Its Role and Benefits

Authors: Taehwan Lee, Kyeongkook Seo, Jaejun Yoo, Sung Whan Yoon
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11078
Pdf URL: https://arxiv.org/pdf/2503.11078
Copy Paste: [[2503.11078]] Understanding Flatness in Generative Models: Its Role and Benefits(https://arxiv.org/abs/2503.11078)
Keywords: diffusion, generative
Abstract: Flat minima, known to enhance generalization and robustness in supervised learning, remain largely unexplored in generative models. In this work, we systematically investigate the role of loss surface flatness in generative models, both theoretically and empirically, with a particular focus on diffusion models. We establish a theoretical claim that flatter minima improve robustness against perturbations in target prior distributions, leading to benefits such as reduced exposure bias -- where errors in noise estimation accumulate over iterations -- and significantly improved resilience to model quantization, preserving generative performance even under strong quantization constraints. We further observe that Sharpness-Aware Minimization (SAM), which explicitly controls the degree of flatness, effectively enhances flatness in diffusion models, whereas other well-known methods such as Stochastic Weight Averaging (SWA) and Exponential Moving Average (EMA), which promote flatness indirectly via ensembling, are less effective. Through extensive experiments on CIFAR-10, LSUN Tower, and FFHQ, we demonstrate that flat minima in diffusion models indeed improves not only generative performance but also robustness.

Title: A Survey of Cross-domain Graph Learning: Progress and Future Directions

Authors: Haihong Zhao, Chenyi Zi, Aochuan Chen, Jia Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11086
Pdf URL: https://arxiv.org/pdf/2503.11086
Copy Paste: [[2503.11086]] A Survey of Cross-domain Graph Learning: Progress and Future Directions(https://arxiv.org/abs/2503.11086)
Keywords: foundation model
Abstract: Graph learning plays a vital role in mining and analyzing complex relationships involved in graph data, which is widely used in many real-world applications like transaction networks and communication networks. Foundation models in CV and NLP have shown powerful cross-domain capabilities that are also significant in graph domains. However, existing graph learning approaches struggle with cross-domain tasks. Inspired by successes in CV and NLP, cross-domain graph learning has once again become a focal point of attention to realizing true graph foundation models. In this survey, we present a comprehensive review and analysis of existing works on cross-domain graph learning. Concretely, we first propose a new taxonomy, categorizing existing approaches based on the learned cross-domain information: structure, feature, and structure-feature mixture. Next, we systematically survey representative methods in these categories. Finally, we discuss the remaining limitations of existing studies and highlight promising avenues for future research. Relevant papers are summarized and will be consistently updated at: this https URL.

Title: Multi-View Industrial Anomaly Detection with Epipolar Constrained Cross-View Fusion

Authors: Yifan Liu, Xun Xu, Shijie Li, Jingyi Liao, Xulei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11088
Pdf URL: https://arxiv.org/pdf/2503.11088
Copy Paste: [[2503.11088]] Multi-View Industrial Anomaly Detection with Epipolar Constrained Cross-View Fusion(https://arxiv.org/abs/2503.11088)
Keywords: anomaly
Abstract: Multi-camera systems provide richer contextual information for industrial anomaly detection. However, traditional methods process each view independently, disregarding the complementary information across viewpoints. Existing multi-view anomaly detection approaches typically employ data-driven cross-view attention for feature fusion but fail to leverage the unique geometric properties of multi-camera setups. In this work, we introduce an epipolar geometry-constrained attention module to guide cross-view fusion, ensuring more effective information aggregation. To further enhance the potential of cross-view attention, we propose a pretraining strategy inspired by memory bank-based anomaly detection. This approach encourages normal feature representations to form multiple local clusters and incorporate multi-view aware negative sample synthesis to regularize pretraining. We demonstrate that our epipolar guided multi-view anomaly detection framework outperforms existing methods on the state-of-the-art multi-view anomaly detection dataset.

Title: Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space

Authors: Weichen Zhan, Zile Zhou, Zhiheng Zheng, Chen Gao, Jinqiang Cui, Yong Li, Xinlei Chen, Xiao-Ping Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11094
Pdf URL: https://arxiv.org/pdf/2503.11094
Copy Paste: [[2503.11094]] Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space(https://arxiv.org/abs/2503.11094)
Keywords: foundation model
Abstract: Spatial reasoning is a fundamental capability of embodied agents and has garnered widespread attention in the field of multimodal large language models (MLLMs). In this work, we propose a novel benchmark, Open3DVQA, to comprehensively evaluate the spatial reasoning capacities of current state-of-the-art (SOTA) foundation models in open 3D space. Open3DVQA consists of 9k VQA samples, collected using an efficient semi-automated tool in a high-fidelity urban simulator. We evaluate several SOTA MLLMs across various aspects of spatial reasoning, such as relative and absolute spatial relationships, situational reasoning, and object-centric spatial attributes. Our results reveal that: 1) MLLMs perform better at answering questions regarding relative spatial relationships than absolute spatial relationships, 2) MLLMs demonstrate similar spatial reasoning abilities for both egocentric and allocentric perspectives, and 3) Fine-tuning large models significantly improves their performance across different spatial reasoning tasks. We believe that our open-source data collection tools and in-depth analyses will inspire further research on MLLM spatial reasoning capabilities. The benchmark is available at this https URL.

Title: A Novel Decomposed Feature-Oriented Framework for Open-Set Semantic Segmentation on LiDAR Data

Authors: Wenbang Deng, Xieyuanli Chen, Qinghua Yu, Yunze He, Junhao Xiao, Huimin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11097
Pdf URL: https://arxiv.org/pdf/2503.11097
Copy Paste: [[2503.11097]] A Novel Decomposed Feature-Oriented Framework for Open-Set Semantic Segmentation on LiDAR Data(https://arxiv.org/abs/2503.11097)
Keywords: anomaly
Abstract: Semantic segmentation is a key technique that enables mobile robots to understand and navigate surrounding environments autonomously. However, most existing works focus on segmenting known objects, overlooking the identification of unknown classes, which is common in real-world applications. In this paper, we propose a feature-oriented framework for open-set semantic segmentation on LiDAR data, capable of identifying unknown objects while retaining the ability to classify known ones. We design a decomposed dual-decoder network to simultaneously perform closed-set semantic segmentation and generate distinctive features for unknown objects. The network is trained with multi-objective loss functions to capture the characteristics of known and unknown objects. Using the extracted features, we introduce an anomaly detection mechanism to identify unknown objects. By integrating the results of close-set semantic segmentation and anomaly detection, we achieve effective feature-driven LiDAR open-set semantic segmentation. Evaluations on both SemanticKITTI and nuScenes datasets demonstrate that our proposed framework significantly outperforms state-of-the-art methods. The source code will be made publicly available at this https URL.

Title: A Survey on Self-supervised Contrastive Learning for Multimodal Text-Image Analysis

Authors: Asifullah Khan, Laiba Asmatullah, Anza Malik, Shahzaib Khan, Hamna Asif
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11101
Pdf URL: https://arxiv.org/pdf/2503.11101
Copy Paste: [[2503.11101]] A Survey on Self-supervised Contrastive Learning for Multimodal Text-Image Analysis(https://arxiv.org/abs/2503.11101)
Keywords: self-supervised
Abstract: Self-supervised learning is a machine learning approach that generates implicit labels by learning underlined patterns and extracting discriminative features from unlabeled data without manual labelling. Contrastive learning introduces the concept of "positive" and "negative" samples, where positive pairs (e.g., variation of the same image/object) are brought together in the embedding space, and negative pairs (e.g., views from different images/objects) are pushed farther away. This methodology has shown significant improvements in image understanding and image text analysis without much reliance on labeled data. In this paper, we comprehensively discuss the terminologies, recent developments and applications of contrastive learning with respect to text-image models. Specifically, we provide an overview of the approaches of contrastive learning in text-image models in recent years. Secondly, we categorize the approaches based on different model structures. Thirdly, we further introduce and discuss the latest advances of the techniques used in the process such as pretext tasks for both images and text, architectural structures, and key trends. Lastly, we discuss the recent state-of-art applications of self-supervised contrastive learning Text-Image based models.

Title: Quantifying Interpretability in CLIP Models with Concept Consistency

Authors: Avinash Madasu, Vasudev Lal, Phillip Howard
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11103
Pdf URL: https://arxiv.org/pdf/2503.11103
Copy Paste: [[2503.11103]] Quantifying Interpretability in CLIP Models with Concept Consistency(https://arxiv.org/abs/2503.11103)
Keywords: in-context
Abstract: CLIP is one of the most popular foundational models and is heavily used for many vision-language tasks. However, little is known about the inner workings of CLIP. While recent work has proposed decomposition-based interpretability methods for identifying textual descriptions of attention heads in CLIP, the implications of conceptual consistency in these text labels on interpretability and model performance has not been explored. To bridge this gap, we study the conceptual consistency of text descriptions for attention heads in CLIP-like models. We conduct extensive experiments on six different models from OpenAI and OpenCLIP which vary by size, type of pre-training data and patch size. We propose Concept Consistency Score (CCS), a novel interpretability metric that measures how consistently individual attention heads in CLIP models align with specific concepts. To assign concept labels to heads, we use in-context learning with ChatGPT, guided by a few manually-curated examples, and validate these labels using an LLM-as-a-judge approach. Our soft-pruning experiments reveal that high CCS heads are critical for preserving model performance, as pruning them leads to a significantly larger performance drop than pruning random or low CCS heads. Notably, we find that high CCS heads capture essential concepts and play a key role in out-of-domain detection, concept-specific reasoning, and video-language understanding. These results position CCS as a powerful interpretability metric for analyzing CLIP-like models.

Title: DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation

Authors: Hongbin Lin, Zilu Guo, Yifan Zhang, Shuaicheng Niu, Yafeng Li, Ruimao Zhang, Shuguang Cui, Zhen Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11122
Pdf URL: https://arxiv.org/pdf/2503.11122
Copy Paste: [[2503.11122]] DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation(https://arxiv.org/abs/2503.11122)
Keywords: diffusion
Abstract: In autonomous driving, vision-centric 3D detection aims to identify 3D objects from images. However, high data collection costs and diverse real-world scenarios limit the scale of training data. Once distribution shifts occur between training and test data, existing methods often suffer from performance degradation, known as Out-of-Distribution (OOD) problems. To address this, controllable Text-to-Image (T2I) diffusion offers a potential solution for training data enhancement, which is required to generate diverse OOD scenarios with precise 3D object geometry. Nevertheless, existing controllable T2I approaches are restricted by the limited scale of training data or struggle to preserve all annotated 3D objects. In this paper, we present DriveGEN, a method designed to improve the robustness of 3D detectors in Driving via Training-Free Controllable Text-to-Image Diffusion Generation. Without extra diffusion model training, DriveGEN consistently preserves objects with precise 3D geometry across diverse OOD generations, consisting of 2 stages: 1) Self-Prototype Extraction: We empirically find that self-attention features are semantic-aware but require accurate region selection for 3D objects. Thus, we extract precise object features via layouts to capture 3D object geometry, termed self-prototypes. 2) Prototype-Guided Diffusion: To preserve objects across various OOD scenarios, we perform semantic-aware feature alignment and shallow feature alignment during denoising. Extensive experiments demonstrate the effectiveness of DriveGEN in improving 3D detection. The code is available at this https URL.

Title: SpaceSeg: A High-Precision Intelligent Perception Segmentation Method for Multi-Spacecraft On-Orbit Targets

Authors: Hao Liu, Pengyu Guo, Siyuan Yang, Zeqing Jiang, Qinglei Hu, Dongyu Li
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.11133
Pdf URL: https://arxiv.org/pdf/2503.11133
Copy Paste: [[2503.11133]] SpaceSeg: A High-Precision Intelligent Perception Segmentation Method for Multi-Spacecraft On-Orbit Targets(https://arxiv.org/abs/2503.11133)
Keywords: foundation model
Abstract: With the continuous advancement of human exploration into deep space, intelligent perception and high-precision segmentation technology for on-orbit multi-spacecraft targets have become critical factors for ensuring the success of modern space missions. However, the complex deep space environment, diverse imaging conditions, and high variability in spacecraft morphology pose significant challenges to traditional segmentation methods. This paper proposes SpaceSeg, an innovative vision foundation model-based segmentation framework with four core technical innovations: First, the Multi-Scale Hierarchical Attention Refinement Decoder (MSHARD) achieves high-precision feature decoding through cross-resolution feature fusion via hierarchical attention. Second, the Multi-spacecraft Connected Component Analysis (MS-CCA) effectively resolves topological structure confusion in dense targets. Third, the Spatial Domain Adaptation Transform framework (SDAT) eliminates cross-domain disparities and resist spatial sensor perturbations through composite enhancement strategies. Finally, a custom Multi-Spacecraft Segmentation Task Loss Function is created to significantly improve segmentation robustness in deep space scenarios. To support algorithm validation, we construct the first multi-scale on-orbit multi-spacecraft semantic segmentation dataset SpaceES, which covers four types of spatial backgrounds and 17 typical spacecraft targets. In testing, SpaceSeg achieves state-of-the-art performance with 89.87$\%$ mIoU and 99.98$\%$ mAcc, surpassing existing best methods by 5.71 percentage points. The dataset and code are open-sourced at this https URL to provide critical technical support for next-generation space situational awareness systems.

Title: GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior

Authors: Zichen Tang, Yuan Yao, Miaomiao Cui, Liefeng Bo, Hongyu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11143
Pdf URL: https://arxiv.org/pdf/2503.11143
Copy Paste: [[2503.11143]] GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior(https://arxiv.org/abs/2503.11143)
Keywords: diffusion
Abstract: Text-guided 3D human generation has advanced with the development of efficient 3D representations and 2D-lifting methods like Score Distillation Sampling (SDS). However, current methods suffer from prolonged training times and often produce results that lack fine facial and garment details. In this paper, we propose GaussianIP, an effective two-stage framework for generating identity-preserving realistic 3D humans from text and image prompts. Our core insight is to leverage human-centric knowledge to facilitate the generation process. In stage 1, we propose a novel Adaptive Human Distillation Sampling (AHDS) method to rapidly generate a 3D human that maintains high identity consistency with the image prompt and achieves a realistic appearance. Compared to traditional SDS methods, AHDS better aligns with the human-centric generation process, enhancing visual quality with notably fewer training steps. To further improve the visual quality of the face and clothes regions, we design a View-Consistent Refinement (VCR) strategy in stage 2. Specifically, it produces detail-enhanced results of the multi-view images from stage 1 iteratively, ensuring the 3D texture consistency across views via mutual attention and distance-guided attention fusion. Then a polished version of the 3D human can be achieved by directly perform reconstruction with the refined images. Extensive experiments demonstrate that GaussianIP outperforms existing methods in both visual quality and training efficiency, particularly in generating identity-preserving results. Our code is available at: this https URL.

Title: Neurons: Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction

Authors: Haonan Wang, Qixiang Zhang, Lehan Wang, Xuanqi Huang, Xiaomeng Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11167
Pdf URL: https://arxiv.org/pdf/2503.11167
Copy Paste: [[2503.11167]] Neurons: Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction(https://arxiv.org/abs/2503.11167)
Keywords: diffusion
Abstract: Decoding visual stimuli from neural activity is essential for understanding the human brain. While fMRI methods have successfully reconstructed static images, fMRI-to-video reconstruction faces challenges due to the need for capturing spatiotemporal dynamics like motion and scene transitions. Recent approaches have improved semantic and perceptual alignment but struggle to integrate coarse fMRI data with detailed visual features. Inspired by the hierarchical organization of the visual system, we propose NEURONS, a novel framework that decouples learning into four correlated sub-tasks: key object segmentation, concept recognition, scene description, and blurry video reconstruction. This approach simulates the visual cortex's functional specialization, allowing the model to capture diverse video content. In the inference stage, NEURONS generates robust conditioning signals for a pre-trained text-to-video diffusion model to reconstruct the videos. Extensive experiments demonstrate that NEURONS outperforms state-of-the-art baselines, achieving solid improvements in video consistency (26.6%) and semantic-level accuracy (19.1%). Notably, NEURONS shows a strong functional correlation with the visual cortex, highlighting its potential for brain-computer interfaces and clinical applications. Code and model weights will be available at: this https URL.

Title: Multi-Stage Generative Upscaler: Reconstructing Football Broadcast Images via Diffusion Models

Authors: Luca Martini, Daniele Zolezzi, Saverio Iacono, Gianni Viardo Vercelli
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11181
Pdf URL: https://arxiv.org/pdf/2503.11181
Copy Paste: [[2503.11181]] Multi-Stage Generative Upscaler: Reconstructing Football Broadcast Images via Diffusion Models(https://arxiv.org/abs/2503.11181)
Keywords: diffusion, generative
Abstract: The reconstruction of low-resolution football broadcast images presents a significant challenge in sports broadcasting, where detailed visuals are essential for analysis and audience engagement. This study introduces a multi-stage generative upscaling framework leveraging Diffusion Models to enhance degraded images, transforming inputs as small as $64 \times 64$ pixels into high-fidelity $1024 \times 1024$ outputs. By integrating an image-to-image pipeline, ControlNet conditioning, and LoRA fine-tuning, our approach surpasses traditional upscaling methods in restoring intricate textures and domain-specific elements such as player details and jersey logos. The custom LoRA is trained on a custom football dataset, ensuring adaptability to sports broadcast needs. Experimental results demonstrate substantial improvements over conventional models, with ControlNet refining fine details and LoRA enhancing task-specific elements. These findings highlight the potential of diffusion-based image reconstruction in sports media, paving the way for future applications in automated video enhancement and real-time sports analytics.

Title: Palette of Language Models: A Solver for Controlled Text Generation

Authors: Zhe Yang, Yi Huang, Yaqin Chen, Xiaoting Wu, Junlan Feng, Chao Deng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11182
Pdf URL: https://arxiv.org/pdf/2503.11182
Copy Paste: [[2503.11182]] Palette of Language Models: A Solver for Controlled Text Generation(https://arxiv.org/abs/2503.11182)
Keywords: generative
Abstract: Recent advancements in large language models have revolutionized text generation with their remarkable capabilities. These models can produce controlled texts that closely adhere to specific requirements when prompted appropriately. However, designing an optimal prompt to control multiple attributes simultaneously can be challenging. A common approach is to linearly combine single-attribute models, but this strategy often overlooks attribute overlaps and can lead to conflicts. Therefore, we propose a novel combination strategy inspired by the Law of Total Probability and Conditional Mutual Information Minimization on generative language models. This method has been adapted for single-attribute control scenario and is termed the Palette of Language Models due to its theoretical linkage between attribute strength and generation style, akin to blending colors on an artist's palette. Moreover, positive correlation and attribute enhancement are advanced as theoretical properties to guide a rational combination strategy design. We conduct experiments on both single control and multiple control settings, and achieve surpassing results.

Title: Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption

Authors: Du Chen, Tianhe Wu, Kede Ma, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11221
Pdf URL: https://arxiv.org/pdf/2503.11221
Copy Paste: [[2503.11221]] Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption(https://arxiv.org/abs/2503.11221)
Keywords: diffusion, generative
Abstract: Full-reference image quality assessment (FR-IQA) generally assumes that reference images are of perfect quality. However, this assumption is flawed due to the sensor and optical limitations of modern imaging systems. Moreover, recent generative enhancement methods are capable of producing images of higher quality than their original. All of these challenge the effectiveness and applicability of current FR-IQA models. To relax the assumption of perfect reference image quality, we build a large-scale IQA database, namely DiffIQA, containing approximately 180,000 images generated by a diffusion-based image enhancer with adjustable hyper-parameters. Each image is annotated by human subjects as either worse, similar, or better quality compared to its reference. Building on this, we present a generalized FR-IQA model, namely Adaptive Fidelity-Naturalness Evaluator (A-FINE), to accurately assess and adaptively combine the fidelity and naturalness of a test image. A-FINE aligns well with standard FR-IQA when the reference image is much more natural than the test image. We demonstrate by extensive experiments that A-FINE surpasses standard FR-IQA models on well-established IQA datasets and our newly created DiffIQA. To further validate A-FINE, we additionally construct a super-resolution IQA benchmark (SRIQA-Bench), encompassing test images derived from ten state-of-the-art SR methods with reliable human quality annotations. Tests on SRIQA-Bench re-affirm the advantages of A-FINE. The code and dataset are available at this https URL.

Title: Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards

Authors: Zijing Hu, Fengda Zhang, Long Chen, Kun Kuang, Jiahui Li, Kaifeng Gao, Jun Xiao, Xin Wang, Wenwu Zhu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11240
Pdf URL: https://arxiv.org/pdf/2503.11240
Copy Paste: [[2503.11240]] Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards(https://arxiv.org/abs/2503.11240)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is only available at the end of the generation process. This makes it difficult to identify which actions during the denoising process contribute positively to the final generated image, potentially leading to ineffective or unnecessary denoising policies. To this end, this paper presents a novel RL-based framework that addresses the sparse reward problem when training diffusion models. Our framework, named $\text{B}^2\text{-DiffuRL}$, employs two strategies: \textbf{B}ackward progressive training and \textbf{B}ranch-based sampling. For one thing, backward progressive training focuses initially on the final timesteps of denoising process and gradually extends the training interval to earlier timesteps, easing the learning difficulty from sparse rewards. For another, we perform branch-based sampling for each training interval. By comparing the samples within the same branch, we can identify how much the policies of the current training interval contribute to the final image, which helps to learn effective policies instead of unnecessary ones. $\text{B}^2\text{-DiffuRL}$ is compatible with existing optimization algorithms. Extensive experiments demonstrate the effectiveness of $\text{B}^2\text{-DiffuRL}$ in improving prompt-image alignment and maintaining diversity in generated images. The code for this work is available.

Title: Spherical Tree-Sliced Wasserstein Distance

Authors: Hoang V. Tran, Thanh T. Chu, Khoi N.M. Nguyen, Trang Pham, Tam Le, Tan M. Nguyen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11249
Pdf URL: https://arxiv.org/pdf/2503.11249
Copy Paste: [[2503.11249]] Spherical Tree-Sliced Wasserstein Distance(https://arxiv.org/abs/2503.11249)
Keywords: self-supervised
Abstract: Sliced Optimal Transport (OT) simplifies the OT problem in high-dimensional spaces by projecting supports of input measures onto one-dimensional lines and then exploiting the closed-form expression of the univariate OT to reduce the computational burden of OT. Recently, the Tree-Sliced method has been introduced to replace these lines with more intricate structures, known as tree systems. This approach enhances the ability to capture topological information of integration domains in Sliced OT while maintaining low computational cost. Inspired by this approach, in this paper, we present an adaptation of tree systems on OT problems for measures supported on a sphere. As a counterpart to the Radon transform variant on tree systems, we propose a novel spherical Radon transform with a new integration domain called spherical trees. By leveraging this transform and exploiting the spherical tree structures, we derive closed-form expressions for OT problems on the sphere. Consequently, we obtain an efficient metric for measures on the sphere, named Spherical Tree-Sliced Wasserstein (STSW) distance. We provide an extensive theoretical analysis to demonstrate the topology of spherical trees and the well-definedness and injectivity of our Radon transform variant, which leads to an orthogonally invariant distance between spherical measures. Finally, we conduct a wide range of numerical experiments, including gradient flows and self-supervised learning, to assess the performance of our proposed metric, comparing it to recent benchmarks.

Title: Federated Koopman-Reservoir Learning for Large-Scale Multivariate Time-Series Anomaly Detection

Authors: Long Tan Le, Tung-Anh Nguyen, Han Shu, Suranga Seneviratne, Choong Seon Hong, Nguyen H. Tran
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2503.11255
Pdf URL: https://arxiv.org/pdf/2503.11255
Copy Paste: [[2503.11255]] Federated Koopman-Reservoir Learning for Large-Scale Multivariate Time-Series Anomaly Detection(https://arxiv.org/abs/2503.11255)
Keywords: anomaly
Abstract: The proliferation of edge devices has dramatically increased the generation of multivariate time-series (MVTS) data, essential for applications from healthcare to smart cities. Such data streams, however, are vulnerable to anomalies that signal crucial problems like system failures or security incidents. Traditional MVTS anomaly detection methods, encompassing statistical and centralized machine learning approaches, struggle with the heterogeneity, variability, and privacy concerns of large-scale, distributed environments. In response, we introduce FedKO, a novel unsupervised Federated Learning framework that leverages the linear predictive capabilities of Koopman operator theory along with the dynamic adaptability of Reservoir Computing. This enables effective spatiotemporal processing and privacy preservation for MVTS data. FedKO is formulated as a bi-level optimization problem, utilizing a specific federated algorithm to explore a shared Reservoir-Koopman model across diverse datasets. Such a model is then deployable on edge devices for efficient detection of anomalies in local MVTS streams. Experimental results across various datasets showcase FedKO's superior performance against state-of-the-art methods in MVTS anomaly detection. Moreover, FedKO reduces up to 8x communication size and 2x memory usage, making it highly suitable for large-scale systems.

Title: Noise Synthesis for Low-Light Image Denoising with Diffusion Models

Authors: Liying Lu, Raphaël Achddou, Sabine Süsstrunk
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.11262
Pdf URL: https://arxiv.org/pdf/2503.11262
Copy Paste: [[2503.11262]] Noise Synthesis for Low-Light Image Denoising with Diffusion Models(https://arxiv.org/abs/2503.11262)
Keywords: diffusion
Abstract: Low-light photography produces images with low signal-to-noise ratios due to limited photons. In such conditions, common approximations like the Gaussian noise model fall short, and many denoising techniques fail to remove noise effectively. Although deep-learning methods perform well, they require large datasets of paired images that are impractical to acquire. As a remedy, synthesizing realistic low-light noise has gained significant attention. In this paper, we investigate the ability of diffusion models to capture the complex distribution of low-light noise. We show that a naive application of conventional diffusion models is inadequate for this task and propose three key adaptations that enable high-precision noise generation without calibration or post-processing: a two-branch architecture to better model signal-dependent and signal-independent noise, the incorporation of positional information to capture fixed-pattern noise, and a tailored diffusion noise schedule. Consequently, our model enables the generation of large datasets for training low-light denoising networks, leading to state-of-the-art performance. Through comprehensive analysis, including statistical evaluation and noise decomposition, we provide deeper insights into the characteristics of the generated data.

Title: CyclePose -- Leveraging Cycle-Consistency for Annotation-Free Nuclei Segmentation in Fluorescence Microscopy

Authors: Jonas Utz, Stefan Vocht, Anne Tjorven Buessen, Dennis Possart, Fabian Wagner, Mareike Thies, Mingxuan Gu, Stefan Uderhardt, Katharina Breininger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11266
Pdf URL: https://arxiv.org/pdf/2503.11266
Copy Paste: [[2503.11266]] CyclePose -- Leveraging Cycle-Consistency for Annotation-Free Nuclei Segmentation in Fluorescence Microscopy(https://arxiv.org/abs/2503.11266)
Keywords: diffusion, generative
Abstract: In recent years, numerous neural network architectures specifically designed for the instance segmentation of nuclei in microscopic images have been released. These models embed nuclei-specific priors to outperform generic architectures like U-Nets; however, they require large annotated datasets, which are often not available. Generative models (GANs, diffusion models) have been used to compensate for this by synthesizing training data. These two-stage approaches are computationally expensive, as first a generative model and then a segmentation model has to be trained. We propose CyclePose, a hybrid framework integrating synthetic data generation and segmentation training. CyclePose builds on a CycleGAN architecture, which allows unpaired translation between microscopy images and segmentation masks. We embed a segmentation model into CycleGAN and leverage a cycle consistency loss for self-supervision. Without annotated data, CyclePose outperforms other weakly or unsupervised methods on two public datasets. Code is available at this https URL

Title: OPTIMUS: Predicting Multivariate Outcomes in Alzheimer's Disease Using Multi-modal Data amidst Missing Values

Authors: Christelle Schneuwly Diaz, Duy-Thanh Vu, Julien Bodelet, Duy-Cat Can, Guillaume Blanc, Haiting Jiang, Lin Yao, Guiseppe Pantaleo, ADNI, Oliver Y. Chén
Subjects: cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2503.11282
Pdf URL: https://arxiv.org/pdf/2503.11282
Copy Paste: [[2503.11282]] OPTIMUS: Predicting Multivariate Outcomes in Alzheimer's Disease Using Multi-modal Data amidst Missing Values(https://arxiv.org/abs/2503.11282)
Keywords: generative
Abstract: Alzheimer's disease, a neurodegenerative disorder, is associated with neural, genetic, and proteomic factors while affecting multiple cognitive and behavioral faculties. Traditional AD prediction largely focuses on univariate disease outcomes, such as disease stages and severity. Multimodal data encode broader disease information than a single modality and may, therefore, improve disease prediction; but they often contain missing values. Recent "deeper" machine learning approaches show promise in improving prediction accuracy, yet the biological relevance of these models needs to be further charted. Integrating missing data analysis, predictive modeling, multimodal data analysis, and explainable AI, we propose OPTIMUS, a predictive, modular, and explainable machine learning framework, to unveil the many-to-many predictive pathways between multimodal input data and multivariate disease outcomes amidst missing values. OPTIMUS first applies modality-specific imputation to uncover data from each modality while optimizing overall prediction accuracy. It then maps multimodal biomarkers to multivariate outcomes using machine-learning and extracts biomarkers respectively predictive of each outcome. Finally, OPTIMUS incorporates XAI to explain the identified multimodal biomarkers. Using data from 346 cognitively normal subjects, 608 persons with mild cognitive impairment, and 251 AD patients, OPTIMUS identifies neural and transcriptomic signatures that jointly but differentially predict multivariate outcomes related to executive function, language, memory, and visuospatial function. Our work demonstrates the potential of building a predictive and biologically explainable machine-learning framework to uncover multimodal biomarkers that capture disease profiles across varying cognitive landscapes. The results improve our understanding of the complex many-to-many pathways in AD.

Title: BriLLM: Brain-inspired Large Language Model

Authors: Hai Zhao, Hongqiu Wu, Dongjie Yang, Anni Zou, Jiale Hong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11299
Pdf URL: https://arxiv.org/pdf/2503.11299
Copy Paste: [[2503.11299]] BriLLM: Brain-inspired Large Language Model(https://arxiv.org/abs/2503.11299)
Keywords: generative
Abstract: This paper reports the first brain-inspired large language model (BriLLM). This is a non-Transformer, non-GPT, non-traditional machine learning input-output controlled generative language model. The model is based on the Signal Fully-connected flowing (SiFu) definition on the directed graph in terms of the neural network, and has the interpretability of all nodes on the graph of the whole model, instead of the traditional machine learning model that only has limited interpretability at the input and output ends. In the language model scenario, the token is defined as a node in the graph. A randomly shaped or user-defined signal flow flows between nodes on the principle of "least resistance" along paths. The next token or node to be predicted or generated is the target of the signal flow. As a language model, BriLLM theoretically supports infinitely long $n$-gram models when the model size is independent of the input and predicted length of the model. The model's working signal flow provides the possibility of recall activation and innate multi-modal support similar to the cognitive patterns of the human brain. At present, we released the first BriLLM version in Chinese, with 4000 tokens, 32-dimensional node width, 16-token long sequence prediction ability, and language model prediction performance comparable to GPT-1. More computing power will help us explore the infinite possibilities depicted above.

Title: Leveraging Diffusion Knowledge for Generative Image Compression with Fractal Frequency-Aware Band Learning

Authors: Lingyu Zhu, Xiangrui Zeng, Bolin Chen, Peilin Chen, Yung-Hui Li, Shiqi Wang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.11321
Pdf URL: https://arxiv.org/pdf/2503.11321
Copy Paste: [[2503.11321]] Leveraging Diffusion Knowledge for Generative Image Compression with Fractal Frequency-Aware Band Learning(https://arxiv.org/abs/2503.11321)
Keywords: diffusion, generative
Abstract: By optimizing the rate-distortion-realism trade-off, generative image compression approaches produce detailed, realistic images instead of the only sharp-looking reconstructions produced by rate-distortion-optimized models. In this paper, we propose a novel deep learning-based generative image compression method injected with diffusion knowledge, obtaining the capacity to recover more realistic textures in practical scenarios. Efforts are made from three perspectives to navigate the rate-distortion-realism trade-off in the generative image compression task. First, recognizing the strong connection between image texture and frequency-domain characteristics, we design a Fractal Frequency-Aware Band Image Compression (FFAB-IC) network to effectively capture the directional frequency components inherent in natural images. This network integrates commonly used fractal band feature operations within a neural non-linear mapping design, enhancing its ability to retain essential given information and filter out unnecessary details. Then, to improve the visual quality of image reconstruction under limited bandwidth, we integrate diffusion knowledge into the encoder and implement diffusion iterations into the decoder process, thus effectively recovering lost texture details. Finally, to fully leverage the spatial and frequency intensity information, we incorporate frequency- and content-aware regularization terms to regularize the training of the generative image compression network. Extensive experiments in quantitative and qualitative evaluations demonstrate the superiority of the proposed method, advancing the boundaries of achievable distortion-realism pairs, i.e., our method achieves better distortions at high realism and better realism at low distortion than ever before.

Title: Self-Supervised Pretraining for Fine-Grained Plankton Recognition

Authors: Joona Kareinen, Tuomas Eerola, Kaisa Kraft, Lasse Lensu, Sanna Suikkanen, Heikki Kälviäinen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11341
Pdf URL: https://arxiv.org/pdf/2503.11341
Copy Paste: [[2503.11341]] Self-Supervised Pretraining for Fine-Grained Plankton Recognition(https://arxiv.org/abs/2503.11341)
Keywords: self-supervised
Abstract: Plankton recognition is an important computer vision problem due to plankton's essential role in ocean food webs and carbon capture, highlighting the need for species-level monitoring. However, this task is challenging due to its fine-grained nature and dataset shifts caused by different imaging instruments and varying species distributions. As new plankton image datasets are collected at an increasing pace, there is a need for general plankton recognition models that require minimal expert effort for data labeling. In this work, we study large-scale self-supervised pretraining for fine-grained plankton recognition. We first employ masked autoencoding and a large volume of diverse plankton image data to pretrain a general-purpose plankton image encoder. Then we utilize fine-tuning to obtain accurate plankton recognition models for new datasets with a very limited number of labeled training images. Our experiments show that self-supervised pretraining with diverse plankton data clearly increases plankton recognition accuracy compared to standard ImageNet pretraining when the amount of training data is limited. Moreover, the accuracy can be further improved when unlabeled target data is available and utilized during the pretraining.

Title: AIstorian lets AI be a historian: A KG-powered multi-agent system for accurate biography generation

Authors: Fengyu Li (1), Yilin Li (1), Junhao Zhu (1), Lu Chen (1), Yanfei Zhang (1), Jia Zhou (1), Hui Zu (1), Jingwen Zhao (2), Yunjun Gao (1) ((1) Zhejiang University, (2) Poisson Lab, Huawei)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11346
Pdf URL: https://arxiv.org/pdf/2503.11346
Copy Paste: [[2503.11346]] AIstorian lets AI be a historian: A KG-powered multi-agent system for accurate biography generation(https://arxiv.org/abs/2503.11346)
Keywords: in-context
Abstract: Huawei has always been committed to exploring the AI application in historical research. Biography generation, as a specialized form of abstractive summarization, plays a crucial role in historical research but faces unique challenges that existing large language models (LLMs) struggle to address. These challenges include maintaining stylistic adherence to historical writing conventions, ensuring factual fidelity, and handling fragmented information across multiple documents. We present AIstorian, a novel end-to-end agentic system featured with a knowledge graph (KG)-powered retrieval-augmented generation (RAG) and anti-hallucination multi-agents. Specifically, AIstorian introduces an in-context learning based chunking strategy and a KG-based index for accurate and efficient reference retrieval. Meanwhile, AIstorian orchestrates multi-agents to conduct on-the-fly hallucination detection and error-type-aware correction. Additionally, to teach LLMs a certain language style, we finetune LLMs based on a two-step training approach combining data augmentation-enhanced supervised fine-tuning with stylistic preference optimization. Extensive experiments on a real-life historical Jinshi dataset demonstrate that AIstorian achieves a 3.8x improvement in factual accuracy and a 47.6% reduction in hallucination rate compared to existing baselines. The data and code are available at: this https URL.

Title: PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models

Authors: Mayank Nautiyal, Stela Arranz Gheorghe, Kristiana Stefa, Li Ju, Ida-Maria Sintorn, Prashant Singh
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.11360
Pdf URL: https://arxiv.org/pdf/2503.11360
Copy Paste: [[2503.11360]] PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models(https://arxiv.org/abs/2503.11360)
Keywords: foundation model
Abstract: Language-guided attention frameworks have significantly enhanced both interpretability and performance in image classification; however, the reliance on deterministic embeddings from pre-trained vision-language foundation models to generate reference attention maps frequently overlooks the intrinsic multivaluedness and ill-posed characteristics of cross-modal mappings. To address these limitations, we introduce PARIC, a probabilistic framework for guiding visual attention via language specifications. Our approach enables pre-trained vision-language models to generate probabilistic reference attention maps, which align textual and visual modalities more effectively while incorporating uncertainty estimates, as compared to their deterministic counterparts. Experiments on benchmark test problems demonstrate that PARIC enhances prediction accuracy, mitigates bias, ensures consistent predictions, and improves robustness across various datasets.

Title: PBR3DGen: A VLM-guided Mesh Generation with High-quality PBR Texture

Authors: Xiaokang Wei, Bowen Zhang, Xianghui Yang, Yuxuan Wang, Chunchao Guo, Xi Zhao, Yan Luximon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11368
Pdf URL: https://arxiv.org/pdf/2503.11368
Copy Paste: [[2503.11368]] PBR3DGen: A VLM-guided Mesh Generation with High-quality PBR Texture(https://arxiv.org/abs/2503.11368)
Keywords: diffusion
Abstract: Generating high-quality physically based rendering (PBR) materials is important to achieve realistic rendering in the downstream tasks, yet it remains challenging due to the intertwined effects of materials and lighting. While existing methods have made breakthroughs by incorporating material decomposition in the 3D generation pipeline, they tend to bake highlights into albedo and ignore spatially varying properties of metallicity and roughness. In this work, we present PBR3DGen, a two-stage mesh generation method with high-quality PBR materials that integrates the novel multi-view PBR material estimation model and a 3D PBR mesh reconstruction model. Specifically, PBR3DGen leverages vision language models (VLM) to guide multi-view diffusion, precisely capturing the spatial distribution and inherent attributes of reflective-metalness material. Additionally, we incorporate view-dependent illumination-aware conditions as pixel-aware priors to enhance spatially varying material properties. Furthermore, our reconstruction model reconstructs high-quality mesh with PBR materials. Experimental results demonstrate that PBR3DGen significantly outperforms existing methods, achieving new state-of-the-art results for PBR estimation and mesh generation. More results and visualization can be found on our project page: this https URL.

Title: Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding

Authors: David Gastager, Ghazal Ghazaei, Constantin Patsch
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11392
Pdf URL: https://arxiv.org/pdf/2503.11392
Copy Paste: [[2503.11392]] Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding(https://arxiv.org/abs/2503.11392)
Keywords: generative
Abstract: Automated surgical workflow analysis is crucial for education, research, and clinical decision-making, but the lack of annotated datasets hinders the development of accurate and comprehensive workflow analysis solutions. We introduce a novel approach for addressing the sparsity and heterogeneity of annotated training data inspired by the human learning procedure of watching experts and understanding their explanations. Our method leverages a video-language model trained on alignment, denoising, and generative tasks to learn short-term spatio-temporal and multimodal representations. A task-specific temporal model is then used to capture relationships across entire videos. To achieve comprehensive video-language understanding in the surgical domain, we introduce a data collection and filtering strategy to construct a large-scale pretraining dataset from educational YouTube videos. We then utilize parameter-efficient fine-tuning by projecting downstream task annotations from publicly available surgical datasets into the language domain. Extensive experiments in two surgical domains demonstrate the effectiveness of our approach, with performance improvements of up to 7% in phase segmentation tasks, 8% in zero-shot phase segmentation, and comparable capabilities to fully-supervised models in few-shot settings. Harnessing our model's capabilities for long-range temporal localization and text generation, we present the first comprehensive solution for dense video captioning (DVC) of surgical videos, addressing this task despite the absence of existing DVC datasets in the surgical domain. We introduce a novel approach to surgical workflow understanding that leverages video-language pretraining, large-scale video pretraining, and optimized fine-tuning. Our method improves performance over state-of-the-art techniques and enables new downstream tasks for surgical video understanding.

Title: Towards A Correct Usage of Cryptography in Semantic Watermarks for Diffusion Models

Authors: Jonas Thietke, Andreas Müller, Denis Lukovnikov, Asja Fischer, Erwin Quiring
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.11404
Pdf URL: https://arxiv.org/pdf/2503.11404
Copy Paste: [[2503.11404]] Towards A Correct Usage of Cryptography in Semantic Watermarks for Diffusion Models(https://arxiv.org/abs/2503.11404)
Keywords: diffusion
Abstract: Semantic watermarking methods enable the direct integration of watermarks into the generation process of latent diffusion models by only modifying the initial latent noise. One line of approaches building on Gaussian Shading relies on cryptographic primitives to steer the sampling process of the latent noise. However, we identify several issues in the usage of cryptographic techniques in Gaussian Shading, particularly in its proof of lossless performance and key management, causing ambiguity in follow-up works, too. In this work, we therefore revisit the cryptographic primitives for semantic watermarking. We introduce a novel, general proof of lossless performance based on IND\$-CPA security for semantic watermarks. We then discuss the configuration of the cryptographic primitives in semantic watermarks with respect to security, efficiency, and generation quality.

Title: A Neural Network Architecture Based on Attention Gate Mechanism for 3D Magnetotelluric Forward Modeling

Authors: Xin Zhong, Weiwei Ling, Kejia Pan, Pinxia Wu, Jiajing Zhang, Zhiliang Zhan, Wenbo Xiao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11408
Pdf URL: https://arxiv.org/pdf/2503.11408
Copy Paste: [[2503.11408]] A Neural Network Architecture Based on Attention Gate Mechanism for 3D Magnetotelluric Forward Modeling(https://arxiv.org/abs/2503.11408)
Keywords: anomaly
Abstract: Traditional three-dimensional magnetotelluric (MT) numerical forward modeling methods, such as the finite element method (FEM) and finite volume method (FVM), suffer from high computational costs and low efficiency due to limitations in mesh refinement and computational resources. We propose a novel neural network architecture named MTAGU-Net, which integrates an attention gating mechanism for 3D MT forward modeling. Specifically, a dual-path attention gating module is designed based on forward response data images and embedded in the skip connections between the encoder and decoder. This module enables the fusion of critical anomaly information from shallow feature maps during the decoding of deep feature maps, significantly enhancing the network's capability to extract features from anomalous regions. Furthermore, we introduce a synthetic model generation method utilizing 3D Gaussian random field (GRF), which accurately replicates the electrical structures of real-world geological scenarios with high fidelity. Numerical experiments demonstrate that MTAGU-Net outperforms conventional 3D U-Net in terms of convergence stability and prediction accuracy, with the structural similarity index (SSIM) of the forward response data consistently exceeding 0.98. Moreover, the network can accurately predict forward response data on previously unseen datasets models, demonstrating its strong generalization ability and validating the feasibility and effectiveness of this method in practical applications.

Title: Empowering Time Series Analysis with Synthetic Data: A Survey and Outlook in the Era of Foundation Models

Authors: Xu Liu, Taha Aksu, Juncheng Liu, Qingsong Wen, Yuxuan Liang, Caiming Xiong, Silvio Savarese, Doyen Sahoo, Junnan Li, Chenghao Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11411
Pdf URL: https://arxiv.org/pdf/2503.11411
Copy Paste: [[2503.11411]] Empowering Time Series Analysis with Synthetic Data: A Survey and Outlook in the Era of Foundation Models(https://arxiv.org/abs/2503.11411)
Keywords: foundation model
Abstract: Time series analysis is crucial for understanding dynamics of complex systems. Recent advances in foundation models have led to task-agnostic Time Series Foundation Models (TSFMs) and Large Language Model-based Time Series Models (TSLLMs), enabling generalized learning and integrating contextual information. However, their success depends on large, diverse, and high-quality datasets, which are challenging to build due to regulatory, diversity, quality, and quantity constraints. Synthetic data emerge as a viable solution, addressing these challenges by offering scalable, unbiased, and high-quality alternatives. This survey provides a comprehensive review of synthetic data for TSFMs and TSLLMs, analyzing data generation strategies, their role in model pretraining, fine-tuning, and evaluation, and identifying future research directions.

Title: MTV-Inpaint: Multi-Task Long Video Inpainting

Authors: Shiyuan Yang, Zheng Gu, Liang Hou, Xin Tao, Pengfei Wan, Xiaodong Chen, Jing Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11412
Pdf URL: https://arxiv.org/pdf/2503.11412
Copy Paste: [[2503.11412]] MTV-Inpaint: Multi-Task Long Video Inpainting(https://arxiv.org/abs/2503.11412)
Keywords: diffusion
Abstract: Video inpainting involves modifying local regions within a video, ensuring spatial and temporal consistency. Most existing methods focus primarily on scene completion (i.e., filling missing regions) and lack the capability to insert new objects into a scene in a controllable manner. Fortunately, recent advancements in text-to-video (T2V) diffusion models pave the way for text-guided video inpainting. However, directly adapting T2V models for inpainting remains limited in unifying completion and insertion tasks, lacks input controllability, and struggles with long videos, thereby restricting their applicability and flexibility. To address these challenges, we propose MTV-Inpaint, a unified multi-task video inpainting framework capable of handling both traditional scene completion and novel object insertion tasks. To unify these distinct tasks, we design a dual-branch spatial attention mechanism in the T2V diffusion U-Net, enabling seamless integration of scene completion and object insertion within a single framework. In addition to textual guidance, MTV-Inpaint supports multimodal control by integrating various image inpainting models through our proposed image-to-video (I2V) inpainting mode. Additionally, we propose a two-stage pipeline that combines keyframe inpainting with in-between frame propagation, enabling MTV-Inpaint to effectively handle long videos with hundreds of frames. Extensive experiments demonstrate that MTV-Inpaint achieves state-of-the-art performance in both scene completion and object insertion tasks. Furthermore, it demonstrates versatility in derived applications such as multi-modal inpainting, object editing, removal, image object brush, and the ability to handle long videos. Project page: this https URL.

Title: From Generative AI to Innovative AI: An Evolutionary Roadmap

Authors: Seyed Mahmoud Sajjadi Mohammadabadi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.11419
Pdf URL: https://arxiv.org/pdf/2503.11419
Copy Paste: [[2503.11419]] From Generative AI to Innovative AI: An Evolutionary Roadmap(https://arxiv.org/abs/2503.11419)
Keywords: generative
Abstract: This paper explores the critical transition from Generative Artificial Intelligence (GenAI) to Innovative Artificial Intelligence (InAI). While recent advancements in GenAI have enabled systems to produce high-quality content across various domains, these models often lack the capacity for true innovation. In this context, innovation is defined as the ability to generate novel and useful outputs that go beyond mere replication of learned data. The paper examines this shift and proposes a roadmap for developing AI systems that can generate content and engage in autonomous problem-solving and creative ideation. The work provides both theoretical insights and practical strategies for advancing AI to a stage where it can genuinely innovate, contributing meaningfully to science, technology, and the arts.

Title: TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation

Authors: Hongxiang Zhao, Xingchen Liu, Mutian Xu, Yiming Hao, Weikai Chen, Xiaoguang Han
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.11423
Pdf URL: https://arxiv.org/pdf/2503.11423
Copy Paste: [[2503.11423]] TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation(https://arxiv.org/abs/2503.11423)
Keywords: diffusion
Abstract: We address key limitations in existing datasets and models for task-oriented hand-object interaction video generation, a critical approach of generating video demonstrations for robotic imitation learning. Current datasets, such as Ego4D, often suffer from inconsistent view perspectives and misaligned interactions, leading to reduced video quality and limiting their applicability for precise imitation learning tasks. Towards this end, we introduce TASTE-Rob -- a pioneering large-scale dataset of 100,856 ego-centric hand-object interaction videos. Each video is meticulously aligned with language instructions and recorded from a consistent camera viewpoint to ensure interaction clarity. By fine-tuning a Video Diffusion Model (VDM) on TASTE-Rob, we achieve realistic object interactions, though we observed occasional inconsistencies in hand grasping postures. To enhance realism, we introduce a three-stage pose-refinement pipeline that improves hand posture accuracy in generated videos. Our curated dataset, coupled with the specialized pose-refinement framework, provides notable performance gains in generating high-quality, task-oriented hand-object interaction videos, resulting in achieving superior generalizable robotic manipulation. The TASTE-Rob dataset will be made publicly available upon publication to foster further advancements in the field.

Title: Text Compression for Efficient Language Generation

Authors: David Gu, Peter Belcak, Roger Wattenhofer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.11426
Pdf URL: https://arxiv.org/pdf/2503.11426
Copy Paste: [[2503.11426]] Text Compression for Efficient Language Generation(https://arxiv.org/abs/2503.11426)
Keywords: generative
Abstract: We challenge the prevailing assumption that LLMs must rely fully on sub-word tokens for high-quality text generation. To this end, we propose the "Generative Pretrained Thoughtformer" (GPTHF), a hierarchical transformer language model capable of text generation by compressing text into sentence embeddings and employing a sentence attention mechanism. GPTHF retains GPT's architecture, modifying only token interactions via dynamic sparse attention masks. Our experiments show that GPTHF achieves an up to an order of magnitude improvement in FLOPs efficiency and a threefold increase in runtime speed compared to equally-sized GPT models in the low-size regime. This is achieved through a unique generation method that caches and reuses sentence embeddings, allowing significant portions of the input to bypass large parts of the network.

Title: Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios

Authors: Hang Shao, Lei Luo, Jianjun Qian, Mengkai Yan, Shuo Chen, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11465
Pdf URL: https://arxiv.org/pdf/2503.11465
Copy Paste: [[2503.11465]] Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios(https://arxiv.org/abs/2503.11465)
Keywords: self-supervised
Abstract: Physiological activities can be manifested by the sensitive changes in facial imaging. While they are barely observable to our eyes, computer vision manners can, and the derived remote photoplethysmography (rPPG) has shown considerable promise. However, existing studies mainly rely on spatial skin recognition and temporal rhythmic interactions, so they focus on identifying explicit features under ideal light conditions, but perform poorly in-the-wild with intricate obstacles and extreme illumination exposure. In this paper, we propose an end-to-end video transformer model for rPPG. It strives to eliminate complex and unknown external time-varying interferences, whether they are sufficient to occupy subtle biosignal amplitudes or exist as periodic perturbations that hinder network training. In the specific implementation, we utilize global interference sharing, subject background reference, and self-supervised disentanglement to eliminate interference, and further guide learning based on spatiotemporal filtering, reconstruction guidance, and frequency domain and biological prior constraints to achieve effective rPPG. To the best of our knowledge, this is the first robust rPPG model for real outdoor scenarios based on natural face videos, and is lightweight to deploy. Extensive experiments show the competitiveness and performance of our model in rPPG prediction across datasets and scenes.

Title: T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image Evaluation

Authors: Seyed Mohammad Hadi Hosseini, Amir Mohammad Izadi, Ali Abdollahi, Armin Saghafian, Mahdieh Soleymani Baghshah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11481
Pdf URL: https://arxiv.org/pdf/2503.11481
Copy Paste: [[2503.11481]] T2I-FineEval: Fine-Grained Compositional Metric for Text-to-Image Evaluation(https://arxiv.org/abs/2503.11481)
Keywords: generative
Abstract: Although recent text-to-image generative models have achieved impressive performance, they still often struggle with capturing the compositional complexities of prompts including attribute binding, and spatial relationships between different entities. This misalignment is not revealed by common evaluation metrics such as CLIPScore. Recent works have proposed evaluation metrics that utilize Visual Question Answering (VQA) by decomposing prompts into questions about the generated image for more robust compositional evaluation. Although these methods align better with human evaluations, they still fail to fully cover the compositionality within the image. To address this, we propose a novel metric that breaks down images into components, and texts into fine-grained questions about the generated image for evaluation. Our method outperforms previous state-of-the-art metrics, demonstrating its effectiveness in evaluating text-to-image generative models. Code is available at this https URL T2I-FineEval.

Title: Unicorn: A Universal and Collaborative Reinforcement Learning Approach Towards Generalizable Network-Wide Traffic Signal Control

Authors: Yifeng Zhang, Yilin Liu, Ping Gong, Peizhuo Li, Mingfeng Fan, Guillaume Sartoretti
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.11488
Pdf URL: https://arxiv.org/pdf/2503.11488
Copy Paste: [[2503.11488]] Unicorn: A Universal and Collaborative Reinforcement Learning Approach Towards Generalizable Network-Wide Traffic Signal Control(https://arxiv.org/abs/2503.11488)
Keywords: self-supervised
Abstract: Adaptive traffic signal control (ATSC) is crucial in reducing congestion, maximizing throughput, and improving mobility in rapidly growing urban areas. Recent advancements in parameter-sharing multi-agent reinforcement learning (MARL) have greatly enhanced the scalable and adaptive optimization of complex, dynamic flows in large-scale homogeneous networks. However, the inherent heterogeneity of real-world traffic networks, with their varied intersection topologies and interaction dynamics, poses substantial challenges to achieving scalable and effective ATSC across different traffic scenarios. To address these challenges, we present Unicorn, a universal and collaborative MARL framework designed for efficient and adaptable network-wide ATSC. Specifically, we first propose a unified approach to map the states and actions of intersections with varying topologies into a common structure based on traffic movements. Next, we design a Universal Traffic Representation (UTR) module with a decoder-only network for general feature extraction, enhancing the model's adaptability to diverse traffic scenarios. Additionally, we incorporate an Intersection Specifics Representation (ISR) module, designed to identify key latent vectors that represent the unique intersection's topology and traffic dynamics through variational inference techniques. To further refine these latent representations, we employ a contrastive learning approach in a self-supervised manner, which enables better differentiation of intersection-specific features. Moreover, we integrate the state-action dependencies of neighboring agents into policy optimization, which effectively captures dynamic agent interactions and facilitates efficient regional collaboration. Our results show that Unicorn outperforms other methods across various evaluation metrics, highlighting its potential in complex, dynamic traffic networks.

Title: TikZero: Zero-Shot Text-Guided Graphics Program Synthesis

Authors: Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, Simone Paolo Ponzetto
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.11509
Pdf URL: https://arxiv.org/pdf/2503.11509
Copy Paste: [[2503.11509]] TikZero: Zero-Shot Text-Guided Graphics Program Synthesis(https://arxiv.org/abs/2503.11509)
Keywords: generative
Abstract: With the rise of generative AI, synthesizing figures from text captions becomes a compelling application. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like TikZ, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, TikZero matches or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models are publicly available.

Title: Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Kaidi Xu, Mengshu Sun, Jindong Gu, Renjing Xu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.11519
Pdf URL: https://arxiv.org/pdf/2503.11519
Copy Paste: [[2503.11519]] Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models(https://arxiv.org/abs/2503.11519)
Keywords: generative
Abstract: Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-vision, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), tasks have attracted significant attention. Large Vision Language Models (LVLMs) and I2I GMs are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to generate disruptive outputs semantically related to those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of VLP tasks when injected into images. In this paper, we comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs. To better observe performance modifications and characteristics of this threat, we also introduce the TVPI Dataset. Through extensive explorations, we deepen the understanding of the underlying causes of the TVPI threat in various GMs and offer valuable insights into its potential origins.

Title: Bottom-up Iterative Anomalous Diffusion Detector (BI-ADD)

Authors: Junwoo Park, Nataliya Sokolovska, Clément Cabriel, Ignacio Izeddin, Judith Miné-Hattab
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.11529
Pdf URL: https://arxiv.org/pdf/2503.11529
Copy Paste: [[2503.11529]] Bottom-up Iterative Anomalous Diffusion Detector (BI-ADD)(https://arxiv.org/abs/2503.11529)
Keywords: diffusion
Abstract: In recent years, the segmentation of short molecular trajectories with varying diffusive properties has drawn particular attention of researchers, since it allows studying the dynamics of a particle. In the past decade, machine learning methods have shown highly promising results, also in changepoint detection and segmentation tasks. Here, we introduce a novel iterative method to identify the changepoints in a molecular trajectory, i.e., frames, where the diffusive behavior of a particle changes. A trajectory in our case follows a fractional Brownian motion and we estimate the diffusive properties of the trajectories. The proposed BI-ADD combines unsupervised and supervised learning methods to detect the changepoints. Our approach can be used for the analysis of molecular trajectories at the individual level and also be extended to multiple particle tracking, which is an important challenge in fundamental biology. We validated BI-ADD in various scenarios within the framework of the AnDi2 Challenge 2024 dedicated to single particle tracking. Our method is implemented in Python and is publicly available for research purposes.

Title: AugGen: Synthetic Augmentation Can Improve Discriminative Models

Authors: Parsa Rahimi, Damien Teney, Sebastien Marcel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11544
Pdf URL: https://arxiv.org/pdf/2503.11544
Copy Paste: [[2503.11544]] AugGen: Synthetic Augmentation Can Improve Discriminative Models(https://arxiv.org/abs/2503.11544)
Keywords: generative
Abstract: The increasing dependence on large-scale datasets in machine learning introduces significant privacy and ethical challenges. Synthetic data generation offers a promising solution; however, most current methods rely on external datasets or pre-trained models, which add complexity and escalate resource demands. In this work, we introduce a novel self-contained synthetic augmentation technique that strategically samples from a conditional generative model trained exclusively on the target dataset. This approach eliminates the need for auxiliary data sources. Applied to face recognition datasets, our method achieves 1--12\% performance improvements on the IJB-C and IJB-B benchmarks. It outperforms models trained solely on real data and exceeds the performance of state-of-the-art synthetic data generation baselines. Notably, these enhancements often surpass those achieved through architectural improvements, underscoring the significant impact of synthetic augmentation in data-scarce environments. These findings demonstrate that carefully integrated synthetic data not only addresses privacy and resource constraints but also substantially boosts model performance. Project page this https URL

Title: Advancing 3D Gaussian Splatting Editing with Complementary and Consensus Information

Authors: Xuanqi Zhang, Jieun Lee, Chris Joslin, Wonsook Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11601
Pdf URL: https://arxiv.org/pdf/2503.11601
Copy Paste: [[2503.11601]] Advancing 3D Gaussian Splatting Editing with Complementary and Consensus Information(https://arxiv.org/abs/2503.11601)
Keywords: diffusion
Abstract: We present a novel framework for enhancing the visual fidelity and consistency of text-guided 3D Gaussian Splatting (3DGS) editing. Existing editing approaches face two critical challenges: inconsistent geometric reconstructions across multiple viewpoints, particularly in challenging camera positions, and ineffective utilization of depth information during image manipulation, resulting in over-texture artifacts and degraded object boundaries. To address these limitations, we introduce: 1) A complementary information mutual learning network that enhances depth map estimation from 3DGS, enabling precise depth-conditioned 3D editing while preserving geometric structures. 2) A wavelet consensus attention mechanism that effectively aligns latent codes during the diffusion denoising process, ensuring multi-view consistency in the edited results. Through extensive experimentation, our method demonstrates superior performance in rendering quality and view consistency compared to state-of-the-art approaches. The results validate our framework as an effective solution for text-guided editing of 3D scenes.

Title: From Denoising Score Matching to Langevin Sampling: A Fine-Grained Error Analysis in the Gaussian Setting

Authors: Samuel Hurault, Matthieu Terris, Thomas Moreau, Gabriel Peyré
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2503.11615
Pdf URL: https://arxiv.org/pdf/2503.11615
Copy Paste: [[2503.11615]] From Denoising Score Matching to Langevin Sampling: A Fine-Grained Error Analysis in the Gaussian Setting(https://arxiv.org/abs/2503.11615)
Keywords: diffusion, generative
Abstract: Sampling from an unknown distribution, accessible only through discrete samples, is a fundamental problem at the core of generative AI. The current state-of-the-art methods follow a two-step process: first estimating the score function (the gradient of a smoothed log-distribution) and then applying a gradient-based sampling algorithm. The resulting distribution's correctness can be impacted by several factors: the generalization error due to a finite number of initial samples, the error in score matching, and the diffusion error introduced by the sampling algorithm. In this paper, we analyze the sampling process in a simple yet representative setting-sampling from Gaussian distributions using a Langevin diffusion sampler. We provide a sharp analysis of the Wasserstein sampling error that arises from the multiple sources of error throughout the pipeline. This allows us to rigorously track how the anisotropy of the data distribution (encoded by its power spectrum) interacts with key parameters of the end-to-end sampling method, including the noise amplitude, the step sizes in both score matching and diffusion, and the number of initial samples. Notably, we show that the Wasserstein sampling error can be expressed as a kernel-type norm of the data power spectrum, where the specific kernel depends on the method parameters. This result provides a foundation for further analysis of the tradeoffs involved in optimizing sampling accuracy, such as adapting the noise amplitude to the choice of step sizes.

Title: ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Authors: Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, Di Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.11647
Pdf URL: https://arxiv.org/pdf/2503.11647
Copy Paste: [[2503.11647]] ReCamMaster: Camera-Controlled Generative Rendering from A Single Video(https://arxiv.org/abs/2503.11647)
Keywords: generative
Abstract: Camera control has been actively studied in text or image conditioned video generation tasks. However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation. It is non-trivial due to the extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. The core innovation lies in harnessing the generative capabilities of pre-trained text-to-video models through a simple yet powerful video conditioning mechanism -- its capability often overlooked in current research. To overcome the scarcity of qualified training data, we construct a comprehensive multi-camera synchronized video dataset using Unreal Engine 5, which is carefully curated to follow real-world filming characteristics, covering diverse scenes and camera movements. It helps the model generalize to in-the-wild videos. Lastly, we further improve the robustness to diverse inputs through a meticulously designed training strategy. Extensive experiments tell that our method substantially outperforms existing state-of-the-art approaches and strong baselines. Our method also finds promising applications in video stabilization, super-resolution, and outpainting. Project page: this https URL