2025-04-01

Title: A Novel Chaos-Based Cryptographic Scrambling Technique to Secure Medical Images

Authors: Chandra Sekhar Sanaboina
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.22683
Pdf URL: https://arxiv.org/pdf/2503.22683
Copy Paste: [[2503.22683]] A Novel Chaos-Based Cryptographic Scrambling Technique to Secure Medical Images(https://arxiv.org/abs/2503.22683)
Keywords: diffusion
Abstract: These days, a tremendous quantity of digital visual data is sent over many networks and stored in many different formats. This visual information is usually very confidential and financially rewarding. Maintaining safe transmission of data is crucial, as is the use of approaches to offer security features like privacy, integrity, or authentication that are tailored to certain types of data. Protecting sensitive medical images stored in electronic health records is the focus of this article, which proposes a technique of encryption and decryption. In order to safe-guard image-based programs, encryption methods are applied. Privacy, integrity, and authenticity are only few of the security elements investigated by the proposed system, which encrypts medical pictures using chaos maps. In all stages of the protocol, the suggested chaos-based data scrambling method is employed to mitigate the short-comings of traditional confusion and diffusion designs. Bifurcation charts, Lyapunov exponents, tests for mean squared error and peak-to-average signal-to-noise ratio, and histogram analysis are only some of the tools we use to investigate the suggested system's chaotic behavior.

Title: A Spatial-temporal Deep Probabilistic Diffusion Model for Reliable Hail Nowcasting with Radar Echo Extrapolation

Authors: Haonan Shi, Long Tian, Jie Tao, Yufei Li, Liming Wang, Xiyang Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.22724
Pdf URL: https://arxiv.org/pdf/2503.22724
Copy Paste: [[2503.22724]] A Spatial-temporal Deep Probabilistic Diffusion Model for Reliable Hail Nowcasting with Radar Echo Extrapolation(https://arxiv.org/abs/2503.22724)
Keywords: diffusion, generative
Abstract: Hail nowcasting is a considerable contributor to meteorological disasters and there is a great need to mitigate its socioeconomic effects through precise forecast that has high resolution, long lead times and local details with large landscapes. Existing medium-range weather forecasting methods primarily rely on changes in upper air currents and cloud layers to predict precipitation events, such as heavy rainfall, which are unsuitable for hail nowcasting since it is mainly caused by low-altitude local strong convection associated with terrains. Additionally, radar captures the status of low cloud layers, such as water vapor, droplets, and ice crystals, providing rich signals suitable for hail nowcasting. To this end, we introduce a Spatial-Temporal gEnerAtive Model called SteamCast for hail nowcasting with radar echo extrapolation, it is a deep probabilistic diffusion model based on spatial-temporal representations including radar echoes as well as their position/time embeddings, which we trained on historical reanalysis archive from Yan'an Meteorological Bureau in China, where the crop yield like apple suffers greatly from hail damage. Considering the short-term nature of hail, SteamCast provides 30-minute nowcasts at 6-minute intervals for a single radar reflectivity variable, across 9 different vertical angles, on a latitude-longitude grid with approximately 1 km * 1 km resolution per pixel in Yan'an City, China. By successfully fusing the spatial-temporal features of radar echoes, SteamCast delivers competitive, and in some cases superior, results compared to other deep learning-based models such as PredRNN and VMRNN.

Title: Reasoning Beyond Limits: Advances and Open Problems for LLMs

Authors: Mohamed Amine Ferrag, Norbert Tihanyi, Merouane Debbah
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.22732
Pdf URL: https://arxiv.org/pdf/2503.22732
Copy Paste: [[2503.22732]] Reasoning Beyond Limits: Advances and Open Problems for LLMs(https://arxiv.org/abs/2503.22732)
Keywords: generative
Abstract: Recent generative reasoning breakthroughs have transformed how large language models (LLMs) tackle complex problems by dynamically retrieving and refining information while generating coherent, multi-step thought processes. Techniques such as inference-time scaling, reinforcement learning, supervised fine-tuning, and distillation have been successfully applied to models like DeepSeek-R1, OpenAI's o1 & o3, GPT-4o, Qwen-32B, and various Llama variants, resulting in enhanced reasoning capabilities. In this paper, we provide a comprehensive analysis of the top 27 LLM models released between 2023 and 2025 (including models such as Mistral AI Small 3 24B, DeepSeek-R1, Search-o1, QwQ-32B, and phi-4). Then, we present an extensive overview of training methodologies that spans general training approaches, mixture-of-experts (MoE) and architectural innovations, retrieval-augmented generation (RAG), chain-of-thought and self-improvement techniques, as well as test-time compute scaling, distillation, and reinforcement learning (RL) methods. Finally, we discuss the key challenges in advancing LLM capabilities, including improving multi-step reasoning without human supervision, overcoming limitations in chained tasks, balancing structured prompts with flexibility, and enhancing long-context retrieval and external tool integration.

Title: Cyborg Data: Merging Human with AI Generated Training Data

Authors: Kai North, Christopher Ormerod
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.22736
Pdf URL: https://arxiv.org/pdf/2503.22736
Copy Paste: [[2503.22736]] Cyborg Data: Merging Human with AI Generated Training Data(https://arxiv.org/abs/2503.22736)
Keywords: generative
Abstract: Automated scoring (AS) systems used in large-scale assessment have traditionally used small statistical models that require a large quantity of hand-scored data to make accurate predictions, which can be time-consuming and costly. Generative Large Language Models are trained on many tasks and have shown impressive abilities to generalize to new tasks with little to no data. While these models require substantially more computational power to make predictions, they still require some fine-tuning to meet operational standards. Evidence suggests that these models can exceed human-human levels of agreement even when fine-tuned on small amounts of data. With this in mind, we propose a model distillation pipeline in which a large generative model, a Teacher, teaches a much smaller model, a Student. The Teacher, trained on a small subset of the training data, is used to provide scores on the remaining training data, which is then used to train the Student. We call the resulting dataset "Cyborg Data", as it combines human and machine-scored responses. Our findings show that Student models trained on "Cyborg Data" show performance comparable to training on the entire dataset, while only requiring 10% of the original hand-scored data.

Title: ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning

Authors: Zhaorun Chen, Mintong Kang, Bo Li
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2503.22738
Pdf URL: https://arxiv.org/pdf/2503.22738
Copy Paste: [[2503.22738]] ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning(https://arxiv.org/abs/2503.22738)
Keywords: foundation model
Abstract: Autonomous agents powered by foundation models have seen widespread adoption across various real-world applications. However, they remain highly vulnerable to malicious instructions and attacks, which can result in severe consequences such as privacy breaches and financial losses. More critically, existing guardrails for LLMs are not applicable due to the complex and dynamic nature of agents. To tackle these challenges, we propose ShieldAgent, the first guardrail agent designed to enforce explicit safety policy compliance for the action trajectory of other protected agents through logical reasoning. Specifically, ShieldAgent first constructs a safety policy model by extracting verifiable rules from policy documents and structuring them into a set of action-based probabilistic rule circuits. Given the action trajectory of the protected agent, ShieldAgent retrieves relevant rule circuits and generates a shielding plan, leveraging its comprehensive tool library and executable code for formal verification. In addition, given the lack of guardrail benchmarks for agents, we introduce ShieldAgent-Bench, a dataset with 3K safety-related pairs of agent instructions and action trajectories, collected via SOTA attacks across 6 web environments and 7 risk categories. Experiments show that ShieldAgent achieves SOTA on ShieldAgent-Bench and three existing benchmarks, outperforming prior methods by 11.3% on average with a high recall of 90.1%. Additionally, ShieldAgent reduces API queries by 64.7% and inference time by 58.2%, demonstrating its high precision and efficiency in safeguarding agents.

Title: Adaptive State-Space Mamba for Real-Time Sensor Data Anomaly Detection

Authors: Alice Zhang, Chao Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.22743
Pdf URL: https://arxiv.org/pdf/2503.22743
Copy Paste: [[2503.22743]] Adaptive State-Space Mamba for Real-Time Sensor Data Anomaly Detection(https://arxiv.org/abs/2503.22743)
Keywords: anomaly
Abstract: State-space modeling has emerged as a powerful paradigm for sequence analysis in various tasks such as natural language processing, time-series forecasting, and signal processing. In this work, we propose an \emph{Adaptive State-Space Mamba} (\textbf{ASSM}) framework for real-time sensor data anomaly detection. While state-space models have been previously employed for image processing applications (e.g., style transfer \cite{wang2024stylemamba}), our approach leverages the core idea of sequential hidden states to tackle a significantly different domain: detecting anomalies on streaming sensor data. In particular, we introduce an adaptive gating mechanism that dynamically modulates the hidden state update based on contextual and learned statistical cues. This design ensures that our model remains computationally efficient and scalable, even under rapid data arrival rates. Extensive experiments on real-world and synthetic sensor datasets demonstrate that our method achieves superior detection performance compared to existing baselines. Our approach is easily extensible to other time-series tasks that demand rapid and reliable detection capabilities.

Title: LeForecast: Enterprise Hybrid Forecast by Time Series Intelligence

Authors: Zheng Tan, Yiwen Nie, Wenfa Wu, Guanyu Zhang, Yanze Liu, Xinyuan Tian, Kailin Gao, Mengya Liu, Qijiang Cheng, Haipeng Jiang, Yingzheng Ma, Wei Zheng, Yuci Zhu, Yuanyuan Sun, Xiangyu Lei, Xiyu Guan, Wanqing Huang, Shouming Liu, Xiangquan Meng, Pengzhan Qu, Chao Yang, Jiaxuan Fan, Yuan He, Hongsheng Qi, Yangzhou Du
Subjects: cs.LG, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2503.22747
Pdf URL: https://arxiv.org/pdf/2503.22747
Copy Paste: [[2503.22747]] LeForecast: Enterprise Hybrid Forecast by Time Series Intelligence(https://arxiv.org/abs/2503.22747)
Keywords: foundation model
Abstract: Demand is spiking in industrial fields for multidisciplinary forecasting, where a broad spectrum of sectors needs planning and forecasts to streamline intelligent business management, such as demand forecasting, product planning, inventory optimization, etc. Specifically, these tasks expecting intelligent approaches to learn from sequentially collected historical data and then foresee most possible trend, i.e. time series forecasting. Challenge of it lies in interpreting complex business contexts and the efficiency and generalisation of modelling. With aspirations of pre-trained foundational models for such purpose, given their remarkable success of large foundation model across legions of tasks, we disseminate \leforecast{}, an enterprise intelligence platform tailored for time series tasks. It integrates advanced interpretations of time series data and multi-source information, and a three-pillar modelling engine combining a large foundation model (Le-TSFM), multimodal model and hybrid model to derive insights, predict or infer futures, and then drive optimisation across multiple sectors in enterprise operations. The framework is composed by a model pool, model profiling module, and two different fusion approaches regarding original model architectures. Experimental results verify the efficiency of our trail fusion concepts: router-based fusion network and coordination of large and small models, resulting in high costs for redundant development and maintenance of models. This work reviews deployment of LeForecast and its performance in three industrial use cases. Our comprehensive experiments indicate that LeForecast is a profound and practical platform for efficient and competitive performance. And we do hope that this work can enlighten the research and grounding of time series techniques in accelerating enterprise.

Title: Ignite Forecasting with SPARK: An Efficient Generative Framework for Refining LLMs in Temporal Knowledge Graph Forecasting

Authors: Gongzhu Yin, Hongli Zhang, Yi Luo, Yuchen Yang, Kun Lu, Chao Meng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.22748
Pdf URL: https://arxiv.org/pdf/2503.22748
Copy Paste: [[2503.22748]] Ignite Forecasting with SPARK: An Efficient Generative Framework for Refining LLMs in Temporal Knowledge Graph Forecasting(https://arxiv.org/abs/2503.22748)
Keywords: generative
Abstract: Temporal Knowledge Graph (TKG) forecasting is crucial for predicting future events using historical data. With the surge of Large Language Models (LLMs), recent studies have begun exploring their integration into TKG forecasting and achieved some success. However, they still face limitations such as limited input length, inefficient output generation, and resource-intensive refinement, which undermine their performance and practical applicability. To address these limitations, we introduce SPARK, a Sequence-level Proxy-Adapting framework for Refining LLMs in TKG forecasting. Inspired by inference-time algorithms adopted in controlling generation, SPARK offers a cost-effective, plug-and-play solution through two key innovations: (1) Beam Sequence-Level Generation, which reframes TKG forecasting as a top-K sequence-level generation task, using beam search for efficiently generating next-entity distribution in a single forward pass. (2) TKG Adapter for Refinement, which employs traditional TKG models as trainable proxy adapters to leverage global graph information and refine LLM outputs, overcoming both the input length and the resource-intensive fine-tuning problems. Experiments across diverse datasets validate SPARK's forecasting performance, robust generalization capabilities, and high efficiency. We release source codes at this https URL.

Title: Patronus: Bringing Transparency to Diffusion Models with Prototypes

Authors: Nina Weng, Aasa Feragen, Siavash Bigdeli
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.22782
Pdf URL: https://arxiv.org/pdf/2503.22782
Copy Paste: [[2503.22782]] Patronus: Bringing Transparency to Diffusion Models with Prototypes(https://arxiv.org/abs/2503.22782)
Keywords: diffusion, generative
Abstract: Diffusion-based generative models, such as Denoising Diffusion Probabilistic Models (DDPMs), have achieved remarkable success in image generation, but their step-by-step denoising process remains opaque, leaving critical aspects of the generation mechanism unexplained. To address this, we introduce \emph{Patronus}, an interpretable diffusion model inspired by ProtoPNet. Patronus integrates a prototypical network into DDPMs, enabling the extraction of prototypes and conditioning of the generation process on their prototype activation vector. This design enhances interpretability by showing the learned prototypes and how they influence the generation process. Additionally, the model supports downstream tasks like image manipulation, enabling more transparent and controlled modifications. Moreover, Patronus could reveal shortcut learning in the generation process by detecting unwanted correlations between learned prototypes. Notably, Patronus operates entirely without any annotations or text prompts. This work opens new avenues for understanding and controlling diffusion models through prototype-based interpretability. Our code is available at \href{this https URL}{this https URL}.

Title: DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers

Authors: Hanling Zhang, Rundong Su, Zhihang Yuan, Pengtao Chen, Mingzhu Shen Yibo Fan, Shengen Yan, Guohao Dai, Yu Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.22796
Pdf URL: https://arxiv.org/pdf/2503.22796
Copy Paste: [[2503.22796]] DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers(https://arxiv.org/abs/2503.22796)
Keywords: diffusion
Abstract: Text-to-image generation models, especially Multimodal Diffusion Transformers (MMDiT), have shown remarkable progress in generating high-quality images. However, these models often face significant computational bottlenecks, particularly in attention mechanisms, which hinder their scalability and efficiency. In this paper, we introduce DiTFastAttnV2, a post-training compression method designed to accelerate attention in MMDiT. Through an in-depth analysis of MMDiT's attention patterns, we identify key differences from prior DiT-based methods and propose head-wise arrow attention and caching mechanisms to dynamically adjust attention heads, effectively bridging this gap. We also design an Efficient Fused Kernel for further acceleration. By leveraging local metric methods and optimization techniques, our approach significantly reduces the search time for optimal compression schemes to just minutes while maintaining generation quality. Furthermore, with the customized kernel, DiTFastAttnV2 achieves a 68% reduction in attention FLOPs and 1.5x end-to-end speedup on 2K image generation without compromising visual fidelity.

Title: Zero-shot Domain Generalization of Foundational Models for 3D Medical Image Segmentation: An Experimental Study

Authors: Soumitri Chattopadhyay, Basar Demir, Marc Niethammer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.22862
Pdf URL: https://arxiv.org/pdf/2503.22862
Copy Paste: [[2503.22862]] Zero-shot Domain Generalization of Foundational Models for 3D Medical Image Segmentation: An Experimental Study(https://arxiv.org/abs/2503.22862)
Keywords: foundation model
Abstract: Domain shift, caused by variations in imaging modalities and acquisition protocols, limits model generalization in medical image segmentation. While foundation models (FMs) trained on diverse large-scale data hold promise for zero-shot generalization, their application to volumetric medical data remains underexplored. In this study, we examine their ability towards domain generalization (DG), by conducting a comprehensive experimental study encompassing 6 medical segmentation FMs and 12 public datasets spanning multiple modalities and anatomies. Our findings reveal the potential of promptable FMs in bridging the domain gap via smart prompting techniques. Additionally, by probing into multiple facets of zero-shot DG, we offer valuable insights into the viability of FMs for DG and identify promising avenues for future research.

Title: SIGHT: Single-Image Conditioned Generation of Hand Trajectories for Hand-Object Interaction

Authors: Alexey Gavryushin, Florian Redhardt, Gaia Di Lorenzo, Luc Van Gool, Marc Pollefeys, Kaichun Mo, Xi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.22869
Pdf URL: https://arxiv.org/pdf/2503.22869
Copy Paste: [[2503.22869]] SIGHT: Single-Image Conditioned Generation of Hand Trajectories for Hand-Object Interaction(https://arxiv.org/abs/2503.22869)
Keywords: diffusion
Abstract: We introduce a novel task of generating realistic and diverse 3D hand trajectories given a single image of an object, which could be involved in a hand-object interaction scene or pictured by itself. When humans grasp an object, appropriate trajectories naturally form in our minds to use it for specific tasks. Hand-object interaction trajectory priors can greatly benefit applications in robotics, embodied AI, augmented reality and related fields. However, synthesizing realistic and appropriate hand trajectories given a single object or hand-object interaction image is a highly ambiguous task, requiring to correctly identify the object of interest and possibly even the correct interaction among many possible alternatives. To tackle this challenging problem, we propose the SIGHT-Fusion system, consisting of a curated pipeline for extracting visual features of hand-object interaction details from egocentric videos involving object manipulation, and a diffusion-based conditional motion generation model processing the extracted features. We train our method given video data with corresponding hand trajectory annotations, without supervision in the form of action labels. For the evaluation, we establish benchmarks utilizing the first-person FPHAB and HOI4D datasets, testing our method against various baselines and using multiple metrics. We also introduce task simulators for executing the generated hand trajectories and reporting task success rates as an additional metric. Experiments show that our method generates more appropriate and realistic hand trajectories than baselines and presents promising generalization capability on unseen objects. The accuracy of the generated hand trajectories is confirmed in a physics simulation setting, showcasing the authenticity of the created sequences and their applicability in downstream uses.

Title: Task Tokens: A Flexible Approach to Adapting Behavior Foundation Models

Authors: Ron Vainshtein, Zohar Rimon, Shie Mannor, Chen Tessler
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.22886
Pdf URL: https://arxiv.org/pdf/2503.22886
Copy Paste: [[2503.22886]] Task Tokens: A Flexible Approach to Adapting Behavior Foundation Models(https://arxiv.org/abs/2503.22886)
Keywords: foundation model
Abstract: Recent advancements in imitation learning have led to transformer-based behavior foundation models (BFMs) that enable multi-modal, human-like control for humanoid agents. While excelling at zero-shot generation of robust behaviors, BFMs often require meticulous prompt engineering for specific tasks, potentially yielding suboptimal results. We introduce "Task Tokens", a method to effectively tailor BFMs to specific tasks while preserving their flexibility. Our approach leverages the transformer architecture of BFMs to learn a new task-specific encoder through reinforcement learning, keeping the original BFM frozen. This allows incorporation of user-defined priors, balancing reward design and prompt engineering. By training a task encoder to map observations to tokens, used as additional BFM inputs, we guide performance improvement while maintaining the model's diverse control characteristics. We demonstrate Task Tokens' efficacy across various tasks, including out-of-distribution scenarios, and show their compatibility with other prompting modalities. Our results suggest that Task Tokens offer a promising approach for adapting BFMs to specific control tasks while retaining their generalization capabilities.

Title: Learning Library Cell Representations in Vector Space

Authors: Rongjian Liang, Yi-Chen Lu, Wen-Hao Liu, Haoxing Ren
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2503.22900
Pdf URL: https://arxiv.org/pdf/2503.22900
Copy Paste: [[2503.22900]] Learning Library Cell Representations in Vector Space(https://arxiv.org/abs/2503.22900)
Keywords: self-supervised
Abstract: We propose Lib2Vec, a novel self-supervised framework to efficiently learn meaningful vector representations of library cells, enabling ML models to capture essential cell semantics. The framework comprises three key components: (1) an automated method for generating regularity tests to quantitatively evaluate how well cell representations reflect inter-cell relationships; (2) a self-supervised learning scheme that systematically extracts training data from Liberty files, removing the need for costly labeling; and (3) an attention-based model architecture that accommodates various pin counts and enables the creation of property-specific cell and arc embeddings. Experimental results demonstrate that Lib2Vec effectively captures functional and electrical similarities. Moreover, linear algebraic operations on cell vectors reveal meaningful relationships, such as vector(BUF) - vector(INV) + vector(NAND) ~ vector(AND), showcasing the framework's nuanced representation capabilities. Lib2Vec also enhances downstream circuit learning applications, especially when labeled data is scarce.

Title: Resona: Improving Context Copying in Linear Recurrence Models with Retrieval

Authors: Xinyu Wang, Linrui Ma, Jerry Huang, Peng Lu, Prasanna Parthasarathi, Xiao-Wen Chang, Boxing Chen, Yufei Cui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.22913
Pdf URL: https://arxiv.org/pdf/2503.22913
Copy Paste: [[2503.22913]] Resona: Improving Context Copying in Linear Recurrence Models with Retrieval(https://arxiv.org/abs/2503.22913)
Keywords: in-context
Abstract: Recent shifts in the space of large language model (LLM) research have shown an increasing focus on novel architectures to compete with prototypical Transformer-based models that have long dominated this space. Linear recurrent models have proven to be a viable competitor due to their computational efficiency. However, such models still demonstrate a sizable gap compared to Transformers in terms of in-context learning among other tasks that require recalling information from a context. In this work, we introduce __Resona__, a simple and scalable framework for augmenting linear recurrent models with retrieval. __Resona__~augments models with the ability to integrate retrieved information from the provided input context, enabling tailored behavior to diverse task requirements. Experiments on a variety of linear recurrent models demonstrate that __Resona__-augmented models observe significant performance gains on a variety of synthetic as well as real-world natural language tasks, highlighting its ability to act as a general purpose method to improve the in-context learning and language modeling abilities of linear recurrent LLMs.

Title: SuperEIO: Self-Supervised Event Feature Learning for Event Inertial Odometry

Authors: Peiyu Chen, Fuling Lin, Weipeng Guan, Peng Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.22963
Pdf URL: https://arxiv.org/pdf/2503.22963
Copy Paste: [[2503.22963]] SuperEIO: Self-Supervised Event Feature Learning for Event Inertial Odometry(https://arxiv.org/abs/2503.22963)
Keywords: self-supervised
Abstract: Event cameras asynchronously output low-latency event streams, promising for state estimation in high-speed motion and challenging lighting conditions. As opposed to frame-based cameras, the motion-dependent nature of event cameras presents persistent challenges in achieving robust event feature detection and matching. In recent years, learning-based approaches have demonstrated superior robustness over traditional handcrafted methods in feature detection and matching, particularly under aggressive motion and HDR scenarios. In this paper, we propose SuperEIO, a novel framework that leverages the learning-based event-only detection and IMU measurements to achieve event-inertial odometry. Our event-only feature detection employs a convolutional neural network under continuous event streams. Moreover, our system adopts the graph neural network to achieve event descriptor matching for loop closure. The proposed system utilizes TensorRT to accelerate the inference speed of deep networks, which ensures low-latency processing and robust real-time operation on resource-limited platforms. Besides, we evaluate our method extensively on multiple public datasets, demonstrating its superior accuracy and robustness compared to other state-of-the-art event-based methods. We have also open-sourced our pipeline to facilitate research in the field: this https URL.

Title: Multi-label classification for multi-temporal, multi-spatial coral reef condition monitoring using vision foundation model with adapter learning

Authors: Xinlei Shao, Hongruixuan Chen, Fan Zhao, Kirsty Magson, Jundong Chen, Peiran Li, Jiaqi Wang, Jun Sasaki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23012
Pdf URL: https://arxiv.org/pdf/2503.23012
Copy Paste: [[2503.23012]] Multi-label classification for multi-temporal, multi-spatial coral reef condition monitoring using vision foundation model with adapter learning(https://arxiv.org/abs/2503.23012)
Keywords: foundation model
Abstract: Coral reef ecosystems provide essential ecosystem services, but face significant threats from climate change and human activities. Although advances in deep learning have enabled automatic classification of coral reef conditions, conventional deep models struggle to achieve high performance when processing complex underwater ecological images. Vision foundation models, known for their high accuracy and cross-domain generalizability, offer promising solutions. However, fine-tuning these models requires substantial computational resources and results in high carbon emissions. To address these challenges, adapter learning methods such as Low-Rank Adaptation (LoRA) have emerged as a solution. This study introduces an approach integrating the DINOv2 vision foundation model with the LoRA fine-tuning method. The approach leverages multi-temporal field images collected through underwater surveys at 15 dive sites at Koh Tao, Thailand, with all images labeled according to universal standards used in citizen science-based conservation programs. The experimental results demonstrate that the DINOv2-LoRA model achieved superior accuracy, with a match ratio of 64.77%, compared to 60.34% achieved by the best conventional model. Furthermore, incorporating LoRA reduced the trainable parameters from 1,100M to 5.91M. Transfer learning experiments conducted under different temporal and spatial settings highlight the exceptional generalizability of DINOv2-LoRA across different seasons and sites. This study is the first to explore the efficient adaptation of foundation models for multi-label classification of coral reef conditions under multi-temporal and multi-spatial settings. The proposed method advances the classification of coral reef conditions and provides a tool for monitoring, conserving, and managing coral reef ecosystems.

Title: MeshCraft: Exploring Efficient and Controllable Mesh Generation with Flow-based DiTs

Authors: Xianglong He, Junyi Chen, Di Huang, Zexiang Liu, Xiaoshui Huang, Wanli Ouyang, Chun Yuan, Yangguang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23022
Pdf URL: https://arxiv.org/pdf/2503.23022
Copy Paste: [[2503.23022]] MeshCraft: Exploring Efficient and Controllable Mesh Generation with Flow-based DiTs(https://arxiv.org/abs/2503.23022)
Keywords: diffusion
Abstract: In the domain of 3D content creation, achieving optimal mesh topology through AI models has long been a pursuit for 3D artists. Previous methods, such as MeshGPT, have explored the generation of ready-to-use 3D objects via mesh auto-regressive techniques. While these methods produce visually impressive results, their reliance on token-by-token predictions in the auto-regressive process leads to several significant limitations. These include extremely slow generation speeds and an uncontrollable number of mesh faces. In this paper, we introduce MeshCraft, a novel framework for efficient and controllable mesh generation, which leverages continuous spatial diffusion to generate discrete triangle faces. Specifically, MeshCraft consists of two core components: 1) a transformer-based VAE that encodes raw meshes into continuous face-level tokens and decodes them back to the original meshes, and 2) a flow-based diffusion transformer conditioned on the number of faces, enabling the generation of high-quality 3D meshes with a predefined number of faces. By utilizing the diffusion model for the simultaneous generation of the entire mesh topology, MeshCraft achieves high-fidelity mesh generation at significantly faster speeds compared to auto-regressive methods. Specifically, MeshCraft can generate an 800-face mesh in just 3.2 seconds (35$\times$ faster than existing baselines). Extensive experiments demonstrate that MeshCraft outperforms state-of-the-art techniques in both qualitative and quantitative evaluations on ShapeNet dataset and demonstrates superior performance on Objaverse dataset. Moreover, it integrates seamlessly with existing conditional guidance strategies, showcasing its potential to relieve artists from the time-consuming manual work involved in mesh creation.

Title: Unsupervised Anomaly Detection in Multivariate Time Series across Heterogeneous Domains

Authors: Vincent Jacob, Yanlei Diao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.23060
Pdf URL: https://arxiv.org/pdf/2503.23060
Copy Paste: [[2503.23060]] Unsupervised Anomaly Detection in Multivariate Time Series across Heterogeneous Domains(https://arxiv.org/abs/2503.23060)
Keywords: anomaly
Abstract: The widespread adoption of digital services, along with the scale and complexity at which they operate, has made incidents in IT operations increasingly more likely, diverse, and impactful. This has led to the rapid development of a central aspect of "Artificial Intelligence for IT Operations" (AIOps), focusing on detecting anomalies in vast amounts of multivariate time series data generated by service entities. In this paper, we begin by introducing a unifying framework for benchmarking unsupervised anomaly detection (AD) methods, and highlight the problem of shifts in normal behaviors that can occur in practical AIOps scenarios. To tackle anomaly detection under domain shift, we then cast the problem in the framework of domain generalization and propose a novel approach, Domain-Invariant VAE for Anomaly Detection (DIVAD), to learn domain-invariant representations for unsupervised anomaly detection. Our evaluation results using the Exathlon benchmark show that the two main DIVAD variants significantly outperform the best unsupervised AD method in maximum performance, with 20% and 15% improvements in maximum peak F1-scores, respectively. Evaluation using the Application Server Dataset further demonstrates the broader applicability of our domain generalization methods.

Title: Efficient Adaptation For Remote Sensing Visual Grounding

Authors: Hasan Moughnieh, Mohamad Chalhoub, Hasan Nasrallah, Cristiano Nattero, Paolo Campanella, Ali J. Ghandour
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.23083
Pdf URL: https://arxiv.org/pdf/2503.23083
Copy Paste: [[2503.23083]] Efficient Adaptation For Remote Sensing Visual Grounding(https://arxiv.org/abs/2503.23083)
Keywords: foundation model
Abstract: Foundation models have revolutionized artificial intelligence (AI), offering remarkable capabilities across multi-modal domains. Their ability to precisely locate objects in complex aerial and satellite images, using rich contextual information and detailed object descriptions, is essential for remote sensing (RS). These models can associate textual descriptions with object positions through the Visual Grounding (VG) task, but due to domain-specific challenges, their direct application to RS produces sub-optimal results. To address this, we applied Parameter Efficient Fine Tuning (PEFT) techniques to adapt these models for RS-specific VG tasks. Specifically, we evaluated LoRA placement across different modules in Grounding DINO and used BitFit and adapters to fine-tune the OFA foundation model pre-trained on general-purpose VG datasets. This approach achieved performance comparable to or surpassing current State Of The Art (SOTA) models while significantly reducing computational costs. This study highlights the potential of PEFT techniques to advance efficient and precise multi-modal analysis in RS, offering a practical and cost-effective alternative to full model training.

Title: The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction

Authors: Yihuai Hong, Dian Zhou, Meng Cao, Lei Yu, Zhijing Jin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23084
Pdf URL: https://arxiv.org/pdf/2503.23084
Copy Paste: [[2503.23084]] The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction(https://arxiv.org/abs/2503.23084)
Keywords: generative
Abstract: Large language models (LLMs) excel on a variety of reasoning benchmarks, but previous studies suggest they sometimes struggle to generalize to unseen questions, potentially due to over-reliance on memorized training examples. However, the precise conditions under which LLMs switch between reasoning and memorization during text generation remain unclear. In this work, we provide a mechanistic understanding of LLMs' reasoning-memorization dynamics by identifying a set of linear features in the model's residual stream that govern the balance between genuine reasoning and memory recall. These features not only distinguish reasoning tasks from memory-intensive ones but can also be manipulated to causally influence model performance on reasoning tasks. Additionally, we show that intervening in these reasoning features helps the model more accurately activate the most relevant problem-solving capabilities during answer generation. Our findings offer new insights into the underlying mechanisms of reasoning and memory in LLMs and pave the way for the development of more robust and interpretable generative AI systems.

Title: Evaluating Compositional Scene Understanding in Multimodal Generative Models

Authors: Shuhao Fu, Andrew Jun Lee, Anna Wang, Ida Momennejad, Trevor Bihl, Hongjing Lu, Taylor W. Webb
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23125
Pdf URL: https://arxiv.org/pdf/2503.23125
Copy Paste: [[2503.23125]] Evaluating Compositional Scene Understanding in Multimodal Generative Models(https://arxiv.org/abs/2503.23125)
Keywords: generative
Abstract: The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust and generalizable scene understanding. While major strides have been made toward the development of general-purpose, multimodal generative models, including both text-to-image models and multimodal vision-language models, it remains unclear whether these systems are capable of accurately generating and interpreting scenes involving the composition of multiple objects and relations. In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and InternVL2.5-38B), and compare the performance of these systems to human participants. The results suggest that these systems display some ability to solve compositional and relational tasks, showing notable improvements over the previous generation of multimodal models, but with performance nevertheless well below the level of human participants, particularly for more complex scenes involving many ($>5$) objects and multiple relations. These results highlight the need for further progress toward compositional understanding of visual scenes.

Title: A GAN-Enhanced Deep Learning Framework for Rooftop Detection from Historical Aerial Imagery

Authors: Pengyu Chen, Sicheng Wang, Cuizhen Wang, Senrong Wang, Beiao Huang, Lu Huang, Zhe Zang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23200
Pdf URL: https://arxiv.org/pdf/2503.23200
Copy Paste: [[2503.23200]] A GAN-Enhanced Deep Learning Framework for Rooftop Detection from Historical Aerial Imagery(https://arxiv.org/abs/2503.23200)
Keywords: generative
Abstract: Accurate rooftop detection from historical aerial imagery is vital for examining long-term urban development and human settlement patterns. However, black-and-white analog photographs pose significant challenges for modern object detection frameworks due to their limited spatial resolution, lack of color information, and archival degradation. To address these limitations, this study introduces a two-stage image enhancement pipeline based on Generative Adversarial Networks (GANs): image colorization using DeOldify, followed by super-resolution enhancement with Real-ESRGAN. The enhanced images were then used to train and evaluate rooftop detection models, including Faster R-CNN, DETReg, and YOLOv11n. Results show that combining colorization with super-resolution substantially improves detection performance, with YOLOv11n achieving a mean Average Precision (mAP) exceeding 85%. This reflects an improvement of approximately 40% over original black-and-white images and 20% over images enhanced through colorization alone. The proposed method effectively bridges the gap between archival imagery and contemporary deep learning techniques, enabling more reliable extraction of building footprints from historical aerial photographs.

Title: Enhancing Knowledge Graph Completion with Entity Neighborhood and Relation Context

Authors: Jianfang Chen, Kai Zhang, Aoran Gan, Shiwei Tong, Shuanghong Shen, Qi Liu
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2503.23205
Pdf URL: https://arxiv.org/pdf/2503.23205
Copy Paste: [[2503.23205]] Enhancing Knowledge Graph Completion with Entity Neighborhood and Relation Context(https://arxiv.org/abs/2503.23205)
Keywords: generative
Abstract: Knowledge Graph Completion (KGC) aims to infer missing information in Knowledge Graphs (KGs) to address their inherent incompleteness. Traditional structure-based KGC methods, while effective, face significant computational demands and scalability challenges due to the need for dense embedding learning and scoring all entities in the KG for each prediction. Recent text-based approaches using language models like T5 and BERT have mitigated these issues by converting KG triples into text for reasoning. However, they often fail to fully utilize contextual information, focusing mainly on the neighborhood of the entity and neglecting the context of the relation. To address this issue, we propose KGC-ERC, a framework that integrates both types of context to enrich the input of generative language models and enhance their reasoning capabilities. Additionally, we introduce a sampling strategy to effectively select relevant context within input token constraints, which optimizes the utilization of contextual information and potentially improves model performance. Experiments on the Wikidata5M, Wiki27K, and FB15K-237-N datasets show that KGC-ERC outperforms or matches state-of-the-art baselines in predictive performance and scalability.

Title: RECALL-MM: A Multimodal Dataset of Consumer Product Recalls for Risk Analysis using Computational Methods and Large Language Models

Authors: Diana Bolanos, Mohammadmehdi Ataei, Daniele Grandi, Kosa Goucher-Lambert
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.23213
Pdf URL: https://arxiv.org/pdf/2503.23213
Copy Paste: [[2503.23213]] RECALL-MM: A Multimodal Dataset of Consumer Product Recalls for Risk Analysis using Computational Methods and Large Language Models(https://arxiv.org/abs/2503.23213)
Keywords: generative
Abstract: Product recalls provide valuable insights into potential risks and hazards within the engineering design process, yet their full potential remains underutilized. In this study, we curate data from the United States Consumer Product Safety Commission (CPSC) recalls database to develop a multimodal dataset, RECALL-MM, that informs data-driven risk assessment using historical information, and augment it using generative methods. Patterns in the dataset highlight specific areas where improved safety measures could have significant impact. We extend our analysis by demonstrating interactive clustering maps that embed all recalls into a shared latent space based on recall descriptions and product names. Leveraging these data-driven tools, we explore three case studies to demonstrate the dataset's utility in identifying product risks and guiding safer design decisions. The first two case studies illustrate how designers can visualize patterns across recalled products and situate new product ideas within the broader recall landscape to proactively anticipate hazards. In the third case study, we extend our approach by employing a large language model (LLM) to predict potential hazards based solely on product images. This demonstrates the model's ability to leverage visual context to identify risk factors, revealing strong alignment with historical recall data across many hazard categories. However, the analysis also highlights areas where hazard prediction remains challenging, underscoring the importance of risk awareness throughout the design process. Collectively, this work aims to bridge the gap between historical recall data and future product safety, presenting a scalable, data-driven approach to safer engineering design.

Title: Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection

Authors: Marc-Antoine Lavoie, Anas Mahmoud, Steven L. Waslander
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23220
Pdf URL: https://arxiv.org/pdf/2503.23220
Copy Paste: [[2503.23220]] Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection(https://arxiv.org/abs/2503.23220)
Keywords: self-supervised
Abstract: The current state-of-the-art methods in domain adaptive object detection (DAOD) use Mean Teacher self-labelling, where a teacher model, directly derived as an exponential moving average of the student model, is used to generate labels on the target domain which are then used to improve both models in a positive loop. This couples learning and generating labels on the target domain, and other recent works also leverage the generated labels to add additional domain alignment losses. We believe this coupling is brittle and excessively constrained: there is no guarantee that a student trained only on source data can generate accurate target domain labels and initiate the positive feedback loop, and much better target domain labels can likely be generated by using a large pretrained network that has been exposed to much more data. Vision foundational models are exactly such models, and they have shown impressive task generalization capabilities even when frozen. We want to leverage these models for DAOD and introduce DINO Teacher, which consists of two components. First, we train a new labeller on source data only using a large frozen DINOv2 backbone and show it generates more accurate labels than Mean Teacher. Next, we align the student's source and target image patch features with those from a DINO encoder, driving source and target representations closer to the generalizable DINO representation. We obtain state-of-the-art performance on multiple DAOD datasets. Code available at this https URL

Title: Synthetic Art Generation and DeepFake Detection A Study on Jamini Roy Inspired Dataset

Authors: Kushal Agrawal, Romi Banerjee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23226
Pdf URL: https://arxiv.org/pdf/2503.23226
Copy Paste: [[2503.23226]] Synthetic Art Generation and DeepFake Detection A Study on Jamini Roy Inspired Dataset(https://arxiv.org/abs/2503.23226)
Keywords: diffusion, generative
Abstract: The intersection of generative AI and art is a fascinating area that brings both exciting opportunities and significant challenges, especially when it comes to identifying synthetic artworks. This study takes a unique approach by examining diffusion-based generative models in the context of Indian art, specifically focusing on the distinctive style of Jamini Roy. To explore this, we fine-tuned Stable Diffusion 3 and used techniques like ControlNet and IPAdapter to generate realistic images. This allowed us to create a new dataset that includes both real and AI-generated artworks, which is essential for a detailed analysis of what these models can produce. We employed various qualitative and quantitative methods, such as Fourier domain assessments and autocorrelation metrics, to uncover subtle differences between synthetic images and authentic pieces. A key takeaway from recent research is that existing methods for detecting deepfakes face considerable challenges, especially when the deepfakes are of high quality and tailored to specific cultural contexts. This highlights a critical gap in current detection technologies, particularly in light of the challenges identified above, where high-quality and culturally specific deepfakes are difficult to detect. This work not only sheds light on the increasing complexity of generative models but also sets a crucial foundation for future research aimed at effective detection of synthetic art.

Title: Evaluating how LLM annotations represent diverse views on contentious topics

Authors: Megan A. Brown, Shubham Atreja, Libby Hemphill, Patrick Y. Wu
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2503.23243
Pdf URL: https://arxiv.org/pdf/2503.23243
Copy Paste: [[2503.23243]] Evaluating how LLM annotations represent diverse views on contentious topics(https://arxiv.org/abs/2503.23243)
Keywords: generative
Abstract: Researchers have proposed the use of generative large language models (LLMs) to label data for both research and applied settings. This literature emphasizes the improved performance of LLMs relative to other natural language models, noting that LLMs typically outperform other models on standard metrics such as accuracy, precision, recall, and F1 score. However, previous literature has also highlighted the bias embedded in language models, particularly around contentious topics such as potentially toxic content. This bias could result in labels applied by LLMs that disproportionately align with majority groups over a more diverse set of viewpoints. In this paper, we evaluate how LLMs represent diverse viewpoints on these contentious tasks. Across four annotation tasks on four datasets, we show that LLMs do not show substantial disagreement with annotators on the basis of demographics. Instead, the model, prompt, and disagreement between human annotators on the labeling task are far more predictive of LLM agreement. Our findings suggest that when using LLMs to annotate data, under-representing the views of particular groups is not a substantial concern. We conclude with a discussion of the implications for researchers and practitioners.

Title: Learning Predictive Visuomotor Coordination

Authors: Wenqi Jia, Bolin Lai, Miao Liu, Danfei Xu, James M. Rehg
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.23300
Pdf URL: https://arxiv.org/pdf/2503.23300
Copy Paste: [[2503.23300]] Learning Predictive Visuomotor Coordination(https://arxiv.org/abs/2503.23300)
Keywords: diffusion
Abstract: Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling.

Title: HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation

Authors: Hongwei Zheng, Han Li, Wenrui Dai, Ziyang Zheng, Chenglin Li, Junni Zou, Hongkai Xiong
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.23331
Pdf URL: https://arxiv.org/pdf/2503.23331
Copy Paste: [[2503.23331]] HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation(https://arxiv.org/abs/2503.23331)
Keywords: generative
Abstract: Existing 2D-to-3D human pose estimation (HPE) methods struggle with the occlusion issue by enriching information like temporal and visual cues in the lifting stage. In this paper, we argue that these methods ignore the limitation of the sparse skeleton 2D input representation, which fundamentally restricts the 2D-to-3D lifting and worsens the occlusion issue. To address these, we propose a novel two-stage generative densification method, named Hierarchical Pose AutoRegressive Transformer (HiPART), to generate hierarchical 2D dense poses from the original sparse 2D pose. Specifically, we first develop a multi-scale skeleton tokenization module to quantize the highly dense 2D pose into hierarchical tokens and propose a Skeleton-aware Alignment to strengthen token connections. We then develop a Hierarchical AutoRegressive Modeling scheme for hierarchical 2D pose generation. With generated hierarchical poses as inputs for 2D-to-3D lifting, the proposed method shows strong robustness in occluded scenarios and achieves state-of-the-art performance on the single-frame-based 3D HPE. Moreover, it outperforms numerous multi-frame methods while reducing parameter and computational complexity and can also complement them to further enhance performance and robustness.

Title: TraceMark-LDM: Authenticatable Watermarking for Latent Diffusion Models via Binary-Guided Rearrangement

Authors: Wenhao Luo, Zhangyi Shen, Ye Yao, Feng Ding, Guopu Zhu, Weizhi Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23332
Pdf URL: https://arxiv.org/pdf/2503.23332
Copy Paste: [[2503.23332]] TraceMark-LDM: Authenticatable Watermarking for Latent Diffusion Models via Binary-Guided Rearrangement(https://arxiv.org/abs/2503.23332)
Keywords: diffusion
Abstract: Image generation algorithms are increasingly integral to diverse aspects of human society, driven by their practical applications. However, insufficient oversight in artificial Intelligence generated content (AIGC) can facilitate the spread of malicious content and increase the risk of copyright infringement. Among the diverse range of image generation models, the Latent Diffusion Model (LDM) is currently the most widely used, dominating the majority of the Text-to-Image model market. Currently, most attribution methods for LDMs rely on directly embedding watermarks into the generated images or their intermediate noise, a practice that compromises both the quality and the robustness of the generated content. To address these limitations, we introduce TraceMark-LDM, an novel algorithm that integrates watermarking to attribute generated images while guaranteeing non-destructive performance. Unlike current methods, TraceMark-LDM leverages watermarks as guidance to rearrange random variables sampled from a Gaussian distribution. To mitigate potential deviations caused by inversion errors, the small absolute elements are grouped and rearranged. Additionally, we fine-tune the LDM encoder to enhance the robustness of the watermark. Experimental results show that images synthesized using TraceMark-LDM exhibit superior quality and attribution accuracy compared to state-of-the-art (SOTA) techniques. Notably, TraceMark-LDM demonstrates exceptional robustness against various common attack methods, consistently outperforming SOTA methods.

Title: Object Isolated Attention for Consistent Story Visualization

Authors: Xiangyang Luo, Junhao Cheng, Yifan Xie, Xin Zhang, Tao Feng, Zhou Liu, Fei Ma, Fei Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23353
Pdf URL: https://arxiv.org/pdf/2503.23353
Copy Paste: [[2503.23353]] Object Isolated Attention for Consistent Story Visualization(https://arxiv.org/abs/2503.23353)
Keywords: diffusion
Abstract: Open-ended story visualization is a challenging task that involves generating coherent image sequences from a given storyline. One of the main difficulties is maintaining character consistency while creating natural and contextually fitting scenes--an area where many existing methods struggle. In this paper, we propose an enhanced Transformer module that uses separate self attention and cross attention mechanisms, leveraging prior knowledge from pre-trained diffusion models to ensure logical scene creation. The isolated self attention mechanism improves character consistency by refining attention maps to reduce focus on irrelevant areas and highlight key features of the same character. Meanwhile, the isolated cross attention mechanism independently processes each character's features, avoiding feature fusion and further strengthening consistency. Notably, our method is training-free, allowing the continuous generation of new characters and storylines without re-tuning. Both qualitative and quantitative evaluations show that our approach outperforms current methods, demonstrating its effectiveness.

Title: DSPFusion: Image Fusion via Degradation and Semantic Dual-Prior Guidance

Authors: Linfeng Tang, Chunyu Li, Guoqing Wang, Yixuan Yuan, Jiayi Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23355
Pdf URL: https://arxiv.org/pdf/2503.23355
Copy Paste: [[2503.23355]] DSPFusion: Image Fusion via Degradation and Semantic Dual-Prior Guidance(https://arxiv.org/abs/2503.23355)
Keywords: diffusion
Abstract: Existing fusion methods are tailored for high-quality images but struggle with degraded images captured under harsh circumstances, thus limiting the practical potential of image fusion. This work presents a \textbf{D}egradation and \textbf{S}emantic \textbf{P}rior dual-guided framework for degraded image \textbf{Fusion} (\textbf{DSPFusion}), utilizing degradation priors and high-quality scene semantic priors restored via diffusion models to guide both information recovery and fusion in a unified model. In specific, it first individually extracts modality-specific degradation priors, while jointly capturing comprehensive low-quality semantic priors. Subsequently, a diffusion model is developed to iteratively restore high-quality semantic priors in a compact latent space, enabling our method to be over $20 \times$ faster than mainstream diffusion model-based image fusion schemes. Finally, the degradation priors and high-quality semantic priors are employed to guide information enhancement and aggregation via the dual-prior guidance and prior-guided fusion modules. Extensive experiments demonstrate that DSPFusion mitigates most typical degradations while integrating complementary context with minimal computational cost, greatly broadening the application scope of image fusion.

Title: Towards Physically Plausible Video Generation via VLM Planning

Authors: Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, Xu Jia
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23368
Pdf URL: https://arxiv.org/pdf/2503.23368
Copy Paste: [[2503.23368]] Towards Physically Plausible Video Generation via VLM Planning(https://arxiv.org/abs/2503.23368)
Keywords: diffusion
Abstract: Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: this https URL.

Title: Map Feature Perception Metric for Map Generation Quality Assessment and Loss Optimization

Authors: Chenxing Sun, Jing Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23370
Pdf URL: https://arxiv.org/pdf/2503.23370
Copy Paste: [[2503.23370]] Map Feature Perception Metric for Map Generation Quality Assessment and Loss Optimization(https://arxiv.org/abs/2503.23370)
Keywords: generative
Abstract: In intelligent cartographic generation tasks empowered by generative models, the authenticity of synthesized maps constitutes a critical determinant. Concurrently, the selection of appropriate evaluation metrics to quantify map authenticity emerges as a pivotal research challenge. Current methodologies predominantly adopt computer vision-based image assessment metrics to compute discrepancies between generated and reference maps. However, conventional visual similarity metrics-including L1, L2, SSIM, and FID-primarily operate at pixel-level comparisons, inadequately capturing cartographic global features and spatial correlations, consequently inducing semantic-structural artifacts in generated outputs. This study introduces a novel Map Feature Perception Metric designed to evaluate global characteristics and spatial congruence between synthesized and target maps. Diverging from pixel-wise metrics, our approach extracts elemental-level deep features that comprehensively encode cartographic structural integrity and topological relationships. Experimental validation demonstrates MFP's superior capability in evaluating cartographic semantic features, with classification-enhanced implementations outperforming conventional loss functions across diverse generative frameworks. When employed as optimization objectives, our metric achieves performance gains ranging from 2% to 50% across multiple benchmarks compared to traditional L1, L2, and SSIM baselines. This investigation concludes that explicit consideration of cartographic global attributes and spatial coherence substantially enhances generative model optimization, thereby significantly improving the geographical plausibility of synthesized maps.

Title: JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

Authors: Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, Tat-Seng Chua
Subjects: cs.CV, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.23377
Pdf URL: https://arxiv.org/pdf/2503.23377
Copy Paste: [[2503.23377]] JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization(https://arxiv.org/abs/2503.23377)
Keywords: diffusion
Abstract: This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at this https URL.

Title: A Large Scale Analysis of Gender Biases in Text-to-Image Generative Models

Authors: Leander Girrbach, Stephan Alaniz, Genevieve Smith, Zeynep Akata
Subjects: cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2503.23398
Pdf URL: https://arxiv.org/pdf/2503.23398
Copy Paste: [[2503.23398]] A Large Scale Analysis of Gender Biases in Text-to-Image Generative Models(https://arxiv.org/abs/2503.23398)
Keywords: generative
Abstract: With the increasing use of image generation technology, understanding its social biases, including gender bias, is essential. This paper presents the first large-scale study on gender bias in text-to-image (T2I) models, focusing on everyday situations. While previous research has examined biases in occupations, we extend this analysis to gender associations in daily activities, objects, and contexts. We create a dataset of 3,217 gender-neutral prompts and generate 200 images per prompt from five leading T2I models. We automatically detect the perceived gender of people in the generated images and filter out images with no person or multiple people of different genders, leaving 2,293,295 images. To enable a broad analysis of gender bias in T2I models, we group prompts into semantically similar concepts and calculate the proportion of male- and female-gendered images for each prompt. Our analysis shows that T2I models reinforce traditional gender roles, reflect common gender stereotypes in household roles, and underrepresent women in financial related activities. Women are predominantly portrayed in care- and human-centered scenarios, and men in technical or physical labor scenarios.

Title: Diffusion Meets Few-shot Class Incremental Learning

Authors: Junsu Kim, Yunhoe Ku, Dongyoon Han, Seungryul Baek
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23402
Pdf URL: https://arxiv.org/pdf/2503.23402
Copy Paste: [[2503.23402]] Diffusion Meets Few-shot Class Incremental Learning(https://arxiv.org/abs/2503.23402)
Keywords: diffusion, generative
Abstract: Few-shot class-incremental learning (FSCIL) is challenging due to extremely limited training data; while aiming to reduce catastrophic forgetting and learn new information. We propose Diffusion-FSCIL, a novel approach that employs a text-to-image diffusion model as a frozen backbone. Our conjecture is that FSCIL can be tackled using a large generative model's capabilities benefiting from 1) generation ability via large-scale pre-training; 2) multi-scale representation; 3) representational flexibility through the text encoder. To maximize the representation capability, we propose to extract multiple complementary diffusion features to play roles as latent replay with slight support from feature distillation for preventing generative biases. Our framework realizes efficiency through 1) using a frozen backbone; 2) minimal trainable components; 3) batch processing of multiple feature extractions. Extensive experiments on CUB-200, miniImageNet, and CIFAR-100 show that Diffusion-FSCIL surpasses state-of-the-art methods, preserving performance on previously learned classes and adapting effectively to new ones.

Title: GMapLatent: Geometric Mapping in Latent Space

Authors: Wei Zeng, Xuebin Chang, Jianghao Su, Xiang Gu, Jian Sun, Zongben Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23407
Pdf URL: https://arxiv.org/pdf/2503.23407
Copy Paste: [[2503.23407]] GMapLatent: Geometric Mapping in Latent Space(https://arxiv.org/abs/2503.23407)
Keywords: generative
Abstract: Cross-domain generative models based on encoder-decoder AI architectures have attracted much attention in generating realistic images, where domain alignment is crucial for generation accuracy. Domain alignment methods usually deal directly with the initial distribution; however, mismatched or mixed clusters can lead to mode collapse and mixture problems in the decoder, compromising model generalization capabilities. In this work, we innovate a cross-domain alignment and generation model that introduces a canonical latent space representation based on geometric mapping to align the cross-domain latent spaces in a rigorous and precise manner, thus avoiding mode collapse and mixture in the encoder-decoder generation architectures. We name this model GMapLatent. The core of the method is to seamlessly align latent spaces with strict cluster correspondence constraints using the canonical parameterizations of cluster-decorated latent spaces. We first (1) transform the latent space to a canonical parameter domain by composing barycenter translation, optimal transport merging and constrained harmonic mapping, and then (2) compute geometric registration with cluster constraints over the canonical parameter domains. This process realizes a bijective (one-to-one and onto) mapping between newly transformed latent spaces and generates a precise alignment of cluster pairs. Cross-domain generation is then achieved through the aligned latent spaces embedded in the encoder-decoder pipeline. Experiments on gray-scale and color images validate the efficiency, efficacy and applicability of GMapLatent, and demonstrate that the proposed model has superior performance over existing models.

Title: Towards Trustworthy GUI Agents: A Survey

Authors: Yucheng Shi, Wenhao Yu, Wenlin Yao, Wenhu Chen, Ninghao Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.23434
Pdf URL: https://arxiv.org/pdf/2503.23434
Copy Paste: [[2503.23434]] Towards Trustworthy GUI Agents: A Survey(https://arxiv.org/abs/2503.23434)
Keywords: foundation model
Abstract: GUI agents, powered by large foundation models, can interact with digital interfaces, enabling various applications in web automation, mobile navigation, and software testing. However, their increasing autonomy has raised critical concerns about their security, privacy, and safety. This survey examines the trustworthiness of GUI agents in five critical dimensions: security vulnerabilities, reliability in dynamic environments, transparency and explainability, ethical considerations, and evaluation methodologies. We also identify major challenges such as vulnerability to adversarial attacks, cascading failure modes in sequential decision-making, and a lack of realistic evaluation benchmarks. These issues not only hinder real-world deployment but also call for comprehensive mitigation strategies beyond task success. As GUI agents become more widespread, establishing robust safety standards and responsible development practices is essential. This survey provides a foundation for advancing trustworthy GUI agents through systematic understanding and future research.

Title: AU-TTT: Vision Test-Time Training model for Facial Action Unit Detection

Authors: Bohao Xing, Kaishen Yuan, Zitong Yu, Xin Liu, Heikki Kälviäinen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23450
Pdf URL: https://arxiv.org/pdf/2503.23450
Copy Paste: [[2503.23450]] AU-TTT: Vision Test-Time Training model for Facial Action Unit Detection(https://arxiv.org/abs/2503.23450)
Keywords: self-supervised
Abstract: Facial Action Units (AUs) detection is a cornerstone of objective facial expression analysis and a critical focus in affective computing. Despite its importance, AU detection faces significant challenges, such as the high cost of AU annotation and the limited availability of datasets. These constraints often lead to overfitting in existing methods, resulting in substantial performance degradation when applied across diverse datasets. Addressing these issues is essential for improving the reliability and generalizability of AU detection methods. Moreover, many current approaches leverage Transformers for their effectiveness in long-context modeling, but they are hindered by the quadratic complexity of self-attention. Recently, Test-Time Training (TTT) layers have emerged as a promising solution for long-sequence modeling. Additionally, TTT applies self-supervised learning for iterative updates during both training and inference, offering a potential pathway to mitigate the generalization challenges inherent in AU detection tasks. In this paper, we propose a novel vision backbone tailored for AU detection, incorporating bidirectional TTT blocks, named AU-TTT. Our approach introduces TTT Linear to the AU detection task and optimizes image scanning mechanisms for enhanced performance. Additionally, we design an AU-specific Region of Interest (RoI) scanning mechanism to capture fine-grained facial features critical for AU detection. Experimental results demonstrate that our method achieves competitive performance in both within-domain and cross-domain scenarios.

Title: Beyond Academic Benchmarks: Critical Analysis and Best Practices for Visual Industrial Anomaly Detection

Authors: Aimira Baitieva, Yacine Bouaouni, Alexandre Briot, Dick Ameln, Souhaiel Khalfaoui, Samet Akcay
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23451
Pdf URL: https://arxiv.org/pdf/2503.23451
Copy Paste: [[2503.23451]] Beyond Academic Benchmarks: Critical Analysis and Best Practices for Visual Industrial Anomaly Detection(https://arxiv.org/abs/2503.23451)
Keywords: anomaly
Abstract: Anomaly detection (AD) is essential for automating visual inspection in manufacturing. This field of computer vision is rapidly evolving, with increasing attention towards real-world applications. Meanwhile, popular datasets are typically produced in controlled lab environments with artificially created defects, unable to capture the diversity of real production conditions. New methods often fail in production settings, showing significant performance degradation or requiring impractical computational resources. This disconnect between academic results and industrial viability threatens to misdirect visual anomaly detection research. This paper makes three key contributions: (1) we demonstrate the importance of real-world datasets and establish benchmarks using actual production data, (2) we provide a fair comparison of existing SOTA methods across diverse tasks by utilizing metrics that are valuable for practical applications, and (3) we present a comprehensive analysis of recent advancements in this field by discussing important challenges and new perspectives for bridging the academia-industry gap. The code is publicly available at this https URL

Title: TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

Authors: Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, Ying Tai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23461
Pdf URL: https://arxiv.org/pdf/2503.23461
Copy Paste: [[2503.23461]] TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes(https://arxiv.org/abs/2503.23461)
Keywords: generative
Abstract: This paper explores the task of Complex Visual Text Generation (CVTG), which centers on generating intricate textual content distributed across diverse regions within visual images. In CVTG, image generation models often rendering distorted and blurred visual text or missing some visual text. To tackle these challenges, we propose TextCrafter, a novel multi-visual text rendering method. TextCrafter employs a progressive strategy to decompose complex visual text into distinct components while ensuring robust alignment between textual content and its visual carrier. Additionally, it incorporates a token focus enhancement mechanism to amplify the prominence of visual text during the generation process. TextCrafter effectively addresses key challenges in CVTG tasks, such as text confusion, omissions, and blurriness. Moreover, we present a new benchmark dataset, CVTG-2K, tailored to rigorously evaluate the performance of generative models on CVTG tasks. Extensive experiments demonstrate that our method surpasses state-of-the-art approaches.

Title: Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model

Authors: Jannik Endres, Oliver Hahn, Charles Corbière, Simone Schaub-Meyer, Stefan Roth, Alexandre Alahi
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.23502
Pdf URL: https://arxiv.org/pdf/2503.23502
Copy Paste: [[2503.23502]] Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model(https://arxiv.org/abs/2503.23502)
Keywords: foundation model
Abstract: Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360° field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.

Title: Federated Self-Supervised Learning for One-Shot Cross-Modal and Cross-Imaging Technique Segmentation

Authors: Siladittya Manna, Suresh Das, Sayantari Ghosh, Saumik Bhattacharya
Subjects: cs.CV, cs.LG, eess.IV, physics.med-ph
Abstract URL: https://arxiv.org/abs/2503.23507
Pdf URL: https://arxiv.org/pdf/2503.23507
Copy Paste: [[2503.23507]] Federated Self-Supervised Learning for One-Shot Cross-Modal and Cross-Imaging Technique Segmentation(https://arxiv.org/abs/2503.23507)
Keywords: self-supervised
Abstract: Decentralized federated learning enables learning of data representations from multiple sources without compromising the privacy of the clients. In applications like medical image segmentation, where obtaining a large annotated dataset from a single source is a distressing problem, federated self-supervised learning can provide some solace. In this work, we push the limits further by exploring a federated self-supervised one-shot segmentation task representing a more data-scarce scenario. We adopt a pre-existing self-supervised few-shot segmentation framework CoWPro and adapt it to the federated learning scenario. To the best of our knowledge, this work is the first to attempt a self-supervised few-shot segmentation task in the federated learning domain. Moreover, we consider the clients to be constituted of data from different modalities and imaging techniques like MR or CT, which makes the problem even harder. Additionally, we reinforce and improve the baseline CoWPro method using a fused dice loss which shows considerable improvement in performance over the baseline CoWPro. Finally, we evaluate this novel framework on a completely unseen held-out part of the local client dataset. We observe that the proposed framework can achieve performance at par or better than the FedAvg version of the CoWPro framework on the held-out validation dataset.

Title: Enhancing Creative Generation on Stable Diffusion-based Models

Authors: Jiyeon Han, Dahee Kwon, Gayoung Lee, Junho Kim, Jaesik Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23538
Pdf URL: https://arxiv.org/pdf/2503.23538
Copy Paste: [[2503.23538]] Enhancing Creative Generation on Stable Diffusion-based Models(https://arxiv.org/abs/2503.23538)
Keywords: diffusion, generative
Abstract: Recent text-to-image generative models, particularly Stable Diffusion and its distilled variants, have achieved impressive fidelity and strong text-image alignment. However, their creative capability remains constrained, as including `creative' in prompts seldom yields the desired results. This paper introduces C3 (Creative Concept Catalyst), a training-free approach designed to enhance creativity in Stable Diffusion-based models. C3 selectively amplifies features during the denoising process to foster more creative outputs. We offer practical guidelines for choosing amplification factors based on two main aspects of creativity. C3 is the first study to enhance creativity in diffusion models without extensive computational costs. We demonstrate its effectiveness across various Stable Diffusion-based models.

Title: DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution

Authors: Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy Ren, Chun-Le Guo, Chongyi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23580
Pdf URL: https://arxiv.org/pdf/2503.23580
Copy Paste: [[2503.23580]] DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution(https://arxiv.org/abs/2503.23580)
Keywords: diffusion, generative
Abstract: Large-scale pre-trained diffusion models are becoming increasingly popular in solving the Real-World Image Super-Resolution (Real-ISR) problem because of their rich generative priors. The recent development of diffusion transformer (DiT) has witnessed overwhelming performance over the traditional UNet-based architecture in image generation, which also raises the question: Can we adopt the advanced DiT-based diffusion model for Real-ISR? To this end, we propose our DiT4SR, one of the pioneering works to tame the large-scale DiT model for Real-ISR. Instead of directly injecting embeddings extracted from low-resolution (LR) images like ControlNet, we integrate the LR embeddings into the original attention mechanism of DiT, allowing for the bidirectional flow of information between the LR latent and the generated latent. The sufficient interaction of these two streams allows the LR stream to evolve with the diffusion process, producing progressively refined guidance that better aligns with the generated latent at each diffusion step. Additionally, the LR guidance is injected into the generated latent via a cross-stream convolution layer, compensating for DiT's limited ability to capture local information. These simple but effective designs endow the DiT model with superior performance in Real-ISR, which is demonstrated by extensive experiments. Project Page: this https URL.

Title: Make Autoregressive Great Again: Diffusion-Free Graph Generation with Next-Scale Prediction

Authors: Samuel Belkadi, Steve Hong, Marian Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.23612
Pdf URL: https://arxiv.org/pdf/2503.23612
Copy Paste: [[2503.23612]] Make Autoregressive Great Again: Diffusion-Free Graph Generation with Next-Scale Prediction(https://arxiv.org/abs/2503.23612)
Keywords: diffusion, generative
Abstract: Autoregressive models are popular generative models due to their speed and properties. However, they require an explicit sequence order, which contradicts the unordered nature of graphs. In contrast, diffusion models maintain permutation invariance and enable one-shot generation but require up to thousands of denoising steps and additional features, leading to high computational costs. Inspired by recent breakthroughs in image generation-especially the success of visual autoregressive methods-we propose MAG, a novel diffusion-free graph generation framework based on next-scale prediction. By leveraging a hierarchy of latent representations, the model progressively generates scales of the entire graph without the need for explicit node ordering. Extensive experiments on both generic and molecular graph datasets demonstrate that MAG delivers competitive performance compared to state-of-the-art methods, achieving up to three orders of magnitude in speedup during inference.

Title: Graph-Eq: Discovering Mathematical Equations using Graph Generative Models

Authors: Nisal Ranasinghe, Damith Senanayake, Saman Halgamuge
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23617
Pdf URL: https://arxiv.org/pdf/2503.23617
Copy Paste: [[2503.23617]] Graph-Eq: Discovering Mathematical Equations using Graph Generative Models(https://arxiv.org/abs/2503.23617)
Keywords: generative
Abstract: The ability to discover meaningful, accurate, and concise mathematical equations that describe datasets is valuable across various domains. Equations offer explicit relationships between variables, enabling deeper insights into underlying data patterns. Most existing equation discovery methods rely on genetic programming, which iteratively searches the equation space but is often slow and prone to overfitting. By representing equations as directed acyclic graphs, we leverage the use of graph neural networks to learn the underlying semantics of equations, and generate new, previously unseen equations. Although graph generative models have been shown to be successful in discovering new types of graphs in many fields, there application in discovering equations remains largely unexplored. In this work, we propose Graph-EQ, a deep graph generative model designed for efficient equation discovery. Graph-EQ uses a conditional variational autoencoder (CVAE) to learn a rich latent representation of the equation space by training it on a large corpus of equations in an unsupervised manner. Instead of directly searching the equation space, we employ Bayesian optimization to efficiently explore this learned latent space. We show that the encoder-decoder architecture of Graph-Eq is able to accurately reconstruct input equations. Moreover, we show that the learned latent representation can be sampled and decoded into valid equations, including new and previously unseen equations in the training data. Finally, we assess Graph-Eq's ability to discover equations that best fit a dataset by exploring the latent space using Bayesian optimization. Latent space exploration is done on 20 dataset with known ground-truth equations, and Graph-Eq is shown to successfully discover the grountruth equation in the majority of datasets.

Title: Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging

Authors: Amar Kumar, Anita Kriz, Barak Pertzov, Tal Arbel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23618
Pdf URL: https://arxiv.org/pdf/2503.23618
Copy Paste: [[2503.23618]] Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging(https://arxiv.org/abs/2503.23618)
Keywords: foundation model
Abstract: Vision-language foundation models (VLMs) have shown impressive performance in guiding image generation through text, with emerging applications in medical imaging. In this work, we are the first to investigate the question: 'Can fine-tuned foundation models help identify critical, and possibly unknown, data properties?' By evaluating our proposed method on a chest x-ray dataset, we show that these models can generate high-resolution, precisely edited images compared to methods that rely on Structural Causal Models (SCMs) according to numerous metrics. For the first time, we demonstrate that fine-tuned VLMs can reveal hidden data relationships that were previously obscured due to available metadata granularity and model capacity limitations. Our experiments demonstrate both the potential of these models to reveal underlying dataset properties while also exposing the limitations of fine-tuned VLMs for accurate image editing and susceptibility to biases and spurious correlations.

Title: Language-Guided Trajectory Traversal in Disentangled Stable Diffusion Latent Space for Factorized Medical Image Generation

Authors: Zahra TehraniNasab, Amar Kumar, Tal Arbel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23623
Pdf URL: https://arxiv.org/pdf/2503.23623
Copy Paste: [[2503.23623]] Language-Guided Trajectory Traversal in Disentangled Stable Diffusion Latent Space for Factorized Medical Image Generation(https://arxiv.org/abs/2503.23623)
Keywords: diffusion, foundation model, generative
Abstract: Text-to-image diffusion models have demonstrated a remarkable ability to generate photorealistic images from natural language prompts. These high-resolution, language-guided synthesized images are essential for the explainability of disease or exploring causal relationships. However, their potential for disentangling and controlling latent factors of variation in specialized domains like medical imaging remains under-explored. In this work, we present the first investigation of the power of pre-trained vision-language foundation models, once fine-tuned on medical image datasets, to perform latent disentanglement for factorized medical image generation and interpolation. Through extensive experiments on chest X-ray and skin datasets, we illustrate that fine-tuned, language-guided Stable Diffusion inherently learns to factorize key attributes for image generation, such as the patient's anatomical structures or disease diagnostic features. We devise a framework to identify, isolate, and manipulate key attributes through latent space trajectory traversal of generative models, facilitating precise control over medical image synthesis.

Title: Expanding-and-Shrinking Binary Neural Networks

Authors: Xulong Shi, Caiyi Sun, Zhi Qi, Liu Hao, Xiaodong Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23709
Pdf URL: https://arxiv.org/pdf/2503.23709
Copy Paste: [[2503.23709]] Expanding-and-Shrinking Binary Neural Networks(https://arxiv.org/abs/2503.23709)
Keywords: diffusion, generative
Abstract: While binary neural networks (BNNs) offer significant benefits in terms of speed, memory and energy, they encounter substantial accuracy degradation in challenging tasks compared to their real-valued counterparts. Due to the binarization of weights and activations, the possible values of each entry in the feature maps generated by BNNs are strongly constrained. To tackle this limitation, we propose the expanding-and-shrinking operation, which enhances binary feature maps with negligible increase of computation complexity, thereby strengthening the representation capacity. Extensive experiments conducted on multiple benchmarks reveal that our approach generalizes well across diverse applications ranging from image classification, object detection to generative diffusion model, while also achieving remarkable improvement over various leading binarization algorithms based on different architectures including both CNNs and Transformers.

Title: Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space

Authors: Yi Liu, Wengen Li, Jihong Guan, Shuigeng Zhou, Yichao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23717
Pdf URL: https://arxiv.org/pdf/2503.23717
Copy Paste: [[2503.23717]] Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space(https://arxiv.org/abs/2503.23717)
Keywords: diffusion, generative
Abstract: Cloud removal (CR) remains a challenging task in remote sensing image processing. Although diffusion models (DM) exhibit strong generative capabilities, their direct applications to CR are suboptimal, as they generate cloudless images from random noise, ignoring inherent information in cloudy inputs. To overcome this drawback, we develop a new CR model EMRDM based on mean-reverting diffusion models (MRDMs) to establish a direct diffusion process between cloudy and cloudless images. Compared to current MRDMs, EMRDM offers a modular framework with updatable modules and an elucidated design space, based on a reformulated forward process and a new ordinary differential equation (ODE)-based backward process. Leveraging our framework, we redesign key MRDM modules to boost CR performance, including restructuring the denoiser via a preconditioning technique, reorganizing the training process, and improving the sampling process by introducing deterministic and stochastic samplers. To achieve multi-temporal CR, we further develop a denoising network for simultaneously denoising sequential images. Experiments on mono-temporal and multi-temporal datasets demonstrate the superior performance of EMRDM. Our code is available at this https URL.

Title: KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language

Authors: Yoonshik Kim, Jaeyoon Jung
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.23730
Pdf URL: https://arxiv.org/pdf/2503.23730
Copy Paste: [[2503.23730]] KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language(https://arxiv.org/abs/2503.23730)
Keywords: generative
Abstract: The recent emergence of Large Vision-Language Models(VLMs) has resulted in a variety of different benchmarks for evaluating such models. Despite this, we observe that most existing evaluation methods suffer from the fact that they either require the model to choose from pre-determined responses, sacrificing open-endedness, or evaluate responses using a judge model, resulting in subjective and unreliable evaluation. In addition, we observe a lack of benchmarks for VLMs in the Korean language, which are necessary as a separate metric from more common English language benchmarks, as the performance of generative language models can differ significantly based on the language being used. Therefore, we present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language for the evaluation of VLMs. Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria covering 10 different aspects of VLM performance. The grading criteria eliminate the problem of unreliability by allowing the judge model to grade each response based on a pre-determined set of rules. By defining the evaluation criteria in an objective manner, even a small open-source model can be used to evaluate models on our benchmark reliably. In addition to evaluating a large number of existing VLMs on our benchmark, we also experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods. Our evaluation code is available at this https URL

Title: Time-Series Forecasting via Topological Information Supervised Framework with Efficient Topological Feature Learning

Authors: ZiXin Lin, Nur Fariha Syaqina Zulkepli
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.23757
Pdf URL: https://arxiv.org/pdf/2503.23757
Copy Paste: [[2503.23757]] Time-Series Forecasting via Topological Information Supervised Framework with Efficient Topological Feature Learning(https://arxiv.org/abs/2503.23757)
Keywords: generative
Abstract: Topological Data Analysis (TDA) has emerged as a powerful tool for extracting meaningful features from complex data structures, driving significant advancements in fields such as neuroscience, biology, machine learning, and financial modeling. Despite its success, the integration of TDA with time-series prediction remains underexplored due to three primary challenges: the limited utilization of temporal dependencies within topological features, computational bottlenecks associated with persistent homology, and the deterministic nature of TDA pipelines restricting generalized feature learning. This study addresses these challenges by proposing the Topological Information Supervised (TIS) Prediction framework, which leverages neural networks and Conditional Generative Adversarial Networks (CGANs) to generate synthetic topological features, preserving their distribution while significantly reducing computational time. We propose a novel training strategy that integrates topological consistency loss to improve the predictive accuracy of deep learning models. Specifically, we introduce two state-of-the-art models, TIS-BiGRU and TIS-Informer, designed to capture short-term and long-term temporal dependencies, respectively. Comparative experimental results demonstrate the superior performance of TIS models over conventional predictors, validating the effectiveness of integrating topological information. This work not only advances TDA-based time-series prediction but also opens new avenues for utilizing topological features in deep learning architectures.

Title: Accelerating High-Efficiency Organic Photovoltaic Discovery via Pretrained Graph Neural Networks and Generative Reinforcement Learning

Authors: Jiangjie Qiu, Hou Hei Lam, Xiuyuan Hu, Wentao Li, Siwei Fu, Fankun Zeng, Hao Zhang, Xiaonan Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.23766
Pdf URL: https://arxiv.org/pdf/2503.23766
Copy Paste: [[2503.23766]] Accelerating High-Efficiency Organic Photovoltaic Discovery via Pretrained Graph Neural Networks and Generative Reinforcement Learning(https://arxiv.org/abs/2503.23766)
Keywords: generative
Abstract: Organic photovoltaic (OPV) materials offer a promising avenue toward cost-effective solar energy utilization. However, optimizing donor-acceptor (D-A) combinations to achieve high power conversion efficiency (PCE) remains a significant challenge. In this work, we propose a framework that integrates large-scale pretraining of graph neural networks (GNNs) with a GPT-2 (Generative Pretrained Transformer 2)-based reinforcement learning (RL) strategy to design OPV molecules with potentially high PCE. This approach produces candidate molecules with predicted efficiencies approaching 21\%, although further experimental validation is required. Moreover, we conducted a preliminary fragment-level analysis to identify structural motifs recognized by the RL model that may contribute to enhanced PCE, thus providing design guidelines for the broader research community. To facilitate continued discovery, we are building the largest open-source OPV dataset to date, expected to include nearly 3,000 donor-acceptor pairs. Finally, we discuss plans to collaborate with experimental teams on synthesizing and characterizing AI-designed molecules, which will provide new data to refine and improve our predictive and generative models.

Title: WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization

Authors: Ine Gevers, Victor De Marez, Luna De Bruyne, Walter Daelemans
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23779
Pdf URL: https://arxiv.org/pdf/2503.23779
Copy Paste: [[2503.23779]] WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization(https://arxiv.org/abs/2503.23779)
Keywords: generative
Abstract: In this study, we take a closer look at how Winograd schema challenges can be used to evaluate common sense reasoning in LLMs. Specifically, we evaluate generative models of different sizes on the popular WinoGrande benchmark. We release WinoWhat, a new corpus, in which each instance of the WinoGrande validation set is paraphrased. Additionally, we evaluate the performance on the challenge across five common sense knowledge categories, giving more fine-grained insights on what types of knowledge are more challenging for LLMs. Surprisingly, all models perform significantly worse on WinoWhat, implying that LLM reasoning capabilities are overestimated on WinoGrande. To verify whether this is an effect of benchmark memorization, we match benchmark instances to LLM trainingdata and create two test-suites. We observe that memorization has a minimal effect on model performance on WinoGrande.

Title: MGD-SAM2: Multi-view Guided Detail-enhanced Segment Anything Model 2 for High-Resolution Class-agnostic Segmentation

Authors: Haoran Shen, Peixian Zhuang, Jiahao Kou, Yuxin Zeng, Haoying Xu, Jiangyun Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23786
Pdf URL: https://arxiv.org/pdf/2503.23786
Copy Paste: [[2503.23786]] MGD-SAM2: Multi-view Guided Detail-enhanced Segment Anything Model 2 for High-Resolution Class-agnostic Segmentation(https://arxiv.org/abs/2503.23786)
Keywords: foundation model
Abstract: Segment Anything Models (SAMs), as vision foundation models, have demonstrated remarkable performance across various image analysis tasks. Despite their strong generalization capabilities, SAMs encounter challenges in fine-grained detail segmentation for high-resolution class-independent segmentation (HRCS), due to the limitations in the direct processing of high-resolution inputs and low-resolution mask predictions, and the reliance on accurate manual prompts. To address these limitations, we propose MGD-SAM2 which integrates SAM2 with multi-view feature interaction between a global image and local patches to achieve precise segmentation. MGD-SAM2 incorporates the pre-trained SAM2 with four novel modules: the Multi-view Perception Adapter (MPAdapter), the Multi-view Complementary Enhancement Module (MCEM), the Hierarchical Multi-view Interaction Module (HMIM), and the Detail Refinement Module (DRM). Specifically, we first introduce MPAdapter to adapt the SAM2 encoder for enhanced extraction of local details and global semantics in HRCS images. Then, MCEM and HMIM are proposed to further exploit local texture and global context by aggregating multi-view features within and across multi-scales. Finally, DRM is designed to generate gradually restored high-resolution mask predictions, compensating for the loss of fine-grained details resulting from directly upsampling the low-resolution prediction maps. Experimental results demonstrate the superior performance and strong generalization of our model on multiple high-resolution and normal-resolution datasets. Code will be available at this https URL.

Title: On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices

Authors: Bosung Kim, Kyuhwan Lee, Isu Jeong, Jungmin Cheon, Yeojin Lee, Seulki Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23796
Pdf URL: https://arxiv.org/pdf/2503.23796
Copy Paste: [[2503.23796]] On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices(https://arxiv.org/abs/2503.23796)
Keywords: diffusion, generative
Abstract: We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, the proposed On-device Sora applies three novel techniques to pre-trained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device, comparable to those produced by high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation on commodity mobile and embedded devices without resource-intensive re-training for model optimization (compression). The code implementation is available at a GitHub repository(this https URL).

Title: An extension of linear self-attention for in-context learning

Authors: Katsuyuki Hagiwara
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.23814
Pdf URL: https://arxiv.org/pdf/2503.23814
Copy Paste: [[2503.23814]] An extension of linear self-attention for in-context learning(https://arxiv.org/abs/2503.23814)
Keywords: in-context
Abstract: In-context learning is a remarkable property of transformers and has been the focus of recent research. An attention mechanism is a key component in transformers, in which an attention matrix encodes relationships between words in a sentence and is used as weights for words in a sentence. This mechanism is effective for capturing language representations. However, it is questionable whether naive self-attention is suitable for in-context learning in general tasks, since the computation implemented by self-attention is somewhat restrictive in terms of matrix multiplication. In fact, we may need appropriate input form designs when considering heuristic implementations of computational algorithms. In this paper, in case of linear self-attention, we extend it by introducing a bias matrix in addition to a weight matrix for an input. Despite the simple extension, the extended linear self-attention can output any constant matrix, input matrix and multiplications of two or three matrices in the input. Note that the second property implies that it can be a skip connection. Therefore, flexible matrix manipulations can be implemented by connecting the extended linear self-attention components. As an example of implementation using the extended linear self-attention, we show a heuristic construction of a batch-type gradient descent of ridge regression under a reasonable input form.

Title: Conformal uncertainty quantification to evaluate predictive fairness of foundation AI model for skin lesion classes across patient demographics

Authors: Swarnava Bhattacharyya, Umapada Pal, Tapabrata Chakraborti
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.23819
Pdf URL: https://arxiv.org/pdf/2503.23819
Copy Paste: [[2503.23819]] Conformal uncertainty quantification to evaluate predictive fairness of foundation AI model for skin lesion classes across patient demographics(https://arxiv.org/abs/2503.23819)
Keywords: foundation model
Abstract: Deep learning based diagnostic AI systems based on medical images are starting to provide similar performance as human experts. However these data hungry complex systems are inherently black boxes and therefore slow to be adopted for high risk applications like healthcare. This problem of lack of transparency is exacerbated in the case of recent large foundation models, which are trained in a self supervised manner on millions of data points to provide robust generalisation across a range of downstream tasks, but the embeddings generated from them happen through a process that is not interpretable, and hence not easily trustable for clinical applications. To address this timely issue, we deploy conformal analysis to quantify the predictive uncertainty of a vision transformer (ViT) based foundation model across patient demographics with respect to sex, age and ethnicity for the tasks of skin lesion classification using several public benchmark datasets. The significant advantage of this method is that conformal analysis is method independent and it not only provides a coverage guarantee at population level but also provides an uncertainty score for each individual. We used a model-agnostic dynamic F1-score-based sampling during model training, which helped to stabilize the class imbalance and we investigate the effects on uncertainty quantification (UQ) with or without this bias mitigation step. Thus we show how this can be used as a fairness metric to evaluate the robustness of the feature embeddings of the foundation model (Google DermFoundation) and thus advance the trustworthiness and fairness of clinical AI.

Title: Expanding RL with Verifiable Rewards Across Diverse Domains

Authors: Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.23829
Pdf URL: https://arxiv.org/pdf/2503.23829
Copy Paste: [[2503.23829]] Expanding RL with Verifiable Rewards Across Diverse Domains(https://arxiv.org/abs/2503.23829)
Keywords: generative
Abstract: Reinforcement learning (RL) with verifiable rewards (RLVR) has shown promising results in mathematical reasoning and coding tasks where well-structured reference answers are available. However, its applicability to broader domains remains underexplored. In this work, we study the extension of RLVR to more diverse domains such as medicine, chemistry, psychology, and economics. We observe high agreement in binary judgments across different large language models (LLMs) when objective reference answers exist, which challenges the necessity of large-scale annotation for training domain-specific reward models. To address the limitations of binary rewards when handling unstructured reference answers, we further incorporate model-based soft scoring into RLVR to improve its flexibility. Our experiments show that a distilled generative reward model can serve as an effective cross-domain verifier, providing reliable reward signals for RL without requiring domain-specific annotations. By fine-tuning a base 7B model using various RL algorithms against our reward model, we obtain policies that outperform state-of-the-art open-source aligned LLMs such as Qwen2.5-72B-Instruct and DeepSeek-R1-Distill-Qwen-32B by a large margin, across domains in free-form answer settings. This also strengthens RLVR's robustness and scalability, highlighting its potential for real-world applications with noisy or weak labels.

Title: FlexiMo: A Flexible Remote Sensing Foundation Model

Authors: Xuyang Li, Chenyu Li, Pedram Ghamisi, Danfeng Hong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23844
Pdf URL: https://arxiv.org/pdf/2503.23844
Copy Paste: [[2503.23844]] FlexiMo: A Flexible Remote Sensing Foundation Model(https://arxiv.org/abs/2503.23844)
Keywords: foundation model
Abstract: The rapid expansion of multi-source satellite imagery drives innovation in Earth observation, opening unprecedented opportunities for Remote Sensing Foundation Models to harness diverse data. However, many existing models remain constrained by fixed spatial resolutions and patch sizes, limiting their ability to fully exploit the heterogeneous spatial characteristics inherent in satellite imagery. To address these challenges, we propose FlexiMo, a flexible remote sensing foundation model that endows the pre-trained model with the flexibility to adapt to arbitrary spatial resolutions. Central to FlexiMo is a spatial resolution-aware module that employs a parameter-free alignment embedding mechanism to dynamically recalibrate patch embeddings based on the input image's resolution and dimensions. This design not only preserves critical token characteristics and ensures multi-scale feature fidelity but also enables efficient feature extraction without requiring modifications to the underlying network architecture. In addition, FlexiMo incorporates a lightweight channel adaptation module that leverages prior spectral information from sensors. This mechanism allows the model to process images with varying numbers of channels while maintaining the data's intrinsic physical properties. Extensive experiments on diverse multimodal, multi-resolution, and multi-scale datasets demonstrate that FlexiMo significantly enhances model generalization and robustness. In particular, our method achieves outstanding performance across a range of downstream tasks, including scene classification, land cover classification, urban building segmentation, and cloud detection. By enabling parameter-efficient and physically consistent adaptation, FlexiMo paves the way for more adaptable and effective foundation models in real-world remote sensing applications.

Title: Communication-Efficient and Personalized Federated Foundation Model Fine-Tuning via Tri-Matrix Adaptation

Authors: Yongle Li, Bo Liu, Sheng Huang, ZHeng ZHang, Xiaotong Yuan, Richang Hong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.23869
Pdf URL: https://arxiv.org/pdf/2503.23869
Copy Paste: [[2503.23869]] Communication-Efficient and Personalized Federated Foundation Model Fine-Tuning via Tri-Matrix Adaptation(https://arxiv.org/abs/2503.23869)
Keywords: foundation model
Abstract: In federated learning, fine-tuning pre-trained foundation models poses significant challenges, particularly regarding high communication cost and suboptimal model performance due to data heterogeneity between the clients. To address these issues, this paper introduces communication-efficient federated LoRA adaption (CE-LoRA), a method that employs a tri-factorization low-rank adaptation approach with personalized model parameter aggregation. We first presents a novel LoRA parameter factorization by introducing a small-size dense matrix, which can significantly reduce the communication cost and achieve comparable empirical performance than transferring the low-rank parameter matrix used by existing methods. Without violating data privacy, the server considers the client similarity in both training dataset and model parameter space, and learns personalized weights for model aggregation. Our experiments on various LLM and VLM fine-tuning tasks demonstrate that CE-LoRA not only significantly reduces communication overhead but also improves performance under not independently and identically distributed data conditions. In addition, CE-LoRA improves data privacy protection, effectively mitigating gradient-based data reconstruction attacks.

Title: ExScene: Free-View 3D Scene Reconstruction with Gaussian Splatting from a Single Image

Authors: Tianyi Gong, Boyan Li, Yifei Zhong, Fangxin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23881
Pdf URL: https://arxiv.org/pdf/2503.23881
Copy Paste: [[2503.23881]] ExScene: Free-View 3D Scene Reconstruction with Gaussian Splatting from a Single Image(https://arxiv.org/abs/2503.23881)
Keywords: diffusion
Abstract: The increasing demand for augmented and virtual reality applications has highlighted the importance of crafting immersive 3D scenes from a simple single-view image. However, due to the partial priors provided by single-view input, existing methods are often limited to reconstruct low-consistency 3D scenes with narrow fields of view from single-view input. These limitations make them less capable of generalizing to reconstruct immersive scenes. To address this problem, we propose ExScene, a two-stage pipeline to reconstruct an immersive 3D scene from any given single-view image. ExScene designs a novel multimodal diffusion model to generate a high-fidelity and globally consistent panoramic image. We then develop a panoramic depth estimation approach to calculate geometric information from panorama, and we combine geometric information with high-fidelity panoramic image to train an initial 3D Gaussian Splatting (3DGS) model. Following this, we introduce a GS refinement technique with 2D stable video diffusion priors. We add camera trajectory consistency and color-geometric priors into the denoising process of diffusion to improve color and spatial consistency across image sequences. These refined sequences are then used to fine-tune the initial 3DGS model, leading to better reconstruction quality. Experimental results demonstrate that our ExScene achieves consistent and immersive scene reconstruction using only single-view input, significantly surpassing state-of-the-art baselines.

Title: MuseFace: Text-driven Face Editing via Diffusion-based Mask Generation Approach

Authors: Xin Zhang, Siting Huang, Xiangyang Luo, Yifan Xie, Weijiang Yu, Heng Chang, Fei Ma, Fei Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23888
Pdf URL: https://arxiv.org/pdf/2503.23888
Copy Paste: [[2503.23888]] MuseFace: Text-driven Face Editing via Diffusion-based Mask Generation Approach(https://arxiv.org/abs/2503.23888)
Keywords: diffusion
Abstract: Face editing modifies the appearance of face, which plays a key role in customization and enhancement of personal images. Although much work have achieved remarkable success in text-driven face editing, they still face significant challenges as none of them simultaneously fulfill the characteristics of diversity, controllability and flexibility. To address this challenge, we propose MuseFace, a text-driven face editing framework, which relies solely on text prompt to enable face editing. Specifically, MuseFace integrates a Text-to-Mask diffusion model and a semantic-aware face editing model, capable of directly generating fine-grained semantic masks from text and performing face editing. The Text-to-Mask diffusion model provides \textit{diversity} and \textit{flexibility} to the framework, while the semantic-aware face editing model ensures \textit{controllability} of the framework. Our framework can create fine-grained semantic masks, making precise face editing possible, and significantly enhancing the controllability and flexibility of face editing models. Extensive experiments demonstrate that MuseFace achieves superior high-fidelity performance.

Title: DiffScale: Continuous Downscaling and Bias Correction of Subseasonal Wind Speed Forecasts using Diffusion Models

Authors: Maximilian Springenberg, Noelia Otero, Yuxin Xue, Jackie Ma
Subjects: cs.LG, cs.AI, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.23893
Pdf URL: https://arxiv.org/pdf/2503.23893
Copy Paste: [[2503.23893]] DiffScale: Continuous Downscaling and Bias Correction of Subseasonal Wind Speed Forecasts using Diffusion Models(https://arxiv.org/abs/2503.23893)
Keywords: diffusion, generative
Abstract: Renewable resources are strongly dependent on local and large-scale weather situations. Skillful subseasonal to seasonal (S2S) forecasts -- beyond two weeks and up to two months -- can offer significant socioeconomic advantages to the energy sector. This study aims to enhance wind speed predictions using a diffusion model with classifier-free guidance to downscale S2S forecasts of surface wind speed. We propose DiffScale, a diffusion model that super-resolves spatial information for continuous downscaling factors and lead times. Leveraging weather priors as guidance for the generative process of diffusion models, we adopt the perspective of conditional probabilities on sampling super-resolved S2S forecasts. We aim to directly estimate the density associated with the target S2S forecasts at different spatial resolutions and lead times without auto-regression or sequence prediction, resulting in an efficient and flexible model. Synthetic experiments were designed to super-resolve wind speed S2S forecasts from the European Center for Medium-Range Weather Forecast (ECMWF) from a coarse resolution to a finer resolution of ERA5 reanalysis data, which serves as a high-resolution target. The innovative aspect of DiffScale lies in its flexibility to downscale arbitrary scaling factors, enabling it to generalize across various grid resolutions and lead times -without retraining the model- while correcting model errors, making it a versatile tool for improving S2S wind speed forecasts. We achieve a significant improvement in prediction quality, outperforming baselines up to week 3.

Title: Training-Free Text-Guided Image Editing with Visual Autoregressive Model

Authors: Yufei Wang, Lanqing Guo, Zhihao Li, Jiaxing Huang, Pichao Wang, Bihan Wen, Jian Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23897
Pdf URL: https://arxiv.org/pdf/2503.23897
Copy Paste: [[2503.23897]] Training-Free Text-Guided Image Editing with Visual Autoregressive Model(https://arxiv.org/abs/2503.23897)
Keywords: diffusion
Abstract: Text-guided image editing is an essential task that enables users to modify images through natural language descriptions. Recent advances in diffusion models and rectified flows have significantly improved editing quality, primarily relying on inversion techniques to extract structured noise from input images. However, inaccuracies in inversion can propagate errors, leading to unintended modifications and compromising fidelity. Moreover, even with perfect inversion, the entanglement between textual prompts and image features often results in global changes when only local edits are intended. To address these challenges, we propose a novel text-guided image editing framework based on VAR (Visual AutoRegressive modeling), which eliminates the need for explicit inversion while ensuring precise and controlled modifications. Our method introduces a caching mechanism that stores token indices and probability distributions from the original image, capturing the relationship between the source prompt and the image. Using this cache, we design an adaptive fine-grained masking strategy that dynamically identifies and constrains modifications to relevant regions, preventing unintended changes. A token reassembling approach further refines the editing process, enhancing diversity, fidelity, and control. Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds, processing a 1K resolution image in as fast as 1.2 seconds. Extensive experiments demonstrate that our method achieves performance comparable to, or even surpassing, existing diffusion- and rectified flow-based approaches in both quantitative metrics and visual quality. The code will be released.

Title: HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment

Authors: Zhichao Liao, Xiaokun Liu, Wenyu Qin, Qingyu Li, Qiulin Wang, Pengfei Wan, Di Zhang, Long Zeng, Pingfa Feng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23907
Pdf URL: https://arxiv.org/pdf/2503.23907
Copy Paste: [[2503.23907]] HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment(https://arxiv.org/abs/2503.23907)
Keywords: foundation model
Abstract: Image Aesthetic Assessment (IAA) is a long-standing and challenging research task. However, its subset, Human Image Aesthetic Assessment (HIAA), has been scarcely explored, even though HIAA is widely used in social media, AI workflows, and related domains. To bridge this research gap, our work pioneers a holistic implementation framework tailored for HIAA. Specifically, we introduce HumanBeauty, the first dataset purpose-built for HIAA, which comprises 108k high-quality human images with manual annotations. To achieve comprehensive and fine-grained HIAA, 50K human images are manually collected through a rigorous curation process and annotated leveraging our trailblazing 12-dimensional aesthetic standard, while the remaining 58K with overall aesthetic labels are systematically filtered from public datasets. Based on the HumanBeauty database, we propose HumanAesExpert, a powerful Vision Language Model for aesthetic evaluation of human images. We innovatively design an Expert head to incorporate human knowledge of aesthetic sub-dimensions while jointly utilizing the Language Modeling (LM) and Regression head. This approach empowers our model to achieve superior proficiency in both overall and fine-grained HIAA. Furthermore, we introduce a MetaVoter, which aggregates scores from all three heads, to effectively balance the capabilities of each head, thereby realizing improved assessment precision. Extensive experiments demonstrate that our HumanAesExpert models deliver significantly better performance in HIAA than other state-of-the-art models. Our datasets, models, and codes are publicly released to advance the HIAA community. Project webpage: this https URL

Title: Green MLOps to Green GenOps: An Empirical Study of Energy Consumption in Discriminative and Generative AI Operations

Authors: Adrián Sánchez-Mompó, Ioannis Mavromatis, Peizheng Li, Konstantinos Katsaros, Aftab Khan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23934
Pdf URL: https://arxiv.org/pdf/2503.23934
Copy Paste: [[2503.23934]] Green MLOps to Green GenOps: An Empirical Study of Energy Consumption in Discriminative and Generative AI Operations(https://arxiv.org/abs/2503.23934)
Keywords: generative
Abstract: This study presents an empirical investigation into the energy consumption of Discriminative and Generative AI models within real-world MLOps pipelines. For Discriminative models, we examine various architectures and hyperparameters during training and inference and identify energy-efficient practices. For Generative AI, Large Language Models (LLMs) are assessed, focusing primarily on energy consumption across different model sizes and varying service requests. Our study employs software-based power measurements, ensuring ease of replication across diverse configurations, models, and datasets. We analyse multiple models and hardware setups to uncover correlations among various metrics, identifying key contributors to energy consumption. The results indicate that for Discriminative models, optimising architectures, hyperparameters, and hardware can significantly reduce energy consumption without sacrificing performance. For LLMs, energy efficiency depends on balancing model size, reasoning complexity, and request-handling capacity, as larger models do not necessarily consume more energy when utilisation remains low. This analysis provides practical guidelines for designing green and sustainable ML operations, emphasising energy consumption and carbon footprint reductions while maintaining performance. This paper can serve as a benchmark for accurately estimating total energy use across different types of AI models.

Title: JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation

Authors: Fangda Chen, Shanshan Zhao, Chuanfu Xu, Long Lan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.23951
Pdf URL: https://arxiv.org/pdf/2503.23951
Copy Paste: [[2503.23951]] JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation(https://arxiv.org/abs/2503.23951)
Keywords: diffusion
Abstract: Recent text-to-video advancements have enabled coherent video synthesis from prompts and expanded to fine-grained control over appearance and motion. However, existing methods either suffer from concept interference due to feature domain mismatch caused by naive decoupled optimizations or exhibit appearance contamination induced by spatial feature leakage resulting from the entanglement of motion and appearance in reference video reconstructions. In this paper, we propose JointTuner, a novel adaptive joint training framework, to alleviate these issues. Specifically, we develop Adaptive LoRA, which incorporates a context-aware gating mechanism, and integrate the gated LoRA components into the spatial and temporal Transformers within the diffusion model. These components enable simultaneous optimization of appearance and motion, eliminating concept interference. In addition, we introduce the Appearance-independent Temporal Loss, which decouples motion patterns from intrinsic appearance in reference video reconstructions through an appearance-agnostic noise prediction task. The key innovation lies in adding frame-wise offset noise to the ground-truth Gaussian noise, perturbing its distribution, thereby disrupting spatial attributes associated with frames while preserving temporal coherence. Furthermore, we construct a benchmark comprising 90 appearance-motion customized combinations and 10 multi-type automatic metrics across four dimensions, facilitating a more comprehensive evaluation for this customization task. Extensive experiments demonstrate the superior performance of our method compared to current advanced approaches.

Title: SALT: A Flexible Semi-Automatic Labeling Tool for General LiDAR Point Clouds with Cross-Scene Adaptability and 4D Consistency

Authors: Yanbo Wang, Yongtao Chen, Chuan Cao, Tianchen Deng, Wentao Zhao, Jingchuan Wang, Weidong Chen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.23980
Pdf URL: https://arxiv.org/pdf/2503.23980
Copy Paste: [[2503.23980]] SALT: A Flexible Semi-Automatic Labeling Tool for General LiDAR Point Clouds with Cross-Scene Adaptability and 4D Consistency(https://arxiv.org/abs/2503.23980)
Keywords: foundation model
Abstract: We propose a flexible Semi-Automatic Labeling Tool (SALT) for general LiDAR point clouds with cross-scene adaptability and 4D consistency. Unlike recent approaches that rely on camera distillation, SALT operates directly on raw LiDAR data, automatically generating pre-segmentation results. To achieve this, we propose a novel zero-shot learning paradigm, termed data alignment, which transforms LiDAR data into pseudo-images by aligning with the training distribution of vision foundation models. Additionally, we design a 4D-consistent prompting strategy and 4D non-maximum suppression module to enhance SAM2, ensuring high-quality, temporally consistent presegmentation. SALT surpasses the latest zero-shot methods by 18.4% PQ on SemanticKITTI and achieves nearly 40-50% of human annotator performance on our newly collected low-resolution LiDAR data and on combined data from three LiDAR types, significantly boosting annotation efficiency. We anticipate that SALT's open-sourcing will catalyze substantial expansion of current LiDAR datasets and lay the groundwork for the future development of LiDAR foundation models. Code is available at this https URL.

Title: Federated Structured Sparse PCA for Anomaly Detection in IoT Networks

Authors: Chenyi Huang, Xinrong Li, Xianchao Xiu
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2503.23981
Pdf URL: https://arxiv.org/pdf/2503.23981
Copy Paste: [[2503.23981]] Federated Structured Sparse PCA for Anomaly Detection in IoT Networks(https://arxiv.org/abs/2503.23981)
Keywords: anomaly
Abstract: Although federated learning has gained prominence as a privacy-preserving framework tailored for distributed Internet of Things (IoT) environments, current federated principal component analysis (PCA) methods lack integration of sparsity, a critical feature for robust anomaly detection. To address this limitation, we propose a novel federated structured sparse PCA (FedSSP) approach for anomaly detection in IoT networks. The proposed model uniquely integrates double sparsity regularization: (1) row-wise sparsity governed by $\ell_{2,p}$-norm with $p\in[0,1)$ to eliminate redundant feature dimensions, and (2) element-wise sparsity via $\ell_{q}$-norm with $q\in[0,1)$ to suppress noise-sensitive components. To efficiently solve this non-convex optimization problem in a distributed setting, we devise a proximal alternating minimization (PAM) algorithm with rigorous theoretical proofs establishing its convergence guarantees. Experiments on real datasets validate that incorporating structured sparsity enhances both model interpretability and detection accuracy.

Title: DenseFormer: Learning Dense Depth Map from Sparse Depth and Image via Conditional Diffusion Model

Authors: Ming Yuan, Sichao Wang, Chuang Zhang, Lei He, Qing Xu, Jianqiang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.23993
Pdf URL: https://arxiv.org/pdf/2503.23993
Copy Paste: [[2503.23993]] DenseFormer: Learning Dense Depth Map from Sparse Depth and Image via Conditional Diffusion Model(https://arxiv.org/abs/2503.23993)
Keywords: diffusion
Abstract: The depth completion task is a critical problem in autonomous driving, involving the generation of dense depth maps from sparse depth maps and RGB images. Most existing methods employ a spatial propagation network to iteratively refine the depth map after obtaining an initial dense depth. In this paper, we propose DenseFormer, a novel method that integrates the diffusion model into the depth completion task. By incorporating the denoising mechanism of the diffusion model, DenseFormer generates the dense depth map by progressively refining an initial random depth distribution through multiple iterations. We propose a feature extraction module that leverages a feature pyramid structure, along with multi-layer deformable attention, to effectively extract and integrate features from sparse depth maps and RGB images, which serve as the guiding condition for the diffusion process. Additionally, this paper presents a depth refinement module that applies multi-step iterative refinement across various ranges to the dense depth results generated by the diffusion process. The module utilizes image features enriched with multi-scale information and sparse depth input to further enhance the accuracy of the predicted depth map. Extensive experiments on the KITTI outdoor scene dataset demonstrate that DenseFormer outperforms classical depth completion methods.

Title: AMMSM: Adaptive Motion Magnification and Sparse Mamba for Micro-Expression Recognition

Authors: Xuxiong Liu, Tengteng Dong, Fei Wang, Weijie Feng, Xiao Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24057
Pdf URL: https://arxiv.org/pdf/2503.24057
Copy Paste: [[2503.24057]] AMMSM: Adaptive Motion Magnification and Sparse Mamba for Micro-Expression Recognition(https://arxiv.org/abs/2503.24057)
Keywords: self-supervised
Abstract: Micro-expressions are typically regarded as unconscious manifestations of a person's genuine emotions. However, their short duration and subtle signals pose significant challenges for downstream recognition. We propose a multi-task learning framework named the Adaptive Motion Magnification and Sparse Mamba (AMMSM) to address this. This framework aims to enhance the accurate capture of micro-expressions through self-supervised subtle motion magnification, while the sparse spatial selection Mamba architecture combines sparse activation with the advanced Visual Mamba model to model key motion regions and their valuable representations more effectively. Additionally, we employ evolutionary search to optimize the magnification factor and the sparsity ratios of spatial selection, followed by fine-tuning to improve performance further. Extensive experiments on two standard datasets demonstrate that the proposed AMMSM achieves state-of-the-art (SOTA) accuracy and robustness.

Title: A Plasticity-Aware Method for Continual Self-Supervised Learning in Remote Sensing

Authors: Lars Möllenbrok, Behnood Rasti, Begüm Demir
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24088
Pdf URL: https://arxiv.org/pdf/2503.24088
Copy Paste: [[2503.24088]] A Plasticity-Aware Method for Continual Self-Supervised Learning in Remote Sensing(https://arxiv.org/abs/2503.24088)
Keywords: self-supervised
Abstract: Continual self-supervised learning (CSSL) methods have gained increasing attention in remote sensing (RS) due to their capability to learn new tasks sequentially from continuous streams of unlabeled data. Existing CSSL methods, while learning new tasks, focus on preventing catastrophic forgetting. To this end, most of them use regularization strategies to retain knowledge of previous tasks. This reduces the model's ability to adapt to the data of new tasks (i.e., learning plasticity), which can degrade performance. To address this problem, in this paper, we propose a novel CSSL method that aims to learn tasks sequentially, while achieving high learning plasticity. To this end, the proposed method uses a knowledge distillation strategy with an integrated decoupling mechanism. The decoupling is achieved by first dividing the feature dimensions into task-common and task-specific parts. Then, the task-common features are forced to be correlated to ensure memory stability while the task-specific features are forced to be de-correlated facilitating the learning of new features. Experimental results show the effectiveness of the proposed method compared to CaSSLe, which is a widely used CSSL framework, with improvements of up to 1.12% in average accuracy and 2.33% in intransigence in a task-incremental scenario, and 1.24% in average accuracy and 2.01% in intransigence in a class-incremental scenario.

Title: PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis

Authors: Anwesa Choudhuri, Zhongpai Gao, Meng Zheng, Benjamin Planche, Terrence Chen, Ziyan Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.24108
Pdf URL: https://arxiv.org/pdf/2503.24108
Copy Paste: [[2503.24108]] PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis(https://arxiv.org/abs/2503.24108)
Keywords: foundation model
Abstract: Early detection, accurate segmentation, classification and tracking of polyps during colonoscopy are critical for preventing colorectal cancer. Many existing deep-learning-based methods for analyzing colonoscopic videos either require task-specific fine-tuning, lack tracking capabilities, or rely on domain-specific pre-training. In this paper, we introduce \textit{PolypSegTrack}, a novel foundation model that jointly addresses polyp detection, segmentation, classification and unsupervised tracking in colonoscopic videos. Our approach leverages a novel conditional mask loss, enabling flexible training across datasets with either pixel-level segmentation masks or bounding box annotations, allowing us to bypass task-specific fine-tuning. Our unsupervised tracking module reliably associates polyp instances across frames using object queries, without relying on any heuristics. We leverage a robust vision foundation model backbone that is pre-trained unsupervisedly on natural images, thereby removing the need for domain-specific pre-training. Extensive experiments on multiple polyp benchmarks demonstrate that our method significantly outperforms existing state-of-the-art approaches in detection, segmentation, classification, and tracking.

Title: It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data

Authors: Dominik Schnaus, Nikita Araslanov, Daniel Cremers
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.24129
Pdf URL: https://arxiv.org/pdf/2503.24129
Copy Paste: [[2503.24129]] It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data(https://arxiv.org/abs/2503.24129)
Keywords: foundation model
Abstract: The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e. without parallel data. We present the first feasibility study, and investigate conformity of existing vision and language foundation models in the context of unsupervised, or "blind", matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens up the exciting possibility of embedding semantic knowledge into other modalities virtually annotation-free. As a proof of concept, we showcase an unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.

Title: Learning a Canonical Basis of Human Preferences from Binary Ratings

Authors: Kailas Vodrahalli, Wei Wei, James Zou
Subjects: cs.LG, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2503.24150
Pdf URL: https://arxiv.org/pdf/2503.24150
Copy Paste: [[2503.24150]] Learning a Canonical Basis of Human Preferences from Binary Ratings(https://arxiv.org/abs/2503.24150)
Keywords: generative
Abstract: Recent advances in generative AI have been driven by alignment techniques such as reinforcement learning from human feedback (RLHF). RLHF and related techniques typically involve constructing a dataset of binary or ranked choice human preferences and subsequently fine-tuning models to align with these preferences. This paper shifts the focus to understanding the preferences encoded in such datasets and identifying common human preferences. We find that a small subset of 21 preference categories (selected from a set of nearly 5,000 distinct preferences) captures >89% of preference variation across individuals. This small set of preferences is analogous to a canonical basis of human preferences, similar to established findings that characterize human variation in psychology or facial recognition studies. Through both synthetic and empirical evaluations, we confirm that our low-rank, canonical set of human preferences generalizes across the entire dataset and within specific topics. We further demonstrate our preference basis' utility in model evaluation, where our preference categories offer deeper insights into model alignment, and in model training, where we show that fine-tuning on preference-defined subsets successfully aligns the model accordingly.

Title: Foundation Models For Seismic Data Processing: An Extensive Review

Authors: Fabian Fuchs, Mario Ruben Fernandez, Norman Ettrich, Janis Keuper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24166
Pdf URL: https://arxiv.org/pdf/2503.24166
Copy Paste: [[2503.24166]] Foundation Models For Seismic Data Processing: An Extensive Review(https://arxiv.org/abs/2503.24166)
Keywords: foundation model
Abstract: Seismic processing plays a crucial role in transforming raw data into high-quality subsurface images, pivotal for various geoscience applications. Despite its importance, traditional seismic processing techniques face challenges such as noisy and damaged data and the reliance on manual, time-consuming workflows. The emergence of deep learning approaches has introduced effective and user-friendly alternatives, yet many of these deep learning approaches rely on synthetic datasets and specialized neural networks. Recently, foundation models have gained traction in the seismic domain, due to their success in natural imaging. This paper investigates the application of foundation models in seismic processing on the tasks: demultiple, interpolation, and denoising. It evaluates the impact of different model characteristics, such as pre-training technique and neural network architecture, on performance and efficiency. Rather than proposing a single seismic foundation model, this paper critically examines various natural image foundation models and suggest some promising candidates for future exploration.

Title: Implicit In-Context Learning: Evidence from Artificial Language Experiments

Authors: Xiaomeng Ma, Qihui Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.24190
Pdf URL: https://arxiv.org/pdf/2503.24190
Copy Paste: [[2503.24190]] Implicit In-Context Learning: Evidence from Artificial Language Experiments(https://arxiv.org/abs/2503.24190)
Keywords: in-context
Abstract: Humans acquire language through implicit learning, absorbing complex patterns without explicit awareness. While LLMs demonstrate impressive linguistic capabilities, it remains unclear whether they exhibit human-like pattern recognition during in-context learning at inferencing level. We adapted three classic artificial language learning experiments spanning morphology, morphosyntax, and syntax to systematically evaluate implicit learning at inferencing level in two state-of-the-art OpenAI models: gpt-4o and o3-mini. Our results reveal linguistic domain-specific alignment between models and human behaviors, o3-mini aligns better in morphology while both models align in syntax.

Title: DiET-GS: Diffusion Prior and Event Stream-Assisted Motion Deblurring 3D Gaussian Splatting

Authors: Seungjun Lee, Gim Hee Lee
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2503.24210
Pdf URL: https://arxiv.org/pdf/2503.24210
Copy Paste: [[2503.24210]] DiET-GS: Diffusion Prior and Event Stream-Assisted Motion Deblurring 3D Gaussian Splatting(https://arxiv.org/abs/2503.24210)
Keywords: diffusion
Abstract: Reconstructing sharp 3D representations from blurry multi-view images are long-standing problem in computer vision. Recent works attempt to enhance high-quality novel view synthesis from the motion blur by leveraging event-based cameras, benefiting from high dynamic range and microsecond temporal resolution. However, they often reach sub-optimal visual quality in either restoring inaccurate color or losing fine-grained details. In this paper, we present DiET-GS, a diffusion prior and event stream-assisted motion deblurring 3DGS. Our framework effectively leverages both blur-free event streams and diffusion prior in a two-stage training strategy. Specifically, we introduce the novel framework to constraint 3DGS with event double integral, achieving both accurate color and well-defined details. Additionally, we propose a simple technique to leverage diffusion prior to further enhance the edge details. Qualitative and quantitative results on both synthetic and real-world data demonstrate that our DiET-GS is capable of producing significantly better quality of novel views compared to the existing baselines. Our project page is this https URL

Title: Pre-training with 3D Synthetic Data: Learning 3D Point Cloud Instance Segmentation from 3D Synthetic Scenes

Authors: Daichi Otsuka, Shinichi Mae, Ryosuke Yamada, Hirokatsu Kataoka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24229
Pdf URL: https://arxiv.org/pdf/2503.24229
Copy Paste: [[2503.24229]] Pre-training with 3D Synthetic Data: Learning 3D Point Cloud Instance Segmentation from 3D Synthetic Scenes(https://arxiv.org/abs/2503.24229)
Keywords: generative
Abstract: In the recent years, the research community has witnessed growing use of 3D point cloud data for the high applicability in various real-world applications. By means of 3D point cloud, this modality enables to consider the actual size and spatial understanding. The applied fields include mechanical control of robots, vehicles, or other real-world systems. Along this line, we would like to improve 3D point cloud instance segmentation which has emerged as a particularly promising approach for these applications. However, the creation of 3D point cloud datasets entails enormous costs compared to 2D image datasets. To train a model of 3D point cloud instance segmentation, it is necessary not only to assign categories but also to provide detailed annotations for each point in the large-scale 3D space. Meanwhile, the increase of recent proposals for generative models in 3D domain has spurred proposals for using a generative model to create 3D point cloud data. In this work, we propose a pre-training with 3D synthetic data to train a 3D point cloud instance segmentation model based on generative model for 3D scenes represented by point cloud data. We directly generate 3D point cloud data with Point-E for inserting a generated data into a 3D scene. More recently in 2025, although there are other accurate 3D generation models, even using the Point-E as an early 3D generative model can effectively support the pre-training with 3D synthetic data. In the experimental section, we compare our pre-training method with baseline methods indicated improved performance, demonstrating the efficacy of 3D generative models for 3D point cloud instance segmentation.

Title: Enhancing Large Language Models (LLMs) for Telecommunications using Knowledge Graphs and Retrieval-Augmented Generation

Authors: Dun Yuan, Hao Zhou, Di Wu, Xue Liu, Hao Chen, Yan Xin, Jianzhong (Charlie)Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.24245
Pdf URL: https://arxiv.org/pdf/2503.24245
Copy Paste: [[2503.24245]] Enhancing Large Language Models (LLMs) for Telecommunications using Knowledge Graphs and Retrieval-Augmented Generation(https://arxiv.org/abs/2503.24245)
Keywords: generative
Abstract: Large language models (LLMs) have made significant progress in general-purpose natural language processing tasks. However, LLMs are still facing challenges when applied to domain-specific areas like telecommunications, which demands specialized expertise and adaptability to evolving standards. This paper presents a novel framework that combines knowledge graph (KG) and retrieval-augmented generation (RAG) techniques to enhance LLM performance in the telecom domain. The framework leverages a KG to capture structured, domain-specific information about network protocols, standards, and other telecom-related entities, comprehensively representing their relationships. By integrating KG with RAG, LLMs can dynamically access and utilize the most relevant and up-to-date knowledge during response generation. This hybrid approach bridges the gap between structured knowledge representation and the generative capabilities of LLMs, significantly enhancing accuracy, adaptability, and domain-specific comprehension. Our results demonstrate the effectiveness of the KG-RAG framework in addressing complex technical queries with precision. The proposed KG-RAG model attained an accuracy of 88% for question answering tasks on a frequently used telecom-specific dataset, compared to 82% for the RAG-only and 48% for the LLM-only approaches.

Title: Beyond a Single Mode: GAN Ensembles for Diverse Medical Data Generation

Authors: Lorenzo Tronchin, Tommy Löfstedt, Paolo Soda, Valerio Guarrasi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.24258
Pdf URL: https://arxiv.org/pdf/2503.24258
Copy Paste: [[2503.24258]] Beyond a Single Mode: GAN Ensembles for Diverse Medical Data Generation(https://arxiv.org/abs/2503.24258)
Keywords: generative
Abstract: The advancement of generative AI, particularly in medical imaging, confronts the trilemma of ensuring high fidelity, diversity, and efficiency in synthetic data generation. While Generative Adversarial Networks (GANs) have shown promise across various applications, they still face challenges like mode collapse and insufficient coverage of real data distributions. This work explores the use of GAN ensembles to overcome these limitations, specifically in the context of medical imaging. By solving a multi-objective optimisation problem that balances fidelity and diversity, we propose a method for selecting an optimal ensemble of GANs tailored for medical data. The selected ensemble is capable of generating diverse synthetic medical images that are representative of true data distributions and computationally efficient. Each model in the ensemble brings a unique contribution, ensuring minimal redundancy. We conducted a comprehensive evaluation using three distinct medical datasets, testing 22 different GAN architectures with various loss functions and regularisation techniques. By sampling models at different training epochs, we crafted 110 unique configurations. The results highlight the capability of GAN ensembles to enhance the quality and utility of synthetic medical images, thereby improving the efficacy of downstream tasks such as diagnostic modelling.

Title: FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics

Authors: Yixuan Li, Yu Tian, Yipo Huang, Wei Lu, Shiqi Wang, Weisi Lin, Anderson Rocha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24267
Pdf URL: https://arxiv.org/pdf/2503.24267
Copy Paste: [[2503.24267]] FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics(https://arxiv.org/abs/2503.24267)
Keywords: generative
Abstract: The rapid and unrestrained advancement of generative artificial intelligence (AI) presents a double-edged sword: while enabling unprecedented creativity, it also facilitates the generation of highly convincing deceptive content, undermining societal trust. As image generation techniques become increasingly sophisticated, detecting synthetic images is no longer just a binary task: it necessitates interpretable, context-aware methodologies that enhance trustworthiness and transparency. However, existing detection models primarily focus on classification, offering limited explanatory insights into image authenticity. In this work, we propose FakeScope, an expert multimodal model (LMM) tailored for AI-generated image forensics, which not only identifies AI-synthetic images with high accuracy but also provides rich, interpretable, and query-driven forensic insights. We first construct FakeChain dataset that contains linguistic authenticity reasoning based on visual trace evidence, developed through a novel human-machine collaborative framework. Building upon it, we further present FakeInstruct, the largest multimodal instruction tuning dataset containing 2 million visual instructions tailored to enhance forensic awareness in LMMs. FakeScope achieves state-of-the-art performance in both closed-ended and open-ended forensic scenarios. It can distinguish synthetic images with high accuracy while offering coherent and insightful explanations, free-form discussions on fine-grained forgery attributes, and actionable enhancement strategies. Notably, despite being trained exclusively on qualitative hard labels, FakeScope demonstrates remarkable zero-shot quantitative capability on detection, enabled by our proposed token-based probability estimation strategy. Furthermore, FakeScope exhibits strong generalization and in-the-wild ability, ensuring its applicability in real-world scenarios.

Title: Visual Acoustic Fields

Authors: Yuelei Li, Hyunjin Kim, Fangneng Zhan, Ri-Zhao Qiu, Mazeyu Ji, Xiaojun Shan, Xueyan Zou, Paul Liang, Hanspeter Pfister, Xiaolong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.24270
Pdf URL: https://arxiv.org/pdf/2503.24270
Copy Paste: [[2503.24270]] Visual Acoustic Fields(https://arxiv.org/abs/2503.24270)
Keywords: diffusion
Abstract: Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties. Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS). Our approach features two key modules: sound generation and sound localization. The sound generation module leverages a conditional diffusion model, which takes multiscale features rendered from a feature-augmented 3DGS to generate realistic hitting sounds. Meanwhile, the sound localization module enables querying the 3D scene, represented by the feature-augmented 3DGS, to localize hitting positions based on the sound sources. To support this framework, we introduce a novel pipeline for collecting scene-level visual-sound sample pairs, achieving alignment between captured images, impact locations, and corresponding sounds. To the best of our knowledge, this is the first dataset to connect visual and acoustic signals in a 3D context. Extensive experiments on our dataset demonstrate the effectiveness of Visual Acoustic Fields in generating plausible impact sounds and accurately localizing impact sources. Our project page is at this https URL.

Title: Learning Velocity and Acceleration: Self-Supervised Motion Consistency for Pedestrian Trajectory Prediction

Authors: Yizhou Huang, Yihua Cheng, Kezhi Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.24272
Pdf URL: https://arxiv.org/pdf/2503.24272
Copy Paste: [[2503.24272]] Learning Velocity and Acceleration: Self-Supervised Motion Consistency for Pedestrian Trajectory Prediction(https://arxiv.org/abs/2503.24272)
Keywords: self-supervised
Abstract: Understanding human motion is crucial for accurate pedestrian trajectory prediction. Conventional methods typically rely on supervised learning, where ground-truth labels are directly optimized against predicted trajectories. This amplifies the limitations caused by long-tailed data distributions, making it difficult for the model to capture abnormal behaviors. In this work, we propose a self-supervised pedestrian trajectory prediction framework that explicitly models position, velocity, and acceleration. We leverage velocity and acceleration information to enhance position prediction through feature injection and a self-supervised motion consistency mechanism. Our model hierarchically injects velocity features into the position stream. Acceleration features are injected into the velocity stream. This enables the model to predict position, velocity, and acceleration jointly. From the predicted position, we compute corresponding pseudo velocity and acceleration, allowing the model to learn from data-generated pseudo labels and thus achieve self-supervised learning. We further design a motion consistency evaluation strategy grounded in physical principles; it selects the most reasonable predicted motion trend by comparing it with historical dynamics and uses this trend to guide and constrain trajectory generation. We conduct experiments on the ETH-UCY and Stanford Drone datasets, demonstrating that our method achieves state-of-the-art performance on both datasets.

Title: Style Quantization for Data-Efficient GAN Training

Authors: Jian Wang, Xin Lan, Jizhe Zhou, Yuxin Tian, Jiancheng Lv
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24282
Pdf URL: https://arxiv.org/pdf/2503.24282
Copy Paste: [[2503.24282]] Style Quantization for Data-Efficient GAN Training(https://arxiv.org/abs/2503.24282)
Keywords: foundation model
Abstract: Under limited data setting, GANs often struggle to navigate and effectively exploit the input latent space. Consequently, images generated from adjacent variables in a sparse input latent space may exhibit significant discrepancies in realism, leading to suboptimal consistency regularization (CR) outcomes. To address this, we propose \textit{SQ-GAN}, a novel approach that enhances CR by introducing a style space quantization scheme. This method transforms the sparse, continuous input latent space into a compact, structured discrete proxy space, allowing each element to correspond to a specific real data point, thereby improving CR performance. Instead of direct quantization, we first map the input latent variables into a less entangled ``style'' space and apply quantization using a learnable codebook. This enables each quantized code to control distinct factors of variation. Additionally, we optimize the optimal transport distance to align the codebook codes with features extracted from the training data by a foundation model, embedding external knowledge into the codebook and establishing a semantically rich vocabulary that properly describes the training dataset. Extensive experiments demonstrate significant improvements in both discriminator robustness and generation quality with our method.

Title: Can Test-Time Scaling Improve World Foundation Model?

Authors: Wenyan Cong, Hanqing Zhu, Peihao Wang, Bangya Liu, Dejia Xu, Kevin Wang, David Z. Pan, Yan Wang, Zhiwen Fan, Zhangyang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24320
Pdf URL: https://arxiv.org/pdf/2503.24320
Copy Paste: [[2503.24320]] Can Test-Time Scaling Improve World Foundation Model?(https://arxiv.org/abs/2503.24320)
Keywords: foundation model
Abstract: World foundation models, which simulate the physical world by predicting future states from current observations and inputs, have become central to many applications in physical intelligence, including autonomous driving and robotics. However, these models require substantial computational resources for pretraining and are further constrained by available data during post-training. As such, scaling computation at test time emerges as both a critical and practical alternative to traditional model enlargement or re-training. In this work, we introduce SWIFT, a test-time scaling framework tailored for WFMs. SWIFT integrates our extensible WFM evaluation toolkit with process-level inference strategies, including fast tokenization, probability-based Top-K pruning, and efficient beam search. Empirical results on the COSMOS model demonstrate that test-time scaling exists even in a compute-optimal way. Our findings reveal that test-time scaling laws hold for WFMs and that SWIFT provides a scalable and effective pathway for improving WFM inference without retraining or increasing model size. The code is available at this https URL.

Title: NoProp: Training Neural Networks without Back-propagation or Forward-propagation

Authors: Qinyu Li, Yee Whye Teh, Razvan Pascanu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.24322
Pdf URL: https://arxiv.org/pdf/2503.24322
Copy Paste: [[2503.24322]] NoProp: Training Neural Networks without Back-propagation or Forward-propagation(https://arxiv.org/abs/2503.24322)
Keywords: diffusion
Abstract: The canonical deep learning approach for learning requires computing a gradient term at each layer by back-propagating the error signal from the output towards each learnable parameter. Given the stacked structure of neural networks, where each layer builds on the representation of the layer below, this approach leads to hierarchical representations. More abstract features live on the top layers of the model, while features on lower layers are expected to be less abstract. In contrast to this, we introduce a new learning method named NoProp, which does not rely on either forward or backwards propagation. Instead, NoProp takes inspiration from diffusion and flow matching methods, where each layer independently learns to denoise a noisy target. We believe this work takes a first step towards introducing a new family of gradient-free learning methods, that does not learn hierarchical representations -- at least not in the usual sense. NoProp needs to fix the representation at each layer beforehand to a noised version of the target, learning a local denoising process that can then be exploited at inference. We demonstrate the effectiveness of our method on MNIST, CIFAR-10, and CIFAR-100 image classification benchmarks. Our results show that NoProp is a viable learning algorithm which achieves superior accuracy, is easier to use and computationally more efficient compared to other existing back-propagation-free methods. By departing from the traditional gradient based learning paradigm, NoProp alters how credit assignment is done within the network, enabling more efficient distributed learning as well as potentially impacting other characteristics of the learning process.

Title: Self-Supervised Pretraining for Aerial Road Extraction

Authors: Rupert Polley, Sai Vignesh Abishek Deenadayalan, J. Marius Zöllner
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.24326
Pdf URL: https://arxiv.org/pdf/2503.24326
Copy Paste: [[2503.24326]] Self-Supervised Pretraining for Aerial Road Extraction(https://arxiv.org/abs/2503.24326)
Keywords: self-supervised
Abstract: Deep neural networks for aerial image segmentation require large amounts of labeled data, but high-quality aerial datasets with precise annotations are scarce and costly to produce. To address this limitation, we propose a self-supervised pretraining method that improves segmentation performance while reducing reliance on labeled data. Our approach uses inpainting-based pretraining, where the model learns to reconstruct missing regions in aerial images, capturing their inherent structure before being fine-tuned for road extraction. This method improves generalization, enhances robustness to domain shifts, and is invariant to model architecture and dataset choice. Experiments show that our pretraining significantly boosts segmentation accuracy, especially in low-data regimes, making it a scalable solution for aerial image analysis.

Title: PathOrchestra: A Comprehensive Foundation Model for Computational Pathology with Over 100 Diverse Clinical-Grade Tasks

Authors: Fang Yan, Jianfeng Wu, Jiawen Li, Wei Wang, Jiaxuan Lu, Wen Chen, Zizhao Gao, Jianan Li, Hong Yan, Jiabo Ma, Minda Chen, Yang Lu, Qing Chen, Yizhi Wang, Xitong Ling, Xuenian Wang, Zihan Wang, Qiang Huang, Shengyi Hua, Mianxin Liu, Lei Ma, Tian Shen, Xiaofan Zhang, Yonghong He, Hao Chen, Shaoting Zhang, Zhe Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24345
Pdf URL: https://arxiv.org/pdf/2503.24345
Copy Paste: [[2503.24345]] PathOrchestra: A Comprehensive Foundation Model for Computational Pathology with Over 100 Diverse Clinical-Grade Tasks(https://arxiv.org/abs/2503.24345)
Keywords: self-supervised, foundation model
Abstract: The complexity and variability inherent in high-resolution pathological images present significant challenges in computational pathology. While pathology foundation models leveraging AI have catalyzed transformative advancements, their development demands large-scale datasets, considerable storage capacity, and substantial computational resources. Furthermore, ensuring their clinical applicability and generalizability requires rigorous validation across a broad spectrum of clinical tasks. Here, we present PathOrchestra, a versatile pathology foundation model trained via self-supervised learning on a dataset comprising 300K pathological slides from 20 tissue and organ types across multiple centers. The model was rigorously evaluated on 112 clinical tasks using a combination of 61 private and 51 public datasets. These tasks encompass digital slide preprocessing, pan-cancer classification, lesion identification, multi-cancer subtype classification, biomarker assessment, gene expression prediction, and the generation of structured reports. PathOrchestra demonstrated exceptional performance across 27,755 WSIs and 9,415,729 ROIs, achieving over 0.950 accuracy in 47 tasks, including pan-cancer classification across various organs, lymphoma subtype diagnosis, and bladder cancer screening. Notably, it is the first model to generate structured reports for high-incidence colorectal cancer and diagnostically complex lymphoma-areas that are infrequently addressed by foundational models but hold immense clinical potential. Overall, PathOrchestra exemplifies the feasibility and efficacy of a large-scale, self-supervised pathology foundation model, validated across a broad range of clinical-grade tasks. Its high accuracy and reduced reliance on extensive data annotation underline its potential for clinical integration, offering a pathway toward more efficient and high-quality medical services.

Title: ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion

Authors: Rana Muhammad Shahroz Khan, Dongwen Tang, Pingzhi Li, Kai Wang, Tianlong Chen
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.24354
Pdf URL: https://arxiv.org/pdf/2503.24354
Copy Paste: [[2503.24354]] ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion(https://arxiv.org/abs/2503.24354)
Keywords: diffusion, foundation model
Abstract: Parameter generation has emerged as a novel paradigm for neural network development, offering an alternative to traditional neural network training by synthesizing high-quality model weights directly. In the context of Low-Rank Adaptation (LoRA) for evolving ($\textit{i.e.}$, constantly updated) large language models (LLMs), this approach promises efficient adaptation without costly retraining. However, existing methods face critical limitations in simultaneously achieving scalability and controllability. In this paper, we introduce $\texttt{ORAL}$, a novel $\textbf{conditional recurrent diffusion}$ framework that addresses these challenges. $\texttt{ORAL}$ incorporates a novel conditioning mechanism that integrates model architecture and textual task specifications, enabling the generation of task-specific LoRA parameters that can seamlessly transfer across evolving foundation models. Our approach successfully scales to billions-of-parameter LLMs and maintains controllability. Through extensive experiments across seven language tasks, four vision tasks, and three multimodal tasks using five pre-trained LLMs, we demonstrate that $\texttt{ORAL}$ generates high-quality LoRA parameters that achieve comparable or superior performance to vanilla trained counterparts.

Title: InstructRestore: Region-Customized Image Restoration with Human Instructions

Authors: Shuaizheng Liu, Jianqi Ma, Lingchen Sun, Xiangtao Kong, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24357
Pdf URL: https://arxiv.org/pdf/2503.24357
Copy Paste: [[2503.24357]] InstructRestore: Region-Customized Image Restoration with Human Instructions(https://arxiv.org/abs/2503.24357)
Keywords: diffusion
Abstract: Despite the significant progress in diffusion prior-based image restoration, most existing methods apply uniform processing to the entire image, lacking the capability to perform region-customized image restoration according to user instructions. In this work, we propose a new framework, namely InstructRestore, to perform region-adjustable image restoration following human instructions. To achieve this, we first develop a data generation engine to produce training triplets, each consisting of a high-quality image, the target region description, and the corresponding region mask. With this engine and careful data screening, we construct a comprehensive dataset comprising 536,945 triplets to support the training and evaluation of this task. We then examine how to integrate the low-quality image features under the ControlNet architecture to adjust the degree of image details enhancement. Consequently, we develop a ControlNet-like model to identify the target region and allocate different integration scales to the target and surrounding regions, enabling region-customized image restoration that aligns with user instructions. Experimental results demonstrate that our proposed InstructRestore approach enables effective human-instructed image restoration, such as images with bokeh effects and user-instructed local enhancement. Our work advances the investigation of interactive image restoration and enhancement techniques. Data, code, and models will be found at this https URL.

Title: Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation

Authors: Xiaoran Zhang, Eric Z. Chen, Lin Zhao, Xiao Chen, Yikang Liu, Boris Maihe, James S. Duncan, Terrence Chen, Shanhui Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24368
Pdf URL: https://arxiv.org/pdf/2503.24368
Copy Paste: [[2503.24368]] Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation(https://arxiv.org/abs/2503.24368)
Keywords: foundation model
Abstract: We propose a novel approach that adapts hierarchical vision foundation models for real-time ultrasound image segmentation. Existing ultrasound segmentation methods often struggle with adaptability to new tasks, relying on costly manual annotations, while real-time approaches generally fail to match state-of-the-art performance. To overcome these limitations, we introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features, interleaved with DINOv2 representations to enhance visual expressiveness. These enriched features are then decoded to produce precise and robust segmentation. We conduct extensive evaluations on six public datasets and one in-house dataset, covering both cardiac and thyroid ultrasound segmentation. Experiments show that our approach outperforms state-of-the-art methods across multiple datasets and excels with limited supervision, surpassing nnUNet by over 20\% on average in the 1\% and 10\% data settings. Our method achieves $\sim$77 FPS inference speed with TensorRT on a single GPU, enabling real-time clinical applications.

Title: Consistent Subject Generation via Contrastive Instantiated Concepts

Authors: Lee Hsin-Ying, Kelvin C.K. Chan, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.24387
Pdf URL: https://arxiv.org/pdf/2503.24387
Copy Paste: [[2503.24387]] Consistent Subject Generation via Contrastive Instantiated Concepts(https://arxiv.org/abs/2503.24387)
Keywords: generative
Abstract: While text-to-image generative models can synthesize diverse and faithful contents, subject variation across multiple creations limits the application in long content generation. Existing approaches require time-consuming tuning, references for all subjects, or access to other creations. We introduce Contrastive Concept Instantiation (CoCoIns) to effectively synthesize consistent subjects across multiple independent creations. The framework consists of a generative model and a mapping network, which transforms input latent codes into pseudo-words associated with certain instances of concepts. Users can generate consistent subjects with the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to differentiate the combination of prompts and latent codes. Extensive evaluations of human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining higher flexibility. We also demonstrate the potential of extending CoCoIns to multiple subjects and other object categories.