2025-03-04

Title: A Systematic Review of Open Datasets Used in Text-to-Image (T2I) Gen AI Model Safety

Authors: Rakeen Rouf, Trupti Bavalatti, Osama Ahmed, Dhaval Potdar, Faraz Jawed
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.00020
Pdf URL: https://arxiv.org/pdf/2503.00020
Copy Paste: [[2503.00020]] A Systematic Review of Open Datasets Used in Text-to-Image (T2I) Gen AI Model Safety(https://arxiv.org/abs/2503.00020)
Keywords: generative
Abstract: Novel research aimed at text-to-image (T2I) generative AI safety often relies on publicly available datasets for training and evaluation, making the quality and composition of these datasets crucial. This paper presents a comprehensive review of the key datasets used in the T2I research, detailing their collection methods, compositions, semantic and syntactic diversity of prompts and the quality, coverage, and distribution of harm types in the datasets. By highlighting the strengths and limitations of the datasets, this study enables researchers to find the most relevant datasets for a use case, critically assess the downstream impacts of their work given the dataset distribution, particularly regarding model safety and ethical considerations, and also identify the gaps in dataset coverage and quality that future research may address.

Title: I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue

Authors: Esam Ghaleb, Bulat Khaertdinov, Aslı Özyürek, Raquel Fernández
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2503.00071
Pdf URL: https://arxiv.org/pdf/2503.00071
Copy Paste: [[2503.00071]] I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue(https://arxiv.org/abs/2503.00071)
Keywords: self-supervised
Abstract: In face-to-face interaction, we use multiple modalities, including speech and gestures, to communicate information and resolve references to objects. However, how representational co-speech gestures refer to objects remains understudied from a computational perspective. In this work, we address this gap by introducing a multimodal reference resolution task centred on representational gestures, while simultaneously tackling the challenge of learning robust gesture embeddings. We propose a self-supervised pre-training approach to gesture representation learning that grounds body movements in spoken language. Our experiments show that the learned embeddings align with expert annotations and have significant predictive power. Moreover, reference resolution accuracy further improves when (1) using multimodal gesture representations, even when speech is unavailable at inference time, and (2) leveraging dialogue history. Overall, our findings highlight the complementary roles of gesture and speech in reference resolution, offering a step towards more naturalistic models of human-machine interaction.

Title: SSL4EO-S12 v1.1: A Multimodal, Multiseasonal Dataset for Pretraining, Updated

Authors: Benedikt Blumenstiel, Nassim Ait Ali Braham, Conrad M Albrecht, Stefano Maurogiovanni, Paolo Fraccaro
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00168
Pdf URL: https://arxiv.org/pdf/2503.00168
Copy Paste: [[2503.00168]] SSL4EO-S12 v1.1: A Multimodal, Multiseasonal Dataset for Pretraining, Updated(https://arxiv.org/abs/2503.00168)
Keywords: self-supervised, foundation model
Abstract: This technical report presents SSL4EO-S12 v1.1, a multimodal, multitemporal Earth Observation dataset designed for pretraining large-scale foundation models. Building on the success of SSL4EO-S12 v1.0, the new version addresses the previous challenges of data misalignment and a limited data structure for low-barrier, analysis-ready EO processing. SSL4EO-S12 v1.1 covers the world's 10,000 largest cities and its surroundings within a 50 km radius across four seasons, resulting in a diverse collection of nearly one million patches. SSL4EO-S12 v1.1 packages the data in Zarr file format for cloud-efficient loading and representation of meta-information such as including cloud masks and geolocation. Released under the CC-BY-4.0 license, SSL4EO-S12 v1.1 facilitates open research and provides a robust foundation for future advancements in self-supervised learning and geospatial analysis. The dataset is available online through this https URL, and we provided additional resources at this https URL.

Title: PRISM: High-Resolution & Precise Counterfactual Medical Image Generation using Language-guided Stable Diffusion

Authors: Amar Kumar, Anita Kriz, Mohammad Havaei, Tal Arbel
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00196
Pdf URL: https://arxiv.org/pdf/2503.00196
Copy Paste: [[2503.00196]] PRISM: High-Resolution & Precise Counterfactual Medical Image Generation using Language-guided Stable Diffusion(https://arxiv.org/abs/2503.00196)
Keywords: diffusion, foundation model
Abstract: Developing reliable and generalizable deep learning systems for medical imaging faces significant obstacles due to spurious correlations, data imbalances, and limited text annotations in datasets. Addressing these challenges requires architectures robust to the unique complexities posed by medical imaging data. The rapid advancements in vision-language foundation models within the natural image domain prompt the question of how they can be adapted for medical imaging tasks. In this work, we present PRISM, a framework that leverages foundation models to generate high-resolution, language-guided medical image counterfactuals using Stable Diffusion. Our approach demonstrates unprecedented precision in selectively modifying spurious correlations (the medical devices) and disease features, enabling the removal and addition of specific attributes while preserving other image characteristics. Through extensive evaluation, we show how PRISM advances counterfactual generation and enables the development of more robust downstream classifiers for clinically deployable solutions. To facilitate broader adoption and research, we make our code publicly available at this https URL.

Title: Llamarine: Open-source Maritime Industry-specific Large Language Model

Authors: William Nguyen, An Phan, Konobu Kimura, Hitoshi Maeno, Mika Tanaka, Quynh Le, William Poucher, Christopher Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00203
Pdf URL: https://arxiv.org/pdf/2503.00203
Copy Paste: [[2503.00203]] Llamarine: Open-source Maritime Industry-specific Large Language Model(https://arxiv.org/abs/2503.00203)
Keywords: foundation model
Abstract: Large Language Models (LLMs) have demonstrated substantial potential in addressing complex reasoning tasks, yet their general-purpose nature often limits their effectiveness in specialized domains such as maritime navigation. To bridge this gap, we introduce Llamarine, the first open-source LLM designed specifically for maritime navigation. Llamarine 1.0 is developed through continued pretraining and fine-tuning on a high-quality corpus comprising maritime textbooks, research publications, and web text from Wikipedia. This domain-specific training enables the model to acquire expert-level knowledge in navigational principles, collision avoidance, route optimization, and regulatory compliance. Our key contributions include (a) the curation of a comprehensive maritime dataset from authoritative sources, ensuring depth and reliability in the model's knowledge base; (b) the development of a foundational model capable of reasoning about complex navigational challenges with greater accuracy than general-purpose LLMs; and (c) the establishment of a benchmark to evaluate performance in maritime-specific decision-making tasks. Experimental results demonstrate that Llamarine outperforms both general-purpose and commercial LLMs in critical navigation-related tasks, such as trajectory planning, risk assessment, and compliance with maritime regulations. By providing an open-source foundation model trained exclusively on high-quality maritime literature, Llamarine paves the way for AI-driven advancements in maritime safety, efficiency, and operational decision-making.

Title: AnalogGenie: A Generative Engine for Automatic Discovery of Analog Circuit Topologies

Authors: Jian Gao, Weidong Cao, Junyi Yang, Xuan Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.00205
Pdf URL: https://arxiv.org/pdf/2503.00205
Copy Paste: [[2503.00205]] AnalogGenie: A Generative Engine for Automatic Discovery of Analog Circuit Topologies(https://arxiv.org/abs/2503.00205)
Keywords: generative
Abstract: The massive and large-scale design of foundational semiconductor integrated circuits (ICs) is crucial to sustaining the advancement of many emerging and future technologies, such as generative AI, 5G/6G, and quantum computing. Excitingly, recent studies have shown the great capabilities of foundational models in expediting the design of digital ICs. Yet, applying generative AI techniques to accelerate the design of analog ICs remains a significant challenge due to critical domain-specific issues, such as the lack of a comprehensive dataset and effective representation methods for analog circuits. This paper proposes, $\textbf{AnalogGenie}$, a $\underline{\textbf{Gen}}$erat$\underline{\textbf{i}}$ve $\underline{\textbf{e}}$ngine for automatic design/discovery of $\underline{\textbf{Analog}}$ circuit topologies--the most challenging and creative task in the conventional manual design flow of analog ICs. AnalogGenie addresses two key gaps in the field: building a foundational comprehensive dataset of analog circuit topology and developing a scalable sequence-based graph representation universal to analog circuits. Experimental results show the remarkable generation performance of AnalogGenie in broadening the variety of analog ICs, increasing the number of devices within a single design, and discovering unseen circuit topologies far beyond any prior arts. Our work paves the way to transform the longstanding time-consuming manual design flow of analog ICs to an automatic and massive manner powered by generative AI. Our source code is available at this https URL.

Title: Foundation-Model-Boosted Multimodal Learning for fMRI-based Neuropathic Pain Drug Response Prediction

Authors: Wenrui Fan, L. M. Riza Rizky, Jiayang Zhang, Chen Chen, Haiping Lu, Kevin Teh, Dinesh Selvarajah, Shuo Zhou
Subjects: cs.LG, cs.AI, cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2503.00210
Pdf URL: https://arxiv.org/pdf/2503.00210
Copy Paste: [[2503.00210]] Foundation-Model-Boosted Multimodal Learning for fMRI-based Neuropathic Pain Drug Response Prediction(https://arxiv.org/abs/2503.00210)
Keywords: foundation model
Abstract: Neuropathic pain, affecting up to 10% of adults, remains difficult to treat due to limited therapeutic efficacy and tolerability. Although resting-state functional MRI (rs-fMRI) is a promising non-invasive measurement of brain biomarkers to predict drug response in therapeutic development, the complexity of fMRI demands machine learning models with substantial capacity. However, extreme data scarcity in neuropathic pain research limits the application of high-capacity models. To address the challenge of data scarcity, we propose FMM$_{TC}$, a Foundation-Model-boosted Multimodal learning framework for fMRI-based neuropathic pain drug response prediction, which leverages both internal multimodal information in pain-specific data and external knowledge from large pain-agnostic data. Specifically, to maximize the value of limited pain-specific data, FMM$_{TC}$ integrates complementary information from two rs-fMRI modalities: Time series and functional Connectivity. FMM$_{TC}$ is further boosted by an fMRI foundation model with its external knowledge from extensive pain-agnostic fMRI datasets enriching limited pain-specific information. Evaluations with an in-house dataset and a public dataset from OpenNeuro demonstrate FMM$_{TC}$'s superior representation ability, generalizability, and cross-dataset adaptability over existing unimodal fMRI models that only consider one of the rs-fMRI modalities. The ablation study validates the effectiveness of multimodal learning and foundation-model-powered external knowledge transfer in FMM$_{TC}$. An integrated gradient-based interpretation study explains how FMM$_{TC}$'s cross-dataset dynamic behaviors enhance its adaptability. In conclusion, FMM$_{TC}$ boosts clinical trials in neuropathic pain therapeutic development by accurately predicting drug responses to improve the participant stratification efficiency.

Title: Flow Matching for Medical Image Synthesis: Bridging the Gap Between Speed and Quality

Authors: Milad Yazdani, Yasamin Medghalchi, Pooria Ashrafian, Ilker Hacihaliloglu, Dena Shahriari
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.00266
Pdf URL: https://arxiv.org/pdf/2503.00266
Copy Paste: [[2503.00266]] Flow Matching for Medical Image Synthesis: Bridging the Gap Between Speed and Quality(https://arxiv.org/abs/2503.00266)
Keywords: diffusion, generative
Abstract: Deep learning models have emerged as a powerful tool for various medical applications. However, their success depends on large, high-quality datasets that are challenging to obtain due to privacy concerns and costly annotation. Generative models, such as diffusion models, offer a potential solution by synthesizing medical images, but their practical adoption is hindered by long inference times. In this paper, we propose the use of an optimal transport flow matching approach to accelerate image generation. By introducing a straighter mapping between the source and target distribution, our method significantly reduces inference time while preserving and further enhancing the quality of the outputs. Furthermore, this approach is highly adaptable, supporting various medical imaging modalities, conditioning mechanisms (such as class labels and masks), and different spatial dimensions, including 2D and 3D. Beyond image generation, it can also be applied to related tasks such as image enhancement. Our results demonstrate the efficiency and versatility of this framework, making it a promising advancement for medical imaging applications. Code with checkpoints and a synthetic dataset (beneficial for classification and segmentation) is now available on: this https URL.

Title: Learning to Animate Images from A Few Videos to Portray Delicate Human Actions

Authors: Haoxin Li, Yingchen Yu, Qilong Wu, Hanwang Zhang, Boyang Li, Song Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00276
Pdf URL: https://arxiv.org/pdf/2503.00276
Copy Paste: [[2503.00276]] Learning to Animate Images from A Few Videos to Portray Delicate Human Actions(https://arxiv.org/abs/2503.00276)
Keywords: generative
Abstract: Despite recent progress, video generative models still struggle to animate human actions from static images, particularly when handling uncommon actions whose training data are limited. In this paper, we investigate the task of learning to animate human actions from a small number of videos -- 16 or fewer -- which is highly valuable in real-world applications like video and movie production. Few-shot learning of generalizable motion patterns while ensuring smooth transitions from the initial reference image is exceedingly challenging. We propose FLASH (Few-shot Learning to Animate and Steer Humans), which improves motion generalization by aligning motion features and inter-frame correspondence relations between videos that share the same motion but have different appearances. This approach minimizes overfitting to visual appearances in the limited training data and enhances the generalization of learned motion patterns. Additionally, FLASH extends the decoder with additional layers to compensate lost details in the latent space, fostering smooth transitions from the initial reference image. Experiments demonstrate that FLASH effectively animates images with unseen human or scene appearances into specified actions while maintaining smooth transitions from the reference image.

Title: Remasking Discrete Diffusion Models with Inference-Time Scaling

Authors: Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, Volodymyr Kuleshov
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.00307
Pdf URL: https://arxiv.org/pdf/2503.00307
Copy Paste: [[2503.00307]] Remasking Discrete Diffusion Models with Inference-Time Scaling(https://arxiv.org/abs/2503.00307)
Keywords: diffusion
Abstract: Part of the success of diffusion models stems from their ability to perform iterative refinement, i.e., repeatedly correcting outputs during generation. However, modern masked discrete diffusion lacks this capability: when a token is generated, it cannot be updated again, even when it introduces an error. Here, we address this limitation by introducing the remasking diffusion model (ReMDM) sampler, a method that can be applied to pretrained masked diffusion models in a principled way and that is derived from a discrete diffusion model with a custom remasking backward process. Most interestingly, ReMDM endows discrete diffusion with a form of inference-time compute scaling. By increasing the number of sampling steps, ReMDM generates natural language outputs that approach the quality of autoregressive models, whereas when the computation budget is limited, ReMDM better maintains quality. ReMDM also improves sample quality of masked diffusion models for discretized images, and in scientific domains such as molecule design, ReMDM facilitates diffusion guidance and pushes the Pareto frontier of controllability relative to classical masking and uniform noise diffusion. We provide the code along with a blog post on the project page: this https URL.

Title: DeepONet Augmented by Randomized Neural Networks for Efficient Operator Learning in PDEs

Authors: Zhaoxi Jiang, Fei Wang
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2503.00317
Pdf URL: https://arxiv.org/pdf/2503.00317
Copy Paste: [[2503.00317]] DeepONet Augmented by Randomized Neural Networks for Efficient Operator Learning in PDEs(https://arxiv.org/abs/2503.00317)
Keywords: diffusion
Abstract: Deep operator networks (DeepONets) represent a powerful class of data-driven methods for operator learning, demonstrating strong approximation capabilities for a wide range of linear and nonlinear operators. They have shown promising performance in learning operators that govern partial differential equations (PDEs), including diffusion-reaction systems and Burgers' equations. However, the accuracy of DeepONets is often constrained by computational limitations and optimization challenges inherent in training deep neural networks. Furthermore, the computational cost associated with training these networks is typically very high. To address these challenges, we leverage randomized neural networks (RaNNs), in which the parameters of the hidden layers remain fixed following random initialization. RaNNs compute the output layer parameters using the least-squares method, significantly reducing training time and mitigating optimization errors. In this work, we integrate DeepONets with RaNNs to propose RaNN-DeepONets, a hybrid architecture designed to balance accuracy and efficiency. Furthermore, to mitigate the need for extensive data preparation, we introduce the concept of physics-informed RaNN-DeepONets. Instead of relying on data generated through other time-consuming numerical methods, we incorporate PDE information directly into the training process. We evaluate the proposed model on three benchmark PDE problems: diffusion-reaction dynamics, Burgers' equation, and the Darcy flow problem. Through these tests, we assess its ability to learn nonlinear operators with varying input types. When compared to the standard DeepONet framework, RaNN-DeepONets achieves comparable accuracy while reducing computational costs by orders of magnitude. These results highlight the potential of RaNN-DeepONets as an efficient alternative for operator learning in PDE-based systems.

Title: More of the Same: Persistent Representational Harms Under Increased Representation

Authors: Jennifer Mickel, Maria De-Arteaga, Leqi Liu, Kevin Tian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00333
Pdf URL: https://arxiv.org/pdf/2503.00333
Copy Paste: [[2503.00333]] More of the Same: Persistent Representational Harms Under Increased Representation(https://arxiv.org/abs/2503.00333)
Keywords: generative
Abstract: To recognize and mitigate the harms of generative AI systems, it is crucial to consider who is represented in the outputs of generative AI systems and how people are represented. A critical gap emerges when naively improving who is represented, as this does not imply bias mitigation efforts have been applied to address how people are represented. We critically examined this by investigating gender representation in occupation across state-of-the-art large language models. We first show evidence suggesting that over time there have been interventions to models altering the resulting gender distribution, and we find that women are more represented than men when models are prompted to generate biographies or personas. We then demonstrate that representational biases persist in how different genders are represented by examining statistically significant word differences across genders. This results in a proliferation of representational harms, stereotypes, and neoliberalism ideals that, despite existing interventions to increase female representation, reinforce existing systems of oppression.

Title: SHAZAM: Self-Supervised Change Monitoring for Hazard Detection and Mapping

Authors: Samuel Garske, Konrad Heidler, Bradley Evans, KC Wong, Xiao Xiang Zhu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.00348
Pdf URL: https://arxiv.org/pdf/2503.00348
Copy Paste: [[2503.00348]] SHAZAM: Self-Supervised Change Monitoring for Hazard Detection and Mapping(https://arxiv.org/abs/2503.00348)
Keywords: self-supervised, anomaly
Abstract: The increasing frequency of environmental hazards due to climate change underscores the urgent need for effective monitoring systems. Current approaches either rely on expensive labelled datasets, struggle with seasonal variations, or require multiple observations for confirmation (which delays detection). To address these challenges, this work presents SHAZAM - Self-Supervised Change Monitoring for Hazard Detection and Mapping. SHAZAM uses a lightweight conditional UNet to generate expected images of a region of interest (ROI) for any day of the year, allowing for the direct modelling of normal seasonal changes and the ability to distinguish potential hazards. A modified structural similarity measure compares the generated images with actual satellite observations to compute region-level anomaly scores and pixel-level hazard maps. Additionally, a theoretically grounded seasonal threshold eliminates the need for dataset-specific optimisation. Evaluated on four diverse datasets that contain bushfires (wildfires), burned regions, extreme and out-of-season snowfall, floods, droughts, algal blooms, and deforestation, SHAZAM achieved F1 score improvements of between 0.066 and 0.234 over existing methods. This was achieved primarily through more effective hazard detection (higher recall) while using only 473K parameters. SHAZAM demonstrated superior mapping capabilities through higher spatial resolution and improved ability to suppress background features while accentuating both immediate and gradual hazards. SHAZAM has been established as an effective and generalisable solution for hazard detection and mapping across different geographical regions and a diverse range of hazards. The Python code is available at: this https URL

Title: Solving Instance Detection from an Open-World Perspective

Authors: Qianqian Shen, Yunhan Zhao, Nahyun Kwon, Jeeeun Kim, Yanan Li, Shu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00359
Pdf URL: https://arxiv.org/pdf/2503.00359
Copy Paste: [[2503.00359]] Solving Instance Detection from an Open-World Perspective(https://arxiv.org/abs/2503.00359)
Keywords: foundation model
Abstract: Instance detection (InsDet) aims to localize specific object instances within a novel scene imagery based on given visual references. Technically, it requires proposal detection to identify all possible object instances, followed by instance-level matching to pinpoint the ones of interest. Its open-world nature supports its wide-ranging applications from robotics to AR/VR, but also presents significant challenges: methods must generalize to unknown testing data distributions because (1) the testing scene imagery is unseen during training, and (2) there are domain gaps between visual references and detected proposals. Existing methods attempt to tackle these challenges by synthesizing diverse training examples or utilizing off-the-shelf foundation models (FMs). However, they only partially capitalize the available open-world information. In this paper, we approach InsDet from an Open-World perspective, introducing our method IDOW. We find that, while pretrained FMs yield high recall in instance detection, they are not specifically optimized for instance-level feature matching. To address this, we adapt pretrained FMs for improved instance-level matching using open-world data. Our approach incorporates metric learning along with novel data augmentations, which sample distractors as negative examples and synthesize novel-view instances to enrich the visual references. Extensive experiments demonstrate that our method significantly outperforms prior works, achieving >10 AP over previous results on two recently released challenging benchmark datasets in both conventional and novel instance detection settings.

Title: Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding

Authors: Wei Suo, Lijun Zhang, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, Yanning Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00361
Pdf URL: https://arxiv.org/pdf/2503.00361
Copy Paste: [[2503.00361]] Octopus: Alleviating Hallucination via Dynamic Contrastive Decoding(https://arxiv.org/abs/2503.00361)
Keywords: generative
Abstract: Large Vision-Language Models (LVLMs) have obtained impressive performance in visual content understanding and multi-modal reasoning. Unfortunately, these large models suffer from serious hallucination problems and tend to generate fabricated responses. Recently, several Contrastive Decoding (CD) strategies have been proposed to alleviate hallucination by introducing disturbed inputs. Although great progress has been made, these CD strategies mostly apply a one-size-fits-all approach for all input conditions. In this paper, we revisit this process through extensive experiments. Related results show that hallucination causes are hybrid and each generative step faces a unique hallucination challenge. Leveraging these meaningful insights, we introduce a simple yet effective Octopus-like framework that enables the model to adaptively identify hallucination types and create a dynamic CD workflow. Our Octopus framework not only outperforms existing methods across four benchmarks but also demonstrates excellent deployability and expansibility. Code is available at this https URL.

Title: Approaching the Limits to EFL Writing Enhancement with AI-generated Text and Diverse Learners

Authors: David James Woo, Hengky Susanto, Chi Ho Yeung, Kai Guo, Yilin Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.00367
Pdf URL: https://arxiv.org/pdf/2503.00367
Copy Paste: [[2503.00367]] Approaching the Limits to EFL Writing Enhancement with AI-generated Text and Diverse Learners(https://arxiv.org/abs/2503.00367)
Keywords: generative
Abstract: Generative artificial intelligence (AI) chatbots, such as ChatGPT, are reshaping how English as a foreign language (EFL) students write since students can compose texts by integrating their own words with AI-generated text. This study investigated how 59 Hong Kong secondary school students with varying levels of academic achievement interacted with AI-generated text to compose a feature article, exploring whether any interaction patterns benefited the overall quality of the article. Through content analysis, multiple linear regression and cluster analysis, we found the overall number of words -- whether AI- or human-generated -- is the main predictor of writing quality. However, the impact varies by students' competence to write independently, for instance, by using their own words accurately and coherently to compose a text, and to follow specific interaction patterns with AI-generated text. Therefore, although composing texts with human words and AI-generated text may become prevalent in EFL writing classrooms, without educators' careful attention to EFL writing pedagogy and AI literacy, high-achieving students stand to benefit more from using AI-generated text than low-achieving students.

Title: MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention

Authors: Tianyi Wang, Jianan Fan, Dingxin Zhang, Dongnan Liu, Yong Xia, Heng Huang, Weidong Cai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00374
Pdf URL: https://arxiv.org/pdf/2503.00374
Copy Paste: [[2503.00374]] MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention(https://arxiv.org/abs/2503.00374)
Keywords: self-supervised
Abstract: Histopathology and transcriptomics are fundamental modalities in oncology, encapsulating the morphological and molecular aspects of the disease. Multi-modal self-supervised learning has demonstrated remarkable potential in learning pathological representations by integrating diverse data sources. Conventional multi-modal integration methods primarily emphasize modality alignment, while paying insufficient attention to retaining the modality-specific structures. However, unlike conventional scenarios where multi-modal inputs share highly overlapping features, histopathology and transcriptomics exhibit pronounced heterogeneity, offering orthogonal yet complementary insights. Histopathology provides morphological and spatial context, elucidating tissue architecture and cellular topology, whereas transcriptomics delineates molecular signatures through gene expression patterns. This inherent disparity introduces a major challenge in aligning them while maintaining modality-specific fidelity. To address these challenges, we present MIRROR, a novel multi-modal representation learning method designed to foster both modality alignment and retention. MIRROR employs dedicated encoders to extract comprehensive features for each modality, which is further complemented by a modality alignment module to achieve seamless integration between phenotype patterns and molecular profiles. Furthermore, a modality retention module safeguards unique attributes from each modality, while a style clustering module mitigates redundancy and enhances disease-relevant information by modeling and aligning consistent pathological signatures within a clustering space. Extensive evaluations on TCGA cohorts for cancer subtyping and survival analysis highlight MIRROR's superior performance, demonstrating its effectiveness in constructing comprehensive oncological feature representations and benefiting the cancer diagnosis.

Title: Taming Large Multimodal Agents for Ultra-low Bitrate Semantically Disentangled Image Compression

Authors: Juan Song, Lijie Yang, Mingtao Feng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00399
Pdf URL: https://arxiv.org/pdf/2503.00399
Copy Paste: [[2503.00399]] Taming Large Multimodal Agents for Ultra-low Bitrate Semantically Disentangled Image Compression(https://arxiv.org/abs/2503.00399)
Keywords: diffusion
Abstract: It remains a significant challenge to compress images at ultra-low bitrate while achieving both semantic consistency and high perceptual quality. We propose a novel image compression framework, Semantically Disentangled Image Compression (SEDIC) in this paper. Our proposed SEDIC leverages large multimodal models (LMMs) to disentangle the image into several essential semantic information, including an extremely compressed reference image, overall and object-level text descriptions, and the semantic masks. A multi-stage semantic decoder is designed to progressively restore the transmitted reference image object-by-object, ultimately producing high-quality and perceptually consistent reconstructions. In each decoding stage, a pre-trained controllable diffusion model is utilized to restore the object details on the reference image conditioned by the text descriptions and semantic masks. Experimental results demonstrate that SEDIC significantly outperforms state-of-the-art approaches, achieving superior perceptual quality and semantic consistency at ultra-low bitrates ($\le$ 0.05 bpp). Our code is available at this https URL.

Title: Auto-encoding Molecules: Graph-Matching Capabilities Matter

Authors: Magnus Cunow, Gerrit Großmann
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00426
Pdf URL: https://arxiv.org/pdf/2503.00426
Copy Paste: [[2503.00426]] Auto-encoding Molecules: Graph-Matching Capabilities Matter(https://arxiv.org/abs/2503.00426)
Keywords: generative
Abstract: Autoencoders are effective deep learning models that can function as generative models and learn latent representations for downstream tasks. The use of graph autoencoders - with both encoder and decoder implemented as message passing networks - is intriguing due to their ability to generate permutation-invariant graph representations. However, this approach faces difficulties because decoding a graph structure from a single vector is challenging, and comparing input and output graphs requires an effective permutation-invariant similarity measure. As a result, many studies rely on approximate methods. In this work, we explore the effect of graph matching precision on the training behavior and generation capabilities of a Variational Autoencoder (VAE). Our contribution is two-fold: (1) we propose a transformer-based message passing graph decoder as an alternative to a graph neural network decoder, that is more robust and expressive by leveraging global attention mechanisms. (2) We show that the precision of graph matching has significant impact on training behavior and is essential for effective de novo (molecular) graph generation. Code is available at this https URL

Title: Split Adaptation for Pre-trained Vision Transformers

Authors: Lixu Wang, Bingqi Shang, Yi Li, Payal Mohapatra, Wei Dong, Xiao Wang, Qi Zhu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.00441
Pdf URL: https://arxiv.org/pdf/2503.00441
Copy Paste: [[2503.00441]] Split Adaptation for Pre-trained Vision Transformers(https://arxiv.org/abs/2503.00441)
Keywords: foundation model
Abstract: Vision Transformers (ViTs), extensively pre-trained on large-scale datasets, have become essential to foundation models, allowing excellent performance on diverse downstream tasks with minimal adaptation. Consequently, there is growing interest in adapting pre-trained ViTs across various fields, including privacy-sensitive domains where clients are often reluctant to share their data. Existing adaptation methods typically require direct data access, rendering them infeasible under these constraints. A straightforward solution may be sending the pre-trained ViT to clients for local adaptation, which poses issues of model intellectual property protection and incurs heavy client computation overhead. To address these issues, we propose a novel split adaptation (SA) method that enables effective downstream adaptation while protecting data and models. SA, inspired by split learning (SL), segments the pre-trained ViT into a frontend and a backend, with only the frontend shared with the client for data representation extraction. But unlike regular SL, SA replaces frontend parameters with low-bit quantized values, preventing direct exposure of the model. SA allows the client to add bi-level noise to the frontend and the extracted data representations, ensuring data protection. Accordingly, SA incorporates data-level and model-level out-of-distribution enhancements to mitigate noise injection's impact on adaptation performance. Our SA focuses on the challenging few-shot adaptation and adopts patch retrieval augmentation for overfitting alleviation. Extensive experiments on multiple datasets validate SA's superiority over state-of-the-art methods and demonstrate its defense against advanced data reconstruction attacks while preventing model leakage with minimal computation cost on the client side. The source codes can be found at this https URL.

Title: G-OSR: A Comprehensive Benchmark for Graph Open-Set Recognition

Authors: Yicong Dong, Rundong He, Guangyao Chen, Wentao Zhang, Zhongyi Han, Jieming Shi, Yilong Yin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.00476
Pdf URL: https://arxiv.org/pdf/2503.00476
Copy Paste: [[2503.00476]] G-OSR: A Comprehensive Benchmark for Graph Open-Set Recognition(https://arxiv.org/abs/2503.00476)
Keywords: anomaly
Abstract: Graph Neural Networks (GNNs) have achieved significant success in machine learning, with wide applications in social networks, bioinformatics, knowledge graphs, and other fields. Most research assumes ideal closed-set environments. However, in real-world open-set environments, graph learning models face challenges in robustness and reliability due to unseen classes. This highlights the need for Graph Open-Set Recognition (GOSR) methods to address these issues and ensure effective GNN application in practical scenarios. Research in GOSR is in its early stages, with a lack of a comprehensive benchmark spanning diverse tasks and datasets to evaluate methods. Moreover, traditional methods, Graph Out-of-Distribution Detection (GOODD), GOSR, and Graph Anomaly Detection (GAD) have mostly evolved in isolation, with little exploration of their interconnections or potential applications to GOSR. To fill these gaps, we introduce \textbf{G-OSR}, a comprehensive benchmark for evaluating GOSR methods at both the node and graph levels, using datasets from multiple domains to ensure fair and standardized comparisons of effectiveness and efficiency across traditional, GOODD, GOSR, and GAD methods. The results offer critical insights into the generalizability and limitations of current GOSR methods and provide valuable resources for advancing research in this field through systematic analysis of diverse approaches.

Title: Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture

Authors: Xuanchen Li, Jianyu Wang, Yuhao Cheng, Yikun Zeng, Xingyu Ren, Wenhan Zhu, Weiming Zhao, Yichao Yan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00495
Pdf URL: https://arxiv.org/pdf/2503.00495
Copy Paste: [[2503.00495]] Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture(https://arxiv.org/abs/2503.00495)
Keywords: diffusion
Abstract: Significant progress has been made for speech-driven 3D face animation, but most works focus on learning the motion of mesh/geometry, ignoring the impact of dynamic texture. In this work, we reveal that dynamic texture plays a key role in rendering high-fidelity talking avatars, and introduce a high-resolution 4D dataset \textbf{TexTalk4D}, consisting of 100 minutes of audio-synced scan-level meshes with detailed 8K dynamic textures from 100 subjects. Based on the dataset, we explore the inherent correlation between motion and texture, and propose a diffusion-based framework \textbf{TexTalker} to simultaneously generate facial motions and dynamic textures from speech. Furthermore, we propose a novel pivot-based style injection strategy to capture the complicity of different texture and motion styles, which allows disentangled control. TexTalker, as the first method to generate audio-synced facial motion with dynamic texture, not only outperforms the prior arts in synthesising facial motions, but also produces realistic textures that are consistent with the underlying facial movements. Project page: this https URL.

Title: Periodic Materials Generation using Text-Guided Joint Diffusion Model

Authors: Kishalay Das, Subhojyoti Khastagir, Pawan Goyal, Seung-Cheol Lee, Satadeep Bhattacharjee, Niloy Ganguly
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2503.00522
Pdf URL: https://arxiv.org/pdf/2503.00522
Copy Paste: [[2503.00522]] Periodic Materials Generation using Text-Guided Joint Diffusion Model(https://arxiv.org/abs/2503.00522)
Keywords: diffusion, generative
Abstract: Equivariant diffusion models have emerged as the prevailing approach for generating novel crystal materials due to their ability to leverage the physical symmetries of periodic material structures. However, current models do not effectively learn the joint distribution of atom types, fractional coordinates, and lattice structure of the crystal material in a cohesive end-to-end diffusion framework. Also, none of these models work under realistic setups, where users specify the desired characteristics that the generated structures must match. In this work, we introduce TGDMat, a novel text-guided diffusion model designed for 3D periodic material generation. Our approach integrates global structural knowledge through textual descriptions at each denoising step while jointly generating atom coordinates, types, and lattice structure using a periodic-E(3)-equivariant graph neural network (GNN). Extensive experiments using popular datasets on benchmark tasks reveal that TGDMat outperforms existing baseline methods by a good margin. Notably, for the structure prediction task, with just one generated sample, TGDMat outperforms all baseline models, highlighting the importance of text-guided diffusion. Further, in the generation task, TGDMat surpasses all baselines and their text-fusion variants, showcasing the effectiveness of the joint diffusion paradigm. Additionally, incorporating textual knowledge reduces overall training and sampling computational overhead while enhancing generative performance when utilizing real-world textual prompts from experts.

Title: End-To-End Learning of Gaussian Mixture Priors for Diffusion Sampler

Authors: Denis Blessing, Xiaogang Jia, Gerhard Neumann
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2503.00524
Pdf URL: https://arxiv.org/pdf/2503.00524
Copy Paste: [[2503.00524]] End-To-End Learning of Gaussian Mixture Priors for Diffusion Sampler(https://arxiv.org/abs/2503.00524)
Keywords: diffusion
Abstract: Diffusion models optimized via variational inference (VI) have emerged as a promising tool for generating samples from unnormalized target densities. These models create samples by simulating a stochastic differential equation, starting from a simple, tractable prior, typically a Gaussian distribution. However, when the support of this prior differs greatly from that of the target distribution, diffusion models often struggle to explore effectively or suffer from large discretization errors. Moreover, learning the prior distribution can lead to mode-collapse, exacerbated by the mode-seeking nature of reverse Kullback-Leibler divergence commonly used in VI. To address these challenges, we propose end-to-end learnable Gaussian mixture priors (GMPs). GMPs offer improved control over exploration, adaptability to target support, and increased expressiveness to counteract mode collapse. We further leverage the structure of mixture models by proposing a strategy to iteratively refine the model by adding mixture components during training. Our experimental results demonstrate significant performance improvements across a diverse range of real-world and synthetic benchmark problems when using GMPs without requiring additional target evaluations.

Title: GaussianSeal: Rooting Adaptive Watermarks for 3D Gaussian Generation Model

Authors: Runyi Li, Xuanyu Zhang, Chuhan Tong, Zhipei Xu, Jian Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.00531
Pdf URL: https://arxiv.org/pdf/2503.00531
Copy Paste: [[2503.00531]] GaussianSeal: Rooting Adaptive Watermarks for 3D Gaussian Generation Model(https://arxiv.org/abs/2503.00531)
Keywords: generative
Abstract: With the advancement of AIGC technologies, the modalities generated by models have expanded from images and videos to 3D objects, leading to an increasing number of works focused on 3D Gaussian Splatting (3DGS) generative models. Existing research on copyright protection for generative models has primarily concentrated on watermarking in image and text modalities, with little exploration into the copyright protection of 3D object generative models. In this paper, we propose the first bit watermarking framework for 3DGS generative models, named GaussianSeal, to enable the decoding of bits as copyright identifiers from the rendered outputs of generated 3DGS. By incorporating adaptive bit modulation modules into the generative model and embedding them into the network blocks in an adaptive way, we achieve high-precision bit decoding with minimal training overhead while maintaining the fidelity of the model's outputs. Experiments demonstrate that our method outperforms post-processing watermarking approaches for 3DGS objects, achieving superior performance of watermark decoding accuracy and preserving the quality of the generated results.

Title: What Makes a Good Diffusion Planner for Decision Making?

Authors: Haofei Lu, Dongqi Han, Yifei Shen, Dongsheng Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00535
Pdf URL: https://arxiv.org/pdf/2503.00535
Copy Paste: [[2503.00535]] What Makes a Good Diffusion Planner for Decision Making?(https://arxiv.org/abs/2503.00535)
Keywords: diffusion
Abstract: Diffusion models have recently shown significant potential in solving decision-making problems, particularly in generating behavior plans -- also known as diffusion planning. While numerous studies have demonstrated the impressive performance of diffusion planning, the mechanisms behind the key components of a good diffusion planner remain unclear and the design choices are highly inconsistent in existing studies. In this work, we address this issue through systematic empirical experiments on diffusion planning in an offline reinforcement learning (RL) setting, providing practical insights into the essential components of diffusion planning. We trained and evaluated over 6,000 diffusion models, identifying the critical components such as guided sampling, network architecture, action generation and planning strategy. We revealed that some design choices opposite to the common practice in previous work in diffusion planning actually lead to better performance, e.g., unconditional sampling with selection can be better than guided sampling and Transformer outperforms U-Net as denoising network. Based on these insights, we suggest a simple yet strong diffusion planning baseline that achieves state-of-the-art results on standard offline RL benchmarks.

Title: Streaming Video Question-Answering with In-context Video KV-Cache Retrieval

Authors: Shangzhe Di, Zhelun Yu, Guanghao Zhang, Haoyuan Li, Tao Zhong, Hao Cheng, Bolin Li, Wanggui He, Fangxun Shu, Hao Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00540
Pdf URL: https://arxiv.org/pdf/2503.00540
Copy Paste: [[2503.00540]] Streaming Video Question-Answering with In-context Video KV-Cache Retrieval(https://arxiv.org/abs/2503.00540)
Keywords: in-context
Abstract: We propose ReKV, a novel training-free approach that enables efficient streaming video question-answering (StreamingVQA), by seamlessly integrating with existing Video Large Language Models (Video-LLMs). Traditional VideoQA systems struggle with long videos, as they must process entire videos before responding to queries, and repeat this process for each new question. In contrast, our approach analyzes long videos in a streaming manner, allowing for prompt responses as soon as user queries are received. Building on a common Video-LLM, we first incorporate a sliding-window attention mechanism, ensuring that input frames attend to a limited number of preceding frames, thereby reducing computational overhead. To prevent information loss, we store processed video key-value caches (KV-Caches) in RAM and disk, reloading them into GPU memory as needed. Additionally, we introduce a retrieval method that leverages an external retriever or the parameters within Video-LLMs to retrieve only query-relevant KV-Caches, ensuring both efficiency and accuracy in question answering. ReKV enables the separation of video encoding and question-answering across different processes and GPUs, significantly enhancing the efficiency of StreamingVQA. Through comprehensive experimentation, we validate the efficacy and practicality of our approach, which significantly boosts efficiency and enhances applicability over existing VideoQA models.

Title: Brain Foundation Models: A Survey on Advancements in Neural Signal Processing and Brain Discovery

Authors: Xinliang Zhou, Chenyu Liu, Zhisheng Chen, Kun Wang, Yi Ding, Ziyu Jia, Qingsong Wen
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2503.00580
Pdf URL: https://arxiv.org/pdf/2503.00580
Copy Paste: [[2503.00580]] Brain Foundation Models: A Survey on Advancements in Neural Signal Processing and Brain Discovery(https://arxiv.org/abs/2503.00580)
Keywords: foundation model
Abstract: Brain foundation models (BFMs) have emerged as a transformative paradigm in computational neuroscience, offering a revolutionary framework for processing diverse neural signals across different brain-related tasks. These models leverage large-scale pre-training techniques, allowing them to generalize effectively across multiple scenarios, tasks, and modalities, thus overcoming the traditional limitations faced by conventional artificial intelligence (AI) approaches in understanding complex brain data. By tapping into the power of pretrained models, BFMs provide a means to process neural data in a more unified manner, enabling advanced analysis and discovery in the field of neuroscience. In this survey, we define BFMs for the first time, providing a clear and concise framework for constructing and utilizing these models in various applications. We also examine the key principles and methodologies for developing these models, shedding light on how they transform the landscape of neural signal processing. This survey presents a comprehensive review of the latest advancements in BFMs, covering the most recent methodological innovations, novel views of application areas, and challenges in the field. Notably, we highlight the future directions and key challenges that need to be addressed to fully realize the potential of BFMs. These challenges include improving the quality of brain data, optimizing model architecture for better generalization, increasing training efficiency, and enhancing the interpretability and robustness of BFMs in real-world applications.

Title: AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models

Authors: Sohan Patnaik, Rishabh Jain, Balaji Krishnamurthy, Mausoom Sarkar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00591
Pdf URL: https://arxiv.org/pdf/2503.00591
Copy Paste: [[2503.00591]] AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models(https://arxiv.org/abs/2503.00591)
Keywords: generative
Abstract: Visual layouts are essential in graphic design fields such as advertising, posters, and web interfaces. The application of generative models for content-aware layout generation has recently gained traction. However, these models fail to understand the contextual aesthetic requirements of layout design and do not align with human-like preferences, primarily treating it as a prediction task without considering the final rendered output. To overcome these problems, we offer Aesthetic-Aware Preference Alignment(AAPA), a novel technique to train a Multi-modal Large Language Model (MLLM) for layout prediction that uses MLLM's aesthetic preferences for Direct Preference Optimization over graphic layouts. We propose a data filtering protocol utilizing our layout-quality heuristics for AAPA to ensure training happens on high-quality layouts. Additionally, we introduce a novel evaluation metric that uses another MLLM to compute the win rate of the generated layout against the ground-truth layout based on aesthetics criteria. We also demonstrate the applicability of AAPA for MLLMs of varying scales (1B to 8B parameters) and LLM families (Qwen, Phi, InternLM). By conducting thorough qualitative and quantitative analyses, we verify the efficacy of our approach on two challenging benchmarks - Crello and Webui, showcasing 17%, and 16 improvement over current State-of-The-Art methods, thereby highlighting the potential of MLLMs in aesthetic-aware layout generation.

Title: SolidMark: Evaluating Image Memorization in Generative Models

Authors: Nicky Kriplani, Minh Pham, Gowthami Somepalli, Chinmay Hegde, Niv Cohen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.00592
Pdf URL: https://arxiv.org/pdf/2503.00592
Copy Paste: [[2503.00592]] SolidMark: Evaluating Image Memorization in Generative Models(https://arxiv.org/abs/2503.00592)
Keywords: diffusion, generative
Abstract: Recent works have shown that diffusion models are able to memorize training images and emit them at generation time. However, the metrics used to evaluate memorization and its mitigation techniques suffer from dataset-dependent biases and struggle to detect whether a given specific image has been memorized or not. This paper begins with a comprehensive exploration of issues surrounding memorization metrics in diffusion models. Then, to mitigate these issues, we introduce $\rm \style{font-variant: small-caps}{SolidMark}$, a novel evaluation method that provides a per-image memorization score. We then re-evaluate existing memorization mitigation techniques. We also show that $\rm \style{font-variant: small-caps}{SolidMark}$ is capable of evaluating fine-grained pixel-level memorization. Finally, we release a variety of models based on $\rm \style{font-variant: small-caps}{SolidMark}$ to facilitate further research for understanding memorization phenomena in generative models. All of our code is available at this https URL.

Title: Synergy Between Sufficient Changes and Sparse Mixing Procedure for Disentangled Representation Learning

Authors: Zijian Li, Shunxing Fan, Yujia Zheng, Ignavier Ng, Shaoan Xie, Guangyi Chen, Xinshuai Dong, Ruichu Cai, Kun Zhang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.00639
Pdf URL: https://arxiv.org/pdf/2503.00639
Copy Paste: [[2503.00639]] Synergy Between Sufficient Changes and Sparse Mixing Procedure for Disentangled Representation Learning(https://arxiv.org/abs/2503.00639)
Keywords: generative
Abstract: Disentangled representation learning aims to uncover latent variables underlying the observed data, and generally speaking, rather strong assumptions are needed to ensure identifiability. Some approaches rely on sufficient changes on the distribution of latent variables indicated by auxiliary variables such as domain indices, but acquiring enough domains is often challenging. Alternative approaches exploit structural sparsity assumptions on the mixing procedure, but such constraints are usually (partially) violated in practice. Interestingly, we find that these two seemingly unrelated assumptions can actually complement each other to achieve identifiability. Specifically, when conditioned on auxiliary variables, the sparse mixing procedure assumption provides structural constraints on the mapping from estimated to true latent variables and hence compensates for potentially insufficient distribution changes. Building on this insight, we propose an identifiability theory with less restrictive constraints regarding distribution changes and the sparse mixing procedure, enhancing applicability to real-world scenarios. Additionally, we develop an estimation framework incorporating a domain encoding network and a sparse mixing constraint and provide two implementations based on variational autoencoders and generative adversarial networks, respectively. Experiment results on synthetic and real-world datasets support our theoretical results.

Title: How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations

Authors: Siddhartha Gairola, Moritz Böhle, Francesco Locatello, Bernt Schiele
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00641
Pdf URL: https://arxiv.org/pdf/2503.00641
Copy Paste: [[2503.00641]] How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations(https://arxiv.org/abs/2503.00641)
Keywords: self-supervised
Abstract: Post-hoc importance attribution methods are a popular tool for "explaining" Deep Neural Networks (DNNs) and are inherently based on the assumption that the explanations can be applied independently of how the models were trained. Contrarily, in this work we bring forward empirical evidence that challenges this very notion. Surprisingly, we discover a strong dependency on and demonstrate that the training details of a pre-trained model's classification layer (less than 10 percent of model parameters) play a crucial role, much more than the pre-training scheme itself. This is of high practical relevance: (1) as techniques for pre-training models are becoming increasingly diverse, understanding the interplay between these techniques and attribution methods is critical; (2) it sheds light on an important yet overlooked assumption of post-hoc attribution methods which can drastically impact model explanations and how they are interpreted eventually. With this finding we also present simple yet effective adjustments to the classification layers, that can significantly enhance the quality of model explanations. We validate our findings across several visual pre-training frameworks (fully-supervised, self-supervised, contrastive vision-language training) and analyse how they impact explanations for a wide range of attribution methods on a diverse set of evaluation metrics.

Title: Discrete Codebook World Models for Continuous Control

Authors: Aidan Scannell, Mohammadreza Nakhaei, Kalle Kujanpää, Yi Zhao, Kevin Sebastian Luck, Arno Solin, Joni Pajarinen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.00653
Pdf URL: https://arxiv.org/pdf/2503.00653
Copy Paste: [[2503.00653]] Discrete Codebook World Models for Continuous Control(https://arxiv.org/abs/2503.00653)
Keywords: self-supervised
Abstract: In reinforcement learning (RL), world models serve as internal simulators, enabling agents to predict environment dynamics and future outcomes in order to make informed decisions. While previous approaches leveraging discrete latent spaces, such as DreamerV3, have demonstrated strong performance in discrete action settings and visual control tasks, their comparative performance in state-based continuous control remains underexplored. In contrast, methods with continuous latent spaces, such as TD-MPC2, have shown notable success in state-based continuous control benchmarks. In this paper, we demonstrate that modeling discrete latent states has benefits over continuous latent states and that discrete codebook encodings are more effective representations for continuous control, compared to alternative encodings, such as one-hot and label-based encodings. Based on these insights, we introduce DCWM: Discrete Codebook World Model, a self-supervised world model with a discrete and stochastic latent space, where latent states are codes from a codebook. We combine DCWM with decision-time planning to get our model-based RL algorithm, named DC-MPC: Discrete Codebook Model Predictive Control, which performs competitively against recent state-of-the-art algorithms, including TD-MPC2 and DreamerV3, on continuous control benchmarks. See our project website this http URL.

Title: Transformer Based Self-Context Aware Prediction for Few-Shot Anomaly Detection in Videos

Authors: Gargi V. Pillai, Ashish Verma, Debashis Sen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00670
Pdf URL: https://arxiv.org/pdf/2503.00670
Copy Paste: [[2503.00670]] Transformer Based Self-Context Aware Prediction for Few-Shot Anomaly Detection in Videos(https://arxiv.org/abs/2503.00670)
Keywords: anomaly
Abstract: Anomaly detection in videos is a challenging task as anomalies in different videos are of different kinds. Therefore, a promising way to approach video anomaly detection is by learning the non-anomalous nature of the video at hand. To this end, we propose a one-class few-shot learning driven transformer based approach for anomaly detection in videos that is self-context aware. Features from the first few consecutive non-anomalous frames in a video are used to train the transformer in predicting the non-anomalous feature of the subsequent frame. This takes place under the attention of a self-context learned from the input features themselves. After the learning, given a few previous frames, the video-specific transformer is used to infer if a frame is anomalous or not by comparing the feature predicted by it with the actual. The effectiveness of the proposed method with respect to the state-of-the-art is demonstrated through qualitative and quantitative results on different standard datasets. We also study the positive effect of the self-context used in our approach.

Title: Proteina: Scaling Flow-based Protein Structure Generative Models

Authors: Tomas Geffner, Kieran Didi, Zuobai Zhang, Danny Reidenbach, Zhonglin Cao, Jason Yim, Mario Geiger, Christian Dallago, Emine Kucukbenli, Arash Vahdat, Karsten Kreis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.00710
Pdf URL: https://arxiv.org/pdf/2503.00710
Copy Paste: [[2503.00710]] Proteina: Scaling Flow-based Protein Structure Generative Models(https://arxiv.org/abs/2503.00710)
Keywords: diffusion, generative
Abstract: Recently, diffusion- and flow-based generative models of protein structures have emerged as a powerful tool for de novo protein design. Here, we develop Proteina, a new large-scale flow-based protein backbone generator that utilizes hierarchical fold class labels for conditioning and relies on a tailored scalable transformer architecture with up to 5x as many parameters as previous models. To meaningfully quantify performance, we introduce a new set of metrics that directly measure the distributional similarity of generated proteins with reference sets, complementing existing metrics. We further explore scaling training data to millions of synthetic protein structures and explore improved training and sampling recipes adapted to protein backbone generation. This includes fine-tuning strategies like LoRA for protein backbones, new guidance methods like classifier-free guidance and autoguidance for protein backbones, and new adjusted training objectives. Proteina achieves state-of-the-art performance on de novo protein backbone design and produces diverse and designable proteins at unprecedented length, up to 800 residues. The hierarchical conditioning offers novel control, enabling high-level secondary-structure guidance as well as low-level fold-specific generation.

Title: OpenECG: Benchmarking ECG Foundation Models with Public 1.2 Million Records

Authors: Zhijiang Wan, Qianhao Yu, Jia Mao, Wenfeng Duan, Cheng Ding
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00711
Pdf URL: https://arxiv.org/pdf/2503.00711
Copy Paste: [[2503.00711]] OpenECG: Benchmarking ECG Foundation Models with Public 1.2 Million Records(https://arxiv.org/abs/2503.00711)
Keywords: self-supervised, foundation model, generative
Abstract: This study introduces OpenECG, a large-scale benchmark of 1.2 million 12-lead ECG recordings from nine centers, to evaluate ECG foundation models (ECG-FMs) trained on public datasets. We investigate three self-supervised learning methods (SimCLR, BYOL, MAE) with ResNet-50 and Vision Transformer architectures, assessing model generalization through leave-one-dataset-out experiments and data scaling analysis. Results show that pre-training on diverse datasets significantly improves generalization, with BYOL and MAE outperforming SimCLR, highlighting the efficacy of feature-consistency and generative learning over contrastive approaches. Data scaling experiments reveal that performance saturates at 60-70% of total data for BYOL and MAE, while SimCLR requires more data. These findings demonstrate that publicly available ECG data can match or surpass proprietary datasets in training robust ECG-FMs, paving the way for scalable, clinically meaningful AI-driven ECG analysis.

Title: Shazam: Unifying Multiple Foundation Models for Advanced Computational Pathology

Authors: Wenhui Lei, Anqi Li, Yusheng Tan, Hanyu Chen, Xiaofan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00736
Pdf URL: https://arxiv.org/pdf/2503.00736
Copy Paste: [[2503.00736]] Shazam: Unifying Multiple Foundation Models for Advanced Computational Pathology(https://arxiv.org/abs/2503.00736)
Keywords: foundation model
Abstract: Foundation Models (FMs) in computational pathology (CPath) have significantly advanced the extraction of meaningful features from histopathology image datasets, achieving strong performance across various clinical tasks. Despite their impressive performance, these models often exhibit variability when applied to different tasks, prompting the need for a unified framework capable of consistently excelling across various applications. In this work, we propose Shazam, a novel framework designed to efficiently combine multiple CPath models. Unlike previous approaches that train a fixed-parameter FM, Shazam dynamically extracts and refines information from diverse FMs for each specific task. To ensure that each FM contributes effectively without dominance, a novel distillation strategy is applied, guiding the student model with features from all teacher models, which enhances its generalization ability. Experimental results on two pathology patch classification datasets demonstrate that Shazam outperforms existing CPath models and other fusion methods. Its lightweight, flexible design makes it a promising solution for improving CPath analysis in real-world settings. Code will be available at this https URL.

Title: FaceShot: Bring Any Character into Life

Authors: Junyao Gao, Yanan Sun, Fei Shen, Xin Jiang, Zhening Xing, Kai Chen, Cairong Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00740
Pdf URL: https://arxiv.org/pdf/2503.00740
Copy Paste: [[2503.00740]] FaceShot: Bring Any Character into Life(https://arxiv.org/abs/2503.00740)
Keywords: diffusion
Abstract: In this paper, we present FaceShot, a novel training-free portrait animation framework designed to bring any character into life from any driven video without fine-tuning or retraining. We achieve this by offering precise and robust reposed landmark sequences from an appearance-guided landmark matching module and a coordinate-based landmark retargeting module. Together, these components harness the robust semantic correspondences of latent diffusion models to produce facial motion sequence across a wide range of character types. After that, we input the landmark sequences into a pre-trained landmark-driven animation model to generate animated video. With this powerful generalization capability, FaceShot can significantly extend the application of portrait animation by breaking the limitation of realistic portrait landmark detection for any stylized character and driven video. Also, FaceShot is compatible with any landmark-driven animation model, significantly improving overall performance. Extensive experiments on our newly constructed character benchmark CharacBench confirm that FaceShot consistently surpasses state-of-the-art (SOTA) approaches across any character domain. More results are available at our project website this https URL.

Title: Confounder-Aware Medical Data Selection for Fine-Tuning Pretrained Vision Models

Authors: Anyang Ji, Qingbo Kang, Wei Xu, Changfan Wang, Kang Li, Qicheng Lao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00744
Pdf URL: https://arxiv.org/pdf/2503.00744
Copy Paste: [[2503.00744]] Confounder-Aware Medical Data Selection for Fine-Tuning Pretrained Vision Models(https://arxiv.org/abs/2503.00744)
Keywords: foundation model
Abstract: The emergence of large-scale pre-trained vision foundation models has greatly advanced the medical imaging field through the pre-training and fine-tuning paradigm. However, selecting appropriate medical data for downstream fine-tuning remains a significant challenge considering its annotation cost, privacy concerns, and the detrimental effects of confounding variables. In this work, we present a confounder-aware medical data selection approach for medical dataset curation aiming to select minimal representative data by strategically mitigating the undesirable impact of confounding variables while preserving the natural distribution of the dataset. Our approach first identifies confounding variables within data and then develops a distance-based data selection strategy for confounder-aware sampling with a constrained budget in the data size. We validate the superiority of our approach through extensive experiments across diverse medical imaging modalities, highlighting its effectiveness in addressing the substantial impact of confounding variables and enhancing the fine-tuning efficiency in the medical imaging domain, compared to other data selection approaches.

Title: Dynamic Gradient Sparsification Training for Few-Shot Fine-tuning of CT Lymph Node Segmentation Foundation Model

Authors: Zihao Luo, Zijun Gao, Wenjun Liao, Shichuan Zhang, Guotai Wang, Xiangde Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00748
Pdf URL: https://arxiv.org/pdf/2503.00748
Copy Paste: [[2503.00748]] Dynamic Gradient Sparsification Training for Few-Shot Fine-tuning of CT Lymph Node Segmentation Foundation Model(https://arxiv.org/abs/2503.00748)
Keywords: foundation model
Abstract: Accurate lymph node (LN) segmentation is critical in radiotherapy treatment and prognosis analysis, but is limited by the need for large annotated datasets. While deep learning-based segmentation foundation models show potential in developing high-performing models with fewer samples, their medical adaptation faces LN domain-specific prior deficiencies and inefficient few-shot fine-tuning for complex clinical practices, highlighting the necessity of an LN segmentation foundation model. In this work, we annotated 36,106 visible LNs from 3,346 publicly available head-and-neck CT scans to establish a robust LN segmentation model (nnUNetv2). Building on this, we propose Dynamic Gradient Sparsification Training (DGST), a few-shot fine-tuning approach that preserves foundational knowledge while dynamically updating the most critical parameters of the LN segmentation model with few annotations. We validate it on two publicly available LN segmentation datasets: SegRap2023 and LNQ2023. The results show that DGST outperforms existing few-shot fine-tuning methods, achieving satisfactory performance with limited labeled data. We release the dataset, models and all implementations to facilitate relevant research: this https URL.

Title: Edge Prompt Tuning for Graph Neural Networks

Authors: Xingbo Fu, Yinhan He, Jundong Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.00750
Pdf URL: https://arxiv.org/pdf/2503.00750
Copy Paste: [[2503.00750]] Edge Prompt Tuning for Graph Neural Networks(https://arxiv.org/abs/2503.00750)
Keywords: self-supervised
Abstract: Pre-training powerful Graph Neural Networks (GNNs) with unlabeled graph data in a self-supervised manner has emerged as a prominent technique in recent years. However, inevitable objective gaps often exist between pre-training and downstream tasks. To bridge this gap, graph prompt tuning techniques design and learn graph prompts by manipulating input graphs or reframing downstream tasks as pre-training tasks without fine-tuning the pre-trained GNN models. While recent graph prompt tuning methods have proven effective in adapting pre-trained GNN models for downstream tasks, they overlook the crucial role of edges in graph prompt design, which can significantly affect the quality of graph representations for downstream tasks. In this study, we propose EdgePrompt, a simple yet effective graph prompt tuning method from the perspective of edges. Unlike previous studies that design prompt vectors on node features, EdgePrompt manipulates input graphs by learning additional prompt vectors for edges and incorporates the edge prompts through message passing in the pre-trained GNN models to better embed graph structural information for downstream tasks. Our method is compatible with prevalent GNN architectures pre-trained under various pre-training strategies and is universal for different downstream tasks. We provide comprehensive theoretical analyses of our method regarding its capability of handling node classification and graph classification as downstream tasks. Extensive experiments on ten graph datasets under four pre-training strategies demonstrate the superiority of our proposed method against six baselines. Our code is available at this https URL.

Title: Wavelet-Driven Masked Image Modeling: A Path to Efficient Visual Representation

Authors: Wenzhao Xiang, Chang Liu, Hongyang Yu, Xilin Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00782
Pdf URL: https://arxiv.org/pdf/2503.00782
Copy Paste: [[2503.00782]] Wavelet-Driven Masked Image Modeling: A Path to Efficient Visual Representation(https://arxiv.org/abs/2503.00782)
Keywords: self-supervised
Abstract: Masked Image Modeling (MIM) has garnered significant attention in self-supervised learning, thanks to its impressive capacity to learn scalable visual representations tailored for downstream tasks. However, images inherently contain abundant redundant information, leading the pixel-based MIM reconstruction process to focus excessively on finer details such as textures, thus prolonging training times unnecessarily. Addressing this challenge requires a shift towards a compact representation of features during MIM reconstruction. Frequency domain analysis provides a promising avenue for achieving compact image feature representation. In contrast to the commonly used Fourier transform, wavelet transform not only offers frequency information but also preserves spatial characteristics and multi-level features of the image. Additionally, the multi-level decomposition process of wavelet transformation aligns well with the hierarchical architecture of modern neural networks. In this study, we leverage wavelet transform as a tool for efficient representation learning to expedite the training process of MIM. Specifically, we conduct multi-level decomposition of images using wavelet transform, utilizing wavelet coefficients from different levels to construct distinct reconstruction targets representing various frequencies and scales. These reconstruction targets are then integrated into the MIM process, with adjustable weights assigned to prioritize the most crucial information. Extensive experiments demonstrate that our method achieves comparable or superior performance across various downstream tasks while exhibiting higher training efficiency.

Title: MFM-DA: Instance-Aware Adaptor and Hierarchical Alignment for Efficient Domain Adaptation in Medical Foundation Models

Authors: Jia-Xuan Jiang, Wenhui Lei, Yifeng Wu, Hongtao Wu, Furong Li, Yining Xie, Xiaofan Zhang, Zhong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00802
Pdf URL: https://arxiv.org/pdf/2503.00802
Copy Paste: [[2503.00802]] MFM-DA: Instance-Aware Adaptor and Hierarchical Alignment for Efficient Domain Adaptation in Medical Foundation Models(https://arxiv.org/abs/2503.00802)
Keywords: diffusion, foundation model
Abstract: Medical Foundation Models (MFMs), trained on large-scale datasets, have demonstrated superior performance across various tasks. However, these models still struggle with domain gaps in practical applications. Specifically, even after fine-tuning on source-domain data, task-adapted foundation models often perform poorly in the target domain. To address this challenge, we propose a few-shot unsupervised domain adaptation (UDA) framework for MFMs, named MFM-DA, which only leverages a limited number of unlabeled target-domain images. Our approach begins by training a Denoising Diffusion Probabilistic Model (DDPM), which is then adapted to the target domain using a proposed dynamic instance-aware adaptor and a distribution direction loss, enabling the DDPM to translate source-domain images into the target domain style. The adapted images are subsequently processed through the MFM, where we introduce a designed channel-spatial alignment Low-Rank Adaptation (LoRA) to ensure effective feature alignment. Extensive experiments on optic cup and disc segmentation tasks demonstrate that MFM-DA outperforms state-of-the-art methods. Our work provides a practical solution to the domain gap issue in real-world MFM deployment. Code will be available at here.

Title: HiMo: High-Speed Objects Motion Compensation in Point Clouds

Authors: Qingwen Zhang, Ajinkya Khoche, Yi Yang, Li Ling, Sina Sharif Mansouri, Olov Andersson, Patric Jensfelt
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.00803
Pdf URL: https://arxiv.org/pdf/2503.00803
Copy Paste: [[2503.00803]] HiMo: High-Speed Objects Motion Compensation in Point Clouds(https://arxiv.org/abs/2503.00803)
Keywords: self-supervised
Abstract: LiDAR point clouds often contain motion-induced distortions, degrading the accuracy of object appearances in the captured data. In this paper, we first characterize the underlying reasons for the point cloud distortion and show that this is present in public datasets. We find that this distortion is more pronounced in high-speed environments such as highways, as well as in multi-LiDAR configurations, a common setup for heavy vehicles. Previous work has dealt with point cloud distortion from the ego-motion but fails to consider distortion from the motion of other objects. We therefore introduce a novel undistortion pipeline, HiMo, that leverages scene flow estimation for object motion compensation, correcting the depiction of dynamic objects. We further propose an extension of a state-of-the-art self-supervised scene flow method. Due to the lack of well-established motion distortion metrics in the literature, we also propose two metrics for compensation performance evaluation: compensation accuracy at a point level and shape similarity on objects. To demonstrate the efficacy of our method, we conduct extensive experiments on the Argoverse 2 dataset and a new real-world dataset. Our new dataset is collected from heavy vehicles equipped with multi-LiDARs and on highways as opposed to mostly urban settings in the existing datasets. The source code, including all methods and the evaluation data, will be provided upon publication. See this https URL for more details.

Title: Toward Stable and Consistent Evaluation Results: A New Methodology for Base Model Evaluation

Authors: Hongzhi Luan, Changxin Tian, Zhaoxin Huan, Xiaolu Zhang, Kunlong Chen, Zhiqiang Zhang, Jun Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.00812
Pdf URL: https://arxiv.org/pdf/2503.00812
Copy Paste: [[2503.00812]] Toward Stable and Consistent Evaluation Results: A New Methodology for Base Model Evaluation(https://arxiv.org/abs/2503.00812)
Keywords: in-context
Abstract: This paper poses two critical issues in evaluating base models (without post-training): (1) Unstable evaluation during training: in the early stages of pre-training, the models lack the capability to answer questions as required, leading to unstable evaluation results. This instability makes it difficult to provide solid conclusions to guide the training, especially for key experiments such as data ablation and scaling law. (2) Inconsistency between base and instruct models: base models generally exhibit poorer evaluation performance compared to corresponding instruct models. This gap poses a challenge for assessing whether a base model with better evaluation can truly lead to a better instruct model. To address these issues, we propose Base model Oriented Systematic Evaluation (BOSE), a method specifically designed to optimize the evaluation of base models. Specifically, BOSE introduces two key innovations: In-Context Light-instruction Prompt (ICLiP) for open-ended tasks and Blank-ppl for multi-choice tasks with candidate options, which transforms the standard perplexity (ppl) metric into a fill-in-the-blank format to mitigate early-stage evaluation fluctuations. Furthermore, we are the first to propose Kendall's rank correlation to quantitatively measure the evaluation stability and consistency. Experimental results demonstrate that BOSE significantly enhances both the stability of evaluations during pre-training and the consistency between base and instruct models, thereby providing more reliable guidance for the LLMs' training.

Title: Foundation Models Secretly Understand Neural Network Weights: Enhancing Hypernetwork Architectures with Foundation Models

Authors: Jeffrey Gu, Serena Yeung-Levy
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.00838
Pdf URL: https://arxiv.org/pdf/2503.00838
Copy Paste: [[2503.00838]] Foundation Models Secretly Understand Neural Network Weights: Enhancing Hypernetwork Architectures with Foundation Models(https://arxiv.org/abs/2503.00838)
Keywords: foundation model
Abstract: Large pre-trained models, or foundation models, have shown impressive performance when adapted to a variety of downstream tasks, often out-performing specialized models. Hypernetworks, neural networks that generate some or all of the parameters of another neural network, have become an increasingly important technique for conditioning and generalizing implicit neural representations (INRs), which represent signals or objects such as audio or 3D shapes using a neural network. However, despite the potential benefits of incorporating foundation models in hypernetwork methods, this research direction has not been investigated, likely due to the dissimilarity of the weight generation task with other visual tasks. To address this gap, we (1) show how foundation models can improve hypernetworks with Transformer-based architectures, (2) provide an empirical analysis of the benefits of foundation models for hypernetworks through the lens of the generalizable INR task, showing that leveraging foundation models improves performance, generalizability, and data efficiency across a variety of algorithms and modalities. We also provide further analysis in examining the design space of foundation model-based hypernetworks, including examining the choice of foundation models, algorithms, and the effect of scaling foundation models.

Title: CyberCScope: Mining Skewed Tensor Streams and Online Anomaly Detection in Cybersecurity Systems

Authors: Kota Nakamura, Koki Kawabata, Shungo Tanaka, Yasuko Matsubara, Yasushi Sakurai
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.00871
Pdf URL: https://arxiv.org/pdf/2503.00871
Copy Paste: [[2503.00871]] CyberCScope: Mining Skewed Tensor Streams and Online Anomaly Detection in Cybersecurity Systems(https://arxiv.org/abs/2503.00871)
Keywords: anomaly
Abstract: Cybersecurity systems are continuously producing a huge number of time-stamped events in the form of high-order tensors, such as {count; time, port, flow duration, packet size, . . . }, and so how can we detect anomalies/intrusions in real time? How can we identify multiple types of intrusions and capture their characteristic behaviors? The tensor data consists of categorical and continuous attributes and the data distributions of continuous attributes typically exhibit skew. These data properties require handling skewed infinite and finite dimensional spaces simultaneously. In this paper, we propose a novel streaming method, namely CyberCScope. The method effectively decomposes incoming tensors into major trends while explicitly distinguishing between categorical and skewed continuous attributes. To our knowledge, it is the first to compute hybrid skewed infinite and finite dimensional decomposition. Based on this decomposition, it streamingly finds distinct time-evolving patterns, enabling the detection of multiple types of anomalies. Extensive experiments on large-scale real datasets demonstrate that CyberCScope detects various intrusions with higher accuracy than state-of-the-art baselines while providing meaningful summaries for the intrusions that occur in practice.

Title: A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning

Authors: Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Roy, Harrie Oosterhuis, Maarten de Rijke, Satya Narayan Shukla
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.00897
Pdf URL: https://arxiv.org/pdf/2503.00897
Copy Paste: [[2503.00897]] A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning(https://arxiv.org/abs/2503.00897)
Keywords: diffusion
Abstract: Reinforcement learning ( RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives. Proximal policy optimization (PPO) is the most popular choice of method for policy optimization. While effective in terms of performance, PPO is highly sensitive to hyper-parameters and involves substantial computational overhead. REINFORCE, on the other hand, mitigates some computational complexities such as high memory overhead and sensitive hyper-parameter tuning, but has suboptimal performance due to high-variance and sample inefficiency. While the variance of the REINFORCE can be reduced by sampling multiple actions per input prompt and using a baseline correction term, it still suffers from sample inefficiency. To address these challenges, we systematically analyze the efficiency-effectiveness trade-off between REINFORCE and PPO, and propose leave-one-out PPO ( LOOP), a novel RL for diffusion fine-tuning method. LOOP combines variance reduction techniques from REINFORCE, such as sampling multiple actions per input prompt and a baseline correction term, with the robustness and sample efficiency of PPO via clipping and importance sampling. Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.

Title: From Poses to Identity: Training-Free Person Re-Identification via Feature Centralization

Authors: Chao Yuan, Guiwei Zhang, Changxiao Ma, Tianyi Zhang, Guanglin Niu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00938
Pdf URL: https://arxiv.org/pdf/2503.00938
Copy Paste: [[2503.00938]] From Poses to Identity: Training-Free Person Re-Identification via Feature Centralization(https://arxiv.org/abs/2503.00938)
Keywords: generative
Abstract: Person re-identification (ReID) aims to extract accurate identity representation features. However, during feature extraction, individual samples are inevitably affected by noise (background, occlusions, and model limitations). Considering that features from the same identity follow a normal distribution around identity centers after training, we propose a Training-Free Feature Centralization ReID framework (Pose2ID) by aggregating the same identity features to reduce individual noise and enhance the stability of identity representation, which preserves the feature's original distribution for following strategies such as re-ranking. Specifically, to obtain samples of the same identity, we introduce two components:Identity-Guided Pedestrian Generation: by leveraging identity features to guide the generation process, we obtain high-quality images with diverse poses, ensuring identity consistency even in complex scenarios such as infrared, and this http URL Feature Centralization: it explores each sample's potential positive samples from its neighborhood. Experiments demonstrate that our generative model exhibits strong generalization capabilities and maintains high identity consistency. With the Feature Centralization framework, we achieve impressive performance even with an ImageNet pre-trained model without ReID training, reaching mAP/Rank-1 of 52.81/78.92 on Market1501. Moreover, our method sets new state-of-the-art results across standard, cross-modality, and occluded ReID tasks, showcasing strong adaptability.

Title: Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think

Authors: Jie Tian, Xiaoye Qu, Zhenyi Lu, Wei Wei, Sichen Liu, Yu Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.00948
Pdf URL: https://arxiv.org/pdf/2503.00948
Copy Paste: [[2503.00948]] Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think(https://arxiv.org/abs/2503.00948)
Keywords: diffusion
Abstract: Image-to-Video (I2V) generation aims to synthesize a video clip according to a given image and condition (e.g., text). The key challenge of this task lies in simultaneously generating natural motions while preserving the original appearance of the images. However, current I2V diffusion models (I2V-DMs) often produce videos with limited motion degrees or exhibit uncontrollable motion that conflicts with the textual condition. To address these limitations, we propose a novel Extrapolating and Decoupling framework, which introduces model merging techniques to the I2V domain for the first time. Specifically, our framework consists of three separate stages: (1) Starting with a base I2V-DM, we explicitly inject the textual condition into the temporal module using a lightweight, learnable adapter and fine-tune the integrated model to improve motion controllability. (2) We introduce a training-free extrapolation strategy to amplify the dynamic range of the motion, effectively reversing the fine-tuning process to enhance the motion degree significantly. (3) With the above two-stage models excelling in motion controllability and degree, we decouple the relevant parameters associated with each type of motion ability and inject them into the base I2V-DM. Since the I2V-DM handles different levels of motion controllability and dynamics at various denoising time steps, we adjust the motion-aware parameters accordingly over time. Extensive qualitative and quantitative experiments have been conducted to demonstrate the superiority of our framework over existing methods.

Title: Dynamical Diffusion: Learning Temporal Dynamics with Diffusion Models

Authors: Xingzhuo Guo, Yu Zhang, Baixu Chen, Haoran Xu, Jianmin Wang, Mingsheng Long
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.00951
Pdf URL: https://arxiv.org/pdf/2503.00951
Copy Paste: [[2503.00951]] Dynamical Diffusion: Learning Temporal Dynamics with Diffusion Models(https://arxiv.org/abs/2503.00951)
Keywords: diffusion, generative
Abstract: Diffusion models have emerged as powerful generative frameworks by progressively adding noise to data through a forward process and then reversing this process to generate realistic samples. While these models have achieved strong performance across various tasks and modalities, their application to temporal predictive learning remains underexplored. Existing approaches treat predictive learning as a conditional generation problem, but often fail to fully exploit the temporal dynamics inherent in the data, leading to challenges in generating temporally coherent sequences. To address this, we introduce Dynamical Diffusion (DyDiff), a theoretically sound framework that incorporates temporally aware forward and reverse processes. Dynamical Diffusion explicitly models temporal transitions at each diffusion step, establishing dependencies on preceding states to better capture temporal dynamics. Through the reparameterization trick, Dynamical Diffusion achieves efficient training and inference similar to any standard diffusion model. Extensive experiments across scientific spatiotemporal forecasting, video prediction, and time series forecasting demonstrate that Dynamical Diffusion consistently improves performance in temporal predictive tasks, filling a crucial gap in existing methodologies. Code is available at this repository: this https URL.

Title: Using Synthetic Images to Augment Small Medical Image Datasets

Authors: Minh H. Vu, Lorenzo Tronchin, Tufve Nyholm, Tommy Löfstedt
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.00962
Pdf URL: https://arxiv.org/pdf/2503.00962
Copy Paste: [[2503.00962]] Using Synthetic Images to Augment Small Medical Image Datasets(https://arxiv.org/abs/2503.00962)
Keywords: generative
Abstract: Recent years have witnessed a growing academic and industrial interest in deep learning (DL) for medical imaging. To perform well, DL models require very large labeled datasets. However, most medical imaging datasets are small, with a limited number of annotated samples. The reason they are small is usually because delineating medical images is time-consuming and demanding for oncologists. There are various techniques that can be used to augment a dataset, for example, to apply affine transformations or elastic transformations to available images, or to add synthetic images generated by a Generative Adversarial Network (GAN). In this work, we have developed a novel conditional variant of a current GAN method, the StyleGAN2, to generate multi-modal high-resolution medical images with the purpose to augment small medical imaging datasets with these synthetic images. We use the synthetic and real images from six datasets to train models for the downstream task of semantic segmentation. The quality of the generated medical images and the effect of this augmentation on the segmentation performance were evaluated afterward. Finally, the results indicate that the downstream segmentation models did not benefit from the generated images. Further work and analyses are required to establish how this augmentation affects the segmentation performance.

Title: Molecule Generation for Target Protein Binding with Hierarchical Consistency Diffusion Model

Authors: Guanlue Li, Chenran Jiang, Ziqi Gao, Yu Liu, Chenyang Liu, Jiean Chen, Yong Huang, Jia Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.00975
Pdf URL: https://arxiv.org/pdf/2503.00975
Copy Paste: [[2503.00975]] Molecule Generation for Target Protein Binding with Hierarchical Consistency Diffusion Model(https://arxiv.org/abs/2503.00975)
Keywords: diffusion
Abstract: Effective generation of molecular structures, or new chemical entities, that bind to target proteins is crucial for lead identification and optimization in drug discovery. Despite advancements in atom- and motif-wise deep learning models for 3D molecular generation, current methods often struggle with validity and reliability. To address these issues, we develop the Atom-Motif Consistency Diffusion Model (AMDiff), utilizing a joint-training paradigm for multi-view learning. This model features a hierarchical diffusion architecture that integrates both atom- and motif-level views of molecules, allowing for comprehensive exploration of complementary information. By leveraging classifier-free guidance and incorporating binding site features as conditional inputs, AMDiff ensures robust molecule generation across diverse targets. Compared to existing approaches, AMDiff exhibits superior validity and novelty in generating molecules tailored to fit various protein pockets. Case studies targeting protein kinases, including Anaplastic Lymphoma Kinase (ALK) and Cyclin-dependent kinase 4 (CDK4), demonstrate the model's capability in structure-based de novo drug design. Overall, AMDiff bridges the gap between atom-view and motif-view drug discovery and speeds up the process of target-aware molecular generation.

Title: Underdamped Diffusion Bridges with Applications to Sampling

Authors: Denis Blessing, Julius Berner, Lorenz Richter, Gerhard Neumann
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.01006
Pdf URL: https://arxiv.org/pdf/2503.01006
Copy Paste: [[2503.01006]] Underdamped Diffusion Bridges with Applications to Sampling(https://arxiv.org/abs/2503.01006)
Keywords: diffusion, generative
Abstract: We provide a general framework for learning diffusion bridges that transport prior to target distributions. It includes existing diffusion models for generative modeling, but also underdamped versions with degenerate diffusion matrices, where the noise only acts in certain dimensions. Extending previous findings, our framework allows to rigorously show that score matching in the underdamped case is indeed equivalent to maximizing a lower bound on the likelihood. Motivated by superior convergence properties and compatibility with sophisticated numerical integration schemes of underdamped stochastic processes, we propose \emph{underdamped diffusion bridges}, where a general density evolution is learned rather than prescribed by a fixed noising process. We apply our method to the challenging task of sampling from unnormalized densities without access to samples from the target distribution. Across a diverse range of sampling problems, our approach demonstrates state-of-the-art performance, notably outperforming alternative methods, while requiring significantly fewer discretization steps and no hyperparameter tuning.

Title: Data Unlearning in Diffusion Models

Authors: Silas Alberti, Kenan Hasanaliyev, Manav Shah, Stefano Ermon
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.01034
Pdf URL: https://arxiv.org/pdf/2503.01034
Copy Paste: [[2503.01034]] Data Unlearning in Diffusion Models(https://arxiv.org/abs/2503.01034)
Keywords: diffusion
Abstract: Recent work has shown that diffusion models memorize and reproduce training data examples. At the same time, large copyright lawsuits and legislation such as GDPR have highlighted the need for erasing datapoints from diffusion models. However, retraining from scratch is often too expensive. This motivates the setting of data unlearning, i.e., the study of efficient techniques for unlearning specific datapoints from the training set. Existing concept unlearning techniques require an anchor prompt/class/distribution to guide unlearning, which is not available in the data unlearning setting. General-purpose machine unlearning techniques were found to be either unstable or failed to unlearn data. We therefore propose a family of new loss functions called Subtracted Importance Sampled Scores (SISS) that utilize importance sampling and are the first method to unlearn data with theoretical guarantees. SISS is constructed as a weighted combination between simpler objectives that are responsible for preserving model quality and unlearning the targeted datapoints. When evaluated on CelebA-HQ and MNIST, SISS achieved Pareto optimality along the quality and unlearning strength dimensions. On Stable Diffusion, SISS successfully mitigated memorization on nearly 90% of the prompts we tested.

Title: Scientific Reasoning: Assessment of Multimodal Generative LLMs

Authors: Florian Dreyer, Ekaterina Kolos, Daria Matiash
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.01064
Pdf URL: https://arxiv.org/pdf/2503.01064
Copy Paste: [[2503.01064]] Scientific Reasoning: Assessment of Multimodal Generative LLMs(https://arxiv.org/abs/2503.01064)
Keywords: generative
Abstract: Large language models (LLMs) can answer questions and reason about complex tasks, also from the scientific domain. We assess several multimodal LLMs (MLLMs) on ScienceQA and find that Gemini models show the highest accuracy with little context, and the highest textual similarity to human explanations with richer context. Adapter-tuning of smaller MLLMs did not lead to any reliable performance. Training from Gemini outputs consistently underperformed training from the original data.

Title: All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

Authors: Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, J. Andrew Bagnell
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.01067
Pdf URL: https://arxiv.org/pdf/2503.01067
Copy Paste: [[2503.01067]] All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning(https://arxiv.org/abs/2503.01067)
Keywords: foundation model
Abstract: From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure. Specifically, one first trains a reward model (RM) on some dataset (e.g. human preferences) before using it to provide online feedback as part of a downstream reinforcement learning (RL) procedure, rather than directly optimizing the policy parameters on the dataset via offline maximum likelihood estimation. In fact, from an information-theoretic perspective, we can only lose information via passing through a reward model and cannot create any new information via on-policy sampling. To explain this discrepancy, we scrutinize several hypotheses on the value of RL in FT through both theoretical and empirical lenses. Of the hypotheses considered, we find the most support for the explanation that on problems with a generation-verification gap, the combination of the ease of learning the relatively simple RM (verifier) from the preference data, coupled with the ability of the downstream RL procedure to then filter its search space to the subset of policies (generators) that are optimal for relatively simple verifiers is what leads to the superior performance of online FT.

Title: Depth-Adaptive Graph Neural Networks via Learnable Bakry-'Emery Curvature

Authors: Asela Hevapathige, Ahad N. Zehmakan, Qing Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01079
Pdf URL: https://arxiv.org/pdf/2503.01079
Copy Paste: [[2503.01079]] Depth-Adaptive Graph Neural Networks via Learnable Bakry-'Emery Curvature(https://arxiv.org/abs/2503.01079)
Keywords: diffusion
Abstract: Graph Neural Networks (GNNs) have demonstrated strong representation learning capabilities for graph-based tasks. Recent advances on GNNs leverage geometric properties, such as curvature, to enhance its representation capabilities by modeling complex connectivity patterns and information flow within graphs. However, most existing approaches focus solely on discrete graph topology, overlooking diffusion dynamics and task-specific dependencies essential for effective learning. To address this, we propose integrating Bakry-Émery curvature, which captures both structural and task-driven aspects of information propagation. We develop an efficient, learnable approximation strategy, making curvature computation scalable for large graphs. Furthermore, we introduce an adaptive depth mechanism that dynamically adjusts message-passing layers per vertex based on its curvature, ensuring efficient propagation. Our theoretical analysis establishes a link between curvature and feature distinctiveness, showing that high-curvature vertices require fewer layers, while low-curvature ones benefit from deeper propagation. Extensive experiments on benchmark datasets validate the effectiveness of our approach, showing consistent performance improvements across diverse graph learning tasks.

Title: Fence Theorem: Preprocessing is Dual-Objective Semantic Structure Isolator in 3D Anomaly Detection

Authors: Hanzhe Liang, Jie Zhou, Xuanxin Chen, Jinbao Wang, Can Gao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01100
Pdf URL: https://arxiv.org/pdf/2503.01100
Copy Paste: [[2503.01100]] Fence Theorem: Preprocessing is Dual-Objective Semantic Structure Isolator in 3D Anomaly Detection(https://arxiv.org/abs/2503.01100)
Keywords: anomaly
Abstract: 3D anomaly detection (AD) is prominent but difficult due to lacking a unified theoretical foundation for preprocessing design. We establish the Fence Theorem, formalizing preprocessing as a dual-objective semantic isolator: (1) mitigating cross-semantic interference to the greatest extent feasible and (2) confining anomaly judgments to aligned semantic spaces wherever viable, thereby establishing intra-semantic comparability. Any preprocessing approach achieves this goal through a two-stage process of Emantic-Division and Spatial-Constraints stage. Through systematic deconstruction, we theoretically and experimentally subsume existing preprocessing methods under this theorem via tripartite evidence: qualitative analyses, quantitative studies, and mathematical proofs. Guided by the Fence Theorem, we implement Patch3D, consisting of Patch-Cutting and Patch-Matching modules, to segment semantic spaces and consolidate similar ones while independently modeling normal features within each space. Experiments on Anomaly-ShapeNet and Real3D-AD with different settings demonstrate that progressively finer-grained semantic alignment in preprocessing directly enhances point-level AD accuracy, providing inverse validation of the theorem's causal logic.

Title: Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator

Authors: Kaiwen Zheng, Yongxin Chen, Huayu Chen, Guande He, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01103
Pdf URL: https://arxiv.org/pdf/2503.01103
Copy Paste: [[2503.01103]] Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator(https://arxiv.org/abs/2503.01103)
Keywords: diffusion, generative
Abstract: While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that bridges likelihood-based generative training and the GAN objective to bypass this fundamental constraint. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58 to new records of 1.30/0.97 on CIFAR-10/ImageNet-64 datasets, and by consistently improving both guidance-free and CFG-enhanced FIDs of visual autoregressive models on ImageNet 256$\times$256.

Title: VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors

Authors: Juil Koo, Paul Guerrero, Chun-Hao Paul Huang, Duygu Ceylan, Minhyuk Sung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01107
Pdf URL: https://arxiv.org/pdf/2503.01107
Copy Paste: [[2503.01107]] VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors(https://arxiv.org/abs/2503.01107)
Keywords: generative
Abstract: Generative methods for image and video editing use generative models as priors to perform edits despite incomplete information, such as changing the composition of 3D objects shown in a single image. Recent methods have shown promising composition editing results in the image setting, but in the video setting, editing methods have focused on editing object's appearance and motion, or camera motion, and as a result, methods to edit object composition in videos are still missing. We propose \name as a method for editing 3D object compositions in videos of static scenes with camera motion. Our approach allows editing the 3D position of a 3D object across all frames of a video in a temporally consistent manner. This is achieved by lifting intermediate features of a generative model to a 3D reconstruction that is shared between all frames, editing the reconstruction, and projecting the features on the edited reconstruction back to each frame. To the best of our knowledge, this is the first generative approach to edit object compositions in videos. Our approach is simple and training-free, while outperforming state-of-the-art image editing baselines.

Title: WeGen: A Unified Model for Interactive Multimodal Generation as We Chat

Authors: Zhipeng Huang, Shaobin Zhuang, Canmiao Fu, Binxin Yang, Ying Zhang, Chong Sun, Zhizheng Zhang, Yali Wang, Chen Li, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01115
Pdf URL: https://arxiv.org/pdf/2503.01115
Copy Paste: [[2503.01115]] WeGen: A Unified Model for Interactive Multimodal Generation as We Chat(https://arxiv.org/abs/2503.01115)
Keywords: foundation model, generative
Abstract: Existing multimodal generative models fall short as qualified design copilots, as they often struggle to generate imaginative outputs once instructions are less detailed or lack the ability to maintain consistency with the provided references. In this work, we introduce WeGen, a model that unifies multimodal generation and understanding, and promotes their interplay in iterative generation. It can generate diverse results with high creativity for less detailed instructions. And it can progressively refine prior generation results or integrating specific contents from references following the instructions in its chat with users. During this process, it is capable of preserving consistency in the parts that the user is already satisfied with. To this end, we curate a large-scale dataset, extracted from Internet videos, containing rich object dynamics and auto-labeled dynamics descriptions by advanced foundation models to date. These two information are interleaved into a single sequence to enable WeGen to learn consistency-aware generation where the specified dynamics are generated while the consistency of unspecified content is preserved aligned with instructions. Besides, we introduce a prompt self-rewriting mechanism to enhance generation diversity. Extensive experiments demonstrate the effectiveness of unifying multimodal understanding and generation in WeGen and show it achieves state-of-the-art performance across various visual generation benchmarks. These also demonstrate the potential of WeGen as a user-friendly design copilot as desired. The code and models will be available at this https URL.

Title: ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization

Authors: Shizhan Liu, Hao Zheng, Hang Yu, Jianguo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01122
Pdf URL: https://arxiv.org/pdf/2503.01122
Copy Paste: [[2503.01122]] ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization(https://arxiv.org/abs/2503.01122)
Keywords: diffusion
Abstract: Image personalization has garnered attention for its ability to customize Text-to-Image generation using only a few reference images. However, a key challenge in image personalization is the issue of conceptual coupling, where the limited number of reference images leads the model to form unwanted associations between the personalization target and other concepts. Current methods attempt to tackle this issue indirectly, leading to a suboptimal balance between text control and personalization fidelity. In this paper, we take a direct approach to the concept coupling problem through statistical analysis, revealing that it stems from two distinct sources of dependence discrepancies. We therefore propose two complementary plug-and-play loss functions: Denoising Decouple Loss and Prior Decouple loss, each designed to minimize one type of dependence discrepancy. Extensive experiments demonstrate that our approach achieves a superior trade-off between text control and personalization fidelity.

Title: DPR: Diffusion Preference-based Reward for Offline Reinforcement Learning

Authors: Teng Pang, Bingzheng Wang, Guoqiang Wu, Yilong Yin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.01143
Pdf URL: https://arxiv.org/pdf/2503.01143
Copy Paste: [[2503.01143]] DPR: Diffusion Preference-based Reward for Offline Reinforcement Learning(https://arxiv.org/abs/2503.01143)
Keywords: diffusion
Abstract: Offline preference-based reinforcement learning (PbRL) mitigates the need for reward definition, aligning with human preferences via preference-driven reward feedback without interacting with the environment. However, the effectiveness of preference-driven reward functions depends on the modeling ability of the learning model, which current MLP-based and Transformer-based methods may fail to adequately provide. To alleviate the failure of the reward function caused by insufficient modeling, we propose a novel preference-based reward acquisition method: Diffusion Preference-based Reward (DPR). Unlike previous methods using Bradley-Terry models for trajectory preferences, we use diffusion models to directly model preference distributions for state-action pairs, allowing rewards to be discriminatively obtained from these distributions. In addition, considering the particularity of preference data that only know the internal relationships of paired trajectories, we further propose Conditional Diffusion Preference-based Reward (C-DPR), which leverages relative preference information to enhance the construction of the diffusion model. We apply the above methods to existing offline reinforcement learning algorithms and a series of experiment results demonstrate that the diffusion-based reward acquisition approach outperforms previous MLP-based and Transformer-based methods.

Title: One-shot In-context Part Segmentation

Authors: Zhenqi Dai, Ting Liu, Xingxing Zhang, Yunchao Wei, Yanning Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01144
Pdf URL: https://arxiv.org/pdf/2503.01144
Copy Paste: [[2503.01144]] One-shot In-context Part Segmentation(https://arxiv.org/abs/2503.01144)
Keywords: diffusion, foundation model, in-context
Abstract: In this paper, we present the One-shot In-context Part Segmentation (OIParts) framework, designed to tackle the challenges of part segmentation by leveraging visual foundation models (VFMs). Existing training-based one-shot part segmentation methods that utilize VFMs encounter difficulties when faced with scenarios where the one-shot image and test image exhibit significant variance in appearance and perspective, or when the object in the test image is partially visible. We argue that training on the one-shot example often leads to overfitting, thereby compromising the model's generalization capability. Our framework offers a novel approach to part segmentation that is training-free, flexible, and data-efficient, requiring only a single in-context example for precise segmentation with superior generalization ability. By thoroughly exploring the complementary strengths of VFMs, specifically DINOv2 and Stable Diffusion, we introduce an adaptive channel selection approach by minimizing the intra-class distance for better exploiting these two features, thereby enhancing the discriminatory power of the extracted features for the fine-grained parts. We have achieved remarkable segmentation performance across diverse object categories. The OIParts framework not only eliminates the need for extensive labeled data but also demonstrates superior generalization ability. Through comprehensive experimentation on three benchmark datasets, we have demonstrated the superiority of our proposed method over existing part segmentation approaches in one-shot settings.

Title: CoInD: Enabling Logical Compositions in Diffusion Models

Authors: Sachit Gaudi, Gautam Sreekumar, Vishnu Boddeti
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.01145
Pdf URL: https://arxiv.org/pdf/2503.01145
Copy Paste: [[2503.01145]] CoInD: Enabling Logical Compositions in Diffusion Models(https://arxiv.org/abs/2503.01145)
Keywords: diffusion, generative
Abstract: How can we learn generative models to sample data with arbitrary logical compositions of statistically independent attributes? The prevailing solution is to sample from distributions expressed as a composition of attributes' conditional marginal distributions under the assumption that they are statistically independent. This paper shows that standard conditional diffusion models violate this assumption, even when all attribute compositions are observed during training. And, this violation is significantly more severe when only a subset of the compositions is observed. We propose CoInD to address this problem. It explicitly enforces statistical independence between the conditional marginal distributions by minimizing Fisher's divergence between the joint and marginal distributions. The theoretical advantages of CoInD are reflected in both qualitative and quantitative experiments, demonstrating a significantly more faithful and controlled generation of samples for arbitrary logical compositions of attributes. The benefit is more pronounced for scenarios that current solutions relying on the assumption of conditionally independent marginals struggle with, namely, logical compositions involving the NOT operation and when only a subset of compositions are observed during training.

Title: Unify and Anchor: A Context-Aware Transformer for Cross-Domain Time Series Forecasting

Authors: Xiaobin Hong, Jiawen Zhang, Wenzhong Li, Sanglu Lu, Jia Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.01157
Pdf URL: https://arxiv.org/pdf/2503.01157
Copy Paste: [[2503.01157]] Unify and Anchor: A Context-Aware Transformer for Cross-Domain Time Series Forecasting(https://arxiv.org/abs/2503.01157)
Keywords: foundation model
Abstract: The rise of foundation models has revolutionized natural language processing and computer vision, yet their best practices to time series forecasting remains underexplored. Existing time series foundation models often adopt methodologies from these fields without addressing the unique characteristics of time series data. In this paper, we identify two key challenges in cross-domain time series forecasting: the complexity of temporal patterns and semantic misalignment. To tackle these issues, we propose the ``Unify and Anchor" transfer paradigm, which disentangles frequency components for a unified perspective and incorporates external context as domain anchors for guided adaptation. Based on this framework, we introduce ContexTST, a Transformer-based model that employs a time series coordinator for structured representation and the Transformer blocks with a context-informed mixture-of-experts mechanism for effective cross-domain generalization. Extensive experiments demonstrate that ContexTST advances state-of-the-art forecasting performance while achieving strong zero-shot transferability across diverse domains.

Title: EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting

Authors: Suzhen Wang, Weijie Chen, Wei Zhang, Minda Zhao, Lincheng Li, Rongsheng Zhang, Zhipeng Hu, Xin Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01158
Pdf URL: https://arxiv.org/pdf/2503.01158
Copy Paste: [[2503.01158]] EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting(https://arxiv.org/abs/2503.01158)
Keywords: self-supervised
Abstract: Character customization, or 'face crafting,' is a vital feature in role-playing games (RPGs), enhancing player engagement by enabling the creation of personalized avatars. Existing automated methods often struggle with generalizability across diverse game engines due to their reliance on the intermediate constraints of specific image domain and typically support only one type of input, either text or image. To overcome these challenges, we introduce EasyCraft, an innovative end-to-end feedforward framework that automates character crafting by uniquely supporting both text and image inputs. Our approach employs a translator capable of converting facial images of any style into crafting parameters. We first establish a unified feature distribution in the translator's image encoder through self-supervised learning on a large-scale dataset, enabling photos of any style to be embedded into a unified feature representation. Subsequently, we map this unified feature distribution to crafting parameters specific to a game engine, a process that can be easily adapted to most game engines and thus enhances EasyCraft's generalizability. By integrating text-to-image techniques with our translator, EasyCraft also facilitates precise, text-based character crafting. EasyCraft's ability to integrate diverse inputs significantly enhances the versatility and accuracy of avatar creation. Extensive experiments on two RPG games demonstrate the effectiveness of our method, achieving state-of-the-art results and facilitating adaptability across various avatar engines.

Title: Split Gibbs Discrete Diffusion Posterior Sampling

Authors: Wenda Chu, Yang Song, Yisong Yue
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.01161
Pdf URL: https://arxiv.org/pdf/2503.01161
Copy Paste: [[2503.01161]] Split Gibbs Discrete Diffusion Posterior Sampling(https://arxiv.org/abs/2503.01161)
Keywords: diffusion
Abstract: We study the problem of posterior sampling in discrete-state spaces using discrete diffusion models. While posterior sampling methods for continuous diffusion models have achieved remarkable progress, analogous methods for discrete diffusion models remain challenging. In this work, we introduce a principled plug-and-play discrete diffusion posterior sampling algorithm based on split Gibbs sampling, which we call SG-DPS. Our algorithm enables reward-guided generation and solving inverse problems in discrete-state spaces. We demonstrate that SG-DPS converges to the true posterior distribution on synthetic benchmarks, and enjoys state-of-the-art posterior sampling performance on a range of benchmarks for discrete data, achieving up to 2x improved performance compared to existing baselines.

Title: Med-LEGO: Editing and Adapting toward Generalist Medical Image Diagnosis

Authors: Yitao Zhu, Yuan Yin, Jiaming Li, Mengjie Xu, Zihao Zhao, Honglin Xiong, Sheng Wang, Qian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01164
Pdf URL: https://arxiv.org/pdf/2503.01164
Copy Paste: [[2503.01164]] Med-LEGO: Editing and Adapting toward Generalist Medical Image Diagnosis(https://arxiv.org/abs/2503.01164)
Keywords: foundation model
Abstract: The adoption of visual foundation models has become a common practice in computer-aided diagnosis (CAD). While these foundation models provide a viable solution for creating generalist medical AI, privacy concerns make it difficult to pre-train or continuously update such models across multiple domains and datasets, leading many studies to focus on specialist models. To address this challenge, we propose Med-LEGO, a training-free framework that enables the seamless integration or updating of a generalist CAD model by combining multiple specialist models, similar to assembling LEGO bricks. Med-LEGO enhances LoRA (low-rank adaptation) by incorporating singular value decomposition (SVD) to efficiently capture the domain expertise of each specialist model with minimal additional parameters. By combining these adapted weights through simple operations, Med-LEGO allows for the easy integration or modification of specific diagnostic capabilities without the need for original data or retraining. Finally, the combined model can be further adapted to new diagnostic tasks, making it a versatile generalist model. Our extensive experiments demonstrate that Med-LEGO outperforms existing methods in both cross-domain and in-domain medical tasks while using only 0.18% of full model parameters. These merged models show better convergence and generalization to new tasks, providing an effective path toward generalist medical AI.

Title: Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data

Authors: Haoxin Li, Boyang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01167
Pdf URL: https://arxiv.org/pdf/2503.01167
Copy Paste: [[2503.01167]] Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data(https://arxiv.org/abs/2503.01167)
Keywords: generative
Abstract: Despite impressive advancements in various multimodal tasks, vision-language models (VLMs) still struggle with compositional understanding due to limited exposure to training samples that contain subtle variations within paired examples. With advances in multimodal generative models, a natural solution is to generate synthetic samples with subtle variations for training VLMs. However, generating and training on synthetic samples with subtle variations presents two challenges: difficulty in accurately creating precise variations and inconsistency in cross-modal alignment quality. To address these challenges, we propose SVD-GT (Subtle Variation Data Generation and Training), which integrates image feature injection into a text-to-image generative model to enhance the quality of synthetic variations and employs an adaptive margin loss to differentiate samples using adaptive margins, which help filter out potentially incorrect synthetic samples and focus the learning on informative hard samples. Evaluations on four compositional understanding benchmarks demonstrate that SVD-GT significantly improves the compositionality of VLMs, boosting the average accuracy of CLIP by over 8% across all benchmarks and outperforming state-of-the-art methods by 2% on three benchmarks.

Title: Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

Authors: Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Shinji Watanabe
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.01174
Pdf URL: https://arxiv.org/pdf/2503.01174
Copy Paste: [[2503.01174]] Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics(https://arxiv.org/abs/2503.01174)
Keywords: foundation model
Abstract: The recent wave of audio foundation models (FMs) could provide new capabilities for conversational modeling. However, there have been limited efforts to evaluate these audio FMs comprehensively on their ability to have natural and interactive conversations. To engage in meaningful conversation with the end user, we would want the FMs to additionally perform a fluent succession of turns without too much overlapping speech or long stretches of silence. Inspired by this, we ask whether the recently proposed audio FMs can understand, predict, and perform turn-taking events? To answer this, we propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement. We will open source our evaluation platform to promote the development of advanced conversational AI systems.

Title: SAR-W-MixMAE: SAR Foundation Model Training Using Backscatter Power Weighting

Authors: Ali Caglayan, Nevrez Imamoglu, Toru Kouyama
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.01181
Pdf URL: https://arxiv.org/pdf/2503.01181
Copy Paste: [[2503.01181]] SAR-W-MixMAE: SAR Foundation Model Training Using Backscatter Power Weighting(https://arxiv.org/abs/2503.01181)
Keywords: foundation model
Abstract: Foundation model approaches such as masked auto-encoders (MAE) or its variations are now being successfully applied to satellite imagery. Most of the ongoing technical validation of foundation models have been applied to optical images like RGB or multi-spectral images. Due to difficulty in semantic labeling to create datasets and higher noise content with respect to optical images, Synthetic Aperture Radar (SAR) data has not been explored a lot in the field for foundation models. Therefore, in this work as a pre-training approach, we explored masked auto-encoder, specifically MixMAE on Sentinel-1 SAR images and its impact on SAR image classification tasks. Moreover, we proposed to use the physical characteristic of SAR data for applying weighting parameter on the auto-encoder training loss (MSE) to reduce the effect of speckle noise and very high values on the SAR images. Proposed SAR intensity-based weighting of the reconstruction loss demonstrates promising results both on SAR pre-training and downstream tasks specifically on flood detection compared with the baseline model.

Title: Language-Assisted Feature Transformation for Anomaly Detection

Authors: EungGu Yun, Heonjin Ha, Yeongwoo Nam, Bryan Dongik Lee
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.01184
Pdf URL: https://arxiv.org/pdf/2503.01184
Copy Paste: [[2503.01184]] Language-Assisted Feature Transformation for Anomaly Detection(https://arxiv.org/abs/2503.01184)
Keywords: anomaly
Abstract: This paper introduces LAFT, a novel feature transformation method designed to incorporate user knowledge and preferences into anomaly detection using natural language. Accurately modeling the boundary of normality is crucial for distinguishing abnormal data, but this is often challenging due to limited data or the presence of nuisance attributes. While unsupervised methods that rely solely on data without user guidance are common, they may fail to detect anomalies of specific interest. To address this limitation, we propose Language-Assisted Feature Transformation (LAFT), which leverages the shared image-text embedding space of vision-language models to transform visual features according to user-defined requirements. Combined with anomaly detection methods, LAFT effectively aligns visual features with user preferences, allowing anomalies of interest to be detected. Extensive experiments on both toy and real-world datasets validate the effectiveness of our method.

Title: DifIISR: A Diffusion Model with Gradient Guidance for Infrared Image Super-Resolution

Authors: Xingyuan Li, Zirui Wang, Yang Zou, Zhixin Chen, Jun Ma, Zhiying Jiang, Long Ma, Jinyuan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01187
Pdf URL: https://arxiv.org/pdf/2503.01187
Copy Paste: [[2503.01187]] DifIISR: A Diffusion Model with Gradient Guidance for Infrared Image Super-Resolution(https://arxiv.org/abs/2503.01187)
Keywords: diffusion
Abstract: Infrared imaging is essential for autonomous driving and robotic operations as a supportive modality due to its reliable performance in challenging environments. Despite its popularity, the limitations of infrared cameras, such as low spatial resolution and complex degradations, consistently challenge imaging quality and subsequent visual tasks. Hence, infrared image super-resolution (IISR) has been developed to address this challenge. While recent developments in diffusion models have greatly advanced this field, current methods to solve it either ignore the unique modal characteristics of infrared imaging or overlook the machine perception requirements. To bridge these gaps, we propose DifIISR, an infrared image super-resolution diffusion model optimized for visual quality and perceptual performance. Our approach achieves task-based guidance for diffusion by injecting gradients derived from visual and perceptual priors into the noise during the reverse process. Specifically, we introduce an infrared thermal spectrum distribution regulation to preserve visual fidelity, ensuring that the reconstructed infrared images closely align with high-resolution images by matching their frequency components. Subsequently, we incorporate various visual foundational models as the perceptual guidance for downstream visual tasks, infusing generalizable perceptual features beneficial for detection and segmentation. As a result, our approach gains superior visual results while attaining State-Of-The-Art downstream task performance. Code is available at this https URL

Title: Enhancing Retinal Vessel Segmentation Generalization via Layout-Aware Generative Modelling

Authors: Jonathan Fhima, Jan Van Eijgen, Lennert Beeckmans, Thomas Jacobs, Moti Freiman, Luis Filipe Nakayama, Ingeborg Stalmans, Chaim Baskin, Joachim A. Behar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01190
Pdf URL: https://arxiv.org/pdf/2503.01190
Copy Paste: [[2503.01190]] Enhancing Retinal Vessel Segmentation Generalization via Layout-Aware Generative Modelling(https://arxiv.org/abs/2503.01190)
Keywords: diffusion, generative
Abstract: Generalization in medical segmentation models is challenging due to limited annotated datasets and imaging variability. To address this, we propose Retinal Layout-Aware Diffusion (RLAD), a novel diffusion-based framework for generating controllable layout-aware images. RLAD conditions image generation on multiple key layout components extracted from real images, ensuring high structural fidelity while enabling diversity in other components. Applied to retinal fundus imaging, we augmented the training datasets by synthesizing paired retinal images and vessel segmentations conditioned on extracted blood vessels from real images, while varying other layout components such as lesions and the optic disc. Experiments demonstrated that RLAD-generated data improved generalization in retinal vessel segmentation by up to 8.1%. Furthermore, we present REYIA, a comprehensive dataset comprising 586 manually segmented retinal images. To foster reproducibility and drive innovation, both our code and dataset will be made publicly accessible.

Title: Hypergraph Foundation Model

Authors: Yifan Feng, Shiquan Liu, Xiangmin Han, Shaoyi Du, Zongze Wu, Han Hu, Yue Gao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.01203
Pdf URL: https://arxiv.org/pdf/2503.01203
Copy Paste: [[2503.01203]] Hypergraph Foundation Model(https://arxiv.org/abs/2503.01203)
Keywords: foundation model
Abstract: Hypergraph neural networks (HGNNs) effectively model complex high-order relationships in domains like protein interactions and social networks by connecting multiple vertices through hyperedges, enhancing modeling capabilities, and reducing information loss. Developing foundation models for hypergraphs is challenging due to their distinct data, which includes both vertex features and intricate structural information. We present Hyper-FM, a Hypergraph Foundation Model for multi-domain knowledge extraction, featuring Hierarchical High-Order Neighbor Guided Vertex Knowledge Embedding for vertex feature representation and Hierarchical Multi-Hypergraph Guided Structural Knowledge Extraction for structural information. Additionally, we curate 10 text-attributed hypergraph datasets to advance research between HGNNs and LLMs. Experiments on these datasets show that Hyper-FM outperforms baseline methods by approximately 13.3\%, validating our approach. Furthermore, we propose the first scaling law for hypergraph foundation models, demonstrating that increasing domain diversity significantly enhances performance, unlike merely augmenting vertex and hyperedge counts. This underscores the critical role of domain diversity in scaling hypergraph models.

Title: Tera-MIND: Tera-scale mouse brain simulation via spatial mRNA-guided diffusion

Authors: Jiqing Wu, Ingrid Berg, Yawei Li, Ender Konukoglu, Viktor H. Koelzer
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01220
Pdf URL: https://arxiv.org/pdf/2503.01220
Copy Paste: [[2503.01220]] Tera-MIND: Tera-scale mouse brain simulation via spatial mRNA-guided diffusion(https://arxiv.org/abs/2503.01220)
Keywords: diffusion, generative
Abstract: Holistic 3D modeling of molecularly defined brain structures is crucial for understanding complex brain functions. Emerging tissue profiling technologies enable the construction of a comprehensive atlas of the mammalian brain with sub-cellular resolution and spatially resolved gene expression data. However, such tera-scale volumetric datasets present significant computational challenges in understanding complex brain functions within their native 3D spatial context. Here, we propose the novel generative approach $\textbf{Tera-MIND}$, which can simulate $\textbf{Tera}$-scale $\textbf{M}$ouse bra$\textbf{IN}s$ in 3D using a patch-based and boundary-aware $\textbf{D}$iffusion model. Taking spatial transcriptomic data as the conditional input, we generate virtual mouse brains with comprehensive cellular morphological detail at teravoxel scale. Through the lens of 3D $gene$-$gene$ self-attention, we identify spatial molecular interactions for key transcriptomic pathways in the murine brain, exemplified by glutamatergic and dopaminergic neuronal systems. Importantly, these $in$-$silico$ biological findings are consistent and reproducible across three tera-scale virtual mouse brains. Therefore, Tera-MIND showcases a promising path toward efficient and generative simulations of whole organ systems for biomedical research. Project website: $\href{this http URL}{https}$

Title: Enhancing Network Security Management in Water Systems using FM-based Attack Attribution

Authors: Aleksandar Avdalovic, Joseph Khoury, Ahmad Taha, Elias Bou-Harb
Subjects: cs.LG, cs.CR, eess.SY
Abstract URL: https://arxiv.org/abs/2503.01229
Pdf URL: https://arxiv.org/pdf/2503.01229
Copy Paste: [[2503.01229]] Enhancing Network Security Management in Water Systems using FM-based Attack Attribution(https://arxiv.org/abs/2503.01229)
Keywords: anomaly
Abstract: Water systems are vital components of modern infrastructure, yet they are increasingly susceptible to sophisticated cyber attacks with potentially dire consequences on public health and safety. While state-of-the-art machine learning techniques effectively detect anomalies, contemporary model-agnostic attack attribution methods using LIME, SHAP, and LEMNA are deemed impractical for large-scale, interdependent water systems. This is due to the intricate interconnectivity and dynamic interactions that define these complex environments. Such methods primarily emphasize individual feature importance while falling short of addressing the crucial sensor-actuator interactions in water systems, which limits their effectiveness in identifying root cause attacks. To this end, we propose a novel model-agnostic Factorization Machines (FM)-based approach that capitalizes on water system sensor-actuator interactions to provide granular explanations and attributions for cyber attacks. For instance, an anomaly in an actuator pump activity can be attributed to a top root cause attack candidates, a list of water pressure sensors, which is derived from the underlying linear and quadratic effects captured by our approach. We validate our method using two real-world water system specific datasets, SWaT and WADI, demonstrating its superior performance over traditional attribution methods. In multi-feature cyber attack scenarios involving intricate sensor-actuator interactions, our FM-based attack attribution method effectively ranks attack root causes, achieving approximately 20% average improvement over SHAP and LEMNA.

Title: OIPR: Evaluation for Time-series Anomaly Detection Inspired by Operator Interest

Authors: Yuhan Jing, Jingyu Wang, Lei Zhang, Haifeng Sun, Bo He, Zirui Zhuang, Chengsen Wang, Qi Qi, Jianxin Liao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.01260
Pdf URL: https://arxiv.org/pdf/2503.01260
Copy Paste: [[2503.01260]] OIPR: Evaluation for Time-series Anomaly Detection Inspired by Operator Interest(https://arxiv.org/abs/2503.01260)
Keywords: anomaly
Abstract: With the growing adoption of time-series anomaly detection (TAD) technology, numerous studies have employed deep learning-based detectors for analyzing time-series data in the fields of Internet services, industrial systems, and sensors. The selection and optimization of anomaly detectors strongly rely on the availability of an effective performance evaluation method for TAD. Since anomalies in time-series data often manifest as a sequence of points, conventional metrics that solely consider the detection of individual point are inadequate. Existing evaluation methods for TAD typically employ point-based or event-based metrics to capture the temporal context. However, point-based metrics tend to overestimate detectors that excel only in detecting long anomalies, while event-based metrics are susceptible to being misled by fragmented detection results. To address these limitations, we propose OIPR, a novel set of TAD evaluation metrics. It models the process of operators receiving detector alarms and handling faults, utilizing area under the operator interest curve to evaluate the performance of TAD algorithms. Furthermore, we build a special scenario dataset to compare the characteristics of different evaluation methods. Through experiments conducted on the special scenario dataset and five real-world datasets, we demonstrate the remarkable performance of OIPR in extreme and complex scenarios. It achieves a balance between point and event perspectives, overcoming their primary limitations and offering applicability to broader situations.

Title: Reconciling Stochastic and Deterministic Strategies for Zero-shot Image Restoration using Diffusion Model in Dual

Authors: Chong Wang, Lanqing Guo, Zixuan Fu, Siyuan Yang, Hao Cheng, Alex C. Kot, Bihan Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01288
Pdf URL: https://arxiv.org/pdf/2503.01288
Copy Paste: [[2503.01288]] Reconciling Stochastic and Deterministic Strategies for Zero-shot Image Restoration using Diffusion Model in Dual(https://arxiv.org/abs/2503.01288)
Keywords: diffusion, generative
Abstract: Plug-and-play (PnP) methods offer an iterative strategy for solving image restoration (IR) problems in a zero-shot manner, using a learned \textit{discriminative denoiser} as the implicit prior. More recently, a sampling-based variant of this approach, which utilizes a pre-trained \textit{generative diffusion model}, has gained great popularity for solving IR problems through stochastic sampling. The IR results using PnP with a pre-trained diffusion model demonstrate distinct advantages compared to those using discriminative denoisers, \ie improved perceptual quality while sacrificing the data fidelity. The unsatisfactory results are due to the lack of integration of these strategies in the IR tasks. In this work, we propose a novel zero-shot IR scheme, dubbed Reconciling Diffusion Model in Dual (RDMD), which leverages only a \textbf{single} pre-trained diffusion model to construct \textbf{two} complementary regularizers. Specifically, the diffusion model in RDMD will iteratively perform deterministic denoising and stochastic sampling, aiming to achieve high-fidelity image restoration with appealing perceptual quality. RDMD also allows users to customize the distortion-perception tradeoff with a single hyperparameter, enhancing the adaptability of the restoration process in different practical scenarios. Extensive experiments on several IR tasks demonstrate that our proposed method could achieve superior results compared to existing approaches on both the FFHQ and ImageNet datasets.

Title: SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance

Authors: Peishan Cong, Ziyi Wang, Yuexin Ma, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01291
Pdf URL: https://arxiv.org/pdf/2503.01291
Copy Paste: [[2503.01291]] SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance(https://arxiv.org/abs/2503.01291)
Keywords: generative
Abstract: Generating reasonable and high-quality human interactive motions in a given dynamic environment is crucial for understanding, modeling, transferring, and applying human behaviors to both virtual and physical robots. In this paper, we introduce an effective method, SemGeoMo, for dynamic contextual human motion generation, which fully leverages the text-affordance-joint multi-level semantic and geometric guidance in the generation process, improving the semantic rationality and geometric correctness of generative motions. Our method achieves state-of-the-art performance on three datasets and demonstrates superior generalization capability for diverse interaction scenarios. The project page and code can be found at this https URL.

Title: PA-CLIP: Enhancing Zero-Shot Anomaly Detection through Pseudo-Anomaly Awareness

Authors: Yurui Pan, Lidong Wang, Yuchao Chen, Wenbing Zhu, Bo Peng, Mingmin Chi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01292
Pdf URL: https://arxiv.org/pdf/2503.01292
Copy Paste: [[2503.01292]] PA-CLIP: Enhancing Zero-Shot Anomaly Detection through Pseudo-Anomaly Awareness(https://arxiv.org/abs/2503.01292)
Keywords: anomaly
Abstract: In industrial anomaly detection (IAD), accurately identifying defects amidst diverse anomalies and under varying imaging conditions remains a significant challenge. Traditional approaches often struggle with high false-positive rates, frequently misclassifying normal shadows and surface deformations as defects, an issue that becomes particularly pronounced in products with complex and intricate surface features. To address these challenges, we introduce PA-CLIP, a zero-shot anomaly detection method that reduces background noise and enhances defect detection through a pseudo-anomaly-based framework. The proposed method integrates a multiscale feature aggregation strategy for capturing detailed global and local information, two memory banks for distinguishing background information, including normal patterns and pseudo-anomalies, from true anomaly features, and a decision-making module designed to minimize false positives caused by environmental variations while maintaining high defect sensitivity. Demonstrated on the MVTec AD and VisA datasets, PA-CLIP outperforms existing zero-shot methods, providing a robust solution for industrial defect detection.

Title: Fine-Grained Controllable Apparel Showcase Image Generation via Garment-Centric Outpainting

Authors: Rong Zhang, Jingnan Wang, Zhiwen Zuo, Jianfeng Dong, Wei Li, Chi Wang, Weiwei Xu, Xun Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01294
Pdf URL: https://arxiv.org/pdf/2503.01294
Copy Paste: [[2503.01294]] Fine-Grained Controllable Apparel Showcase Image Generation via Garment-Centric Outpainting(https://arxiv.org/abs/2503.01294)
Keywords: diffusion
Abstract: In this paper, we propose a novel garment-centric outpainting (GCO) framework based on the latent diffusion model (LDM) for fine-grained controllable apparel showcase image generation. The proposed framework aims at customizing a fashion model wearing a given garment via text prompts and facial images. Different from existing methods, our framework takes a garment image segmented from a dressed mannequin or a person as the input, eliminating the need for learning cloth deformation and ensuring faithful preservation of garment details. The proposed framework consists of two stages. In the first stage, we introduce a garment-adaptive pose prediction model that generates diverse poses given the garment. Then, in the next stage, we generate apparel showcase images, conditioned on the garment and the predicted poses, along with specified text prompts and facial images. Notably, a multi-scale appearance customization module (MS-ACM) is designed to allow both overall and fine-grained text-based control over the generated model's appearance. Moreover, we leverage a lightweight feature fusion operation without introducing any extra encoders or modules to integrate multiple conditions, which is more efficient. Extensive experiments validate the superior performance of our framework compared to state-of-the-art methods.

Title: MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation

Authors: Yi Wang, Mushui Liu, Wanggui He, Longxiang Zhang, Ziwei Huang, Guanghao Zhang, Fangxun Shu, Zhong Tao, Dong She, Zhelun Yu, Haoyuan Li, Weilong Dai, Mingli Song, Jie Song, Hao Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01298
Pdf URL: https://arxiv.org/pdf/2503.01298
Copy Paste: [[2503.01298]] MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation(https://arxiv.org/abs/2503.01298)
Keywords: generative
Abstract: Unified generative models have demonstrated extraordinary performance in both text and image generation. However, they tend to underperform when generating intricate images with various interwoven conditions, which is hard to solely rely on straightforward text-to-image generation. In response to this challenge, we introduce MINT, an innovative unified generative model, empowered with native multimodal chain of thought (MCoT) for enhanced image generation for the first time. Firstly, we design Mixture of Transformer Experts (MTXpert), an expert-parallel structure that effectively supports both natural language generation (NLG) and visual capabilities, while avoiding potential modality conflicts that could hinder the full potential of each modality. Building on this, we propose an innovative MCoT training paradigm, a step-by-step approach to multimodal thinking, reasoning, and reflection specifically designed to enhance image generation. This paradigm equips MINT with nuanced, element-wise decoupled alignment and a comprehensive understanding of textual and visual components. Furthermore, it fosters advanced multimodal reasoning and self-reflection, enabling the construction of images that are firmly grounded in the logical relationships between these elements. Notably, MINT has been validated to exhibit superior performance across multiple benchmarks for text-to-image (T2I) and image-to-text (I2T) tasks.

Title: OnlineAnySeg: Online Zero-Shot 3D Segmentation by Visual Foundation Model Guided 2D Mask Merging

Authors: Yijie Tang, Jiazhao Zhang, Yuqing Lan, Yulan Guo, Dezun Dong, Chenyang Zhu, Kai Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01309
Pdf URL: https://arxiv.org/pdf/2503.01309
Copy Paste: [[2503.01309]] OnlineAnySeg: Online Zero-Shot 3D Segmentation by Visual Foundation Model Guided 2D Mask Merging(https://arxiv.org/abs/2503.01309)
Keywords: foundation model
Abstract: Online 3D open-vocabulary segmentation of a progressively reconstructed scene is both a critical and challenging task for embodied applications. With the success of visual foundation models (VFMs) in the image domain, leveraging 2D priors to address 3D online segmentation has become a prominent research focus. Since segmentation results provided by 2D priors often require spatial consistency to be lifted into final 3D segmentation, an efficient method for identifying spatial overlap among 2D masks is essential - yet existing methods rarely achieve this in real time, mainly limiting its use to offline approaches. To address this, we propose an efficient method that lifts 2D masks generated by VFMs into a unified 3D instance using a hashing technique. By employing voxel hashing for efficient 3D scene querying, our approach reduces the time complexity of costly spatial overlap queries from $O(n^2)$ to $O(n)$. Accurate spatial associations further enable 3D merging of 2D masks through simple similarity-based filtering in a zero-shot manner, making our approach more robust to incomplete and noisy data. Evaluated on the ScanNet and SceneNN benchmarks, our approach achieves state-of-the-art performance in online, open-vocabulary 3D instance segmentation with leading efficiency.

Title: CacheQuant: Comprehensively Accelerated Diffusion Models

Authors: Xuewen Liu, Zhikai Li, Qingyi Gu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01323
Pdf URL: https://arxiv.org/pdf/2503.01323
Copy Paste: [[2503.01323]] CacheQuant: Comprehensively Accelerated Diffusion Models(https://arxiv.org/abs/2503.01323)
Keywords: diffusion, generative
Abstract: Diffusion models have gradually gained prominence in the field of image synthesis, showcasing remarkable generative capabilities. Nevertheless, the slow inference and complex networks, resulting from redundancy at both temporal and structural levels, hinder their low-latency applications in real-world scenarios. Current acceleration methods for diffusion models focus separately on temporal and structural levels. However, independent optimization at each level to further push the acceleration limits results in significant performance degradation. On the other hand, integrating optimizations at both levels can compound the acceleration effects. Unfortunately, we find that the optimizations at these two levels are not entirely orthogonal. Performing separate optimizations and then simply integrating them results in unsatisfactory performance. To tackle this issue, we propose CacheQuant, a novel training-free paradigm that comprehensively accelerates diffusion models by jointly optimizing model caching and quantization techniques. Specifically, we employ a dynamic programming approach to determine the optimal cache schedule, in which the properties of caching and quantization are carefully considered to minimize errors. Additionally, we propose decoupled error correction to further mitigate the coupled and accumulated errors step by step. Experimental results show that CacheQuant achieves a 5.18 speedup and 4 compression for Stable Diffusion on MS-COCO, with only a 0.02 loss in CLIP score. Our code are open-sourced: this https URL .

Title: Jailbreaking Generative AI: Empowering Novices to Conduct Phishing Attacks

Authors: Rina Mishra, Gaurav Varshney, Shreya Singh
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.01395
Pdf URL: https://arxiv.org/pdf/2503.01395
Copy Paste: [[2503.01395]] Jailbreaking Generative AI: Empowering Novices to Conduct Phishing Attacks(https://arxiv.org/abs/2503.01395)
Keywords: generative
Abstract: The rapid advancements in generative AI models, such as ChatGPT, have introduced both significant benefits and new risks within the cybersecurity landscape. This paper investigates the potential misuse of the latest AI model, ChatGPT-4o Mini, in facilitating social engineering attacks, with a particular focus on phishing, one of the most pressing cybersecurity threats today. While existing literature primarily addresses the technical aspects, such as jailbreaking techniques, none have fully explored the free and straightforward execution of a comprehensive phishing campaign by novice users using ChatGPT-4o Mini. In this study, we examine the vulnerabilities of AI-driven chatbot services in 2025, specifically how methods like jailbreaking and reverse psychology can bypass ethical safeguards, allowing ChatGPT to generate phishing content, suggest hacking tools, and assist in carrying out phishing attacks. Our findings underscore the alarming ease with which even inexperienced users can execute sophisticated phishing campaigns, emphasizing the urgent need for stronger cybersecurity measures and heightened user awareness in the age of AI.

Title: Divide and Conquer: Heterogeneous Noise Integration for Diffusion-based Adversarial Purification

Authors: Gaozheng Pei, Shaojie Lyu, Gong Chen, Ke Ma, Qianqian Xu, Yingfei Sun, Qingming Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01407
Pdf URL: https://arxiv.org/pdf/2503.01407
Copy Paste: [[2503.01407]] Divide and Conquer: Heterogeneous Noise Integration for Diffusion-based Adversarial Purification(https://arxiv.org/abs/2503.01407)
Keywords: diffusion
Abstract: Existing diffusion-based purification methods aim to disrupt adversarial perturbations by introducing a certain amount of noise through a forward diffusion process, followed by a reverse process to recover clean examples. However, this approach is fundamentally flawed: the uniform operation of the forward process across all pixels compromises normal pixels while attempting to combat adversarial perturbations, resulting in the target model producing incorrect predictions. Simply relying on low-intensity noise is insufficient for effective defense. To address this critical issue, we implement a heterogeneous purification strategy grounded in the interpretability of neural networks. Our method decisively applies higher-intensity noise to specific pixels that the target model focuses on while the remaining pixels are subjected to only low-intensity noise. This requirement motivates us to redesign the sampling process of the diffusion model, allowing for the effective removal of varying noise levels. Furthermore, to evaluate our method against strong adaptative attack, our proposed method sharply reduces time cost and memory usage through a single-step resampling. The empirical evidence from extensive experiments across three datasets demonstrates that our method outperforms most current adversarial training and purification techniques by a substantial margin.

Title: DLF: Extreme Image Compression with Dual-generative Latent Fusion

Authors: Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Yuan Zhang, Yan Lu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.01428
Pdf URL: https://arxiv.org/pdf/2503.01428
Copy Paste: [[2503.01428]] DLF: Extreme Image Compression with Dual-generative Latent Fusion(https://arxiv.org/abs/2503.01428)
Keywords: diffusion, generative
Abstract: Recent studies in extreme image compression have achieved remarkable performance by compressing the tokens from generative tokenizers. However, these methods often prioritize clustering common semantics within the dataset, while overlooking the diverse details of individual objects. Consequently, this results in suboptimal reconstruction fidelity, especially at low bitrates. To address this issue, we introduce a Dual-generative Latent Fusion (DLF) paradigm. DLF decomposes the latent into semantic and detail elements, compressing them through two distinct branches. The semantic branch clusters high-level information into compact tokens, while the detail branch encodes perceptually critical details to enhance the overall fidelity. Additionally, we propose a cross-branch interactive design to reduce redundancy between the two branches, thereby minimizing the overall bit cost. Experimental results demonstrate the impressive reconstruction quality of DLF even below 0.01 bits per pixel (bpp). On the CLIC2020 test set, our method achieves bitrate savings of up to 27.93% on LPIPS and 53.55% on DISTS compared to MS-ILLM. Furthermore, DLF surpasses recent diffusion-based codecs in visual fidelity while maintaining a comparable level of generative realism. Code will be available later.

Title: Generative Human Geometry Distribution

Authors: Xiangjun Tang, Biao Zhang, Peter Wonka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01448
Pdf URL: https://arxiv.org/pdf/2503.01448
Copy Paste: [[2503.01448]] Generative Human Geometry Distribution(https://arxiv.org/abs/2503.01448)
Keywords: generative
Abstract: Realistic human geometry generation is an important yet challenging task, requiring both the preservation of fine clothing details and the accurate modeling of clothing-pose interactions. Geometry distributions, which can model the geometry of a single human as a distribution, provide a promising representation for high-fidelity synthesis. However, applying geometry distributions for human generation requires learning a dataset-level distribution over numerous individual geometry distributions. To address the resulting challenges, we propose a novel 3D human generative framework that, for the first time, models the distribution of human geometry distributions. Our framework operates in two stages: first, generating the human geometry distribution, and second, synthesizing high-fidelity humans by sampling from this distribution. We validate our method on two tasks: pose-conditioned 3D human generation and single-view-based novel pose generation. Experimental results demonstrate that our approach achieves the best quantitative results in terms of realism and geometric fidelity, outperforming state-of-the-art generative methods.

Title: Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh

Authors: Fajri Koto, Rituraj Joshi, Nurdaulet Mukhituly, Yuxia Wang, Zhuohan Xie, Rahul Pal, Daniil Orel, Parvez Mullah, Diana Turmakhan, Maiya Goloburda, Mohammed Kamran, Samujjwal Ghosh, Bokang Jia, Jonibek Mansurov, Mukhammed Togmanov, Debopriyo Banerjee, Nurkhan Laiyk, Akhmed Sakip, Xudong Han, Ekaterina Kochmar, Alham Fikri Aji, Aaryamonvikram Singh, Alok Anil Jadhav, Satheesh Katipomu, Samta Kamboj, Monojit Choudhury, Gurpreet Gosal, Gokul Ramakrishnan, Biswajit Mishra, Sarath Chandran, Avraham Sheinin, Natalia Vassilieva, Neha Sengupta, Larry Murray, Preslav Nakov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01493
Pdf URL: https://arxiv.org/pdf/2503.01493
Copy Paste: [[2503.01493]] Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh(https://arxiv.org/abs/2503.01493)
Keywords: generative
Abstract: Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outperforming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. We release Sherkala-Chat (8B) as an open-weight instruction-tuned model and provide a detailed overview of its training, fine-tuning, safety alignment, and evaluation, aiming to advance research and support diverse real-world applications.

Title: Meta Learning-Driven Iterative Refinement for Robust Anomaly Detection in Industrial Inspection

Authors: Muhammad Aqeel, Shakiba Sharifi, Marco Cristani, Francesco Setti
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.01569
Pdf URL: https://arxiv.org/pdf/2503.01569
Copy Paste: [[2503.01569]] Meta Learning-Driven Iterative Refinement for Robust Anomaly Detection in Industrial Inspection(https://arxiv.org/abs/2503.01569)
Keywords: anomaly
Abstract: This study investigates the performance of robust anomaly detection models in industrial inspection, focusing particularly on their ability to handle noisy data. We propose to leverage the adaptation ability of meta learning approaches to identify and reject noisy training data to improve the learning process. In our model, we employ Model Agnostic Meta Learning (MAML) and an iterative refinement process through an Inter-Quartile Range rejection scheme to enhance their adaptability and robustness. This approach significantly improves the models capability to distinguish between normal and defective conditions. Our results of experiments conducted on well known MVTec and KSDD2 datasets demonstrate that the proposed method not only excels in environments with substantial noise but can also contribute in case of a clear training set, isolating those samples that are relatively out of distribution, thus offering significant improvements over traditional models.

Title: MRI super-resolution reconstruction using efficient diffusion probabilistic model with residual shifting

Authors: Mojtaba Safari, Shansong Wang, Zach Eidex, Qiang Li, Erik H. Middlebrooks, David S. Yu, Xiaofeng Yang
Subjects: cs.CV, physics.med-ph
Abstract URL: https://arxiv.org/abs/2503.01576
Pdf URL: https://arxiv.org/pdf/2503.01576
Copy Paste: [[2503.01576]] MRI super-resolution reconstruction using efficient diffusion probabilistic model with residual shifting(https://arxiv.org/abs/2503.01576)
Keywords: diffusion
Abstract: Objective:This study introduces a residual error-shifting mechanism that drastically reduces sampling steps while preserving critical anatomical details, thus accelerating MRI reconstruction. Approach:We propose a novel diffusion-based SR framework called Res-SRDiff, which integrates residual error shifting into the forward diffusion process. This enables efficient HR image reconstruction by aligning the degraded HR and LR this http URL evaluated Res-SRDiff on ultra-high-field brain T1 MP2RAGE maps and T2-weighted prostate images, comparing it with Bicubic, Pix2pix, CycleGAN, and a conventional denoising diffusion probabilistic model with vision transformer backbone (TM-DDPM), using quantitative metrics such as peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), gradient magnitude similarity deviation (GMSD), and learned perceptual image patch similarity (LPIPS). Main results: Res-SRDiff significantly outperformed all comparative methods in terms of PSNR, SSIM, and GMSD across both datasets, with statistically significant improvements (p-values<<0.05). The model achieved high-fidelity image restoration with only four sampling steps, drastically reducing computational time to under one second per slice, which is substantially faster than conventional TM-DDPM with around 20 seconds per slice. Qualitative analyses further demonstrated that Res-SRDiff effectively preserved fine anatomical details and lesion morphology in both brain and pelvic MRI images. Significance: Our findings show that Res-SRDiff is an efficient and accurate MRI SR method, markedly improving computational efficiency and image quality. Integrating residual error shifting into the diffusion process allows for rapid and robust HR image reconstruction, enhancing clinical MRI workflows and advancing medical imaging research. The source at:this https URL

Title: EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection

Authors: Yuhao Zhou, Sirui Song, Boyang Liu, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Zhihao Zhang, Wei Li, Xuanjing Huang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.01586
Pdf URL: https://arxiv.org/pdf/2503.01586
Copy Paste: [[2503.01586]] EliteKV: Scalable KV Cache Compression via RoPE Frequency Selection and Joint Low-Rank Projection(https://arxiv.org/abs/2503.01586)
Keywords: foundation model
Abstract: Rotary Position Embedding (RoPE) enables each attention head to capture multi-frequency information along the sequence dimension and is widely applied in foundation models. However, the nonlinearity introduced by RoPE complicates optimization of the key state in the Key-Value (KV) cache for RoPE-based attention. Existing KV cache compression methods typically store key state before rotation and apply the transformation during decoding, introducing additional computational overhead. This paper introduces EliteKV, a flexible modification framework for RoPE-based models supporting variable KV cache compression ratios. EliteKV first identifies the intrinsic frequency preference of each head using RoPElite, selectively restoring linearity to certain dimensions of key within attention computation. Building on this, joint low-rank compression of key and value enables partial cache sharing. Experimental results show that with minimal uptraining on only $0.6\%$ of the original training data, RoPE-based models achieve a $75\%$ reduction in KV cache size while preserving performance within a negligible margin. Furthermore, EliteKV consistently performs well across models of different scales within the same family.

Title: In-context Learning vs. Instruction Tuning: The Case of Small and Multilingual Language Models

Authors: David Ponce, Thierry Etchegoyhen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01611
Pdf URL: https://arxiv.org/pdf/2503.01611
Copy Paste: [[2503.01611]] In-context Learning vs. Instruction Tuning: The Case of Small and Multilingual Language Models(https://arxiv.org/abs/2503.01611)
Keywords: in-context
Abstract: Instruction following is a critical ability for Large Language Models to perform downstream tasks. The standard approach to instruction alignment has relied on a specific phase of model tuning over curated instruction datasets, optionally complemented with an alignment step over human preferences. Recent work has shown the potential of in-context learning (ICL) alternatives to guide base models towards instruction following. This type of approach is particularly relevant to extend instruction following across languages and models of varying sizes adapted to different types of usage. In this work we compare ICL and instruction fine-tuning in English, French and Spanish, on Small Language Models, and provide experimental results on applying Direct Preference Optimisation (DPO) over base models. Our results show that scenarios involving multilingual and smaller models result in downgraded ICL instruction following performance, only partially mitigated by DPO alignment. This study aims to further our understanding of current strengths and limitations of alternative methods for instruction following.

Title: A General Purpose Spectral Foundational Model for Both Proximal and Remote Sensing Spectral Imaging

Authors: William Michael Laprade, Jesper Cairo Westergaard, Svend Christensen, Mads Nielsen, Anders Bjorholm Dahl
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01628
Pdf URL: https://arxiv.org/pdf/2503.01628
Copy Paste: [[2503.01628]] A General Purpose Spectral Foundational Model for Both Proximal and Remote Sensing Spectral Imaging(https://arxiv.org/abs/2503.01628)
Keywords: foundation model
Abstract: Spectral imaging data acquired via multispectral and hyperspectral cameras can have hundreds of channels, where each channel records the reflectance at a specific wavelength and bandwidth. Time and resource constraints limit our ability to collect large spectral datasets, making it difficult to build and train predictive models from scratch. In the RGB domain, we can often alleviate some of the limitations of smaller datasets by using pretrained foundational models as a starting point. However, most existing foundation models are pretrained on large datasets of 3-channel RGB images, severely limiting their effectiveness when used with spectral imaging data. The few spectral foundation models that do exist usually have one of two limitations: (1) they are built and trained only on remote sensing data limiting their application in proximal spectral imaging, (2) they utilize the more widely available multispectral imaging datasets with less than 15 channels restricting their use with hundred-channel hyperspectral images. To alleviate these issues, we propose a large-scale foundational model and dataset built upon the masked autoencoder architecture that takes advantage of spectral channel encoding, spatial-spectral masking and ImageNet pretraining for an adaptable and robust model for downstream spectral imaging tasks.

Title: SparseMamba-PCL: Scribble-Supervised Medical Image Segmentation via SAM-Guided Progressive Collaborative Learning

Authors: Luyi Qiu, Tristan Till, Xiaobao Guo, Adams Wai-Kin Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01633
Pdf URL: https://arxiv.org/pdf/2503.01633
Copy Paste: [[2503.01633]] SparseMamba-PCL: Scribble-Supervised Medical Image Segmentation via SAM-Guided Progressive Collaborative Learning(https://arxiv.org/abs/2503.01633)
Keywords: foundation model
Abstract: Scribble annotations significantly reduce the cost and labor required for dense labeling in large medical datasets with complex anatomical structures. However, current scribble-supervised learning methods are limited in their ability to effectively propagate sparse annotation labels to dense segmentation masks and accurately segment object boundaries. To address these issues, we propose a Progressive Collaborative Learning framework that leverages novel algorithms and the Med-SAM foundation model to enhance information quality during training. (1) We enrich ground truth scribble segmentation labels through a new algorithm, propagating scribbles to estimate object boundaries. (2) We enhance feature representation by optimizing Med-SAM-guided training through the fusion of feature embeddings from Med-SAM and our proposed Sparse Mamba network. This enriched representation also facilitates the fine-tuning of the Med-SAM decoder with enriched scribbles. (3) For inference, we introduce a Sparse Mamba network, which is highly capable of capturing local and global dependencies by replacing the traditional sequential patch processing method with a skip-sampling procedure. Experiments on the ACDC, CHAOS, and MSCMRSeg datasets validate the effectiveness of our framework, outperforming nine state-of-the-art methods. Our code is available at \href{this https URL}{this http URL}.

Title: DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models

Authors: Zhendong Wang, Jianmin Bao, Shuyang Gu, Dong Chen, Wengang Zhou, Houqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01645
Pdf URL: https://arxiv.org/pdf/2503.01645
Copy Paste: [[2503.01645]] DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models(https://arxiv.org/abs/2503.01645)
Keywords: diffusion
Abstract: In this paper, we present DesignDiffusion, a simple yet effective framework for the novel task of synthesizing design images from textual descriptions. A primary challenge lies in generating accurate and style-consistent textual and visual content. Existing works in a related task of visual text generation often focus on generating text within given specific regions, which limits the creativity of generation models, resulting in style or color inconsistencies between textual and visual elements if applied to design image generation. To address this issue, we propose an end-to-end, one-stage diffusion-based framework that avoids intricate components like position and layout modeling. Specifically, the proposed framework directly synthesizes textual and visual design elements from user prompts. It utilizes a distinctive character embedding derived from the visual text to enhance the input prompt, along with a character localization loss for enhanced supervision during text generation. Furthermore, we employ a self-play Direct Preference Optimization fine-tuning strategy to improve the quality and accuracy of the synthesized visual text. Extensive experiments demonstrate that DesignDiffusion achieves state-of-the-art performance in design image generation.

Title: ToLo: A Two-Stage, Training-Free Layout-To-Image Generation Framework For High-Overlap Layouts

Authors: Linhao Huang, Jing Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01667
Pdf URL: https://arxiv.org/pdf/2503.01667
Copy Paste: [[2503.01667]] ToLo: A Two-Stage, Training-Free Layout-To-Image Generation Framework For High-Overlap Layouts(https://arxiv.org/abs/2503.01667)
Keywords: diffusion
Abstract: Recent training-free layout-to-image diffusion models have demonstrated remarkable performance in generating high-quality images with controllable layouts. These models follow a one-stage framework: Encouraging the model to focus the attention map of each concept on its corresponding region by defining attention map-based losses. However, these models still struggle to accurately follow layouts with significant overlap, often leading to issues like attribute leakage and missing entities. In this paper, we propose ToLo, a two-stage, training-free layout-to-image generation framework for high-overlap layouts. Our framework consists of two stages: the aggregation stage and the separation stage, each with its own loss function based on the attention map. To provide a more effective evaluation, we partition the HRS dataset based on the Intersection over Union (IoU) of the input layouts, creating a new dataset for layout-to-image generation with varying levels of overlap. Through extensive experiments on this dataset, we demonstrate that ToLo significantly enhances the performance of existing methods when dealing with high-overlap layouts. Our code and dataset are available here: this https URL.

Title: GRNFormer: A Biologically-Guided Framework for Integrating Gene Regulatory Networks into RNA Foundation Models

Authors: Mufan Qiu, Xinyu Hu, Fengwei Zhan, Sukwon Yun, Jie Peng, Ruichen Zhang, Bhavya Kailkhura, Jiekun Yang, Tianlong Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.01682
Pdf URL: https://arxiv.org/pdf/2503.01682
Copy Paste: [[2503.01682]] GRNFormer: A Biologically-Guided Framework for Integrating Gene Regulatory Networks into RNA Foundation Models(https://arxiv.org/abs/2503.01682)
Keywords: foundation model
Abstract: Foundation models for single-cell RNA sequencing (scRNA-seq) have shown promising capabilities in capturing gene expression patterns. However, current approaches face critical limitations: they ignore biological prior knowledge encoded in gene regulatory relationships and fail to leverage multi-omics signals that could provide complementary regulatory insights. In this paper, we propose GRNFormer, a new framework that systematically integrates multi-scale Gene Regulatory Networks (GRNs) inferred from multi-omics data into RNA foundation model training. Our framework introduces two key innovations. First, we introduce a pipeline for constructing hierarchical GRNs that capture regulatory relationships at both cell-type-specific and cell-specific resolutions. Second, we design a structure-aware integration framework that addresses the information asymmetry in GRNs through two technical advances: (1) A graph topological adapter using multi-head cross-attention to weight regulatory relationships dynamically, and (2) a novel edge perturbation strategy that perturb GRNs with biologically-informed co-expression links to augment graph neural network training. Comprehensive experiments have been conducted on three representative downstream tasks across multiple model architectures to demonstrate the effectiveness of GRNFormer. It achieves consistent improvements over state-of-the-art (SoTA) baselines: $3.6\%$ increase in drug response prediction correlation, $9.6\%$ improvement in single-cell drug classification AUC, and $1.1\%$ average gain in gene perturbation prediction accuracy.

Title: Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution

Authors: Kun Li, Tianhua Zhang, Yunxiang Li, Hongyin Luo, Abdalla Moustafa, Xixin Wu, James Glass, Helen Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.01695
Pdf URL: https://arxiv.org/pdf/2503.01695
Copy Paste: [[2503.01695]] Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution(https://arxiv.org/abs/2503.01695)
Keywords: generative
Abstract: Improving context faithfulness in large language models is essential for developing trustworthy retrieval augmented generation systems and mitigating hallucinations, especially in long-form question answering (LFQA) tasks or scenarios involving knowledge conflicts. Existing methods either intervene LLMs only at inference without addressing their inherent limitations or overlook the potential for self-improvement. In this paper, we introduce GenDiE (Generate, Discriminate, Evolve), a novel self-evolving framework that enhances context faithfulness through fine-grained sentence-level optimization. GenDiE combines both generative and discriminative training, equipping LLMs with self-generation and self-scoring capabilities to facilitate iterative self-evolution. This supports both data construction for model alignment and score-guided search during inference. Furthermore, by treating each sentence in a response as an independent optimization unit, GenDiE effectively addresses the limitations of previous approaches that optimize at the holistic answer level, which may miss unfaithful details. Experiments on ASQA (in-domain LFQA) and ConFiQA (out-of-domain counterfactual QA) datasets demonstrate that GenDiE surpasses various baselines in both faithfulness and correctness, and exhibits robust performance for domain adaptation.

Title: KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation

Authors: Antoni Bigata, Michał Stypułkowski, Rodrigo Mira, Stella Bounareli, Konstantinos Vougioukas, Zoe Landgraf, Nikita Drobyshev, Maciej Zieba, Stavros Petridis, Maja Pantic
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.01715
Pdf URL: https://arxiv.org/pdf/2503.01715
Copy Paste: [[2503.01715]] KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation(https://arxiv.org/abs/2503.01715)
Keywords: diffusion
Abstract: Current audio-driven facial animation methods achieve impressive results for short videos but suffer from error accumulation and identity drift when extended to longer durations. Existing methods attempt to mitigate this through external spatial control, increasing long-term consistency but compromising the naturalness of motion. We propose KeyFace, a novel two-stage diffusion-based framework, to address these issues. In the first stage, keyframes are generated at a low frame rate, conditioned on audio input and an identity frame, to capture essential facial expressions and movements over extended periods of time. In the second stage, an interpolation model fills in the gaps between keyframes, ensuring smooth transitions and temporal coherence. To further enhance realism, we incorporate continuous emotion representations and handle a wide range of non-speech vocalizations (NSVs), such as laughter and sighs. We also introduce two new evaluation metrics for assessing lip synchronization and NSV generation. Experimental results show that KeyFace outperforms state-of-the-art methods in generating natural, coherent facial animations over extended durations, successfully encompassing NSVs and continuous emotions.

Title: Quality Measures for Dynamic Graph Generative Models

Authors: Ryien Hosseini, Filippo Simini, Venkatram Vishwanath, Rebecca Willett, Henry Hoffmann
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.01720
Pdf URL: https://arxiv.org/pdf/2503.01720
Copy Paste: [[2503.01720]] Quality Measures for Dynamic Graph Generative Models(https://arxiv.org/abs/2503.01720)
Keywords: generative
Abstract: Deep generative models have recently achieved significant success in modeling graph data, including dynamic graphs, where topology and features evolve over time. However, unlike in vision and natural language domains, evaluating generative models for dynamic graphs is challenging due to the difficulty of visualizing their output, making quantitative metrics essential. In this work, we develop a new quality metric for evaluating generative models of dynamic graphs. Current metrics for dynamic graphs typically involve discretizing the continuous-evolution of graphs into static snapshots and then applying conventional graph similarity measures. This approach has several limitations: (a) it models temporally related events as i.i.d. samples, failing to capture the non-uniform evolution of dynamic graphs; (b) it lacks a unified measure that is sensitive to both features and topology; (c) it fails to provide a scalar metric, requiring multiple metrics without clear superiority; and (d) it requires explicitly instantiating each static snapshot, leading to impractical runtime demands that hinder evaluation at scale. We propose a novel metric based on the \textit{Johnson-Lindenstrauss} lemma, applying random projections directly to dynamic graph data. This results in an expressive, scalar, and application-agnostic measure of dynamic graph similarity that overcomes the limitations of traditional methods. We also provide a comprehensive empirical evaluation of metrics for continuous-time dynamic graphs, demonstrating the effectiveness of our approach compared to existing methods. Our implementation is available at this https URL.

Title: Self-attention-based Diffusion Model for Time-series Imputation in Partial Blackout Scenarios

Authors: Mohammad Rafid Ul Islam, Prasad Tadepalli, Alan Fern
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.01737
Pdf URL: https://arxiv.org/pdf/2503.01737
Copy Paste: [[2503.01737]] Self-attention-based Diffusion Model for Time-series Imputation in Partial Blackout Scenarios(https://arxiv.org/abs/2503.01737)
Keywords: diffusion
Abstract: Missing values in multivariate time series data can harm machine learning performance and introduce bias. These gaps arise from sensor malfunctions, blackouts, and human error and are typically addressed by data imputation. Previous work has tackled the imputation of missing data in random, complete blackouts and forecasting scenarios. The current paper addresses a more general missing pattern, which we call "partial blackout," where a subset of features is missing for consecutive time steps. We introduce a two-stage imputation process using self-attention and diffusion processes to model feature and temporal correlations. Notably, our model effectively handles missing data during training, enhancing adaptability and ensuring reliable imputation and performance, even with incomplete datasets. Our experiments on benchmark and two real-world time series datasets demonstrate that our model outperforms the state-of-the-art in partial blackout scenarios and shows better scalability.

Title: VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

Authors: Wenhao Wang, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01739
Pdf URL: https://arxiv.org/pdf/2503.01739
Copy Paste: [[2503.01739]] VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation(https://arxiv.org/abs/2503.01739)
Keywords: generative
Abstract: Text-to-video generative models convert textual prompts into dynamic visual content, offering wide-ranging applications in film production, gaming, and education. However, their real-world performance often falls short of user expectations. One key reason is that these models have not been trained on videos related to some topics users want to create. In this paper, we propose VideoUFO, the first Video dataset specifically curated to align with Users' FOcus in real-world scenarios. Beyond this, our VideoUFO also features: (1) minimal ($0.29\%$) overlap with existing video datasets, and (2) videos searched exclusively via YouTube's official API under the Creative Commons license. These two attributes provide future researchers with greater freedom to broaden their training sources. The VideoUFO comprises over $1.09$ million video clips, each paired with both a brief and a detailed caption (description). Specifically, through clustering, we first identify $1,291$ user-focused topics from the million-scale real text-to-video prompt dataset, VidProM. Then, we use these topics to retrieve videos from YouTube, split the retrieved videos into clips, and generate both brief and detailed captions for each clip. After verifying the clips with specified topics, we are left with about $1.09$ million video clips. Our experiments reveal that (1) current $16$ text-to-video models do not achieve consistent performance across all user-focused topics; and (2) a simple model trained on VideoUFO outperforms others on worst-performing topics. The dataset is publicly available at this https URL under the CC BY 4.0 License.

Title: ECG-EmotionNet: Nested Mixture of Expert (NMoE) Adaptation of ECG-Foundation Model for Driver Emotion Recognition

Authors: Nastaran Mansourian, Arash Mohammadi, M. Omair Ahmad, M.N.S. Swamy
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2503.01750
Pdf URL: https://arxiv.org/pdf/2503.01750
Copy Paste: [[2503.01750]] ECG-EmotionNet: Nested Mixture of Expert (NMoE) Adaptation of ECG-Foundation Model for Driver Emotion Recognition(https://arxiv.org/abs/2503.01750)
Keywords: foundation model
Abstract: Driver emotion recognition plays a crucial role in driver monitoring systems, enhancing human-autonomy interactions and the trustworthiness of Autonomous Driving (AD). Various physiological and behavioural modalities have been explored for this purpose, with Electrocardiogram (ECG) emerging as a standout choice for real-time emotion monitoring, particularly in dynamic and unpredictable driving conditions. Existing methods, however, often rely on multi-channel ECG signals recorded under static conditions, limiting their applicability in real-world dynamic driving scenarios. To address this limitation, the paper introduces ECG-EmotionNet, a novel architecture designed specifically for emotion recognition in dynamic driving environments. ECG-EmotionNet is constructed by adapting a recently introduced ECG Foundation Model (FM) and uniquely employs single-channel ECG signals, ensuring both robust generalizability and computational efficiency. Unlike conventional adaptation methods such as full fine-tuning, linear probing, or low-rank adaptation, we propose an intuitively pleasing alternative, referred to as the nested Mixture of Experts (MoE) adaptation. More precisely, each transformer layer of the underlying FM is treated as a separate expert, with embeddings extracted from these experts fused using trainable weights within a gating mechanism. This approach enhances the representation of both global and local ECG features, leading to a 6% improvement in accuracy and a 7% increase in the F1 score, all while maintaining computational efficiency. The effectiveness of the proposed ECG-EmotionNet architecture is evaluated using a recently introduced and challenging driver emotion monitoring dataset.

Title: Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

Authors: Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, Huan Ling
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01774
Pdf URL: https://arxiv.org/pdf/2503.01774
Copy Paste: [[2503.01774]] Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models(https://arxiv.org/abs/2503.01774)
Keywords: diffusion
Abstract: Neural Radiance Fields and 3D Gaussian Splatting have revolutionized 3D reconstruction and novel-view synthesis task. However, achieving photorealistic rendering from extreme novel viewpoints remains challenging, as artifacts persist across representations. In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation. Difix serves two critical roles in our pipeline. First, it is used during the reconstruction phase to clean up pseudo-training views that are rendered from the reconstruction and then distilled back into 3D. This greatly enhances underconstrained regions and improves the overall 3D representation quality. More importantly, Difix also acts as a neural enhancer during inference, effectively removing residual artifacts arising from imperfect 3D supervision and the limited capacity of current reconstruction models. Difix3D+ is a general solution, a single model compatible with both NeRF and 3DGS representations, and it achieves an average 2$\times$ improvement in FID score over baselines while maintaining 3D consistency.

Title: Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation

Authors: Tiansheng Wen, Yifei Wang, Zequn Zeng, Zhong Peng, Yudi Su, Xinyang Liu, Bo Chen, Hongwei Liu, Stefanie Jegelka, Chenyu You
Subjects: cs.LG, cs.AI, cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2503.01776
Pdf URL: https://arxiv.org/pdf/2503.01776
Copy Paste: [[2503.01776]] Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation(https://arxiv.org/abs/2503.01776)
Keywords: generative
Abstract: Many large-scale systems rely on high-quality deep representations (embeddings) to facilitate tasks like retrieval, search, and generative modeling. Matryoshka Representation Learning (MRL) recently emerged as a solution for adaptive embedding lengths, but it requires full model retraining and suffers from noticeable performance degradations at short lengths. In this paper, we show that sparse coding offers a compelling alternative for achieving adaptive representation with minimal overhead and higher fidelity. We propose Contrastive Sparse Representation (CSR), a method that sparsifies pre-trained embeddings into a high-dimensional but selectively activated feature space. By leveraging lightweight autoencoding and task-aware contrastive objectives, CSR preserves semantic quality while allowing flexible, cost-effective inference at different sparsity levels. Extensive experiments on image, text, and multimodal benchmarks demonstrate that CSR consistently outperforms MRL in terms of both accuracy and retrieval speed-often by large margins-while also cutting training time to a fraction of that required by MRL. Our results establish sparse coding as a powerful paradigm for adaptive representation learning in real-world applications where efficiency and fidelity are both paramount. Code is available at this https URL

Title: OFF-CLIP: Improving Normal Detection Confidence in Radiology CLIP with Simple Off-Diagonal Term Auto-Adjustment

Authors: Junhyun Park, Chanyu Moon, Donghwan Lee, Kyungsu Kim, Minho Hwang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01794
Pdf URL: https://arxiv.org/pdf/2503.01794
Copy Paste: [[2503.01794]] OFF-CLIP: Improving Normal Detection Confidence in Radiology CLIP with Simple Off-Diagonal Term Auto-Adjustment(https://arxiv.org/abs/2503.01794)
Keywords: anomaly
Abstract: Contrastive Language-Image Pre-Training (CLIP) has enabled zero-shot classification in radiology, reducing reliance on manual annotations. However, conventional contrastive learning struggles with normal case detection due to its strict intra-sample alignment, which disrupts normal sample clustering and leads to high false positives (FPs) and false negatives (FNs). To address these issues, we propose OFF-CLIP, a contrastive learning refinement that improves normal detection by introducing an off-diagonal term loss to enhance normal sample clustering and applying sentence-level text filtering to mitigate FNs by removing misaligned normal statements from abnormal reports. OFF-CLIP can be applied to radiology CLIP models without requiring any architectural modifications. Experimental results show that OFF-CLIP significantly improves normal classification, achieving a 0.61 Area under the curve (AUC) increase on VinDr-CXR over CARZero, the state-of-the-art zero-shot classification baseline, while maintaining or improving abnormal classification performance. Additionally, OFF-CLIP enhances zero-shot grounding by improving pointing game accuracy, confirming better anomaly localization. These results demonstrate OFF-CLIP's effectiveness as a robust and efficient enhancement for medical vision-language models.

Title: On the Power of Context-Enhanced Learning in LLMs

Authors: Xingyu Zhu, Abhishek Panigrahi, Sanjeev Arora
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.01821
Pdf URL: https://arxiv.org/pdf/2503.01821
Copy Paste: [[2503.01821]] On the Power of Context-Enhanced Learning in LLMs(https://arxiv.org/abs/2503.01821)
Keywords: in-context
Abstract: We formalize a new concept for LLMs, context-enhanced learning. It involves standard gradient-based learning on text except that the context is enhanced with additional data on which no auto-regressive gradients are computed. This setting is a gradient-based analog of usual in-context learning (ICL) and appears in some recent works. Using a multi-step reasoning task, we prove in a simplified setting that context-enhanced learning can be exponentially more sample-efficient than standard learning when the model is capable of ICL. At a mechanistic level, we find that the benefit of context-enhancement arises from a more accurate gradient learning signal. We also experimentally demonstrate that it appears hard to detect or recover learning materials that were used in the context during training. This may have implications for data security as well as copyright.

Title: Open-source framework for detecting bias and overfitting for large pathology images

Authors: Anders Sildnes, Nikita Shvetsov, Masoud Tafavvoghi, Vi Ngoc-Nha Tran, Kajsa Møllersen, Lill-Tove Rasmussen Busund, Thomas K. Kilvær, Lars Ailo Bongo
Subjects: cs.LG, cs.SE, eess.IV
Abstract URL: https://arxiv.org/abs/2503.01827
Pdf URL: https://arxiv.org/pdf/2503.01827
Copy Paste: [[2503.01827]] Open-source framework for detecting bias and overfitting for large pathology images(https://arxiv.org/abs/2503.01827)
Keywords: self-supervised, foundation model
Abstract: Even foundational models that are trained on datasets with billions of data samples may develop shortcuts that lead to overfitting and bias. Shortcuts are non-relevant patterns in data, such as the background color or color intensity. So, to ensure the robustness of deep learning applications, there is a need for methods to detect and remove such shortcuts. Today's model debugging methods are time consuming since they often require customization to fit for a given model architecture in a specific domain. We propose a generalized, model-agnostic framework to debug deep learning models. We focus on the domain of histopathology, which has very large images that require large models - and therefore large computation resources. It can be run on a workstation with a commodity GPU. We demonstrate that our framework can replicate non-image shortcuts that have been found in previous work for self-supervised learning models, and we also identify possible shortcuts in a foundation model. Our easy to use tests contribute to the development of more reliable, accurate, and generalizable models for WSI analysis. Our framework is available as an open-source tool available on github.

Title: Denoising Functional Maps: Diffusion Models for Shape Correspondence

Authors: Aleksei Zhuravlev, Zorah Lähner, Vladislav Golyanik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.01845
Pdf URL: https://arxiv.org/pdf/2503.01845
Copy Paste: [[2503.01845]] Denoising Functional Maps: Diffusion Models for Shape Correspondence(https://arxiv.org/abs/2503.01845)
Keywords: diffusion
Abstract: Estimating correspondences between pairs of deformable shapes remains a challenging problem. Despite substantial progress, existing methods lack broad generalization capabilities and require category-specific training data. To address these limitations, we propose a fundamentally new approach to shape correspondence based on denoising diffusion models. In our method, a diffusion model learns to directly predict the functional map, a low-dimensional representation of a point-wise map between shapes. We use a large dataset of synthetic human meshes for training and employ two steps to reduce the number of functional maps that need to be learned. First, the maps refer to a template rather than shape pairs. Second, the functional map is defined in a basis of eigenvectors of the Laplacian, which is not unique due to sign ambiguity. Therefore, we introduce an unsupervised approach to select a specific basis by correcting the signs of eigenvectors based on surface features. Our approach achieves competitive performance on standard human datasets, meshes with anisotropic connectivity, non-isometric humanoid shapes, as well as animals compared to existing descriptor-based and large-scale shape deformation methods.