2025-01-07

Title: SmartSpatial: Enhancing the 3D Spatial Arrangement Capabilities of Stable Diffusion Models and Introducing a Novel 3D Spatial Evaluation Framework

Authors: Mao Xun Huang, Hen-Hsen Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.01998
Pdf URL: https://arxiv.org/pdf/2501.01998
Copy Paste: [[2501.01998]] SmartSpatial: Enhancing the 3D Spatial Arrangement Capabilities of Stable Diffusion Models and Introducing a Novel 3D Spatial Evaluation Framework(https://arxiv.org/abs/2501.01998)
Keywords: diffusion
Abstract: Stable Diffusion models have made remarkable strides in generating photorealistic images from text prompts but often falter when tasked with accurately representing complex spatial arrangements, particularly involving intricate 3D relationships. To address this limitation, we introduce SmartSpatial, an innovative approach that enhances the spatial arrangement capabilities of Stable Diffusion models through 3D-aware conditioning and attention-guided mechanisms. SmartSpatial incorporates depth information and employs cross-attention control to ensure precise object placement, delivering notable improvements in spatial accuracy metrics. In conjunction with SmartSpatial, we present SmartSpatialEval, a comprehensive evaluation framework designed to assess spatial relationships. This framework utilizes vision-language models and graph-based dependency parsing for performance analysis. Experimental results on the COCO and SpatialPrompts datasets show that SmartSpatial significantly outperforms existing methods, setting new benchmarks for spatial arrangement accuracy in image generation.

Title: Information Subtraction: Learning Representations for Conditional Entropy

Authors: Keng Hou Leong, Yuxuan Xiu, Wai Kin (Victor)Chan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.02012
Pdf URL: https://arxiv.org/pdf/2501.02012
Copy Paste: [[2501.02012]] Information Subtraction: Learning Representations for Conditional Entropy(https://arxiv.org/abs/2501.02012)
Keywords: generative
Abstract: The representations of conditional entropy and conditional mutual information are significant in explaining the unique effects among variables. While previous studies based on conditional contrastive sampling have effectively removed information regarding discrete sensitive variables, they have not yet extended their scope to continuous cases. This paper introduces Information Subtraction, a framework designed to generate representations that preserve desired information while eliminating the undesired. We implement a generative-based architecture that outputs these representations by simultaneously maximizing an information term and minimizing another. With its flexibility in disentangling information, we can iteratively apply Information Subtraction to represent arbitrary information components between continuous variables, thereby explaining the various relationships that exist between them. Our results highlight the representations' ability to provide semantic features of conditional entropy. By subtracting sensitive and domain-specific information, our framework demonstrates effective performance in fair learning and domain generalization. The code for this paper is available at this https URL

Title: Machine Learning-Based Differential Diagnosis of Parkinson's Disease Using Kinematic Feature Extraction and Selection

Authors: Masahiro Matsumoto, Abu Saleh Musa Miah, Nobuyoshi Asai, Jungpil Shin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02014
Pdf URL: https://arxiv.org/pdf/2501.02014
Copy Paste: [[2501.02014]] Machine Learning-Based Differential Diagnosis of Parkinson's Disease Using Kinematic Feature Extraction and Selection(https://arxiv.org/abs/2501.02014)
Keywords: generative
Abstract: Parkinson's disease (PD), the second most common neurodegenerative disorder, is characterized by dopaminergic neuron loss and the accumulation of abnormal synuclein. PD presents both motor and non-motor symptoms that progressively impair daily functioning. The severity of these symptoms is typically assessed using the MDS-UPDRS rating scale, which is subjective and dependent on the physician's experience. Additionally, PD shares symptoms with other neurodegenerative diseases, such as progressive supranuclear palsy (PSP) and multiple system atrophy (MSA), complicating accurate diagnosis. To address these diagnostic challenges, we propose a machine learning-based system for differential diagnosis of PD, PSP, MSA, and healthy controls (HC). This system utilizes a kinematic feature-based hierarchical feature extraction and selection approach. Initially, 18 kinematic features are extracted, including two newly proposed features: Thumb-to-index vector velocity and acceleration, which provide insights into motor control patterns. In addition, 41 statistical features were extracted here from each kinematic feature, including some new approaches such as Average Absolute Change, Rhythm, Amplitude, Frequency, Standard Deviation of Frequency, and Slope. Feature selection is performed using One-way ANOVA to rank features, followed by Sequential Forward Floating Selection (SFFS) to identify the most relevant ones, aiming to reduce the computational complexity. The final feature set is used for classification, achieving a classification accuracy of 66.67% for each dataset and 88.89% for each patient, with particularly high performance for the MSA and HC groups using the SVM algorithm. This system shows potential as a rapid and accurate diagnostic tool in clinical practice, though further data collection and refinement are needed to enhance its reliability.

Title: 3D Cloud reconstruction through geospatially-aware Masked Autoencoders

Authors: Stella Girtsou, Emiliano Diaz Salas-Porras, Lilli Freischem, Joppe Massant, Kyriaki-Margarita Bintsi, Guiseppe Castiglione, William Jones, Michael Eisinger, Emmanuel Johnson, Anna Jungbluth
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02035
Pdf URL: https://arxiv.org/pdf/2501.02035
Copy Paste: [[2501.02035]] 3D Cloud reconstruction through geospatially-aware Masked Autoencoders(https://arxiv.org/abs/2501.02035)
Keywords: self-supervised
Abstract: Clouds play a key role in Earth's radiation balance with complex effects that introduce large uncertainties into climate models. Real-time 3D cloud data is essential for improving climate predictions. This study leverages geostationary imagery from MSG/SEVIRI and radar reflectivity measurements of cloud profiles from CloudSat/CPR to reconstruct 3D cloud structures. We first apply self-supervised learning (SSL) methods-Masked Autoencoders (MAE) and geospatially-aware SatMAE on unlabelled MSG images, and then fine-tune our models on matched image-profile pairs. Our approach outperforms state-of-the-art methods like U-Nets, and our geospatial encoding further improves prediction results, demonstrating the potential of SSL for cloud reconstruction.

Title: Advancing Pancreatic Cancer Prediction with a Next Visit Token Prediction Head on top of Med-BERT

Authors: Jianping He, Laila Rasmy, Degui Zhi, Cui Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02044
Pdf URL: https://arxiv.org/pdf/2501.02044
Copy Paste: [[2501.02044]] Advancing Pancreatic Cancer Prediction with a Next Visit Token Prediction Head on top of Med-BERT(https://arxiv.org/abs/2501.02044)
Keywords: foundation model
Abstract: Background: Recently, numerous foundation models pretrained on extensive data have demonstrated efficacy in disease prediction using Electronic Health Records (EHRs). However, there remains some unanswered questions on how to best utilize such models especially with very small fine-tuning cohorts. Methods: We utilized Med-BERT, an EHR-specific foundation model, and reformulated the disease binary prediction task into a token prediction task and a next visit mask token prediction task to align with Med-BERT's pretraining task format in order to improve the accuracy of pancreatic cancer (PaCa) prediction in both few-shot and fully supervised settings. Results: The reformulation of the task into a token prediction task, referred to as Med-BERT-Sum, demonstrates slightly superior performance in both few-shot scenarios and larger data samples. Furthermore, reformulating the prediction task as a Next Visit Mask Token Prediction task (Med-BERT-Mask) significantly outperforms the conventional Binary Classification (BC) prediction task (Med-BERT-BC) by 3% to 7% in few-shot scenarios with data sizes ranging from 10 to 500 samples. These findings highlight that aligning the downstream task with Med-BERT's pretraining objectives substantially enhances the model's predictive capabilities, thereby improving its effectiveness in predicting both rare and common diseases. Conclusion: Reformatting disease prediction tasks to align with the pretraining of foundation models enhances prediction accuracy, leading to earlier detection and timely intervention. This approach improves treatment effectiveness, survival rates, and overall patient outcomes for PaCa and potentially other cancers.

Title: Active Learning Enables Extrapolation in Molecular Generative Models

Authors: Evan R. Antoniuk, Peggy Li, Nathan Keilbart, Stephen Weitzner, Bhavya Kailkhura, Anna M. Hiszpanski
Subjects: cs.LG, cond-mat.mtrl-sci, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2501.02059
Pdf URL: https://arxiv.org/pdf/2501.02059
Copy Paste: [[2501.02059]] Active Learning Enables Extrapolation in Molecular Generative Models(https://arxiv.org/abs/2501.02059)
Keywords: generative
Abstract: Although generative models hold promise for discovering molecules with optimized desired properties, they often fail to suggest synthesizable molecules that improve upon the known molecules seen in training. We find that a key limitation is not in the molecule generation process itself, but in the poor generalization capabilities of molecular property predictors. We tackle this challenge by creating an active-learning, closed-loop molecule generation pipeline, whereby molecular generative models are iteratively refined on feedback from quantum chemical simulations to improve generalization to new chemical space. Compared against other generative model approaches, only our active learning approach generates molecules with properties that extrapolate beyond the training data (reaching up to 0.44 standard deviations beyond the training data range) and out-of-distribution molecule classification accuracy is improved by 79%. By conditioning molecular generation on thermodynamic stability data from the active-learning loop, the proportion of stable molecules generated is 3.5x higher than the next-best model.

Title: AGGA: A Dataset of Academic Guidelines for Generative AI and Large Language Models

Authors: Junfeng Jiao, Saleh Afroogh, Kevin Chen, David Atkinson, Amit Dhurandhar
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2501.02063
Pdf URL: https://arxiv.org/pdf/2501.02063
Copy Paste: [[2501.02063]] AGGA: A Dataset of Academic Guidelines for Generative AI and Large Language Models(https://arxiv.org/abs/2501.02063)
Keywords: generative
Abstract: This study introduces AGGA, a dataset comprising 80 academic guidelines for the use of Generative AIs (GAIs) and Large Language Models (LLMs) in academic settings, meticulously collected from official university websites. The dataset contains 188,674 words and serves as a valuable resource for natural language processing tasks commonly applied in requirements engineering, such as model synthesis, abstraction identification, and document structure assessment. Additionally, AGGA can be further annotated to function as a benchmark for various tasks, including ambiguity detection, requirements categorization, and the identification of equivalent requirements. Our methodologically rigorous approach ensured a thorough examination, with a selection of universities that represent a diverse range of global institutions, including top-ranked universities across six continents. The dataset captures perspectives from a variety of academic fields, including humanities, technology, and both public and private institutions, offering a broad spectrum of insights into the integration of GAIs and LLMs in academia.

Title: ArtCrafter: Text-Image Aligning Style Transfer via Embedding Reframing

Authors: Nisha Huang, Kaer Huang, Yifan Pu, Jiangshan Wang, Jie Guo, Yiqiang Yan, Xiu Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02064
Pdf URL: https://arxiv.org/pdf/2501.02064
Copy Paste: [[2501.02064]] ArtCrafter: Text-Image Aligning Style Transfer via Embedding Reframing(https://arxiv.org/abs/2501.02064)
Keywords: diffusion
Abstract: Recent years have witnessed significant advancements in text-guided style transfer, primarily attributed to innovations in diffusion models. These models excel in conditional guidance, utilizing text or images to direct the sampling process. However, despite their capabilities, direct conditional guidance approaches often face challenges in balancing the expressiveness of textual semantics with the diversity of output results while capturing stylistic features. To address these challenges, we introduce ArtCrafter, a novel framework for text-to-image style transfer. Specifically, we introduce an attention-based style extraction module, meticulously engineered to capture the subtle stylistic elements within an image. This module features a multi-layer architecture that leverages the capabilities of perceiver attention mechanisms to integrate fine-grained information. Additionally, we present a novel text-image aligning augmentation component that adeptly balances control over both modalities, enabling the model to efficiently map image and text embeddings into a shared feature space. We achieve this through attention operations that enable smooth information flow between modalities. Lastly, we incorporate an explicit modulation that seamlessly blends multimodal enhanced embeddings with original embeddings through an embedding reframing design, empowering the model to generate diverse outputs. Extensive experiments demonstrate that ArtCrafter yields impressive results in visual stylization, exhibiting exceptional levels of stylistic intensity, controllability, and diversity.

Title: Counterfactual Explanation for Auto-Encoder Based Time-Series Anomaly Detection

Authors: Abhishek Srinivasan, Varun Singapuri Ravi, Juan Carlos Andresen, Anders Holst
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.02069
Pdf URL: https://arxiv.org/pdf/2501.02069
Copy Paste: [[2501.02069]] Counterfactual Explanation for Auto-Encoder Based Time-Series Anomaly Detection(https://arxiv.org/abs/2501.02069)
Keywords: anomaly
Abstract: The complexity of modern electro-mechanical systems require the development of sophisticated diagnostic methods like anomaly detection capable of detecting deviations. Conventional anomaly detection approaches like signal processing and statistical modelling often struggle to effectively handle the intricacies of complex systems, particularly when dealing with multi-variate signals. In contrast, neural network-based anomaly detection methods, especially Auto-Encoders, have emerged as a compelling alternative, demonstrating remarkable performance. However, Auto-Encoders exhibit inherent opaqueness in their decision-making processes, hindering their practical implementation at scale. Addressing this opacity is essential for enhancing the interpretability and trustworthiness of anomaly detection models. In this work, we address this challenge by employing a feature selector to select features and counterfactual explanations to give a context to the model output. We tested this approach on the SKAB benchmark dataset and an industrial time-series dataset. The gradient based counterfactual explanation approach was evaluated via validity, sparsity and distance measures. Our experimental findings illustrate that our proposed counterfactual approach can offer meaningful and valuable insights into the model decision-making process, by explaining fewer signals compared to conventional approaches. These insights enhance the trustworthiness and interpretability of anomaly detection models.

Title: Online Detection of Water Contamination Under Concept Drift

Authors: Jin Li, Kleanthis Malialis, Stelios G. Vrachimis, Marios M. Polycarpou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02107
Pdf URL: https://arxiv.org/pdf/2501.02107
Copy Paste: [[2501.02107]] Online Detection of Water Contamination Under Concept Drift(https://arxiv.org/abs/2501.02107)
Keywords: anomaly
Abstract: Water Distribution Networks (WDNs) are vital infrastructures, and contamination poses serious public health risks. Harmful substances can interact with disinfectants like chlorine, making chlorine monitoring essential for detecting contaminants. However, chlorine sensors often become unreliable and require frequent calibration. This study introduces the Dual-Threshold Anomaly and Drift Detection (AD&DD) method, an unsupervised approach combining a dual-threshold drift detection mechanism with an LSTM-based Variational Autoencoder(LSTM-VAE) for real-time contamination detection. Tested on two realistic WDNs, AD&DD effectively identifies anomalies with sensor offsets as concept drift, and outperforms other methods. A proposed decentralized architecture enables accurate contamination detection and localization by deploying AD&DD on selected nodes.

Title: Plasma-CycleGAN: Plasma Biomarker-Guided MRI to PET Cross-modality Translation Using Conditional CycleGAN

Authors: Yanxi Chen, Yi Su, Celine Dumitrascu, Kewei Chen, David Weidman, Richard J Caselli, Nicholas Ashton, Eric M Reiman, Yalin Wang
Subjects: cs.CV, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2501.02146
Pdf URL: https://arxiv.org/pdf/2501.02146
Copy Paste: [[2501.02146]] Plasma-CycleGAN: Plasma Biomarker-Guided MRI to PET Cross-modality Translation Using Conditional CycleGAN(https://arxiv.org/abs/2501.02146)
Keywords: generative
Abstract: Cross-modality translation between MRI and PET imaging is challenging due to the distinct mechanisms underlying these modalities. Blood-based biomarkers (BBBMs) are revolutionizing Alzheimer's disease (AD) detection by identifying patients and quantifying brain amyloid levels. However, the potential of BBBMs to enhance PET image synthesis remains unexplored. In this paper, we performed a thorough study on the effect of incorporating BBBM into deep generative models. By evaluating three widely used cross-modality translation models, we found that BBBMs integration consistently enhances the generative quality across all models. By visual inspection of the generated results, we observed that PET images generated by CycleGAN exhibit the best visual fidelity. Based on these findings, we propose Plasma-CycleGAN, a novel generative model based on CycleGAN, to synthesize PET images from MRI using BBBMs as conditions. This is the first approach to integrate BBBMs in conditional cross-modality translation between MRI and PET.

Title: Generating Multimodal Images with GAN: Integrating Text, Image, and Style

Authors: Chaoyi Tan, Wenqing Zhang, Zhen Qi, Kowei Shih, Xinshi Li, Ao Xiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02167
Pdf URL: https://arxiv.org/pdf/2501.02167
Copy Paste: [[2501.02167]] Generating Multimodal Images with GAN: Integrating Text, Image, and Style(https://arxiv.org/abs/2501.02167)
Keywords: generative
Abstract: In the field of computer vision, multimodal image generation has become a research hotspot, especially the task of integrating text, image, and style. In this study, we propose a multimodal image generation method based on Generative Adversarial Networks (GAN), capable of effectively combining text descriptions, reference images, and style information to generate images that meet multimodal requirements. This method involves the design of a text encoder, an image feature extractor, and a style integration module, ensuring that the generated images maintain high quality in terms of visual content and style consistency. We also introduce multiple loss functions, including adversarial loss, text-image consistency loss, and style matching loss, to optimize the generation process. Experimental results show that our method produces images with high clarity and consistency across multiple public datasets, demonstrating significant performance improvements compared to existing methods. The outcomes of this study provide new insights into multimodal image generation and present broad application prospects.

Title: CPTuning: Contrastive Prompt Tuning for Generative Relation Extraction

Authors: Jiaxin Duan, Fengyu Lu, Junfei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02196
Pdf URL: https://arxiv.org/pdf/2501.02196
Copy Paste: [[2501.02196]] CPTuning: Contrastive Prompt Tuning for Generative Relation Extraction(https://arxiv.org/abs/2501.02196)
Keywords: generative, in-context
Abstract: Generative relation extraction (RE) commonly involves first reformulating RE as a linguistic modeling problem easily tackled with pre-trained language models (PLM) and then fine-tuning a PLM with supervised cross-entropy loss. Although having achieved promising performance, existing approaches assume only one deterministic relation between each pair of entities without considering real scenarios where multiple relations may be valid, i.e., entity pair overlap, causing their limited applications. To address this problem, we introduce a novel contrastive prompt tuning method for RE, CPTuning, which learns to associate a candidate relation between two in-context entities with a probability mass above or below a threshold, corresponding to whether the relation exists. Beyond learning schema, CPTuning also organizes RE as a verbalized relation generation task and uses Trie-constrained decoding to ensure a model generates valid relations. It adaptively picks out the generated candidate relations with a high estimated likelihood in inference, thereby achieving multi-relation extraction. We conduct extensive experiments on four widely used datasets to validate our method. Results show that T5-large fine-tuned with CPTuning significantly outperforms previous methods, regardless of single or multiple relations extraction.

Title: Self-Supervised Learning for Detecting AI-Generated Faces as Anomalies

Authors: Mian Zou, Baosheng Yu, Yibing Zhan, Kede Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02207
Pdf URL: https://arxiv.org/pdf/2501.02207
Copy Paste: [[2501.02207]] Self-Supervised Learning for Detecting AI-Generated Faces as Anomalies(https://arxiv.org/abs/2501.02207)
Keywords: self-supervised, anomaly
Abstract: The detection of AI-generated faces is commonly approached as a binary classification task. Nevertheless, the resulting detectors frequently struggle to adapt to novel AI face generators, which evolve rapidly. In this paper, we describe an anomaly detection method for AI-generated faces by leveraging self-supervised learning of camera-intrinsic and face-specific features purely from photographic face images. The success of our method lies in designing a pretext task that trains a feature extractor to rank four ordinal exchangeable image file format (EXIF) tags and classify artificially manipulated face images. Subsequently, we model the learned feature distribution of photographic face images using a Gaussian mixture model. Faces with low likelihoods are flagged as AI-generated. Both quantitative and qualitative experiments validate the effectiveness of our method. Our code is available at \url{this https URL}.

Title: Diffusion Model-Based Data Synthesis Aided Federated Semi-Supervised Learning

Authors: Zhongwei Wang, Tong Wu, Zhiyong Chen, Liang Qian, Yin Xu, Meixia Tao
Subjects: cs.LG, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2501.02219
Pdf URL: https://arxiv.org/pdf/2501.02219
Copy Paste: [[2501.02219]] Diffusion Model-Based Data Synthesis Aided Federated Semi-Supervised Learning(https://arxiv.org/abs/2501.02219)
Keywords: diffusion
Abstract: Federated semi-supervised learning (FSSL) is primarily challenged by two factors: the scarcity of labeled data across clients and the non-independent and identically distribution (non-IID) nature of data among clients. In this paper, we propose a novel approach, diffusion model-based data synthesis aided FSSL (DDSA-FSSL), which utilizes a diffusion model (DM) to generate synthetic data, bridging the gap between heterogeneous local data distributions and the global data distribution. In DDSA-FSSL, clients address the challenge of the scarcity of labeled data by employing a federated learning-trained classifier to perform pseudo labeling for unlabeled data. The DM is then collaboratively trained using both labeled and precision-optimized pseudo-labeled data, enabling clients to generate synthetic samples for classes that are absent in their labeled datasets. This process allows clients to generate more comprehensive synthetic datasets aligned with the global distribution. Extensive experiments conducted on multiple datasets and varying non-IID distributions demonstrate the effectiveness of DDSA-FSSL, e.g., it improves accuracy from 38.46% to 52.14% on CIFAR-10 datasets with 10% labeled data.

Title: MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control

Authors: Mengting Wei, Tuomas Varanka, Xingxun Jiang, Huai-Qian Khor, Guoying Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02260
Pdf URL: https://arxiv.org/pdf/2501.02260
Copy Paste: [[2501.02260]] MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control(https://arxiv.org/abs/2501.02260)
Keywords: diffusion
Abstract: We address the problem of facial expression editing by controling the relative variation of facial action-unit (AU) from the same person. This enables us to edit this specific person's expression in a fine-grained, continuous and interpretable manner, while preserving their identity, pose, background and detailed facial attributes. Key to our model, which we dub MagicFace, is a diffusion model conditioned on AU variations and an ID encoder to preserve facial details of high consistency. Specifically, to preserve the facial details with the input identity, we leverage the power of pretrained Stable-Diffusion models and design an ID encoder to merge appearance features through self-attention. To keep background and pose consistency, we introduce an efficient Attribute Controller by explicitly informing the model of current background and pose of the target. By injecting AU variations into a denoising UNet, our model can animate arbitrary identities with various AU combinations, yielding superior results in high-fidelity expression editing compared to other facial expression editing works. Code is publicly available at this https URL.

Title: Unsupervised Class Generation to Expand Semantic Segmentation Datasets

Authors: Javier Montalvo, Álvaro García-Martín, Pablo Carballeira, Juan C. SanMiguel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02264
Pdf URL: https://arxiv.org/pdf/2501.02264
Copy Paste: [[2501.02264]] Unsupervised Class Generation to Expand Semantic Segmentation Datasets(https://arxiv.org/abs/2501.02264)
Keywords: diffusion, generative
Abstract: Semantic segmentation is a computer vision task where classification is performed at a pixel level. Due to this, the process of labeling images for semantic segmentation is time-consuming and expensive. To mitigate this cost there has been a surge in the use of synthetically generated data -- usually created using simulators or videogames -- which, in combination with domain adaptation methods, can effectively learn how to segment real data. Still, these datasets have a particular limitation: due to their closed-set nature, it is not possible to include novel classes without modifying the tool used to generate them, which is often not public. Concurrently, generative models have made remarkable progress, particularly with the introduction of diffusion models, enabling the creation of high-quality images from text prompts without additional supervision. In this work, we propose an unsupervised pipeline that leverages Stable Diffusion and Segment Anything Module to generate class examples with an associated segmentation mask, and a method to integrate generated cutouts for novel classes in semantic segmentation datasets, all with minimal user input. Our approach aims to improve the performance of unsupervised domain adaptation methods by introducing novel samples into the training data without modifications to the underlying algorithms. With our methods, we show how models can not only effectively learn how to segment novel classes, with an average performance of 51% IoU, but also reduce errors for other, already existing classes, reaching a higher performance level overall.

Title: TDM: Temporally-Consistent Diffusion Model for All-in-One Real-World Video Restoration

Authors: Yizhou Li, Zihua Liu, Yusuke Monno, Masatoshi Okutomi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02269
Pdf URL: https://arxiv.org/pdf/2501.02269
Copy Paste: [[2501.02269]] TDM: Temporally-Consistent Diffusion Model for All-in-One Real-World Video Restoration(https://arxiv.org/abs/2501.02269)
Keywords: diffusion
Abstract: In this paper, we propose the first diffusion-based all-in-one video restoration method that utilizes the power of a pre-trained Stable Diffusion and a fine-tuned ControlNet. Our method can restore various types of video degradation with a single unified model, overcoming the limitation of standard methods that require specific models for each restoration task. Our contributions include an efficient training strategy with Task Prompt Guidance (TPG) for diverse restoration tasks, an inference strategy that combines Denoising Diffusion Implicit Models~(DDIM) inversion with a novel Sliding Window Cross-Frame Attention (SW-CFA) mechanism for enhanced content preservation and temporal consistency, and a scalable pipeline that makes our method all-in-one to adapt to different video restoration tasks. Through extensive experiments on five video restoration tasks, we demonstrate the superiority of our method in generalization capability to real-world videos and temporal consistency preservation over existing state-of-the-art methods. Our method advances the video restoration task by providing a unified solution that enhances video quality across multiple applications.

Title: DiffGraph: Heterogeneous Graph Diffusion Model

Authors: Zongwei Li, Lianghao Xia, Hua Hua, Shijie Zhang, Shuangyang Wang, Chao Huang
Subjects: cs.LG, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2501.02313
Pdf URL: https://arxiv.org/pdf/2501.02313
Copy Paste: [[2501.02313]] DiffGraph: Heterogeneous Graph Diffusion Model(https://arxiv.org/abs/2501.02313)
Keywords: diffusion
Abstract: Recent advances in Graph Neural Networks (GNNs) have revolutionized graph-structured data modeling, yet traditional GNNs struggle with complex heterogeneous structures prevalent in real-world scenarios. Despite progress in handling heterogeneous interactions, two fundamental challenges persist: noisy data significantly compromising embedding quality and learning performance, and existing methods' inability to capture intricate semantic transitions among heterogeneous relations, which impacts downstream predictions. To address these fundamental issues, we present the Heterogeneous Graph Diffusion Model (DiffGraph), a pioneering framework that introduces an innovative cross-view denoising strategy. This advanced approach transforms auxiliary heterogeneous data into target semantic spaces, enabling precise distillation of task-relevant information. At its core, DiffGraph features a sophisticated latent heterogeneous graph diffusion mechanism, implementing a novel forward and backward diffusion process for superior noise management. This methodology achieves simultaneous heterogeneous graph denoising and cross-type transition, while significantly simplifying graph generation through its latent-space diffusion capabilities. Through rigorous experimental validation on both public and industrial datasets, we demonstrate that DiffGraph consistently surpasses existing methods in link prediction and node classification tasks, establishing new benchmarks for robustness and efficiency in heterogeneous graph processing. The model implementation is publicly available at: this https URL.

Title: Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications

Authors: Jodi M. Casabianca, Daniel F. McCaffrey, Matthew S. Johnson, Naim Alper, Vladimir Zubenko
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2501.02334
Pdf URL: https://arxiv.org/pdf/2501.02334
Copy Paste: [[2501.02334]] Validity Arguments For Constructed Response Scoring Using Generative Artificial Intelligence Applications(https://arxiv.org/abs/2501.02334)
Keywords: generative
Abstract: The rapid advancements in large language models and generative artificial intelligence (AI) capabilities are making their broad application in the high-stakes testing context more likely. Use of generative AI in the scoring of constructed responses is particularly appealing because it reduces the effort required for handcrafting features in traditional AI scoring and might even outperform those methods. The purpose of this paper is to highlight the differences in the feature-based and generative AI applications in constructed response scoring systems and propose a set of best practices for the collection of validity evidence to support the use and interpretation of constructed response scores from scoring systems using generative AI. We compare the validity evidence needed in scoring systems using human ratings, feature-based natural language processing AI scoring engines, and generative AI. The evidence needed in the generative AI context is more extensive than in the feature-based NLP scoring context because of the lack of transparency and other concerns unique to generative AI such as consistency. Constructed response score data from standardized tests demonstrate the collection of validity evidence for different types of scoring systems and highlights the numerous complexities and considerations when making a validity argument for these scores. In addition, we discuss how the evaluation of AI scores might include a consideration of how a contributory scoring approach combining multiple AI scores (from different sources) will cover more of the construct in the absence of human ratings.

Title: CorrFill: Enhancing Faithfulness in Reference-based Inpainting with Correspondence Guidance in Diffusion Models

Authors: Kuan-Hung Liu, Cheng-Kun Yang, Min-Hung Chen, Yu-Lun Liu, Yen-Yu Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02355
Pdf URL: https://arxiv.org/pdf/2501.02355
Copy Paste: [[2501.02355]] CorrFill: Enhancing Faithfulness in Reference-based Inpainting with Correspondence Guidance in Diffusion Models(https://arxiv.org/abs/2501.02355)
Keywords: diffusion
Abstract: In the task of reference-based image inpainting, an additional reference image is provided to restore a damaged target image to its original state. The advancement of diffusion models, particularly Stable Diffusion, allows for simple formulations in this task. However, existing diffusion-based methods often lack explicit constraints on the correlation between the reference and damaged images, resulting in lower faithfulness to the reference images in the inpainting results. In this work, we propose CorrFill, a training-free module designed to enhance the awareness of geometric correlations between the reference and target images. This enhancement is achieved by guiding the inpainting process with correspondence constraints estimated during inpainting, utilizing attention masking in self-attention layers and an objective function to update the input tensor according to the constraints. Experimental results demonstrate that CorrFill significantly enhances the performance of multiple baseline diffusion-based methods, including state-of-the-art approaches, by emphasizing faithfulness to the reference images.

Title: Generalizable Origin Identification for Text-Guided Image-to-Image Diffusion Models

Authors: Wenhao Wang, Yifan Sun, Zongxin Yang, Zhentao Tan, Zhengdong Hu, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02376
Pdf URL: https://arxiv.org/pdf/2501.02376
Copy Paste: [[2501.02376]] Generalizable Origin Identification for Text-Guided Image-to-Image Diffusion Models(https://arxiv.org/abs/2501.02376)
Keywords: diffusion
Abstract: Text-guided image-to-image diffusion models excel in translating images based on textual prompts, allowing for precise and creative visual modifications. However, such a powerful technique can be misused for spreading misinformation, infringing on copyrights, and evading content tracing. This motivates us to introduce the task of origin IDentification for text-guided Image-to-image Diffusion models (ID$^2$), aiming to retrieve the original image of a given translated query. A straightforward solution to ID$^2$ involves training a specialized deep embedding model to extract and compare features from both query and reference images. However, due to visual discrepancy across generations produced by different diffusion models, this similarity-based approach fails when training on images from one model and testing on those from another, limiting its effectiveness in real-world applications. To solve this challenge of the proposed ID$^2$ task, we contribute the first dataset and a theoretically guaranteed method, both emphasizing generalizability. The curated dataset, OriPID, contains abundant Origins and guided Prompts, which can be used to train and test potential IDentification models across various diffusion models. In the method section, we first prove the existence of a linear transformation that minimizes the distance between the pre-trained Variational Autoencoder (VAE) embeddings of generated samples and their origins. Subsequently, it is demonstrated that such a simple linear transformation can be generalized across different diffusion models. Experimental results show that the proposed method achieves satisfying generalization performance, significantly surpassing similarity-based methods ($+31.6\%$ mAP), even those with generalization designs.

Title: MedSegDiffNCA: Diffusion Models With Neural Cellular Automata for Skin Lesion Segmentation

Authors: Avni Mittal, John Kalkhof, Anirban Mukhopadhyay, Arnav Bhavsar
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2501.02447
Pdf URL: https://arxiv.org/pdf/2501.02447
Copy Paste: [[2501.02447]] MedSegDiffNCA: Diffusion Models With Neural Cellular Automata for Skin Lesion Segmentation(https://arxiv.org/abs/2501.02447)
Keywords: diffusion
Abstract: Denoising Diffusion Models (DDMs) are widely used for high-quality image generation and medical image segmentation but often rely on Unet-based architectures, leading to high computational overhead, especially with high-resolution images. This work proposes three NCA-based improvements for diffusion-based medical image segmentation. First, Multi-MedSegDiffNCA uses a multilevel NCA framework to refine rough noise estimates generated by lower level NCA models. Second, CBAM-MedSegDiffNCA incorporates channel and spatial attention for improved segmentation. Third, MultiCBAM-MedSegDiffNCA combines these methods with a new RGB channel loss for semantic guidance. Evaluations on Lesion segmentation show that MultiCBAM-MedSegDiffNCA matches Unet-based model performance with dice score of 87.84% while using 60-110 times fewer parameters, offering a more efficient solution for low resource medical settings.

Title: GCP: Guarded Collaborative Perception with Spatial-Temporal Aware Malicious Agent Detection

Authors: Yihang Tao, Senkang Hu, Yue Hu, Haonan An, Hangcheng Cao, Yuguang Fang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02450
Pdf URL: https://arxiv.org/pdf/2501.02450
Copy Paste: [[2501.02450]] GCP: Guarded Collaborative Perception with Spatial-Temporal Aware Malicious Agent Detection(https://arxiv.org/abs/2501.02450)
Keywords: anomaly
Abstract: Collaborative perception significantly enhances autonomous driving safety by extending each vehicle's perception range through message sharing among connected and autonomous vehicles. Unfortunately, it is also vulnerable to adversarial message attacks from malicious agents, resulting in severe performance degradation. While existing defenses employ hypothesis-and-verification frameworks to detect malicious agents based on single-shot outliers, they overlook temporal message correlations, which can be circumvented by subtle yet harmful perturbations in model input and output spaces. This paper reveals a novel blind area confusion (BAC) attack that compromises existing single-shot outlier-based detection methods. As a countermeasure, we propose GCP, a Guarded Collaborative Perception framework based on spatial-temporal aware malicious agent detection, which maintains single-shot spatial consistency through a confidence-scaled spatial concordance loss, while simultaneously examining temporal anomalies by reconstructing historical bird's eye view motion flows in low-confidence regions. We also employ a joint spatial-temporal Benjamini-Hochberg test to synthesize dual-domain anomaly results for reliable malicious agent detection. Extensive experiments demonstrate GCP's superior performance under diverse attack scenarios, achieving up to 34.69% improvements in AP@0.5 compared to the state-of-the-art CP defense strategies under BAC attacks, while maintaining consistent 5-8% improvements under other typical attacks. Code will be released at this https URL.

Title: Enhancing Contrastive Learning for Retinal Imaging via Adjusted Augmentation Scales

Authors: Zijie Cheng, Boxuan Li, André Altmann, Pearse A Keane, Yukun Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02451
Pdf URL: https://arxiv.org/pdf/2501.02451
Copy Paste: [[2501.02451]] Enhancing Contrastive Learning for Retinal Imaging via Adjusted Augmentation Scales(https://arxiv.org/abs/2501.02451)
Keywords: self-supervised
Abstract: Contrastive learning, a prominent approach within self-supervised learning, has demonstrated significant effectiveness in developing generalizable models for various applications involving natural images. However, recent research indicates that these successes do not necessarily extend to the medical imaging domain. In this paper, we investigate the reasons for this suboptimal performance and hypothesize that the dense distribution of medical images poses challenges to the pretext tasks in contrastive learning, particularly in constructing positive and negative pairs. We explore model performance under different augmentation strategies and compare the results to those achieved with strong augmentations. Our study includes six publicly available datasets covering multiple clinically relevant tasks. We further assess the model's generalizability through external evaluations. The model pre-trained with weak augmentation outperforms those with strong augmentation, improving AUROC from 0.838 to 0.848 and AUPR from 0.523 to 0.597 on MESSIDOR2, and showing similar enhancements across other datasets. Our findings suggest that optimizing the scale of augmentation is critical for enhancing the efficacy of contrastive learning in medical imaging.

Title: Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera

Authors: Yuliang Guo, Sparsh Garg, S. Mahdi H. Miangoleh, Xinyu Huang, Liu Ren
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2501.02464
Pdf URL: https://arxiv.org/pdf/2501.02464
Copy Paste: [[2501.02464]] Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera(https://arxiv.org/abs/2501.02464)
Keywords: foundation model
Abstract: While recent depth estimation methods exhibit strong zero-shot generalization, achieving accurate metric depth across diverse camera types-particularly those with large fields of view (FoV) such as fisheye and 360-degree cameras-remains a significant challenge. This paper presents Depth Any Camera (DAC), a powerful zero-shot metric depth estimation framework that extends a perspective-trained model to effectively handle cameras with varying FoVs. The framework is designed to ensure that all existing 3D data can be leveraged, regardless of the specific camera types used in new applications. Remarkably, DAC is trained exclusively on perspective images but generalizes seamlessly to fisheye and 360-degree cameras without the need for specialized training data. DAC employs Equi-Rectangular Projection (ERP) as a unified image representation, enabling consistent processing of images with diverse FoVs. Its key components include a pitch-aware Image-to-ERP conversion for efficient online augmentation in ERP space, a FoV alignment operation to support effective training across a wide range of FoVs, and multi-resolution data augmentation to address resolution disparities between training and testing. DAC achieves state-of-the-art zero-shot metric depth estimation, improving delta-1 ($\delta_1$) accuracy by up to 50% on multiple fisheye and 360-degree datasets compared to prior metric depth foundation models, demonstrating robust generalization across camera types.

Title: DeTrack: In-model Latent Denoising Learning for Visual Object Tracking

Authors: Xinyu Zhou, Jinglun Li, Lingyi Hong, Kaixun Jiang, Pinxue Guo, Weifeng Ge, Wenqiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02467
Pdf URL: https://arxiv.org/pdf/2501.02467
Copy Paste: [[2501.02467]] DeTrack: In-model Latent Denoising Learning for Visual Object Tracking(https://arxiv.org/abs/2501.02467)
Keywords: diffusion
Abstract: Previous visual object tracking methods employ image-feature regression models or coordinate autoregression models for bounding box prediction. Image-feature regression methods heavily depend on matching results and do not utilize positional prior, while the autoregressive approach can only be trained using bounding boxes available in the training set, potentially resulting in suboptimal performance during testing with unseen data. Inspired by the diffusion model, denoising learning enhances the model's robustness to unseen data. Therefore, We introduce noise to bounding boxes, generating noisy boxes for training, thus enhancing model robustness on testing data. We propose a new paradigm to formulate the visual object tracking problem as a denoising learning process. However, tracking algorithms are usually asked to run in real-time, directly applying the diffusion model to object tracking would severely impair tracking speed. Therefore, we decompose the denoising learning process into every denoising block within a model, not by running the model multiple times, and thus we summarize the proposed paradigm as an in-model latent denoising learning process. Specifically, we propose a denoising Vision Transformer (ViT), which is composed of multiple denoising blocks. In the denoising block, template and search embeddings are projected into every denoising block as conditions. A denoising block is responsible for removing the noise in a predicted bounding box, and multiple stacked denoising blocks cooperate to accomplish the whole denoising process. Subsequently, we utilize image features and trajectory information to refine the denoised bounding box. Besides, we also utilize trajectory memory and visual memory to improve tracking stability. Experimental results validate the effectiveness of our approach, achieving competitive performance on several challenging datasets.

Title: ACE++: Instruction-Based Image Creation and Editing via Context-Aware Content Filling

Authors: Chaojie Mao, Jingfeng Zhang, Yulin Pan, Zeyinzi Jiang, Zhen Han, Yu Liu, Jingren Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02487
Pdf URL: https://arxiv.org/pdf/2501.02487
Copy Paste: [[2501.02487]] ACE++: Instruction-Based Image Creation and Editing via Context-Aware Content Filling(https://arxiv.org/abs/2501.02487)
Keywords: diffusion, generative
Abstract: We report ACE++, an instruction-based diffusion framework that tackles various image generation and editing tasks. Inspired by the input format for the inpainting task proposed by FLUX.1-Fill-dev, we improve the Long-context Condition Unit (LCU) introduced in ACE and extend this input paradigm to any editing and generation tasks. To take full advantage of image generative priors, we develop a two-stage training scheme to minimize the efforts of finetuning powerful text-to-image diffusion models like FLUX.1-dev. In the first stage, we pre-train the model using task data with the 0-ref tasks from the text-to-image model. There are many models in the community based on the post-training of text-to-image foundational models that meet this training paradigm of the first stage. For example, FLUX.1-Fill-dev deals primarily with painting tasks and can be used as an initialization to accelerate the training process. In the second stage, we finetune the above model to support the general instructions using all tasks defined in ACE. To promote the widespread application of ACE++ in different scenarios, we provide a comprehensive set of models that cover both full finetuning and lightweight finetuning, while considering general applicability and applicability in vertical scenarios. The qualitative analysis showcases the superiority of ACE++ in terms of generating image quality and prompt following ability.

Title: Layout2Scene: 3D Semantic Layout Guided Scene Generation via Geometry and Appearance Diffusion Priors

Authors: Minglin Chen, Longguang Wang, Sheng Ao, Ye Zhang, Kai Xu, Yulan Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02519
Pdf URL: https://arxiv.org/pdf/2501.02519
Copy Paste: [[2501.02519]] Layout2Scene: 3D Semantic Layout Guided Scene Generation via Geometry and Appearance Diffusion Priors(https://arxiv.org/abs/2501.02519)
Keywords: diffusion
Abstract: 3D scene generation conditioned on text prompts has significantly progressed due to the development of 2D diffusion generation models. However, the textual description of 3D scenes is inherently inaccurate and lacks fine-grained control during training, leading to implausible scene generation. As an intuitive and feasible solution, the 3D layout allows for precise specification of object locations within the scene. To this end, we present a text-to-scene generation method (namely, Layout2Scene) using additional semantic layout as the prompt to inject precise control of 3D object positions. Specifically, we first introduce a scene hybrid representation to decouple objects and backgrounds, which is initialized via a pre-trained text-to-3D model. Then, we propose a two-stage scheme to optimize the geometry and appearance of the initialized scene separately. To fully leverage 2D diffusion priors in geometry and appearance generation, we introduce a semantic-guided geometry diffusion model and a semantic-geometry guided diffusion model which are finetuned on a scene dataset. Extensive experiments demonstrate that our method can generate more plausible and realistic scenes as compared to state-of-the-art approaches. Furthermore, the generated scene allows for flexible yet precise editing, thereby facilitating multiple downstream applications.

Title: Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation

Authors: Dawei Dai, Mingming Jia, Yinxiu Zhou, Hang Xing, Chenghang Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02523
Pdf URL: https://arxiv.org/pdf/2501.02523
Copy Paste: [[2501.02523]] Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation(https://arxiv.org/abs/2501.02523)
Keywords: diffusion
Abstract: Facial images have extensive practical applications. Although the current large-scale text-image diffusion models exhibit strong generation capabilities, it is challenging to generate the desired facial images using only text prompt. Image prompts are a logical choice. However, current methods of this type generally focus on general domain. In this paper, we aim to optimize image makeup techniques to generate the desired facial images. Specifically, (1) we built a dataset of 4 million high-quality face image-text pairs (FaceCaptionHQ-4M) based on LAION-Face to train our Face-MakeUp model; (2) to maintain consistency with the reference facial image, we extract/learn multi-scale content features and pose features for the facial image, integrating these into the diffusion model to enhance the preservation of facial identity features for diffusion models. Validation on two face-related test datasets demonstrates that our Face-MakeUp can achieve the best comprehensive this http URL codes are available at:this https URL

Title: Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks

Authors: Leo Franklin, Apiradee Boonmee, Kritsada Wongsuwan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02527
Pdf URL: https://arxiv.org/pdf/2501.02527
Copy Paste: [[2501.02527]] Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks(https://arxiv.org/abs/2501.02527)
Keywords: generative
Abstract: Vision generation remains a challenging frontier in artificial intelligence, requiring seamless integration of visual understanding and generative capabilities. In this paper, we propose a novel framework, Vision-Driven Prompt Optimization (VDPO), that leverages Large Language Models (LLMs) to dynamically generate textual prompts from visual inputs, guiding high-fidelity image synthesis. VDPO combines a visual embedding prompt tuner, a textual instruction generator, and a vision generation module to achieve state-of-the-art performance in diverse vision generation tasks. Extensive experiments on benchmarks such as COCO and Sketchy demonstrate that VDPO consistently outperforms existing methods, achieving significant improvements in FID, LPIPS, and BLEU/CIDEr scores. Additional analyses reveal the scalability, robustness, and generalization capabilities of VDPO, making it a versatile solution for in-domain and out-of-domain tasks. Human evaluations further validate the practical superiority of VDPO in generating visually appealing and semantically coherent outputs.

Title: Decoding fMRI Data into Captions using Prefix Language Modeling

Authors: Vyacheslav Shen, Kassymzhomart Kunanbayev, Dae-Shik Kim
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2501.02570
Pdf URL: https://arxiv.org/pdf/2501.02570
Copy Paste: [[2501.02570]] Decoding fMRI Data into Captions using Prefix Language Modeling(https://arxiv.org/abs/2501.02570)
Keywords: diffusion
Abstract: With the advancements in Large Language and Latent Diffusion models, brain decoding has achieved remarkable results in recent years. The works on the NSD dataset, with stimuli images from the COCO dataset, leverage the embeddings from the CLIP model for image reconstruction and GIT for captioning. However, the current captioning approach introduces the challenge of potential data contamination given that the GIT model was trained on the COCO dataset. In this work, we present an alternative method for decoding brain signals into image captions by predicting a DINOv2 model's embedding of an image from the corresponding fMRI signal and then providing its [CLS] token as the prefix to the GPT-2 language model which decreases computational requirements considerably. Additionally, instead of commonly used Linear Regression, we explore 3D Convolutional Neural Network mapping of fMRI signals to image embedding space for better accounting positional information of voxels.

Title: LeetDecoding: A PyTorch Library for Exponentially Decaying Causal Linear Attention with CUDA Implementations

Authors: Jiaping Wang, Simiao Zhang, Qiao-Chu He, Yifan Chen
Subjects: cs.LG, cs.CL, cs.MS
Abstract URL: https://arxiv.org/abs/2501.02573
Pdf URL: https://arxiv.org/pdf/2501.02573
Copy Paste: [[2501.02573]] LeetDecoding: A PyTorch Library for Exponentially Decaying Causal Linear Attention with CUDA Implementations(https://arxiv.org/abs/2501.02573)
Keywords: generative
Abstract: The machine learning and data science community has made significant while dispersive progress in accelerating transformer-based large language models (LLMs), and one promising approach is to replace the original causal attention in a generative pre-trained transformer (GPT) with \emph{exponentially decaying causal linear attention}. In this paper, we present LeetDecoding, which is the first Python package that provides a large set of computation routines for this fundamental operator. The launch of LeetDecoding was motivated by the current lack of (1) clear understanding of the complexity regarding this operator, (2) a comprehensive collection of existing computation methods (usually spread in seemingly unrelated fields), and (3) CUDA implementations for fast inference on GPU. LeetDecoding's design is easy to integrate with existing linear-attention LLMs, and allows for researchers to benchmark and evaluate new computation methods for exponentially decaying causal linear attention. The usage of LeetDecoding does not require any knowledge of GPU programming and the underlying complexity analysis, intentionally making LeetDecoding accessible to LLM practitioners. The source code of LeetDecoding is provided at \href{this https URL}{this GitHub repository}, and users can simply install LeetDecoding by the command \texttt{pip install leet-decoding}.

Title: DepthMaster: Taming Diffusion Models for Monocular Depth Estimation

Authors: Ziyang Song, Zerong Wang, Bo Li, Hao Zhang, Ruijie Zhu, Li Liu, Peng-Tao Jiang, Tianzhu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02576
Pdf URL: https://arxiv.org/pdf/2501.02576
Copy Paste: [[2501.02576]] DepthMaster: Taming Diffusion Models for Monocular Depth Estimation(https://arxiv.org/abs/2501.02576)
Keywords: diffusion, generative
Abstract: Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at this https URL.

Title: Representation Learning of Lab Values via Masked AutoEncoder

Authors: David Restrepo, Chenwei Wu, Yueran Jia, Jaden K. Sun, Jack Gallifant, Catherine G. Bielick, Yugang Jia, Leo A. Celi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02648
Pdf URL: https://arxiv.org/pdf/2501.02648
Copy Paste: [[2501.02648]] Representation Learning of Lab Values via Masked AutoEncoder(https://arxiv.org/abs/2501.02648)
Keywords: self-supervised, foundation model
Abstract: Accurate imputation of missing laboratory values in electronic health records (EHRs) is critical to enable robust clinical predictions and reduce biases in AI systems in healthcare. Existing methods, such as variational autoencoders (VAEs) and decision tree-based approaches such as XGBoost, struggle to model the complex temporal and contextual dependencies in EHR data, mainly in underrepresented groups. In this work, we propose Lab-MAE, a novel transformer-based masked autoencoder framework that leverages self-supervised learning for the imputation of continuous sequential lab values. Lab-MAE introduces a structured encoding scheme that jointly models laboratory test values and their corresponding timestamps, enabling explicit capturing temporal dependencies. Empirical evaluation on the MIMIC-IV dataset demonstrates that Lab-MAE significantly outperforms the state-of-the-art baselines such as XGBoost across multiple metrics, including root mean square error (RMSE), R-squared (R2), and Wasserstein distance (WD). Notably, Lab-MAE achieves equitable performance across demographic groups of patients, advancing fairness in clinical predictions. We further investigate the role of follow-up laboratory values as potential shortcut features, revealing Lab-MAE's robustness in scenarios where such data is unavailable. The findings suggest that our transformer-based architecture, adapted to the characteristics of the EHR data, offers a foundation model for more accurate and fair clinical imputation models. In addition, we measure and compare the carbon footprint of Lab-MAE with the baseline XGBoost model, highlighting its environmental requirements.

Title: A New Interpretation of the Certainty-Equivalence Approach for PAC Reinforcement Learning with a Generative Model

Authors: Shivaram Kalyanakrishnan, Sheel Shah, Santhosh Kumar Guguloth
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2501.02652
Pdf URL: https://arxiv.org/pdf/2501.02652
Copy Paste: [[2501.02652]] A New Interpretation of the Certainty-Equivalence Approach for PAC Reinforcement Learning with a Generative Model(https://arxiv.org/abs/2501.02652)
Keywords: generative
Abstract: Reinforcement learning (RL) enables an agent interacting with an unknown MDP $M$ to optimise its behaviour by observing transitions sampled from $M$. A natural entity that emerges in the agent's reasoning is $\widehat{M}$, the maximum likelihood estimate of $M$ based on the observed transitions. The well-known \textit{certainty-equivalence} method (CEM) dictates that the agent update its behaviour to $\widehat{\pi}$, which is an optimal policy for $\widehat{M}$. Not only is CEM intuitive, it has been shown to enjoy minimax-optimal sample complexity in some regions of the parameter space for PAC RL with a generative model~\citep{Agarwal2020GenModel}. A seemingly unrelated algorithm is the ``trajectory tree method'' (TTM)~\citep{Kearns+MN:1999}, originally developed for efficient decision-time planning in large POMDPs. This paper presents a theoretical investigation that stems from the surprising finding that CEM may indeed be viewed as an application of TTM. The qualitative benefits of this view are (1) new and simple proofs of sample complexity upper bounds for CEM, in fact under a (2) weaker assumption on the rewards than is prevalent in the current literature. Our analysis applies to both non-stationary and stationary MDPs. Quantitatively, we obtain (3) improvements in the sample-complexity upper bounds for CEM both for non-stationary and stationary MDPs, in the regime that the ``mistake probability'' $\delta$ is small. Additionally, we show (4) a lower bound on the sample complexity for finite-horizon MDPs, which establishes the minimax-optimality of our upper bound for non-stationary MDPs in the small-$\delta$ regime.

Title: GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

Authors: Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02690
Pdf URL: https://arxiv.org/pdf/2501.02690
Copy Paste: [[2501.02690]] GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking(https://arxiv.org/abs/2501.02690)
Keywords: diffusion
Abstract: 4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant limitation of current video generation models. GS-DiT demonstrates strong generalization capabilities and extends the 4D controllability of Gaussian splatting to video generation beyond just camera poses. It supports advanced cinematic effects through the manipulation of the Gaussian field and camera intrinsics, making it a powerful tool for creative video production. Demos are available at this https URL.

Title: Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment

Authors: Jiaze Li, Haoran Xu, Shiding Zhu, Junwei He, Haozhao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02706
Pdf URL: https://arxiv.org/pdf/2501.02706
Copy Paste: [[2501.02706]] Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment(https://arxiv.org/abs/2501.02706)
Keywords: diffusion
Abstract: The rapid development of diffusion models has greatly advanced AI-generated videos in terms of length and consistency recently, yet assessing AI-generated videos still remains challenging. Previous approaches have often focused on User-Generated Content(UGC), but few have targeted AI-Generated Video Quality Assessment methods. In this work, we introduce MSA-VQA, a Multilevel Semantic-Aware Model for AI-Generated Video Quality Assessment, which leverages CLIP-based semantic supervision and cross-attention mechanisms. Our hierarchical framework analyzes video content at three levels: frame, segment, and video. We propose a Prompt Semantic Supervision Module using text encoder of CLIP to ensure semantic consistency between videos and conditional prompts. Additionally, we propose the Semantic Mutation-aware Module to capture subtle variations between frames. Extensive experiments demonstrate our method achieves state-of-the-art results.

Title: Brick-Diffusion: Generating Long Videos with Brick-to-Wall Denoising

Authors: Yunlong Yuan, Yuanfan Guo, Chunwei Wang, Hang Xu, Li Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02741
Pdf URL: https://arxiv.org/pdf/2501.02741
Copy Paste: [[2501.02741]] Brick-Diffusion: Generating Long Videos with Brick-to-Wall Denoising(https://arxiv.org/abs/2501.02741)
Keywords: diffusion
Abstract: Recent advances in diffusion models have greatly improved text-driven video generation. However, training models for long video generation demands significant computational power and extensive data, leading most video diffusion models to be limited to a small number of frames. Existing training-free methods that attempt to generate long videos using pre-trained short video diffusion models often struggle with issues such as insufficient motion dynamics and degraded video fidelity. In this paper, we present Brick-Diffusion, a novel, training-free approach capable of generating long videos of arbitrary length. Our method introduces a brick-to-wall denoising strategy, where the latent is denoised in segments, with a stride applied in subsequent iterations. This process mimics the construction of a staggered brick wall, where each brick represents a denoised segment, enabling communication between frames and improving overall video quality. Through quantitative and qualitative evaluations, we demonstrate that Brick-Diffusion outperforms existing baseline methods in generating high-fidelity videos.

Title: GraphDART: Graph Distillation for Efficient Advanced Persistent Threat Detection

Authors: Saba Fathi Rabooki, Bowen Li, Falih Gozi Febrinanto, Ciyuan Peng, Elham Naghizade, Fengling Han, Feng Xia
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2501.02796
Pdf URL: https://arxiv.org/pdf/2501.02796
Copy Paste: [[2501.02796]] GraphDART: Graph Distillation for Efficient Advanced Persistent Threat Detection(https://arxiv.org/abs/2501.02796)
Keywords: anomaly
Abstract: Cyber-physical-social systems (CPSSs) have emerged in many applications over recent decades, requiring increased attention to security concerns. The rise of sophisticated threats like Advanced Persistent Threats (APTs) makes ensuring security in CPSSs particularly challenging. Provenance graph analysis has proven effective for tracing and detecting anomalies within systems, but the sheer size and complexity of these graphs hinder the efficiency of existing methods, especially those relying on graph neural networks (GNNs). To address these challenges, we present GraphDART, a modular framework designed to distill provenance graphs into compact yet informative representations, enabling scalable and effective anomaly detection. GraphDART can take advantage of diverse graph distillation techniques, including classic and modern graph distillation methods, to condense large provenance graphs while preserving essential structural and contextual information. This approach significantly reduces computational overhead, allowing GNNs to learn from distilled graphs efficiently and enhance detection performance. Extensive evaluations on benchmark datasets demonstrate the robustness of GraphDART in detecting malicious activities across cyber-physical-social systems. By optimizing computational efficiency, GraphDART provides a scalable and practical solution to safeguard interconnected environments against APTs.

Title: First-place Solution for Streetscape Shop Sign Recognition Competition

Authors: Bin Wang, Li Jing
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02811
Pdf URL: https://arxiv.org/pdf/2501.02811
Copy Paste: [[2501.02811]] First-place Solution for Streetscape Shop Sign Recognition Competition(https://arxiv.org/abs/2501.02811)
Keywords: self-supervised
Abstract: Text recognition technology applied to street-view storefront signs is increasingly utilized across various practical domains, including map navigation, smart city planning analysis, and business value assessments in commercial districts. This technology holds significant research and commercial potential. Nevertheless, it faces numerous challenges. Street view images often contain signboards with complex designs and diverse text styles, complicating the text recognition process. A notable advancement in this field was introduced by our team in a recent competition. We developed a novel multistage approach that integrates multimodal feature fusion, extensive self-supervised training, and a Transformer-based large model. Furthermore, innovative techniques such as BoxDQN, which relies on reinforcement learning, and text rectification methods were employed, leading to impressive outcomes. Comprehensive experiments have validated the effectiveness of these methods, showcasing our potential to enhance text recognition capabilities in complex urban environments.

Title: InpDiffusion: Image Inpainting Localization via Conditional Diffusion Models

Authors: Kai Wang, Shaozhang Niu, Qixian Hao, Jiwei Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02816
Pdf URL: https://arxiv.org/pdf/2501.02816
Copy Paste: [[2501.02816]] InpDiffusion: Image Inpainting Localization via Conditional Diffusion Models(https://arxiv.org/abs/2501.02816)
Keywords: diffusion
Abstract: As artificial intelligence advances rapidly, particularly with the advent of GANs and diffusion models, the accuracy of Image Inpainting Localization (IIL) has become increasingly challenging. Current IIL methods face two main challenges: a tendency towards overconfidence, leading to incorrect predictions; and difficulty in detecting subtle tampering boundaries in inpainted images. In response, we propose a new paradigm that treats IIL as a conditional mask generation task utilizing diffusion models. Our method, InpDiffusion, utilizes the denoising process enhanced by the integration of image semantic conditions to progressively refine predictions. During denoising, we employ edge conditions and introduce a novel edge supervision strategy to enhance the model's perception of edge details in inpainted objects. Balancing the diffusion model's stochastic sampling with edge supervision of tampered image regions mitigates the risk of incorrect predictions from overconfidence and prevents the loss of subtle boundaries that can result from overly stochastic processes. Furthermore, we propose an innovative Dual-stream Multi-scale Feature Extractor (DMFE) for extracting multi-scale features, enhancing feature representation by considering both semantic and edge conditions of the inpainted images. Extensive experiments across challenging datasets demonstrate that the InpDiffusion significantly outperforms existing state-of-the-art methods in IIL tasks, while also showcasing excellent generalization capabilities and robustness.

Title: Large Language Models for Video Surveillance Applications

Authors: Ulindu De Silva, Leon Fernando, Billy Lau Pik Lik, Zann Koh, Sam Conrad Joyce, Belinda Yuen, Chau Yuen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02850
Pdf URL: https://arxiv.org/pdf/2501.02850
Copy Paste: [[2501.02850]] Large Language Models for Video Surveillance Applications(https://arxiv.org/abs/2501.02850)
Keywords: generative
Abstract: The rapid increase in video content production has resulted in enormous data volumes, creating significant challenges for efficient analysis and resource management. To address this, robust video analysis tools are essential. This paper presents an innovative proof of concept using Generative Artificial Intelligence (GenAI) in the form of Vision Language Models to enhance the downstream video analysis process. Our tool generates customized textual summaries based on user-defined queries, providing focused insights within extensive video datasets. Unlike traditional methods that offer generic summaries or limited action recognition, our approach utilizes Vision Language Models to extract relevant information, improving analysis precision and efficiency. The proposed method produces textual summaries from extensive CCTV footage, which can then be stored for an indefinite time in a very small storage space compared to videos, allowing users to quickly navigate and verify significant events without exhaustive manual review. Qualitative evaluations result in 80% and 70% accuracy in temporal and spatial quality and consistency of the pipeline respectively.

Title: Seeing the Whole in the Parts in Self-Supervised Representation Learning

Authors: Arthur Aubret, Céline Teulière, Jochen Triesch
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2501.02860
Pdf URL: https://arxiv.org/pdf/2501.02860
Copy Paste: [[2501.02860]] Seeing the Whole in the Parts in Self-Supervised Representation Learning(https://arxiv.org/abs/2501.02860)
Keywords: self-supervised
Abstract: Recent successes in self-supervised learning (SSL) model spatial co-occurrences of visual features either by masking portions of an image or by aggressively cropping it. Here, we propose a new way to model spatial co-occurrences by aligning local representations (before pooling) with a global image representation. We present CO-SSL, a family of instance discrimination methods and show that it outperforms previous methods on several datasets, including ImageNet-1K where it achieves 71.5% of Top-1 accuracy with 100 pre-training epochs. CO-SSL is also more robust to noise corruption, internal corruption, small adversarial attacks, and large training crop sizes. Our analysis further indicates that CO-SSL learns highly redundant local representations, which offers an explanation for its robustness. Overall, our work suggests that aligning local and global representations may be a powerful principle of unsupervised category learning.

Title: Conditional Mutual Information Based Diffusion Posterior Sampling for Solving Inverse Problems

Authors: Shayan Mohajer Hamidi, En-Hui Yang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2501.02880
Pdf URL: https://arxiv.org/pdf/2501.02880
Copy Paste: [[2501.02880]] Conditional Mutual Information Based Diffusion Posterior Sampling for Solving Inverse Problems(https://arxiv.org/abs/2501.02880)
Keywords: diffusion
Abstract: Inverse problems are prevalent across various disciplines in science and engineering. In the field of computer vision, tasks such as inpainting, deblurring, and super-resolution are commonly formulated as inverse problems. Recently, diffusion models (DMs) have emerged as a promising approach for addressing noisy linear inverse problems, offering effective solutions without requiring additional task-specific training. Specifically, with the prior provided by DMs, one can sample from the posterior by finding the likelihood. Since the likelihood is intractable, it is often approximated in the literature. However, this approximation compromises the quality of the generated images. To overcome this limitation and improve the effectiveness of DMs in solving inverse problems, we propose an information-theoretic approach. Specifically, we maximize the conditional mutual information $\mathrm{I}(\boldsymbol{x}_0; \boldsymbol{y} | \boldsymbol{x}_t)$, where $\boldsymbol{x}_0$ represents the reconstructed signal, $\boldsymbol{y}$ is the measurement, and $\boldsymbol{x}_t$ is the intermediate signal at stage $t$. This ensures that the intermediate signals $\boldsymbol{x}_t$ are generated in a way that the final reconstructed signal $\boldsymbol{x}_0$ retains as much information as possible about the measurement $\boldsymbol{y}$. We demonstrate that this method can be seamlessly integrated with recent approaches and, once incorporated, enhances their performance both qualitatively and quantitatively.

Title: FoundPAD: Foundation Models Reloaded for Face Presentation Attack Detection

Authors: Guray Ozgur, Eduarda Caldeira, Tahar Chettaoui, Fadi Boutros, Raghavendra Ramachandra, Naser Damer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02892
Pdf URL: https://arxiv.org/pdf/2501.02892
Copy Paste: [[2501.02892]] FoundPAD: Foundation Models Reloaded for Face Presentation Attack Detection(https://arxiv.org/abs/2501.02892)
Keywords: foundation model
Abstract: Although face recognition systems have seen a massive performance enhancement in recent years, they are still targeted by threats such as presentation attacks, leading to the need for generalizable presentation attack detection (PAD) algorithms. Current PAD solutions suffer from two main problems: low generalization to unknown cenarios and large training data requirements. Foundation models (FM) are pre-trained on extensive datasets, achieving remarkable results when generalizing to unseen domains and allowing for efficient task-specific adaption even when little training data are available. In this work, we recognize the potential of FMs to address common PAD problems and tackle the PAD task with an adapted FM for the first time. The FM under consideration is adapted with LoRA weights while simultaneously training a classification header. The resultant architecture, FoundPAD, is highly generalizable to unseen domains, achieving competitive results in several settings under different data availability scenarios and even when using synthetic training data. To encourage reproducibility and facilitate further research in PAD, we publicly release the implementation of FoundPAD at this https URL .

Title: Skillful High-Resolution Ensemble Precipitation Forecasting with an Integrated Deep Learning Framework

Authors: Shuangshuang He, Hongli Liang, Yuanting Zhang, Xingyuan Yuan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02905
Pdf URL: https://arxiv.org/pdf/2501.02905
Copy Paste: [[2501.02905]] Skillful High-Resolution Ensemble Precipitation Forecasting with an Integrated Deep Learning Framework(https://arxiv.org/abs/2501.02905)
Keywords: diffusion
Abstract: High-resolution precipitation forecasts are crucial for providing accurate weather prediction and supporting effective responses to extreme weather events. Traditional numerical models struggle with stochastic subgrid-scale processes, while recent deep learning models often produce blurry results. To address these challenges, we propose a physics-inspired deep learning framework for high-resolution (0.05\textdegree{} $\times$ 0.05\textdegree{}) ensemble precipitation forecasting. Trained on ERA5 and CMPA high-resolution precipitation datasets, the framework integrates deterministic and probabilistic components. The deterministic model, based on a 3D SwinTransformer, captures average precipitation at mesoscale resolution and incorporates strategies to enhance performance, particularly for moderate to heavy rainfall. The probabilistic model employs conditional diffusion in latent space to account for uncertainties in residual precipitation at convective scales. During inference, ensemble members are generated by repeatedly sampling latent variables, enabling the model to represent precipitation uncertainty. Our model significantly enhances spatial resolution and forecast accuracy. Rank histogram shows that the ensemble system is reliable and unbiased. In a case study of heavy precipitation in southern China, the model outputs align more closely with observed precipitation distributions than ERA5, demonstrating superior capability in capturing extreme precipitation events. Additionally, 5-day real-time forecasts show good performance in terms of CSI scores.

Title: Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis

Authors: Thang-Anh-Quan Nguyen, Nathan Piasco, Luis Roldão, Moussab Bennehar, Dzmitry Tsishkou, Laurent Caraffa, Jean-Philippe Tarel, Roland Brémond
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02913
Pdf URL: https://arxiv.org/pdf/2501.02913
Copy Paste: [[2501.02913]] Pointmap-Conditioned Diffusion for Consistent Novel View Synthesis(https://arxiv.org/abs/2501.02913)
Keywords: diffusion, generative
Abstract: In this paper, we present PointmapDiffusion, a novel framework for single-image novel view synthesis (NVS) that utilizes pre-trained 2D diffusion models. Our method is the first to leverage pointmaps (i.e. rasterized 3D scene coordinates) as a conditioning signal, capturing geometric prior from the reference images to guide the diffusion process. By embedding reference attention blocks and a ControlNet for pointmap features, our model balances between generative capability and geometric consistency, enabling accurate view synthesis across varying viewpoints. Extensive experiments on diverse real-world datasets demonstrate that PointmapDiffusion achieves high-quality, multi-view consistent results with significantly fewer trainable parameters compared to other baselines for single-image NVS tasks.

Title: Unsupervised Tomato Split Anomaly Detection using Hyperspectral Imaging and Variational Autoencoders

Authors: Mahmoud Abdulsalam, Usman Zahidi, Bradley Hurst, Simon Pearson, Grzegorz Cielniak, James Brown
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.02921
Pdf URL: https://arxiv.org/pdf/2501.02921
Copy Paste: [[2501.02921]] Unsupervised Tomato Split Anomaly Detection using Hyperspectral Imaging and Variational Autoencoders(https://arxiv.org/abs/2501.02921)
Keywords: anomaly
Abstract: Tomato anomalies/damages pose a significant challenge in greenhouse farming. While this method of cultivation benefits from efficient resource utilization, anomalies can significantly degrade the quality of farm produce. A common anomaly associated with tomatoes is splitting, characterized by the development of cracks on the tomato skin, which degrades its quality. Detecting this type of anomaly is challenging due to dynamic variations in appearance and sizes, compounded by dataset scarcity. We address this problem in an unsupervised manner by utilizing a tailored variational autoencoder (VAE) with hyperspectral input. Preliminary analysis of the dataset enabled us to select the optimal range of wavelengths for detecting this anomaly. Our findings indicate that the 530nm - 550nm range is suitable for identifying tomato dry splits. The analysis on reconstruction loss allow us to not only detect the anomalies but also to some degree estimate the anomalous regions.

Title: The Tabular Foundation Model TabPFN Outperforms Specialized Time Series Forecasting Models Based on Simple Features

Authors: Shi Bin Hoo, Samuel Müller, David Salinas, Frank Hutter
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.02945
Pdf URL: https://arxiv.org/pdf/2501.02945
Copy Paste: [[2501.02945]] The Tabular Foundation Model TabPFN Outperforms Specialized Time Series Forecasting Models Based on Simple Features(https://arxiv.org/abs/2501.02945)
Keywords: foundation model
Abstract: Foundation models have become popular in forecasting due to their ability to make accurate predictions, even with minimal fine-tuning on specific datasets. In this paper, we demonstrate how the newly released regression variant of TabPFN, a general tabular foundation model, can be applied to time series forecasting. We propose a straightforward approach, TabPFN-TS, which pairs TabPFN with simple feature engineering to achieve strong forecasting performance. Despite its simplicity and with only 11M parameters, TabPFN-TS outperforms Chronos-Mini, a model of similar size, and matches or even slightly outperforms Chronos-Large, which has 65-fold more parameters. A key strength of our method lies in its reliance solely on artificial data during pre-training, avoiding the need for large training datasets and eliminating the risk of benchmark contamination.

Title: SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild

Authors: Jiawei Liu, Yuanzhi Zhu, Feiyu Gao, Zhibo Yang, Peng Wang, Junyang Lin, Xinggang Wang, Wenyu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02962
Pdf URL: https://arxiv.org/pdf/2501.02962
Copy Paste: [[2501.02962]] SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild(https://arxiv.org/abs/2501.02962)
Keywords: diffusion
Abstract: Generating visual text in natural scene images is a challenging task with many unsolved problems. Different from generating text on artificially designed images (such as posters, covers, cartoons, etc.), the text in natural scene images needs to meet the following four key criteria: (1) Fidelity: the generated text should appear as realistic as a photograph and be completely accurate, with no errors in any of the strokes. (2) Reasonability: the text should be generated on reasonable carrier areas (such as boards, signs, walls, etc.), and the generated text content should also be relevant to the scene. (3) Utility: the generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks. (4) Controllability: The attribute of the text (such as font and color) should be controllable as this http URL this paper, we propose a two stage method, SceneVTG++, which simultaneously satisfies the four aspects mentioned above. SceneVTG++ consists of a Text Layout and Content Generator (TLCG) and a Controllable Local Text Diffusion (CLTD). The former utilizes the world knowledge of multi modal large language models to find reasonable text areas and recommend text content according to the nature scene background images, while the latter generates controllable multilingual text based on the diffusion model. Through extensive experiments, we respectively verified the effectiveness of TLCG and CLTD, and demonstrated the state-of-the-art text generation performance of SceneVTG++. In addition, the generated images have superior utility in OCR tasks like text detection and text recognition. Codes and datasets will be available.

Title: Human Gaze Boosts Object-Centered Representation Learning

Authors: Timothy Schaumlöffel, Arthur Aubret, Gemma Roig, Jochen Triesch
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2501.02966
Pdf URL: https://arxiv.org/pdf/2501.02966
Copy Paste: [[2501.02966]] Human Gaze Boosts Object-Centered Representation Learning(https://arxiv.org/abs/2501.02966)
Keywords: self-supervised
Abstract: Recent self-supervised learning (SSL) models trained on human-like egocentric visual inputs substantially underperform on image recognition tasks compared to humans. These models train on raw, uniform visual inputs collected from head-mounted cameras. This is different from humans, as the anatomical structure of the retina and visual cortex relatively amplifies the central visual information, i.e. around humans' gaze location. This selective amplification in humans likely aids in forming object-centered visual representations. Here, we investigate whether focusing on central visual information boosts egocentric visual object learning. We simulate 5-months of egocentric visual experience using the large-scale Ego4D dataset and generate gaze locations with a human gaze prediction model. To account for the importance of central vision in humans, we crop the visual area around the gaze location. Finally, we train a time-based SSL model on these modified inputs. Our experiments demonstrate that focusing on central vision leads to better object-centered representations. Our analysis shows that the SSL model leverages the temporal dynamics of the gaze movements to build stronger visual representations. Overall, our work marks a significant step toward bio-inspired learning of visual representations.

Title: LOHA: Direct Graph Spectral Contrastive Learning Between Low-pass and High-pass Views

Authors: Ziyun Zou, Yinghui Jiang, Lian Shen, Juan Liu, Xiangrong Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2501.02969
Pdf URL: https://arxiv.org/pdf/2501.02969
Copy Paste: [[2501.02969]] LOHA: Direct Graph Spectral Contrastive Learning Between Low-pass and High-pass Views(https://arxiv.org/abs/2501.02969)
Keywords: self-supervised
Abstract: Spectral Graph Neural Networks effectively handle graphs with different homophily levels, with low-pass filter mining feature smoothness and high-pass filter capturing differences. When these distinct filters could naturally form two opposite views for self-supervised learning, the commonalities between the counterparts for the same node remain unexplored, leading to suboptimal performance. In this paper, a simple yet effective self-supervised contrastive framework, LOHA, is proposed to address this gap. LOHA optimally leverages low-pass and high-pass views by embracing "harmony in diversity". Rather than solely maximizing the difference between these distinct views, which may lead to feature separation, LOHA harmonizes the diversity by treating the propagation of graph signals from both views as a composite feature. Specifically, a novel high-dimensional feature named spectral signal trend is proposed to serve as the basis for the composite feature, which remains relatively unaffected by changing filters and focuses solely on original feature differences. LOHA achieves an average performance improvement of 2.8% over runner-up models on 9 real-world datasets with varying homophily levels. Notably, LOHA even surpasses fully-supervised models on several datasets, which underscores the potential of LOHA in advancing the efficacy of spectral GNNs for diverse graph structures.

Title: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

Authors: Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, Ying Tai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.02976
Pdf URL: https://arxiv.org/pdf/2501.02976
Copy Paste: [[2501.02976]] STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution(https://arxiv.org/abs/2501.02976)
Keywords: diffusion, generative
Abstract: Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (\textit{e.g.}, CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce\textbf{~\name} (\textbf{S}patial-\textbf{T}emporal \textbf{A}ugmentation with T2V models for \textbf{R}eal-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the model to focus on different frequency components across diffusion steps. Extensive experiments demonstrate\textbf{~\name}~outperforms state-of-the-art methods on both synthetic and real-world datasets.

Title: TransPixar: Advancing Text-to-Video Generation with Transparency

Authors: Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.03006
Pdf URL: https://arxiv.org/pdf/2501.03006
Copy Paste: [[2501.03006]] TransPixar: Advancing Text-to-Video Generation with Transparency(https://arxiv.org/abs/2501.03006)
Keywords: diffusion, generative
Abstract: Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes. We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data. Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.

Title: Sentiment-guided Commonsense-aware Response Generation for Mental Health Counseling

Authors: Aseem Srivastava, Gauri Naik, Alison Cerezo, Tanmoy Chakraborty, Md. Shad Akhtar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2501.03088
Pdf URL: https://arxiv.org/pdf/2501.03088
Copy Paste: [[2501.03088]] Sentiment-guided Commonsense-aware Response Generation for Mental Health Counseling(https://arxiv.org/abs/2501.03088)
Keywords: foundation model
Abstract: The crisis of mental health issues is escalating. Effective counseling serves as a critical lifeline for individuals suffering from conditions like PTSD, stress, etc. Therapists forge a crucial therapeutic bond with clients, steering them towards positivity. Unfortunately, the massive shortage of professionals, high costs, and mental health stigma pose significant barriers to consulting therapists. As a substitute, Virtual Mental Health Assistants (VMHAs) have emerged in the digital healthcare space. However, most existing VMHAs lack the commonsense to understand the nuanced sentiments of clients to generate effective responses. To this end, we propose EmpRes, a novel sentiment-guided mechanism incorporating commonsense awareness for generating responses. By leveraging foundation models and harnessing commonsense knowledge, EmpRes aims to generate responses that effectively shape the client's sentiment towards positivity. We evaluate the performance of EmpRes on HOPE, a benchmark counseling dataset, and observe a remarkable performance improvement compared to the existing baselines across a suite of qualitative and quantitative metrics. Moreover, our extensive empirical analysis and human evaluation show that the generation ability of EmpRes is well-suited and, in some cases, surpasses the gold standard. Further, we deploy EmpRes as a chat interface for users seeking mental health support. We address the deployed system's effectiveness through an exhaustive user study with a significant positive response. Our findings show that 91% of users find the system effective, 80% express satisfaction, and over 85.45% convey a willingness to continue using the interface and recommend it to others, demonstrating the practical applicability of EmpRes in addressing the pressing challenges of mental health support, emphasizing user feedback, and ethical considerations in a real-world context.

Title: CAT: Content-Adaptive Image Tokenization

Authors: Junhong Shen, Kushal Tirumala, Michihiro Yasunaga, Ishan Misra, Luke Zettlemoyer, Lili Yu, Chunting Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.03120
Pdf URL: https://arxiv.org/pdf/2501.03120
Copy Paste: [[2501.03120]] CAT: Content-Adaptive Image Tokenization(https://arxiv.org/abs/2501.03120)
Keywords: diffusion
Abstract: Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image, taking into account factors critical to human perception. Trained on images with diverse compression ratios, CAT demonstrates robust performance in image reconstruction. We also utilize its variable-length latent representations to train Diffusion Transformers (DiTs) for ImageNet generation. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.

Title: Segment Anything Model for Zero-shot Single Particle Tracking in Liquid Phase Transmission Electron Microscopy

Authors: Risha Goel, Zain Shabeeb, Isabel Panicker, Vida Jamali
Subjects: cs.CV, physics.data-an
Abstract URL: https://arxiv.org/abs/2501.03153
Pdf URL: https://arxiv.org/pdf/2501.03153
Copy Paste: [[2501.03153]] Segment Anything Model for Zero-shot Single Particle Tracking in Liquid Phase Transmission Electron Microscopy(https://arxiv.org/abs/2501.03153)
Keywords: foundation model
Abstract: Liquid phase transmission electron microscopy (LPTEM) offers an unparalleled combination of spatial and temporal resolution, making it a promising tool for single particle tracking at the nanoscale. However, the absence of a standardized framework for identifying and tracking nanoparticles in noisy LPTEM videos has impeded progress in the field to develop this technique as a single particle tracking tool. To address this, we leveraged Segment Anything Model 2 (SAM 2), released by Meta, which is a foundation model developed for segmenting videos and images. Here, we demonstrate that SAM 2 can successfully segment LPTEM videos in a zero-shot manner and without requiring fine-tuning. Building on this capability, we introduce SAM4EM, a comprehensive framework that integrates promptable video segmentation with particle tracking and statistical analysis, providing an end-to-end LPTEM analysis framework for single particle tracking. SAM4EM achieves nearly 50-fold higher accuracy in segmenting and analyzing LPTEM videos compared to state-of-the-art methods, paving the way for broader applications of LPTEM in nanoscale imaging.

Title: Deep-Relative-Trust-Based Diffusion for Decentralized Deep Learning

Authors: Muyun Li, Aaron Fainman, Stefan Vlaski
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2501.03162
Pdf URL: https://arxiv.org/pdf/2501.03162
Copy Paste: [[2501.03162]] Deep-Relative-Trust-Based Diffusion for Decentralized Deep Learning(https://arxiv.org/abs/2501.03162)
Keywords: diffusion
Abstract: Decentralized learning strategies allow a collection of agents to learn efficiently from local data sets without the need for central aggregation or orchestration. Current decentralized learning paradigms typically rely on an averaging mechanism to encourage agreement in the parameter space. We argue that in the context of deep neural networks, which are often over-parameterized, encouraging consensus of the neural network outputs, as opposed to their parameters can be more appropriate. This motivates the development of a new decentralized learning algorithm, termed DRT diffusion, based on deep relative trust (DRT), a recently introduced similarity measure for neural networks. We provide convergence analysis for the proposed strategy, and numerically establish its benefit to generalization, especially with sparse topologies, in an image classification task.

Title: Semantic Captioning: Benchmark Dataset and Graph-Aware Few-Shot In-Context Learning for SQL2Text

Authors: Ali Al-Lawati, Jason Lucas, Prasenjit Mitra
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2501.03166
Pdf URL: https://arxiv.org/pdf/2501.03166
Copy Paste: [[2501.03166]] Semantic Captioning: Benchmark Dataset and Graph-Aware Few-Shot In-Context Learning for SQL2Text(https://arxiv.org/abs/2501.03166)
Keywords: in-context
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in various NLP tasks, including semantic parsing, which trans lates natural language into formal code representations. However, the reverse process, translating code into natural language, termed semantic captioning, has received less attention. This task is becoming increasingly important as LLMs are integrated into platforms for code generation, security analysis, and educational purposes. In this paper, we focus on the captioning of SQL query (SQL2Text) to address the critical need for understanding and explaining SQL queries in an era where LLM-generated code poses potential security risks. We repurpose Text2SQL datasets for SQL2Text by introducing an iterative ICL prompt using GPT-4o to generate multiple additional utterances, which enhances the robustness of the datasets for the reverse task. We conduct our experiments using in-context learning (ICL) based on different sample selection methods, emphasizing smaller, more computationally efficient LLMs. Our findings demonstrate that leveraging the inherent graph properties of SQL for ICL sample selection significantly outperforms random selection by up to 39% on BLEU score and provides better results than alternative methods. Dataset and codes are published: \url{this https URL}.

Title: MObI: Multimodal Object Inpainting Using Diffusion Models

Authors: Alexandru Buburuzan, Anuj Sharma, John Redford, Puneet K. Dokania, Romain Mueller
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.03173
Pdf URL: https://arxiv.org/pdf/2501.03173
Copy Paste: [[2501.03173]] MObI: Multimodal Object Inpainting Using Diffusion Models(https://arxiv.org/abs/2501.03173)
Keywords: diffusion
Abstract: Safety-critical applications, such as autonomous driving, require extensive multimodal data for rigorous testing. Methods based on synthetic data are gaining prominence due to the cost and complexity of gathering real-world data but require a high degree of realism and controllability in order to be useful. This paper introduces MObI, a novel framework for Multimodal Object Inpainting that leverages a diffusion model to create realistic and controllable object inpaintings across perceptual modalities, demonstrated for both camera and lidar simultaneously. Using a single reference RGB image, MObI enables objects to be seamlessly inserted into existing multimodal scenes at a 3D location specified by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, our 3D bounding box conditioning gives objects accurate spatial positioning and realistic scaling. As a result, our approach can be used to insert novel objects flexibly into multimodal scenes, providing significant advantages for testing perception models.

Title: Leveraging Explainable AI for LLM Text Attribution: Differentiating Human-Written and Multiple LLMs-Generated Text

Authors: Ayat Najjar, Huthaifa I. Ashqar, Omar Darwish, Eman Hammad
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2501.03212
Pdf URL: https://arxiv.org/pdf/2501.03212
Copy Paste: [[2501.03212]] Leveraging Explainable AI for LLM Text Attribution: Differentiating Human-Written and Multiple LLMs-Generated Text(https://arxiv.org/abs/2501.03212)
Keywords: generative
Abstract: The development of Generative AI Large Language Models (LLMs) raised the alarm regarding identifying content produced through generative AI or humans. In one case, issues arise when students heavily rely on such tools in a manner that can affect the development of their writing or coding skills. Other issues of plagiarism also apply. This study aims to support efforts to detect and identify textual content generated using LLM tools. We hypothesize that LLMs-generated text is detectable by machine learning (ML), and investigate ML models that can recognize and differentiate texts generated by multiple LLMs tools. We leverage several ML and Deep Learning (DL) algorithms such as Random Forest (RF), and Recurrent Neural Networks (RNN), and utilized Explainable Artificial Intelligence (XAI) to understand the important features in attribution. Our method is divided into 1) binary classification to differentiate between human-written and AI-text, and 2) multi classification, to differentiate between human-written text and the text generated by the five different LLM tools (ChatGPT, LLaMA, Google Bard, Claude, and Perplexity). Results show high accuracy in the multi and binary classification. Our model outperformed GPTZero with 98.5\% accuracy to 78.3\%. Notably, GPTZero was unable to recognize about 4.2\% of the observations, but our model was able to recognize the complete test dataset. XAI results showed that understanding feature importance across different classes enables detailed author/source profiles. Further, aiding in attribution and supporting plagiarism detection by highlighting unique stylistic and structural elements ensuring robust content originality verification.

Title: ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking

Authors: Tingyang Zhang, Chen Wang, Zhiyang Dou, Qingzhe Gao, Jiahui Lei, Baoquan Chen, Lingjie Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2501.03220
Pdf URL: https://arxiv.org/pdf/2501.03220
Copy Paste: [[2501.03220]] ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking(https://arxiv.org/abs/2501.03220)
Keywords: self-supervised
Abstract: In this paper, we propose ProTracker, a novel framework for robust and accurate long-term dense tracking of arbitrary points in videos. The key idea of our method is incorporating probabilistic integration to refine multiple predictions from both optical flow and semantic features for robust short-term and long-term tracking. Specifically, we integrate optical flow estimations in a probabilistic manner, producing smooth and accurate trajectories by maximizing the likelihood of each prediction. To effectively re-localize challenging points that disappear and reappear due to occlusion, we further incorporate long-term feature correspondence into our flow predictions for continuous trajectory generation. Extensive experiments show that ProTracker achieves the state-of-the-art performance among unsupervised and self-supervised approaches, and even outperforms supervised methods on several benchmarks. Our code and model will be publicly available upon publication.

Title: BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

Authors: Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2501.03226
Pdf URL: https://arxiv.org/pdf/2501.03226
Copy Paste: [[2501.03226]] BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning(https://arxiv.org/abs/2501.03226)
Keywords: in-context
Abstract: Cutting-edge large language models (LLMs) demonstrate promising performance in solving complex math problems with a divide-and-conquer pipeline and the assistance of in-context learning (ICL) examples. However, their potential for improvement is limited by two critical problems within their ICL examples: granularity-mismatch and the ensuing negative-effect noise problem. Specifically, the LLMs are capable of the dividing process yet mostly failed by inaccurate reasoning within a few conquer steps, while the ICL examples retrieved in question-grained sometimes lack relevant steps for a specific challenging reasoning step. Further, this disconnect may hinder the correct reasoning due to its irrelevance. To this end, we focus on improving the reasoning quality within each step and present BoostStep. BoostStep aligns the granularity between the retrieving and reasoning on step grained, and provides highly related ICL examples for each reasoning step with a novel `first-try' strategy. BoostStep provides more relevant examples than the coarse question-grained strategy, enhancing the model reasoning quality within each step steadily. BoostStep is a general and robust reasoning-enhancing method that not only improves standalone reasoning performance but also integrates seamlessly with Monte Carlo Tree Search methods (MCTS) to refine both candidate generation and decision-making. Quantitatively, it improves GPT-4o and Qwen2.5-Math-72B by 3.6\% and 2.0\% respectively on various mathematical benchmarks, and 7.5\% gain combined with MCTS.

Title: Gaussian Masked Autoencoders

Authors: Jathushan Rajasegaran, Xinlei Chen, Rulilong Li, Christoph Feichtenhofer, Jitendra Malik, Shiry Ginosar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2501.03229
Pdf URL: https://arxiv.org/pdf/2501.03229
Copy Paste: [[2501.03229]] Gaussian Masked Autoencoders(https://arxiv.org/abs/2501.03229)
Keywords: self-supervised
Abstract: This paper explores Masked Autoencoders (MAE) with Gaussian Splatting. While reconstructive self-supervised learning frameworks such as MAE learns good semantic abstractions, it is not trained for explicit spatial awareness. Our approach, named Gaussian Masked Autoencoder, or GMAE, aims to learn semantic abstractions and spatial understanding jointly. Like MAE, it reconstructs the image end-to-end in the pixel space, but beyond MAE, it also introduces an intermediate, 3D Gaussian-based representation and renders images via splatting. We show that GMAE can enable various zero-shot learning capabilities of spatial understanding (e.g., figure-ground segmentation, image layering, edge detection, etc.) while preserving the high-level semantics of self-supervised representation quality from MAE. To our knowledge, we are the first to employ Gaussian primitives in an image representation learning framework beyond optimization-based single-scene reconstructions. We believe GMAE will inspire further research in this direction and contribute to developing next-generation techniques for modeling high-fidelity visual data. More details at this https URL