2024-12-17

Title: Scalable Early Childhood Reading Performance Prediction

Authors: Zhongkai Shangguan, Zanming Huang, Eshed Ohn-Bar, Ola Ozernov-Palchik, Derek Kosty, Michael Stoolmiller, Hank Fien
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.10401
Pdf URL: https://arxiv.org/pdf/2412.10401
Copy Paste: [[2412.10401]] Scalable Early Childhood Reading Performance Prediction(https://arxiv.org/abs/2412.10401)
Keywords: self-supervised
Abstract: Models for student reading performance can empower educators and institutions to proactively identify at-risk students, thereby enabling early and tailored instructional interventions. However, there are no suitable publicly available educational datasets for modeling and predicting future reading performance. In this work, we introduce the Enhanced Core Reading Instruction ECRI dataset, a novel large-scale longitudinal tabular dataset collected across 44 schools with 6,916 students and 172 teachers. We leverage the dataset to empirically evaluate the ability of state-of-the-art machine learning models to recognize early childhood educational patterns in multivariate and partial measurements. Specifically, we demonstrate a simple self-supervised strategy in which a Multi-Layer Perception (MLP) network is pre-trained over masked inputs to outperform several strong baselines while generalizing over diverse educational settings. To facilitate future developments in precise modeling and responsible use of models for individualized and early intervention strategies, our data and code are available at this https URL.

Title: Generative Adversarial Reviews: When LLMs Become the Critic

Authors: Nicolas Bougie, Narimasa Watanabe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10415
Pdf URL: https://arxiv.org/pdf/2412.10415
Copy Paste: [[2412.10415]] Generative Adversarial Reviews: When LLMs Become the Critic(https://arxiv.org/abs/2412.10415)
Keywords: generative
Abstract: The peer review process is fundamental to scientific progress, determining which papers meet the quality standards for publication. Yet, the rapid growth of scholarly production and increasing specialization in knowledge areas strain traditional scientific feedback mechanisms. In light of this, we introduce Generative Agent Reviewers (GAR), leveraging LLM-empowered agents to simulate faithful peer reviewers. To enable generative reviewers, we design an architecture that extends a large language model with memory capabilities and equips agents with reviewer personas derived from historical data. Central to this approach is a graph-based representation of manuscripts, condensing content and logically organizing information - linking ideas with evidence and technical details. GAR's review process leverages external knowledge to evaluate paper novelty, followed by detailed assessment using the graph representation and multi-round assessment. Finally, a meta-reviewer aggregates individual reviews to predict the acceptance decision. Our experiments demonstrate that GAR performs comparably to human reviewers in providing detailed feedback and predicting paper outcomes. Beyond mere performance comparison, we conduct insightful experiments, such as evaluating the impact of reviewer expertise and examining fairness in reviews. By offering early expert-level feedback, typically restricted to a limited group of researchers, GAR democratizes access to transparent and in-depth evaluation.

Title: GPTDrawer: Enhancing Visual Synthesis through ChatGPT

Authors: Kun Li, Xinwei Chen, Tianyou Song, Hansong Zhang, Wenzhe Zhang, Qing Shan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10429
Pdf URL: https://arxiv.org/pdf/2412.10429
Copy Paste: [[2412.10429]] GPTDrawer: Enhancing Visual Synthesis through ChatGPT(https://arxiv.org/abs/2412.10429)
Keywords: diffusion, generative
Abstract: In the burgeoning field of AI-driven image generation, the quest for precision and relevance in response to textual prompts remains paramount. This paper introduces GPTDrawer, an innovative pipeline that leverages the generative prowess of GPT-based models to enhance the visual synthesis process. Our methodology employs a novel algorithm that iteratively refines input prompts using keyword extraction, semantic analysis, and image-text congruence evaluation. By integrating ChatGPT for natural language processing and Stable Diffusion for image generation, GPTDrawer produces a batch of images that undergo successive refinement cycles, guided by cosine similarity metrics until a threshold of semantic alignment is attained. The results demonstrate a marked improvement in the fidelity of images generated in accordance with user-defined prompts, showcasing the system's ability to interpret and visualize complex semantic constructs. The implications of this work extend to various applications, from creative arts to design automation, setting a new benchmark for AI-assisted creative processes.

Title: SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion

Authors: Ximing Xing, Juncheng Hu, Jing Zhang, Dong Xu, Qian Yu
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10437
Pdf URL: https://arxiv.org/pdf/2412.10437
Copy Paste: [[2412.10437]] SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion(https://arxiv.org/abs/2412.10437)
Keywords: diffusion
Abstract: The generation of Scalable Vector Graphics (SVG) assets from textual data remains a significant challenge, largely due to the scarcity of high-quality vector datasets and the limitations in scalable vector representations required for modeling intricate graphic distributions. This work introduces SVGFusion, a Text-to-SVG model capable of scaling to real-world SVG data without reliance on a text-based discrete language model or prolonged SDS optimization. The essence of SVGFusion is to learn a continuous latent space for vector graphics with a popular Text-to-Image framework. Specifically, SVGFusion consists of two modules: a Vector-Pixel Fusion Variational Autoencoder (VP-VAE) and a Vector Space Diffusion Transformer (VS-DiT). VP-VAE takes both the SVGs and corresponding rasterizations as inputs and learns a continuous latent space, whereas VS-DiT learns to generate a latent code within this space based on the text prompt. Based on VP-VAE, a novel rendering sequence modeling strategy is proposed to enable the latent space to embed the knowledge of construction logics in SVGs. This empowers the model to achieve human-like design capabilities in vector graphics, while systematically preventing occlusion in complex graphic compositions. Moreover, our SVGFusion's ability can be continuously improved by leveraging the scalability of the VS-DiT by adding more VS-DiT blocks. A large-scale SVG dataset is collected to evaluate the effectiveness of our proposed method. Extensive experimentation has confirmed the superiority of our SVGFusion over existing SVG generation methods, achieving enhanced quality and generalizability, thereby establishing a novel framework for SVG content creation. Code, model, and data will be released at: \href{this https URL}{this https URL}

Title: CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs

Authors: Yihan Cao, Jiazhao Zhang, Zhinan Yu, Shuzhen Liu, Zheng Qin, Qin Zou, Bo Du, Kai Xu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2412.10439
Pdf URL: https://arxiv.org/pdf/2412.10439
Copy Paste: [[2412.10439]] CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs(https://arxiv.org/abs/2412.10439)
Keywords: foundation model
Abstract: Object goal navigation (ObjectNav) is a fundamental task of embodied AI that requires the agent to find a target object in unseen environments. This task is particularly challenging as it demands both perceptual and cognitive processes for effective perception and decision-making. While perception has gained significant progress powered by the rapidly developed visual foundation models, the progress on the cognitive side remains limited to either implicitly learning from massive navigation demonstrations or explicitly leveraging pre-defined heuristic rules. Inspired by neuroscientific evidence that humans consistently update their cognitive states while searching for objects in unseen environments, we present CogNav, which attempts to model this cognitive process with the help of large language models. Specifically, we model the cognitive process with a finite state machine composed of cognitive states ranging from exploration to identification. The transitions between the states are determined by a large language model based on an online built heterogeneous cognitive map containing spatial and semantic information of the scene being explored. Extensive experiments on both synthetic and real-world environments demonstrate that our cognitive modeling significantly improves ObjectNav efficiency, with human-like navigation behaviors. In an open-vocabulary and zero-shot setting, our method advances the SOTA of the HM3D benchmark from 69.3% to 87.2%. The code and data will be released.

Title: Unlocking Visual Secrets: Inverting Features with Diffusion Priors for Image Reconstruction

Authors: Sai Qian Zhang, Ziyun Li, Chuan Guo, Saeed Mahloujifar, Deeksha Dangwal, Edward Suh, Barbara De Salvo, Chiao Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10448
Pdf URL: https://arxiv.org/pdf/2412.10448
Copy Paste: [[2412.10448]] Unlocking Visual Secrets: Inverting Features with Diffusion Priors for Image Reconstruction(https://arxiv.org/abs/2412.10448)
Keywords: diffusion
Abstract: Inverting visual representations within deep neural networks (DNNs) presents a challenging and important problem in the field of security and privacy for deep learning. The main goal is to invert the features of an unidentified target image generated by a pre-trained DNN, aiming to reconstruct the original image. Feature inversion holds particular significance in understanding the privacy leakage inherent in contemporary split DNN execution techniques, as well as in various applications based on the extracted DNN features. In this paper, we explore the use of diffusion models, a promising technique for image synthesis, to enhance feature inversion quality. We also investigate the potential of incorporating alternative forms of prior knowledge, such as textual prompts and cross-frame temporal correlations, to further improve the quality of inverted features. Our findings reveal that diffusion models can effectively leverage hidden information from the DNN features, resulting in superior reconstruction performance compared to previous methods. This research offers valuable insights into how diffusion models can enhance privacy and security within applications that are reliant on DNN features.

Title: Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning

Authors: Shihao Xu, Yiyang Luo, Wei Shi
Subjects: cs.CV, cs.AI, cs.CG
Abstract URL: https://arxiv.org/abs/2412.10455
Pdf URL: https://arxiv.org/pdf/2412.10455
Copy Paste: [[2412.10455]] Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning(https://arxiv.org/abs/2412.10455)
Keywords: in-context
Abstract: Geometry mathematics problems pose significant challenges for large language models (LLMs) because they involve visual elements and spatial reasoning. Current methods primarily rely on symbolic character awareness to address these problems. Considering geometry problem solving is a relatively nascent field with limited suitable datasets and currently almost no work on solid geometry problem solving, we collect a geometry question-answer dataset by sourcing geometric data from Chinese high school education websites, referred to as GeoMath. It contains solid geometry questions and answers with accurate reasoning steps as compensation for existing plane geometry datasets. Additionally, we propose a Large Multi-modal Model (LMM) framework named Geo-LLaVA, which incorporates retrieval augmentation with supervised fine-tuning (SFT) in the training stage, called meta-training, and employs in-context learning (ICL) during inference to improve performance. Our fine-tuned model with ICL attains the state-of-the-art performance of 65.25% and 42.36% on selected questions of the GeoQA dataset and GeoMath dataset respectively with proper inference steps. Notably, our model initially endows the ability to solve solid geometry problems and supports the generation of reasonable solid geometry picture descriptions and problem-solving steps. Our research sets the stage for further exploration of LLMs in multi-modal math problem-solving, particularly in geometry math problems.

Title: Explaining Model Overfitting in CNNs via GMM Clustering

Authors: Hui Dou, Xinyu Mu, Mengjun Yi, Feng Han, Jian Zhao, Furao Shen
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2412.10457
Pdf URL: https://arxiv.org/pdf/2412.10457
Copy Paste: [[2412.10457]] Explaining Model Overfitting in CNNs via GMM Clustering(https://arxiv.org/abs/2412.10457)
Keywords: anomaly
Abstract: Convolutional Neural Networks (CNNs) have demonstrated remarkable prowess in the field of computer vision. However, their opaque decision-making processes pose significant challenges for practical applications. In this study, we provide quantitative metrics for assessing CNN filters by clustering the feature maps corresponding to individual filters in the model via Gaussian Mixture Model (GMM). By analyzing the clustering results, we screen out some anomaly filters associated with outlier samples. We further analyze the relationship between the anomaly filters and model overfitting, proposing three hypotheses. This method is universally applicable across diverse CNN architectures without modifications, as evidenced by its successful application to models like AlexNet and LeNet-5. We present three meticulously designed experiments demonstrating our hypotheses from the perspectives of model behavior, dataset characteristics, and filter impacts. Through this work, we offer a novel perspective for evaluating the CNN performance and gain new insights into the operational behavior of model overfitting.

Title: Dynamic Entity-Masked Graph Diffusion Model for histopathological image Representation Learning

Authors: Zhenfeng Zhuang, Min Cen, Yanfeng Li, Fangyu Zhou, Lequan Yu, Baptiste Magnier, Liansheng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10482
Pdf URL: https://arxiv.org/pdf/2412.10482
Copy Paste: [[2412.10482]] Dynamic Entity-Masked Graph Diffusion Model for histopathological image Representation Learning(https://arxiv.org/abs/2412.10482)
Keywords: diffusion, self-supervised
Abstract: Significant disparities between the features of natural images and those inherent to histopathological images make it challenging to directly apply and transfer pre-trained models from natural images to histopathology tasks. Moreover, the frequent lack of annotations in histopathology patch images has driven researchers to explore self-supervised learning methods like mask reconstruction for learning representations from large amounts of unlabeled data. Crucially, previous mask-based efforts in self-supervised learning have often overlooked the spatial interactions among entities, which are essential for constructing accurate representations of pathological entities. To address these challenges, constructing graphs of entities is a promising approach. In addition, the diffusion reconstruction strategy has recently shown superior performance through its random intensity noise addition technique to enhance the robust learned representation. Therefore, we introduce H-MGDM, a novel self-supervised Histopathology image representation learning method through the Dynamic Entity-Masked Graph Diffusion Model. Specifically, we propose to use complementary subgraphs as latent diffusion conditions and self-supervised targets respectively during pre-training. We note that the graph can embed entities' topological relationships and enhance representation. Dynamic conditions and targets can improve pathological fine reconstruction. Our model has conducted pretraining experiments on three large histopathological datasets. The advanced predictive performance and interpretability of H-MGDM are clearly evaluated on comprehensive downstream tasks such as classification and survival analysis on six datasets. Our code will be publicly available at this https URL.

Title: CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information

Authors: Kaifan Zhang, Lihuo He, Xin Jiang, Wen Lu, Di Wang, Xinbo Gao
Subjects: cs.CV, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2412.10489
Pdf URL: https://arxiv.org/pdf/2412.10489
Copy Paste: [[2412.10489]] CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information(https://arxiv.org/abs/2412.10489)
Keywords: diffusion, generative
Abstract: Electroencephalogram (EEG) signals have attracted significant attention from researchers due to their non-invasive nature and high temporal sensitivity in decoding visual stimuli. However, most recent studies have focused solely on the relationship between EEG and image data pairs, neglecting the valuable ``beyond-image-modality" information embedded in EEG signals. This results in the loss of critical multimodal information in EEG. To address this limitation, we propose CognitionCapturer, a unified framework that fully leverages multimodal data to represent EEG signals. Specifically, CognitionCapturer trains Modality Expert Encoders for each modality to extract cross-modal information from the EEG modality. Then, it introduces a diffusion prior to map the EEG embedding space to the CLIP embedding space, followed by using a pretrained generative model, the proposed framework can reconstruct visual stimuli with high semantic and structural fidelity. Notably, the framework does not require any fine-tuning of the generative models and can be extended to incorporate more modalities. Through extensive experiments, we demonstrate that CognitionCapturer outperforms state-of-the-art methods both qualitatively and quantitatively. Code: this https URL.

Title: SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation

Authors: Runtao Liu, Chen I Chieh, Jindong Gu, Jipeng Zhang, Renjie Pi, Qifeng Chen, Philip Torr, Ashkan Khakzar, Fabio Pizzati
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10493
Pdf URL: https://arxiv.org/pdf/2412.10493
Copy Paste: [[2412.10493]] SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation(https://arxiv.org/abs/2412.10493)
Keywords: generative
Abstract: Text-to-image (T2I) models have become widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse. Current safety measures are typically limited to text-based filtering or concept removal strategies, able to remove just a few concepts from the model's generative capabilities. In this work, we introduce SafetyDPO, a method for safety alignment of T2I models through Direct Preference Optimization (DPO). We enable the application of DPO for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2. Using a custom DPO strategy and this dataset, we train safety experts, in the form of low-rank adaptation (LoRA) matrices, able to guide the generation process away from specific safety-related concepts. Then, we merge the experts into a single LoRA using a novel merging strategy for optimal scaling performance. This expert-based approach enables scalability, allowing us to remove 7 times more harmful concepts from T2I models compared to baselines. SafetyDPO consistently outperforms the state-of-the-art on many benchmarks and establishes new practices for safety alignment in T2I networks. Code and data will be shared at this https URL.

Title: SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

Authors: Yushu Wu, Zhixing Zhang, Yanyu Li, Yanwu Xu, Anil Kag, Yang Sui, Huseyin Coskun, Ke Ma, Aleksei Lebedev, Ju Hu, Dimitris Metaxas, Yanzhi Wang, Sergey Tulyakov, Jian Ren
Subjects: cs.CV, cs.AI, cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2412.10494
Pdf URL: https://arxiv.org/pdf/2412.10494
Copy Paste: [[2412.10494]] SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device(https://arxiv.org/abs/2412.10494)
Keywords: diffusion
Abstract: We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds. Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.

Title: Towards Using Machine Learning to Generatively Simulate EV Charging in Urban Areas

Authors: Marek Miltner, Jakub Zíka, Daniel Vašata, Artem Bryksa, Magda Friedjungová, Ondřej Štogl, Ram Rajagopal, Oldřich Starý
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.10531
Pdf URL: https://arxiv.org/pdf/2412.10531
Copy Paste: [[2412.10531]] Towards Using Machine Learning to Generatively Simulate EV Charging in Urban Areas(https://arxiv.org/abs/2412.10531)
Keywords: generative
Abstract: This study addresses the challenge of predicting electric vehicle (EV) charging profiles in urban locations with limited data. Utilizing a neural network architecture, we aim to uncover latent charging profiles influenced by spatio-temporal factors. Our model focuses on peak power demand and daily load shapes, providing insights into charging behavior. Our results indicate significant impacts from the type of Basic Administrative Units on predicted load curves, which contributes to the understanding and optimization of EV charging infrastructure in urban settings and allows Distribution System Operators (DSO) to more efficiently plan EV charging infrastructure expansion.

Title: Too Big to Fool: Resisting Deception in Language Models

Authors: Mohammad Reza Samsami, Mats Leon Richter, Juan Rodriguez, Megh Thakkar, Sarath Chandar, Maxime Gasse
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10558
Pdf URL: https://arxiv.org/pdf/2412.10558
Copy Paste: [[2412.10558]] Too Big to Fool: Resisting Deception in Language Models(https://arxiv.org/abs/2412.10558)
Keywords: in-context
Abstract: Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. This paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.

Title: Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics

Authors: Sara Ghazanfari, Siddharth Garg, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, Francesco Croce
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10594
Pdf URL: https://arxiv.org/pdf/2412.10594
Copy Paste: [[2412.10594]] Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics(https://arxiv.org/abs/2412.10594)
Keywords: generative
Abstract: Human perception of similarity across uni- and multimodal inputs is highly complex, making it challenging to develop automated metrics that accurately mimic it. General purpose vision-language models, such as CLIP and large multi-modal models (LMMs), can be applied as zero-shot perceptual metrics, and several recent works have developed models specialized in narrow perceptual tasks. However, the extent to which existing perceptual metrics align with human perception remains unclear. To investigate this question, we introduce UniSim-Bench, a benchmark encompassing 7 multi-modal perceptual similarity tasks, with a total of 25 datasets. Our evaluation reveals that while general-purpose models perform reasonably well on average, they often lag behind specialized models on individual tasks. Conversely, metrics fine-tuned for specific tasks fail to generalize well to unseen, though related, tasks. As a first step towards a unified multi-task perceptual similarity metric, we fine-tune both encoder-based and generative vision-language models on a subset of the UniSim-Bench tasks. This approach yields the highest average performance, and in some cases, even surpasses taskspecific models. Nevertheless, these models still struggle with generalization to unseen tasks, highlighting the ongoing challenge of learning a robust, unified perceptual similarity metric capable of capturing the human notion of similarity. The code and models are available at this https URL.

Title: EvalGIM: A Library for Evaluating Generative Image Models

Authors: Melissa Hall, Oscar Mañas, Reyhane Askari, Mark Ibrahim, Candace Ross, Pietro Astolfi, Tariq Berrada Ifriqi, Marton Havasi, Yohann Benchetrit, Karen Ullrich, Carolina Braga, Abhishek Charnalia, Maeve Ryan, Mike Rabbat, Michal Drozdzal, Jakob Verbeek, Adriana Romero Soriano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10604
Pdf URL: https://arxiv.org/pdf/2412.10604
Copy Paste: [[2412.10604]] EvalGIM: A Library for Evaluating Generative Image Models(https://arxiv.org/abs/2412.10604)
Keywords: generative
Abstract: As the use of text-to-image generative models increases, so does the adoption of automatic benchmarking methods used in their evaluation. However, while metrics and datasets abound, there are few unified benchmarking libraries that provide a framework for performing evaluations across many datasets and metrics. Furthermore, the rapid introduction of increasingly robust benchmarking methods requires that evaluation libraries remain flexible to new datasets and metrics. Finally, there remains a gap in synthesizing evaluations in order to deliver actionable takeaways about model performance. To enable unified, flexible, and actionable evaluations, we introduce EvalGIM (pronounced ''EvalGym''), a library for evaluating generative image models. EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency of text-to-image generative models. In addition, EvalGIM is designed with flexibility for user customization as a top priority and contains a structure that allows plug-and-play additions of new datasets and metrics. To enable actionable evaluation insights, we introduce ''Evaluation Exercises'' that highlight takeaways for specific evaluation questions. The Evaluation Exercises contain easy-to-use and reproducible implementations of two state-of-the-art evaluation methods of text-to-image generative models: consistency-diversity-realism Pareto Fronts and disaggregated measurements of performance disparities across groups. EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models: robustness analyses of model rankings and balanced evaluations across different prompt styles. We encourage text-to-image model exploration with EvalGIM and invite contributions at this https URL.

Title: CATALOG: A Camera Trap Language-guided Contrastive Learning Model

Authors: Julian D. Santamaria, Claudia Isaza, Jhony H. Giraldo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10624
Pdf URL: https://arxiv.org/pdf/2412.10624
Copy Paste: [[2412.10624]] CATALOG: A Camera Trap Language-guided Contrastive Learning Model(https://arxiv.org/abs/2412.10624)
Keywords: foundation model
Abstract: Foundation Models (FMs) have been successful in various computer vision tasks like image classification, object detection and image segmentation. However, these tasks remain challenging when these models are tested on datasets with different distributions from the training dataset, a problem known as domain shift. This is especially problematic for recognizing animal species in camera-trap images where we have variability in factors like lighting, camouflage and occlusions. In this paper, we propose the Camera Trap Language-guided Contrastive Learning (CATALOG) model to address these issues. Our approach combines multiple FMs to extract visual and textual features from camera-trap data and uses a contrastive loss function to train the model. We evaluate CATALOG on two benchmark datasets and show that it outperforms previous state-of-the-art methods in camera-trap image recognition, especially when the training and testing data have different animal species or come from different geographical areas. Our approach demonstrates the potential of using FMs in combination with multi-modal fusion and contrastive learning for addressing domain shifts in camera-trap image recognition. The code of CATALOG is publicly available at this https URL.

Title: Control of Overfitting with Physics

Authors: Sergei V. Kozyrev, Ilya A Lopatin, Alexander N Pechen
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2412.10716
Pdf URL: https://arxiv.org/pdf/2412.10716
Copy Paste: [[2412.10716]] Control of Overfitting with Physics(https://arxiv.org/abs/2412.10716)
Keywords: generative
Abstract: While there are many works on the applications of machine learning, not so many of them are trying to understand the theoretical justifications to explain their efficiency. In this work, overfitting control (or generalization property) in machine learning is explained using analogies from physics and biology. For stochastic gradient Langevin dynamics, we show that the Eyring formula of kinetic theory allows to control overfitting in the algorithmic stability approach - when wide minima of the risk function with low free energy correspond to low overfitting. For the generative adversarial network (GAN) model, we establish an analogy between GAN and the predator-prey model in biology. An application of this analogy allows us to explain the selection of wide likelihood maxima and overfitting reduction for GANs.

Title: Diagnosing Unknown Attacks in Smart Homes Using Abductive Reasoning

Authors: Kushal Ramkumar, Wanling Cai, John McCarthy, Gavin Doherty, Bashar Nuseibeh, Liliana Pasquale
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2412.10738
Pdf URL: https://arxiv.org/pdf/2412.10738
Copy Paste: [[2412.10738]] Diagnosing Unknown Attacks in Smart Homes Using Abductive Reasoning(https://arxiv.org/abs/2412.10738)
Keywords: anomaly
Abstract: Security attacks are rising, as evidenced by the number of reported vulnerabilities. Among them, unknown attacks, including new variants of existing attacks, technical blind spots or previously undiscovered attacks, challenge enduring security. This is due to the limited number of techniques that diagnose these attacks and enable the selection of adequate security controls. In this paper, we propose an automated technique that detects and diagnoses unknown attacks by identifying the class of attack and the violated security requirements, enabling the selection of adequate security controls. Our technique combines anomaly detection to detect unknown attacks with abductive reasoning to diagnose them. We first model the behaviour of the smart home and its requirements as a logic program in Answer Set Programming (ASP). We then apply Z-Score thresholding to the anomaly scores of an Isolation Forest trained using unlabeled data to simulate unknown attack scenarios. Finally, we encode the network anomaly in the logic program and perform abduction by refutation to identify the class of attack and the security requirements that this anomaly may violate. We demonstrate our technique using a smart home scenario, where we detect and diagnose anomalies in network traffic. We evaluate the precision, recall and F1-score of the anomaly detector and the diagnosis technique against 18 attacks from the ground truth labels provided by two datasets, CICIoT2023 and IoT-23. Our experiments show that the anomaly detector effectively identifies anomalies when the network traces are strong indicators of an attack. When provided with sufficient contextual data, the diagnosis logic effectively identifies true anomalies, and reduces the number of false positives reported by anomaly detectors. Finally, we discuss how our technique can support the selection of adequate security controls.

Title: NeuralPLexer3: Physio-Realistic Biomolecular Complex Structure Prediction with Flow Models

Authors: Zhuoran Qiao, Feizhi Ding, Thomas Dresselhaus, Mia A. Rosenfeld, Xiaotian Han, Owen Howell, Aniketh Iyengar, Stephen Opalenski, Anders S. Christensen, Sai Krishna Sirumalla, Frederick R. Manby, Thomas F. Miller III, Matthew Welborn
Subjects: cs.LG, physics.chem-ph, q-bio.BM
Abstract URL: https://arxiv.org/abs/2412.10743
Pdf URL: https://arxiv.org/pdf/2412.10743
Copy Paste: [[2412.10743]] NeuralPLexer3: Physio-Realistic Biomolecular Complex Structure Prediction with Flow Models(https://arxiv.org/abs/2412.10743)
Keywords: generative
Abstract: Structure determination is essential to a mechanistic understanding of diseases and the development of novel therapeutics. Machine-learning-based structure prediction methods have made significant advancements by computationally predicting protein and bioassembly structures from sequences and molecular topology alone. Despite substantial progress in the field, challenges remain to deliver structure prediction models to real-world drug discovery. Here, we present NeuralPLexer3 -- a physics-inspired flow-based generative model that achieves state-of-the-art prediction accuracy on key biomolecular interaction types and improves training and sampling efficiency compared to its predecessors and alternative methodologies. Examined through newly developed benchmarking strategies, NeuralPLexer3 excels in vital areas that are crucial to structure-based drug design, such as physical validity and ligand-induced conformational changes.

Title: Neural Network Meta Classifier: Improving the Reliability of Anomaly Segmentation

Authors: Jurica Runtas, Tomislav Petkovic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10765
Pdf URL: https://arxiv.org/pdf/2412.10765
Copy Paste: [[2412.10765]] Neural Network Meta Classifier: Improving the Reliability of Anomaly Segmentation(https://arxiv.org/abs/2412.10765)
Keywords: anomaly
Abstract: Deep neural networks (DNNs) are a contemporary solution for semantic segmentation and are usually trained to operate on a predefined closed set of classes. In open-set environments, it is possible to encounter semantically unknown objects or anomalies. Road driving is an example of such an environment in which, from a safety standpoint, it is important to ensure that a DNN indicates it is operating outside of its learned semantic domain. One possible approach to anomaly segmentation is entropy maximization, which is paired with a logistic regression based post-processing step called meta classification, which is in turn used to improve the reliability of detection of anomalous pixels. We propose to substitute the logistic regression meta classifier with a more expressive lightweight fully connected neural network. We analyze advantages and drawbacks of the proposed neural network meta classifier and demonstrate its better performance over logistic regression. We also introduce the concept of informative out-of-distribution examples which we show to improve training results when using entropy maximization in practice. Finally, we discuss the loss of interpretability and show that the behavior of logistic regression and neural network is strongly correlated.

Title: Sample-efficient Unsupervised Policy Cloning from Ensemble Self-supervised Labeled Videos

Authors: Xin Liu, Yaran Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10778
Pdf URL: https://arxiv.org/pdf/2412.10778
Copy Paste: [[2412.10778]] Sample-efficient Unsupervised Policy Cloning from Ensemble Self-supervised Labeled Videos(https://arxiv.org/abs/2412.10778)
Keywords: self-supervised
Abstract: Current advanced policy learning methodologies have demonstrated the ability to develop expert-level strategies when provided enough information. However, their requirements, including task-specific rewards, expert-labeled trajectories, and huge environmental interactions, can be expensive or even unavailable in many scenarios. In contrast, humans can efficiently acquire skills within a few trials and errors by imitating easily accessible internet video, in the absence of any other supervision. In this paper, we try to let machines replicate this efficient watching-and-learning process through Unsupervised Policy from Ensemble Self-supervised labeled Videos (UPESV), a novel framework to efficiently learn policies from videos without any other expert supervision. UPESV trains a video labeling model to infer the expert actions in expert videos, through several organically combined self-supervised tasks. Each task performs its own duties, and they together enable the model to make full use of both expert videos and reward-free interactions for advanced dynamics understanding and robust prediction. Simultaneously, UPESV clones a policy from the labeled expert videos, in turn collecting environmental interactions for self-supervised tasks. After a sample-efficient and unsupervised (i.e., reward-free) training process, an advanced video-imitated policy is obtained. Extensive experiments in sixteen challenging procedurally-generated environments demonstrate that the proposed UPESV achieves state-of-the-art few-shot policy learning (outperforming five current advanced baselines on 12/16 tasks) without exposure to any other supervision except videos. Detailed analysis is also provided, verifying the necessity of each self-supervised task employed in UPESV.

Title: Video Diffusion Transformers are In-Context Learners

Authors: Zhengcong Fei, Di Qiu, Changqian Yu, Debang Li, Mingyuan Fan, Xiang Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10783
Pdf URL: https://arxiv.org/pdf/2412.10783
Copy Paste: [[2412.10783]] Video Diffusion Transformers are In-Context Learners(https://arxiv.org/abs/2412.10783)
Keywords: diffusion, in-context
Abstract: This paper investigates a solution for enabling in-context capabilities of video diffusion transformers, with minimal tuning required for activation. Specifically, we propose a simple pipeline to leverage in-context generation: ($\textbf{i}$) concatenate videos along spacial or time dimension, ($\textbf{ii}$) jointly caption multi-scene video clips from one source, and ($\textbf{iii}$) apply task-specific fine-tuning using carefully curated small datasets. Through a series of diverse controllable tasks, we demonstrate qualitatively that existing advanced text-to-video models can effectively perform in-context generation. Notably, it allows for the creation of consistent multi-scene videos exceeding 30 seconds in duration, without additional computational overhead. Importantly, this method requires no modifications to the original models, results in high-fidelity video outputs that better align with prompt specifications and maintain role consistency. Our framework presents a valuable tool for the research community and offers critical insights for advancing product-level controllable video generation systems. The data, code, and model weights are publicly available at: \url{this https URL}.

Title: StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer

Authors: Pin-Yen Chiu, Dai-Jie Wu, Po-Hsun Chu, Chia-Hsuan Hsu, Hsiang-Chen Chiu, Chih-Yu Wang, Jun-Cheng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10785
Pdf URL: https://arxiv.org/pdf/2412.10785
Copy Paste: [[2412.10785]] StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer(https://arxiv.org/abs/2412.10785)
Keywords: diffusion
Abstract: Kinship face synthesis is a challenging problem due to the scarcity and low quality of the available kinship data. Existing methods often struggle to generate descendants with both high diversity and fidelity while precisely controlling facial attributes such as age and gender. To address these issues, we propose the Style Latent Diffusion Transformer (StyleDiT), a novel framework that integrates the strengths of StyleGAN with the diffusion model to generate high-quality and diverse kinship faces. In this framework, the rich facial priors of StyleGAN enable fine-grained attribute control, while our conditional diffusion model is used to sample a StyleGAN latent aligned with the kinship relationship of conditioning images by utilizing the advantage of modeling complex kinship relationship distribution. StyleGAN then handles latent decoding for final face generation. Additionally, we introduce the Relational Trait Guidance (RTG) mechanism, enabling independent control of influencing conditions, such as each parent's facial image. RTG also enables a fine-grained adjustment between the diversity and fidelity in synthesized faces. Furthermore, we extend the application to an unexplored domain: predicting a partner's facial images using a child's image and one parent's image within the same framework. Extensive experiments demonstrate that our StyleDiT outperforms existing methods by striking an excellent balance between generating diverse and high-fidelity kinship faces.

Title: Optimizing Few-Step Sampler for Diffusion Probabilistic Model

Authors: Jen-Yuan Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10786
Pdf URL: https://arxiv.org/pdf/2412.10786
Copy Paste: [[2412.10786]] Optimizing Few-Step Sampler for Diffusion Probabilistic Model(https://arxiv.org/abs/2412.10786)
Keywords: diffusion
Abstract: Diffusion Probabilistic Models (DPMs) have demonstrated exceptional capability of generating high-quality and diverse images, but their practical application is hindered by the intensive computational cost during inference. The DPM generation process requires solving a Probability-Flow Ordinary Differential Equation (PF-ODE), which involves discretizing the integration domain into intervals for numerical approximation. This corresponds to the sampling schedule of a diffusion ODE solver, and we notice the solution from a first-order solver can be expressed as a convex combination of model outputs at all scheduled time-steps. We derive an upper bound for the discretization error of the sampling schedule, which can be efficiently optimized with Monte-Carlo estimation. Building on these theoretical results, we purpose a two-phase alternating optimization algorithm. In Phase-1, the sampling schedule is optimized for the pre-trained DPM; in Phase-2, the DPM further tuned on the selected time-steps. Experiments on a pre-trained DPM for ImageNet64 dataset demonstrate the purposed method consistently improves the baseline across various number of sampling steps.

Title: Diffusion-based Method for Satellite Pattern-of-Life Identification

Authors: Yongchao Ye, Xinting Zhu, Xuejin Shen, Xiaoyu Chen, Lishuai Li, S. Joe Qin
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2412.10814
Pdf URL: https://arxiv.org/pdf/2412.10814
Copy Paste: [[2412.10814]] Diffusion-based Method for Satellite Pattern-of-Life Identification(https://arxiv.org/abs/2412.10814)
Keywords: diffusion
Abstract: Satellite pattern-of-life (PoL) identification is crucial for space safety and satellite monitoring, involving the analysis of typical satellite behaviors such as station-keeping, drift, etc. However, existing PoL identification methods remain underdeveloped due to the complexity of aerospace systems, variability in satellite behaviors, and fluctuating observation sampling rates. In a first attempt, we developed a domain expertise-informed machine learning method (Expert-ML) to combine satellite orbital movement knowledge and machine learning models. The Expert-ML method achieved high accuracy results in simulation data and real-world data with normal sampling rate. However, this approach lacks of generality as it requires domain expertise and its performance degraded significantly when data sampling rate varied. To achieve generality, we propose a novel diffusion-based PoL identification method. Distinct from prior approaches, the proposed method leverages a diffusion model to achieve end-to-end identification without manual refinement or domain-specific knowledge. Specifically, we employ a multivariate time-series encoder to capture hidden representations of satellite positional data. The encoded features are subsequently incorporated as conditional information in the denoising process to generate PoL labels. Through experimentation across real-world satellite settings, our proposed diffusion-based method demonstrates its high identification quality and provides a robust solution even with reduced data sampling rates, indicating its great potential in practical satellite behavior pattern identification, tracking and related mission deployment.

Title: Diffusion Model from Scratch

Authors: Wang Zhen, Dong Yunyun
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10824
Pdf URL: https://arxiv.org/pdf/2412.10824
Copy Paste: [[2412.10824]] Diffusion Model from Scratch(https://arxiv.org/abs/2412.10824)
Keywords: diffusion, generative
Abstract: Diffusion generative models are currently the most popular generative models. However, their underlying modeling process is quite complex, and starting directly with the seminal paper Denoising Diffusion Probability Model (DDPM) can be challenging. This paper aims to assist readers in building a foundational understanding of generative models by tracing the evolution from VAEs to DDPM through detailed mathematical derivations and a problem-oriented analytical approach. It also explores the core ideas and improvement strategies of current mainstream methodologies, providing guidance for undergraduate and graduate students interested in learning about diffusion models.

Title: Unbiased General Annotated Dataset Generation

Authors: Dengyang Jiang, Haoyu Wang, Lei Zhang, Wei Wei, Guang Dai, Mengmeng Wang, Jingdong Wang, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10831
Pdf URL: https://arxiv.org/pdf/2412.10831
Copy Paste: [[2412.10831]] Unbiased General Annotated Dataset Generation(https://arxiv.org/abs/2412.10831)
Keywords: diffusion, foundation model
Abstract: Pre-training backbone networks on a general annotated dataset (e.g., ImageNet) that comprises numerous manually collected images with category annotations has proven to be indispensable for enhancing the generalization capacity of downstream visual tasks. However, those manually collected images often exhibit bias, which is non-transferable across either categories or domains, thus causing the model's generalization capacity degeneration. To mitigate this problem, we present an unbiased general annotated dataset generation framework (ubGen). Instead of expensive manual collection, we aim at directly generating unbiased images with category annotations. To achieve this goal, we propose to leverage the advantage of a multimodal foundation model (e.g., CLIP), in terms of aligning images in an unbiased semantic space defined by language. Specifically, we develop a bi-level semantic alignment loss, which not only forces all generated images to be consistent with the semantic distribution of all categories belonging to the target dataset in an adversarial learning manner, but also requires each generated image to match the semantic description of its category name. In addition, we further cast an existing image quality scoring model into a quality assurance loss to preserve the quality of the generated image. By leveraging these two loss functions, we can obtain an unbiased image generation model by simply fine-tuning a pre-trained diffusion model using only all category names in the target dataset as input. Experimental results confirm that, compared with the manually labeled dataset or other synthetic datasets, the utilization of our generated unbiased datasets leads to stable generalization capacity enhancement of different backbone networks across various tasks, especially in tasks where the manually labeled samples are scarce.

Title: Zigzag Diffusion Sampling: The Path to Success Is Zigzag

Authors: Lichen Bai, Shitong Shao, Zikai Zhou, Zipeng Qi, Zhiqiang Xu, Haoyi Xiong, Zeke Xie
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10891
Pdf URL: https://arxiv.org/pdf/2412.10891
Copy Paste: [[2412.10891]] Zigzag Diffusion Sampling: The Path to Success Is Zigzag(https://arxiv.org/abs/2412.10891)
Keywords: diffusion, generative
Abstract: Diffusion models, the most popular generative paradigm so far, can inject conditional information into the generation path to guide the latent towards desired directions. However, existing text-to-image diffusion models often fail to maintain high image quality and high prompt-image alignment for those challenging prompts. To mitigate this issue and enhance existing pretrained diffusion models, we mainly made three contributions in this paper. First, we theoretically and empirically demonstrate that the conditional guidance gap between the denoising and inversion processes captures prompt-related semantic information. Second, motivated by theoretical analysis, we derive Zigzag Diffusion Sampling (Z-Sampling), a novel sampling method that leverages the guidance gap to accumulate semantic information step-by-step throughout the entire generation process, leading to improved sampling results. Moreover, as a plug-and-play method, Z-Sampling can be generally applied to various diffusion models (e.g., accelerated ones and Transformer-based ones) with very limited coding and computational costs. Third, our extensive experiments demonstrate that Z-Sampling can generally and significantly enhance generation quality across various benchmark datasets, diffusion models, and performance evaluation metrics. For example, Z-Sampling can even make DreamShaper achieve the HPSv2 winning rate higher than 94% over the original results. Moreover, Z-Sampling can further enhance existing diffusion models combined with other orthogonal methods, including Diffusion-DPO.

Title: Know Unreported Roadway Incidents in Real-time: A Deep Learning Framework for Early Traffic Anomaly Detection

Authors: Haocheng Duan, Hao Wu, Sean Qian
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10892
Pdf URL: https://arxiv.org/pdf/2412.10892
Copy Paste: [[2412.10892]] Know Unreported Roadway Incidents in Real-time: A Deep Learning Framework for Early Traffic Anomaly Detection(https://arxiv.org/abs/2412.10892)
Keywords: anomaly
Abstract: Conventional automatic incident detection (AID) has relied heavily on all incident reports exclusively for training and evaluation. However, these reports suffer from a number of issues, such as delayed reports, inaccurate descriptions, false alarms, missing reports, and incidents that do not necessarily influence traffic. Relying on these reports to train or calibrate AID models hinders their ability to detect traffic anomalies effectively and timely, even leading to convergence issues in the model training process. Moreover, conventional AID models are not inherently designed to capture the early indicators of any generic incidents. It remains unclear how far ahead an AID model can report incidents. The AID applications in the literature are also spatially limited because the data used by most models is often limited to specific test road segments. To solve these problems, we propose a deep learning framework utilizing prior domain knowledge and model-designing strategies. This allows the model to detect a broader range of anomalies, not only incidents that significantly influence traffic flow but also early characteristics of incidents along with historically unreported anomalies. We specially design the model to target the early-stage detection/prediction of an incident. Additionally, unlike most conventional AID studies, we use widely available data, enhancing our method's scalability. The experimental results across numerous road segments on different maps demonstrate that our model leads to more effective and early anomaly detection. Our framework does not focus on stacking or tweaking various deep learning models; instead, it focuses on model design and training strategies to improve early detection performance.

Title: Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Authors: Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Alejandro J. Ruiz, Calla Beauregard, Ashley Fehr, Mikaela Irene Fudolig, Bradford Demarest, Yoshi Meke Bird, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10924
Pdf URL: https://arxiv.org/pdf/2412.10924
Copy Paste: [[2412.10924]] Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning(https://arxiv.org/abs/2412.10924)
Keywords: generative
Abstract: Tokenization is a necessary component within the current architecture of many language models, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DM) is sufficient for reasonably human-like language performance, and that the emergence of human-meaningful linguistic units among tokens motivates linguistically-informed interventions in existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) semantic primitives and as (2) vehicles for conveying salient distributional patterns from human language to the model. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating sub-optimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokenization pretraining can be a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being meaningfully insulated from the main system intelligence.

Title: Video Representation Learning with Joint-Embedding Predictive Architectures

Authors: Katrina Drozdov, Ravid Shwartz-Ziv, Yann LeCun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.10925
Pdf URL: https://arxiv.org/pdf/2412.10925
Copy Paste: [[2412.10925]] Video Representation Learning with Joint-Embedding Predictive Architectures(https://arxiv.org/abs/2412.10925)
Keywords: self-supervised, generative
Abstract: Video representation learning is an increasingly important topic in machine learning research. We present Video JEPA with Variance-Covariance Regularization (VJ-VCR): a joint-embedding predictive architecture for self-supervised video representation learning that employs variance and covariance regularization to avoid representation collapse. We show that hidden representations from our VJ-VCR contain abstract, high-level information about the input data. Specifically, they outperform representations obtained from a generative baseline on downstream tasks that require understanding of the underlying dynamics of moving objects in the videos. Additionally, we explore different ways to incorporate latent variables into the VJ-VCR framework that capture information about uncertainty in the future in non-deterministic settings.

Title: Progressive Compression with Universally Quantized Diffusion Models

Authors: Yibo Yang, Justus C. Will, Stephan Mandt
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.10935
Pdf URL: https://arxiv.org/pdf/2412.10935
Copy Paste: [[2412.10935]] Progressive Compression with Universally Quantized Diffusion Models(https://arxiv.org/abs/2412.10935)
Keywords: diffusion, generative
Abstract: Diffusion probabilistic models have achieved mainstream success in many generative modeling tasks, from image generation to inverse problem solving. A distinct feature of these models is that they correspond to deep hierarchical latent variable models optimizing a variational evidence lower bound (ELBO) on the data likelihood. Drawing on a basic connection between likelihood modeling and compression, we explore the potential of diffusion models for progressive coding, resulting in a sequence of bits that can be incrementally transmitted and decoded with progressively improving reconstruction quality. Unlike prior work based on Gaussian diffusion or conditional diffusion models, we propose a new form of diffusion model with uniform noise in the forward process, whose negative ELBO corresponds to the end-to-end compression cost using universal quantization. We obtain promising first results on image compression, achieving competitive rate-distortion and rate-realism results on a wide range of bit-rates with a single model, bringing neural codecs a step closer to practical deployment.

Title: SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Authors: Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, Emad Barsoum
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.10958
Pdf URL: https://arxiv.org/pdf/2412.10958
Copy Paste: [[2412.10958]] SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer(https://arxiv.org/abs/2412.10958)
Keywords: generative
Abstract: Efficient image tokenization with high compression ratios remains a critical challenge for training generative models. We present SoftVQ-VAE, a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token, substantially increasing the representation capacity of the latent space. When applied to Transformer-based architectures, our approach compresses 256x256 and 512x512 images using as few as 32 or 64 1-dimensional tokens. Not only does SoftVQ-VAE show consistent and high-quality reconstruction, more importantly, it also achieves state-of-the-art and significantly faster image generation results across different denoising-based generative models. Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256x256 images and 55x for 512x512 images while achieving competitive FID scores of 1.78 and 2.21 for SiT-XL. It also improves the training efficiency of the generative models by reducing the number of training iterations by 2.3x while maintaining comparable performance. With its fully-differentiable design and semantic-rich latent space, our experiment demonstrates that SoftVQ-VQE achieves efficient tokenization without compromising generation quality, paving the way for more efficient generative models. Code and model are released.

Title: Can LLMs Help Create Grammar?: Automating Grammar Creation for Endangered Languages with In-Context Learning

Authors: Piyapath T Spencer, Nanthipat Kongborrirak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.10960
Pdf URL: https://arxiv.org/pdf/2412.10960
Copy Paste: [[2412.10960]] Can LLMs Help Create Grammar?: Automating Grammar Creation for Endangered Languages with In-Context Learning(https://arxiv.org/abs/2412.10960)
Keywords: in-context
Abstract: Yes! In the present-day documenting and preserving endangered languages, the application of Large Language Models (LLMs) presents a promising approach. This paper explores how LLMs, particularly through in-context learning, can assist in generating grammatical information for low-resource languages with limited amount of data. We takes Moklen as a case study to evaluate the efficacy of LLMs in producing coherent grammatical rules and lexical entries using only bilingual dictionaries and parallel sentences of the unknown language without building the model from scratch. Our methodology involves organising the existing linguistic data and prompting to efficiently enable to generate formal XLE grammar. Our results demonstrate that LLMs can successfully capture key grammatical structures and lexical information, although challenges such as the potential for English grammatical biases remain. This study highlights the potential of LLMs to enhance language documentation efforts, providing a cost-effective solution for generating linguistic data and contributing to the preservation of endangered languages.

Title: FlowDock: Geometric Flow Matching for Generative Protein-Ligand Docking and Affinity Prediction

Authors: Alex Morehead, Jianlin Cheng
Subjects: cs.LG, cs.AI, q-bio.BM, q-bio.QM
Abstract URL: https://arxiv.org/abs/2412.10966
Pdf URL: https://arxiv.org/pdf/2412.10966
Copy Paste: [[2412.10966]] FlowDock: Geometric Flow Matching for Generative Protein-Ligand Docking and Affinity Prediction(https://arxiv.org/abs/2412.10966)
Keywords: generative
Abstract: Powerful generative models of protein-ligand structure have recently been proposed, but few of these methods support both flexible protein-ligand docking and affinity estimation. Of those that do, none can directly model multiple binding ligands concurrently or have been rigorously benchmarked on pharmacologically relevant drug targets, hindering their widespread adoption in drug discovery efforts. In this work, we propose FlowDock, a deep geometric generative model based on conditional flow matching that learns to directly map unbound (apo) structures to their bound (holo) counterparts for an arbitrary number of binding ligands. Furthermore, FlowDock provides predicted structural confidence scores and binding affinity values with each of its generated protein-ligand complex structures, enabling fast virtual screening of new (multi-ligand) drug targets. For the commonly-used PoseBusters Benchmark dataset, FlowDock achieves a 51% blind docking success rate using unbound (apo) protein input structures and without any information derived from multiple sequence alignments, and for the challenging new DockGen-E dataset, FlowDock matches the performance of single-sequence Chai-1 for binding pocket generalization. Additionally, in the ligand category of the 16th community-wide Critical Assessment of Techniques for Structure Prediction (CASP16), FlowDock ranked among the top-5 methods for pharmacological binding affinity estimation across 140 protein-ligand complexes, demonstrating the efficacy of its learned representations in virtual screening. Source code, data, and pre-trained models are available at this https URL.

Title: DCSEG: Decoupled 3D Open-Set Segmentation using Gaussian Splatting

Authors: Luis Wiedmann, Luca Wiehe, David Rozenberszki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.10972
Pdf URL: https://arxiv.org/pdf/2412.10972
Copy Paste: [[2412.10972]] DCSEG: Decoupled 3D Open-Set Segmentation using Gaussian Splatting(https://arxiv.org/abs/2412.10972)
Keywords: foundation model
Abstract: Open-set 3D segmentation represents a major point of interest for multiple downstream robotics and augmented/virtual reality applications. Recent advances introduce 3D Gaussian Splatting as a computationally efficient representation of the underlying scene. They enable the rendering of novel views while achieving real-time display rates and matching the quality of computationally far more expensive methods. We present a decoupled 3D segmentation pipeline to ensure modularity and adaptability to novel 3D representations and semantic segmentation foundation models. The pipeline proposes class-agnostic masks based on a 3D reconstruction of the scene. Given the resulting class-agnostic masks, we use a class-aware 2D foundation model to add class annotations to the 3D masks. We test this pipeline with 3D Gaussian Splatting and different 2D segmentation models and achieve better performance than more tailored approaches while also significantly increasing the modularity.

Title: PromptV: Leveraging LLM-powered Multi-Agent Prompting for High-quality Verilog Generation

Authors: Zhendong Mi, Renming Zheng, Haowen Zhong, Yue Sun, Shaoyi Huang
Subjects: cs.LG, cs.AI, cs.AR, cs.SE
Abstract URL: https://arxiv.org/abs/2412.11014
Pdf URL: https://arxiv.org/pdf/2412.11014
Copy Paste: [[2412.11014]] PromptV: Leveraging LLM-powered Multi-Agent Prompting for High-quality Verilog Generation(https://arxiv.org/abs/2412.11014)
Keywords: generative
Abstract: Recent advances in agentic LLMs have demonstrated remarkable automated Verilog code generation capabilities. However, existing approaches either demand substantial computational resources or rely on LLM-assisted single-agent prompt learning techniques, which we observe for the first time has a degeneration issue - characterized by deteriorating generative performance and diminished error detection and correction capabilities. This paper proposes a novel multi-agent prompt learning framework to address these limitations and enhance code generation quality. We show for the first time that multi-agent architectures can effectively mitigate the degeneration risk while improving code error correction capabilities, resulting in higher-quality Verilog code generation. Experimental results show that the proposed method could achieve 96.4% and 96.5% pass@10 scores on VerilogEval Machine and Human benchmarks, respectively while attaining 100% Syntax and 99.9% Functionality pass@5 metrics on the RTLLM benchmark.

Title: Exploring Diffusion and Flow Matching Under Generator Matching

Authors: Zeeshan Patel, James DeLoye, Lance Mathias
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2412.11024
Pdf URL: https://arxiv.org/pdf/2412.11024
Copy Paste: [[2412.11024]] Exploring Diffusion and Flow Matching Under Generator Matching(https://arxiv.org/abs/2412.11024)
Keywords: diffusion, generative
Abstract: In this paper, we present a comprehensive theoretical comparison of diffusion and flow matching under the Generator Matching framework. Despite their apparent differences, both diffusion and flow matching can be viewed under the unified framework of Generator Matching. By recasting both diffusion and flow matching under the same generative Markov framework, we provide theoretical insights into why flow matching models can be more robust empirically and how novel model classes can be constructed by mixing deterministic and stochastic components. Our analysis offers a fresh perspective on the relationships between state-of-the-art generative modeling paradigms.

Title: AURORA: Automated Unleash of 3D Room Outlines for VR Applications

Authors: Huijun Han, Yongqing Liang, Yuanlong Zhou, Wenping Wang, Edgar J. Rojas-Munoz, Xin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11033
Pdf URL: https://arxiv.org/pdf/2412.11033
Copy Paste: [[2412.11033]] AURORA: Automated Unleash of 3D Room Outlines for VR Applications(https://arxiv.org/abs/2412.11033)
Keywords: foundation model
Abstract: Creating realistic VR experiences is challenging due to the labor-intensive process of accurately replicating real-world details into virtual scenes, highlighting the need for automated methods that maintain spatial accuracy and provide design flexibility. In this paper, we propose AURORA, a novel method that leverages RGB-D images to automatically generate both purely virtual reality (VR) scenes and VR scenes combined with real-world elements. This approach can benefit designers by streamlining the process of converting real-world details into virtual scenes. AURORA integrates advanced techniques in image processing, segmentation, and 3D reconstruction to efficiently create realistic and detailed interior designs from real-world environments. The design of this integration ensures optimal performance and precision, addressing key challenges in automated indoor design generation by uniquely combining and leveraging the strengths of foundation models. We demonstrate the effectiveness of our approach through experiments, both on self-captured data and public datasets, showcasing its potential to enhance virtual reality (VR) applications by providing interior designs that conform to real-world positioning.

Title: Semantic Steganography: A Framework for Robust and High-Capacity Information Hiding using Large Language Models

Authors: Minhao Bai, Jinshuai Yang, Kaiyi Pang, Yongfeng Huang, Yue Gao
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.11043
Pdf URL: https://arxiv.org/pdf/2412.11043
Copy Paste: [[2412.11043]] Semantic Steganography: A Framework for Robust and High-Capacity Information Hiding using Large Language Models(https://arxiv.org/abs/2412.11043)
Keywords: generative
Abstract: In the era of Large Language Models (LLMs), generative linguistic steganography has become a prevalent technique for hiding information within model-generated texts. However, traditional steganography methods struggle to effectively align steganographic texts with original model-generated texts due to the lower entropy of the predicted probability distribution of LLMs. This results in a decrease in embedding capacity and poses challenges for decoding stegos in real-world communication channels. To address these challenges, we propose a semantic steganography framework based on LLMs, which construct a semantic space and map secret messages onto this space using ontology-entity trees. This framework offers robustness and reliability for transmission in complex channels, as well as resistance to text rendering and word blocking. Additionally, the stegos generated by our framework are indistinguishable from the covers and achieve a higher embedding capacity compared to state-of-the-art steganography methods, while producing higher quality stegos.

Title: Understanding and Mitigating Memorization in Diffusion Models for Tabular Data

Authors: Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiao Li, Jing Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.11044
Pdf URL: https://arxiv.org/pdf/2412.11044
Copy Paste: [[2412.11044]] Understanding and Mitigating Memorization in Diffusion Models for Tabular Data(https://arxiv.org/abs/2412.11044)
Keywords: diffusion
Abstract: Tabular data generation has attracted significant research interest in recent years, with the tabular diffusion models greatly improving the quality of synthetic data. However, while memorization, where models inadvertently replicate exact or near-identical training data, has been thoroughly investigated in image and text generation, its effects on tabular data remain largely unexplored. In this paper, we conduct the first comprehensive investigation of memorization phenomena in diffusion models for tabular data. Our empirical analysis reveals that memorization appears in tabular diffusion models and increases with larger training epochs. We further examine the influence of factors such as dataset sizes, feature dimensions, and different diffusion models on memorization. Additionally, we provide a theoretical explanation for why memorization occurs in tabular diffusion models. To address this issue, we propose TabCutMix, a simple yet effective data augmentation technique that exchanges randomly selected feature segments between random same-class training sample pairs. Building upon this, we introduce TabCutMixPlus, an enhanced method that clusters features based on feature correlations and ensures that features within the same cluster are exchanged together during augmentation. This clustering mechanism mitigates out-of-distribution (OOD) generation issues by maintaining feature coherence. Experimental results across various datasets and diffusion models demonstrate that TabCutMix effectively mitigates memorization while maintaining high-quality data generation.

Title: DisCo-DSO: Coupling Discrete and Continuous Optimization for Efficient Generative Design in Hybrid Spaces

Authors: Jacob F. Pettit, Chak Shing Lee, Jiachen Yang, Alex Ho, Daniel Faissol, Brenden Petersen, Mikel Landajuela
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2412.11051
Pdf URL: https://arxiv.org/pdf/2412.11051
Copy Paste: [[2412.11051]] DisCo-DSO: Coupling Discrete and Continuous Optimization for Efficient Generative Design in Hybrid Spaces(https://arxiv.org/abs/2412.11051)
Keywords: generative
Abstract: We consider the challenge of black-box optimization within hybrid discrete-continuous and variable-length spaces, a problem that arises in various applications, such as decision tree learning and symbolic regression. We propose DisCo-DSO (Discrete-Continuous Deep Symbolic Optimization), a novel approach that uses a generative model to learn a joint distribution over discrete and continuous design variables to sample new hybrid designs. In contrast to standard decoupled approaches, in which the discrete and continuous variables are optimized separately, our joint optimization approach uses fewer objective function evaluations, is robust against non-differentiable objectives, and learns from prior samples to guide the search, leading to significant improvement in performance and sample efficiency. Our experiments on a diverse set of optimization tasks demonstrate that the advantages of DisCo-DSO become increasingly evident as the complexity of the problem increases. In particular, we illustrate DisCo-DSO's superiority over the state-of-the-art methods for interpretable reinforcement learning with decision trees.

Title: SHMT: Self-supervised Hierarchical Makeup Transfer via Latent Diffusion Models

Authors: Zhaoyang Sun, Shengwu Xiong, Yaxiong Chen, Fei Du, Weihua Chen, Fan Wang, Yi Rong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11058
Pdf URL: https://arxiv.org/pdf/2412.11058
Copy Paste: [[2412.11058]] SHMT: Self-supervised Hierarchical Makeup Transfer via Latent Diffusion Models(https://arxiv.org/abs/2412.11058)
Keywords: diffusion, self-supervised
Abstract: This paper studies the challenging task of makeup transfer, which aims to apply diverse makeup styles precisely and naturally to a given facial image. Due to the absence of paired data, current methods typically synthesize sub-optimal pseudo ground truths to guide the model training, resulting in low makeup fidelity. Additionally, different makeup styles generally have varying effects on the person face, but existing methods struggle to deal with this diversity. To address these issues, we propose a novel Self-supervised Hierarchical Makeup Transfer (SHMT) method via latent diffusion models. Following a "decoupling-and-reconstruction" paradigm, SHMT works in a self-supervised manner, freeing itself from the misguidance of imprecise pseudo-paired data. Furthermore, to accommodate a variety of makeup styles, hierarchical texture details are decomposed via a Laplacian pyramid and selectively introduced to the content representation. Finally, we design a novel Iterative Dual Alignment (IDA) module that dynamically adjusts the injection condition of the diffusion model, allowing the alignment errors caused by the domain gap between content and makeup representations to be corrected. Extensive quantitative and qualitative analyses demonstrate the effectiveness of our method. Our code is available at \url{this https URL}.

Title: CFSynthesis: Controllable and Free-view 3D Human Video Synthesis

Authors: Cui Liyuan, Xu Xiaogang, Dong Wenqi, Yang Zesong, Bao Hujun, Cui Zhaopeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11067
Pdf URL: https://arxiv.org/pdf/2412.11067
Copy Paste: [[2412.11067]] CFSynthesis: Controllable and Free-view 3D Human Video Synthesis(https://arxiv.org/abs/2412.11067)
Keywords: diffusion
Abstract: Human video synthesis aims to create lifelike characters in various environments, with wide applications in VR, storytelling, and content creation. While 2D diffusion-based methods have made significant progress, they struggle to generalize to complex 3D poses and varying scene backgrounds. To address these limitations, we introduce CFSynthesis, a novel framework for generating high-quality human videos with customizable attributes, including identity, motion, and scene configurations. Our method leverages a texture-SMPL-based representation to ensure consistent and stable character appearances across free viewpoints. Additionally, we introduce a novel foreground-background separation strategy that effectively decomposes the scene as foreground and background, enabling seamless integration of user-defined backgrounds. Experimental results on multiple datasets show that CFSynthesis not only achieves state-of-the-art performance in complex human animations but also adapts effectively to 3D motions in free-view and user-specified scenarios.

Title: HC-LLM: Historical-Constrained Large Language Models for Radiology Report Generation

Authors: Tengfei Liu, Jiapu Wang, Yongli Hu, Mingjie Li, Junfei Yi, Xiaojun Chang, Junbin Gao, Baocai Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11070
Pdf URL: https://arxiv.org/pdf/2412.11070
Copy Paste: [[2412.11070]] HC-LLM: Historical-Constrained Large Language Models for Radiology Report Generation(https://arxiv.org/abs/2412.11070)
Keywords: in-context
Abstract: Radiology report generation (RRG) models typically focus on individual exams, often overlooking the integration of historical visual or textual data, which is crucial for patient follow-ups. Traditional methods usually struggle with long sequence dependencies when incorporating historical information, but large language models (LLMs) excel at in-context learning, making them well-suited for analyzing longitudinal medical data. In light of this, we propose a novel Historical-Constrained Large Language Models (HC-LLM) framework for RRG, empowering LLMs with longitudinal report generation capabilities by constraining the consistency and differences between longitudinal images and their corresponding reports. Specifically, our approach extracts both time-shared and time-specific features from longitudinal chest X-rays and diagnostic reports to capture disease progression. Then, we ensure consistent representation by applying intra-modality similarity constraints and aligning various features across modalities with multimodal contrastive and structural constraints. These combined constraints effectively guide the LLMs in generating diagnostic reports that accurately reflect the progression of the disease, achieving state-of-the-art results on the Longitudinal-MIMIC dataset. Notably, our approach performs well even without historical data during testing and can be easily adapted to other multimodal large models, enhancing its versatility.

Title: Edge Contrastive Learning: An Augmentation-Free Graph Contrastive Learning Model

Authors: Yujun Li, Hongyuan Zhang, Yuan Yuan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.11075
Pdf URL: https://arxiv.org/pdf/2412.11075
Copy Paste: [[2412.11075]] Edge Contrastive Learning: An Augmentation-Free Graph Contrastive Learning Model(https://arxiv.org/abs/2412.11075)
Keywords: self-supervised
Abstract: Graph contrastive learning (GCL) aims to learn representations from unlabeled graph data in a self-supervised manner and has developed rapidly in recent years. However, edgelevel contrasts are not well explored by most existing GCL methods. Most studies in GCL only regard edges as auxiliary information while updating node features. One of the primary obstacles of edge-based GCL is the heavy computation burden. To tackle this issue, we propose a model that can efficiently learn edge features for GCL, namely AugmentationFree Edge Contrastive Learning (AFECL) to achieve edgeedge contrast. AFECL depends on no augmentation consisting of two parts. Firstly, we design a novel edge feature generation method, where edge features are computed by embedding concatenation of their connected nodes. Secondly, an edge contrastive learning scheme is developed, where edges connecting the same nodes are defined as positive pairs, and other edges are defined as negative pairs. Experimental results show that compared with recent state-of-the-art GCL methods or even some supervised GNNs, AFECL achieves SOTA performance on link prediction and semi-supervised node classification of extremely scarce labels. The source code is available at this https URL.

Title: EquiFlow: Equivariant Conditional Flow Matching with Optimal Transport for 3D Molecular Conformation Prediction

Authors: Qingwen Tian, Yuxin Xu, Yixuan Yang, Zhen Wang, Ziqi Liu, Pengju Yan, Xiaolin Li
Subjects: cs.LG, physics.chem-ph, q-bio.BM
Abstract URL: https://arxiv.org/abs/2412.11082
Pdf URL: https://arxiv.org/pdf/2412.11082
Copy Paste: [[2412.11082]] EquiFlow: Equivariant Conditional Flow Matching with Optimal Transport for 3D Molecular Conformation Prediction(https://arxiv.org/abs/2412.11082)
Keywords: diffusion
Abstract: Molecular 3D conformations play a key role in determining how molecules interact with other molecules or protein surfaces. Recent deep learning advancements have improved conformation prediction, but slow training speeds and difficulties in utilizing high-degree features limit performance. We propose EquiFlow, an equivariant conditional flow matching model with optimal transport. EquiFlow uniquely applies conditional flow matching in molecular 3D conformation prediction, leveraging simulation-free training to address slow training speeds. It uses a modified Equiformer model to encode Cartesian molecular conformations along with their atomic and bond properties into higher-degree embeddings. Additionally, EquiFlow employs an ODE solver, providing faster inference speeds compared to diffusion models with SDEs. Experiments on the QM9 dataset show that EquiFlow predicts small molecule conformations more accurately than current state-of-the-art models.

Title: BarcodeMamba: State Space Models for Biodiversity Analysis

Authors: Tiancheng Gao, Graham W. Taylor
Subjects: cs.LG, q-bio.GN, q-bio.QM
Abstract URL: https://arxiv.org/abs/2412.11084
Pdf URL: https://arxiv.org/pdf/2412.11084
Copy Paste: [[2412.11084]] BarcodeMamba: State Space Models for Biodiversity Analysis(https://arxiv.org/abs/2412.11084)
Keywords: self-supervised, foundation model
Abstract: DNA barcodes are crucial in biodiversity analysis for building automatic identification systems that recognize known species and discover unseen species. Unlike human genome modeling, barcode-based invertebrate identification poses challenges in the vast diversity of species and taxonomic complexity. Among Transformer-based foundation models, BarcodeBERT excelled in species-level identification of invertebrates, highlighting the effectiveness of self-supervised pretraining on barcode-specific datasets. Recently, structured state space models (SSMs) have emerged, with a time complexity that scales sub-quadratically with the context length. SSMs provide an efficient parameterization of sequence modeling relative to attention-based architectures. Given the success of Mamba and Mamba-2 in natural language, we designed BarcodeMamba, a performant and efficient foundation model for DNA barcodes in biodiversity analysis. We conducted a comprehensive ablation study on the impacts of self-supervised training and tokenization methods, and compared both versions of Mamba layers in terms of expressiveness and their capacity to identify "unseen" species held back from training. Our study shows that BarcodeMamba has better performance than BarcodeBERT even when using only 8.3% as many parameters, and improves accuracy to 99.2% on species-level accuracy in linear probing without fine-tuning for "seen" species. In our scaling study, BarcodeMamba with 63.6% of BarcodeBERT's parameters achieved 70.2% genus-level accuracy in 1-nearest neighbor (1-NN) probing for unseen species. The code repository to reproduce our experiments is available at this https URL.

Title: GraphMoRE: Mitigating Topological Heterogeneity via Mixture of Riemannian Experts

Authors: Zihao Guo, Qingyun Sun, Haonan Yuan, Xingcheng Fu, Min Zhou, Yisen Gao, Jianxin Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11085
Pdf URL: https://arxiv.org/pdf/2412.11085
Copy Paste: [[2412.11085]] GraphMoRE: Mitigating Topological Heterogeneity via Mixture of Riemannian Experts(https://arxiv.org/abs/2412.11085)
Keywords: foundation model
Abstract: Real-world graphs have inherently complex and diverse topological patterns, known as topological heterogeneity. Most existing works learn graph representation in a single constant curvature space that is insufficient to match the complex geometric shapes, resulting in low-quality embeddings with high distortion. This also constitutes a critical challenge for graph foundation models, which are expected to uniformly handle a wide variety of diverse graph data. Recent studies have indicated that product manifold gains the possibility to address topological heterogeneity. However, the product manifold is still homogeneous, which is inadequate and inflexible for representing the mixed heterogeneous topology. In this paper, we propose a novel Graph Mixture of Riemannian Experts (GraphMoRE) framework to effectively tackle topological heterogeneity by personalized fine-grained topology geometry pattern preservation. Specifically, to minimize the embedding distortion, we propose a topology-aware gating mechanism to select the optimal embedding space for each node. By fusing the outputs of diverse Riemannian experts with learned gating weights, we construct personalized mixed curvature spaces for nodes, effectively embedding the graph into a heterogeneous manifold with varying curvatures at different points. Furthermore, to fairly measure pairwise distances between different embedding spaces, we present a concise and effective alignment strategy. Extensive experiments on real-world and synthetic datasets demonstrate that our method achieves superior performance with lower distortion, highlighting its potential for modeling complex graphs with topological heterogeneity, and providing a novel architectural perspective for graph foundation models.

Title: DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

Authors: Jinxiu Liu, Shaoheng Lin, Yinxiao Li, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11100
Pdf URL: https://arxiv.org/pdf/2412.11100
Copy Paste: [[2412.11100]] DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes(https://arxiv.org/abs/2412.11100)
Keywords: diffusion
Abstract: The increasing demand for immersive AR/VR applications and spatial intelligence has heightened the need to generate high-quality scene-level and 360° panoramic video. However, most video diffusion models are constrained by limited resolution and aspect ratio, which restricts their applicability to scene-level dynamic content synthesis. In this work, we propose the DynamicScaler, addressing these challenges by enabling spatially scalable and panoramic dynamic scene synthesis that preserves coherence across panoramic scenes of arbitrary size. Specifically, we introduce a Offset Shifting Denoiser, facilitating efficient, synchronous, and coherent denoising panoramic dynamic scenes via a diffusion model with fixed resolution through a seamless rotating Window, which ensures seamless boundary transitions and consistency across the entire panoramic space, accommodating varying resolutions and aspect ratios. Additionally, we employ a Global Motion Guidance mechanism to ensure both local detail fidelity and global motion continuity. Extensive experiments demonstrate our method achieves superior content and motion quality in panoramic scene-level video generation, offering a training-free, efficient, and scalable solution for immersive dynamic scene creation with constant VRAM consumption regardless of the output video resolution. Our project page is available at \url{this https URL}.

Title: SpearBot: Leveraging Large Language Models in a Generative-Critique Framework for Spear-Phishing Email Generation

Authors: Qinglin Qi, Yun Luo, Yijia Xu, Wenbo Guo, Yong Fang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2412.11109
Pdf URL: https://arxiv.org/pdf/2412.11109
Copy Paste: [[2412.11109]] SpearBot: Leveraging Large Language Models in a Generative-Critique Framework for Spear-Phishing Email Generation(https://arxiv.org/abs/2412.11109)
Keywords: generative
Abstract: Large Language Models (LLMs) are increasingly capable, aiding in tasks such as content generation, yet they also pose risks, particularly in generating harmful spear-phishing emails. These emails, crafted to entice clicks on malicious URLs, threaten personal information security. This paper proposes an adversarial framework, SpearBot, which utilizes LLMs to generate spear-phishing emails with various phishing strategies. Through specifically crafted jailbreak prompts, SpearBot circumvents security policies and introduces other LLM instances as critics. When a phishing email is identified by the critic, SpearBot refines the generated email based on the critique feedback until it can no longer be recognized as phishing, thereby enhancing its deceptive quality. To evaluate the effectiveness of SpearBot, we implement various machine-based defenders and assess how well the phishing emails generated could deceive them. Results show these emails often evade detection to a large extent, underscoring their deceptive quality. Additionally, human evaluations of the emails' readability and deception are conducted through questionnaires, confirming their convincing nature and the significant potential harm of the generated phishing emails.

Title: AD-LLM: Benchmarking Large Language Models for Anomaly Detection

Authors: Tiankai Yang, Yi Nian, Shawn Li, Ruiyao Xu, Yuangang Li, Jiaqi Li, Zhuo Xiao, Xiyang Hu, Ryan Rossi, Kaize Ding, Xia Hu, Yue Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11142
Pdf URL: https://arxiv.org/pdf/2412.11142
Copy Paste: [[2412.11142]] AD-LLM: Benchmarking Large Language Models for Anomaly Detection(https://arxiv.org/abs/2412.11142)
Keywords: anomaly
Abstract: Anomaly detection (AD) is an important machine learning task with many real-world uses, including fraud detection, medical diagnosis, and industrial monitoring. Within natural language processing (NLP), AD helps detect issues like spam, misinformation, and unusual user activity. Although large language models (LLMs) have had a strong impact on tasks such as text generation and summarization, their potential in AD has not been studied enough. This paper introduces AD-LLM, the first benchmark that evaluates how LLMs can help with NLP anomaly detection. We examine three key tasks: (i) zero-shot detection, using LLMs' pre-trained knowledge to perform AD without tasks-specific training; (ii) data augmentation, generating synthetic data and category descriptions to improve AD models; and (iii) model selection, using LLMs to suggest unsupervised AD models. Through experiments with different datasets, we find that LLMs can work well in zero-shot AD, that carefully designed augmentation methods are useful, and that explaining model selection for specific datasets remains challenging. Based on these results, we outline six future research directions on LLMs for AD.

Title: Redefining Normal: A Novel Object-Level Approach for Multi-Object Novelty Detection

Authors: Mohammadreza Salehi, Nikolaos Apostolikas, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11148
Pdf URL: https://arxiv.org/pdf/2412.11148
Copy Paste: [[2412.11148]] Redefining Normal: A Novel Object-Level Approach for Multi-Object Novelty Detection(https://arxiv.org/abs/2412.11148)
Keywords: self-supervised
Abstract: In the realm of novelty detection, accurately identifying outliers in data without specific class information poses a significant challenge. While current methods excel in single-object scenarios, they struggle with multi-object situations due to their focus on individual objects. Our paper suggests a novel approach: redefining `normal' at the object level in training datasets. Rather than the usual image-level view, we consider the most dominant object in a dataset as the norm, offering a perspective that is more effective for real-world scenarios. Adapting to our object-level definition of `normal', we modify knowledge distillation frameworks, where a student network learns from a pre-trained teacher network. Our first contribution, DeFeND(Dense Feature Fine-tuning on Normal Data), integrates dense feature fine-tuning into the distillation process, allowing the teacher network to focus on object-level features with a self-supervised loss. The second is masked knowledge distillation, where the student network works with partially hidden inputs, honing its ability to deduce and generalize from incomplete data. This approach not only fares well in single-object novelty detection but also considerably surpasses existing methods in multi-object contexts. The implementation is available at: this https URL

Title: Dual-Schedule Inversion: Training- and Tuning-Free Inversion for Real Image Editing

Authors: Jiancheng Huang, Yi Huang, Jianzhuang Liu, Donghao Zhou, Yifan Liu, Shifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11152
Pdf URL: https://arxiv.org/pdf/2412.11152
Copy Paste: [[2412.11152]] Dual-Schedule Inversion: Training- and Tuning-Free Inversion for Real Image Editing(https://arxiv.org/abs/2412.11152)
Keywords: diffusion
Abstract: Text-conditional image editing is a practical AIGC task that has recently emerged with great commercial and academic value. For real image editing, most diffusion model-based methods use DDIM Inversion as the first stage before editing. However, DDIM Inversion often results in reconstruction failure, leading to unsatisfactory performance for downstream editing. To address this problem, we first analyze why the reconstruction via DDIM Inversion fails. We then propose a new inversion and sampling method named Dual-Schedule Inversion. We also design a classifier to adaptively combine Dual-Schedule Inversion with different editing methods for user-friendly image editing. Our work can achieve superior reconstruction and editing performance with the following advantages: 1) It can reconstruct real images perfectly without fine-tuning, and its reversibility is guaranteed mathematically. 2) The edited object/scene conforms to the semantics of the text prompt. 3) The unedited parts of the object/scene retain the original identity.

Title: OTLRM: Orthogonal Learning-based Low-Rank Metric for Multi-Dimensional Inverse Problems

Authors: Xiangming Wang, Haijin Zeng, Jiaoyang Chen, Sheng Liu, Yongyong Chen, Guoqing Chao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11165
Pdf URL: https://arxiv.org/pdf/2412.11165
Copy Paste: [[2412.11165]] OTLRM: Orthogonal Learning-based Low-Rank Metric for Multi-Dimensional Inverse Problems(https://arxiv.org/abs/2412.11165)
Keywords: generative
Abstract: In real-world scenarios, complex data such as multispectral images and multi-frame videos inherently exhibit robust low-rank property. This property is vital for multi-dimensional inverse problems, such as tensor completion, spectral imaging reconstruction, and multispectral image denoising. Existing tensor singular value decomposition (t-SVD) definitions rely on hand-designed or pre-given transforms, which lack flexibility for defining tensor nuclear norm (TNN). The TNN-regularized optimization problem is solved by the singular value thresholding (SVT) operator, which leverages the t-SVD framework to obtain the low-rank tensor. However, it is quite complicated to introduce SVT into deep neural networks due to the numerical instability problem in solving the derivatives of the eigenvectors. In this paper, we introduce a novel data-driven generative low-rank t-SVD model based on the learnable orthogonal transform, which can be naturally solved under its representation. Prompted by the linear algebra theorem of the Householder transformation, our learnable orthogonal transform is achieved by constructing an endogenously orthogonal matrix adaptable to neural networks, optimizing it as arbitrary orthogonal matrices. Additionally, we propose a low-rank solver as a generalization of SVT, which utilizes an efficient representation of generative networks to obtain low-rank structures. Extensive experiments highlight its significant restoration enhancements.

Title: OccScene: Semantic Occupancy-based Cross-task Mutual Learning for 3D Scene Generation

Authors: Bohan Li, Xin Jin, Jianan Wang, Yukai Shi, Yasheng Sun, Xiaofeng Wang, Zhuang Ma, Baao Xie, Chao Ma, Xiaokang Yang, Wenjun Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11183
Pdf URL: https://arxiv.org/pdf/2412.11183
Copy Paste: [[2412.11183]] OccScene: Semantic Occupancy-based Cross-task Mutual Learning for 3D Scene Generation(https://arxiv.org/abs/2412.11183)
Keywords: diffusion
Abstract: Recent diffusion models have demonstrated remarkable performance in both 3D scene generation and perception tasks. Nevertheless, existing methods typically separate these two processes, acting as a data augmenter to generate synthetic data for downstream perception tasks. In this work, we propose OccScene, a novel mutual learning paradigm that integrates fine-grained 3D perception and high-quality generation in a unified framework, achieving a cross-task win-win effect. OccScene generates new and consistent 3D realistic scenes only depending on text prompts, guided with semantic occupancy in a joint-training diffusion framework. To align the occupancy with the diffusion latent, a Mamba-based Dual Alignment module is introduced to incorporate fine-grained semantics and geometry as perception priors. Within OccScene, the perception module can be effectively improved with customized and diverse generated scenes, while the perception priors in return enhance the generation performance for mutual benefits. Extensive experiments show that OccScene achieves realistic 3D scene generation in broad indoor and outdoor scenarios, while concurrently boosting the perception models to achieve substantial performance improvements in the 3D perception task of semantic occupancy prediction.

Title: ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction

Authors: Yi Feng, Yu Han, Xijing Zhang, Tanghui Li, Yanting Zhang, Rui Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11210
Pdf URL: https://arxiv.org/pdf/2412.11210
Copy Paste: [[2412.11210]] ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction(https://arxiv.org/abs/2412.11210)
Keywords: foundation model
Abstract: Inferring the 3D structure of a scene from a single image is an ill-posed and challenging problem in the field of vision-centric autonomous driving. Existing methods usually employ neural radiance fields to produce voxelized 3D occupancy, lacking instance-level semantic reasoning and temporal photometric consistency. In this paper, we propose ViPOcc, which leverages the visual priors from vision foundation models (VFMs) for fine-grained 3D occupancy prediction. Unlike previous works that solely employ volume rendering for RGB and depth image reconstruction, we introduce a metric depth estimation branch, in which an inverse depth alignment module is proposed to bridge the domain gap in depth distribution between VFM predictions and the ground truth. The recovered metric depth is then utilized in temporal photometric alignment and spatial geometric alignment to ensure accurate and consistent 3D occupancy prediction. Additionally, we also propose a semantic-guided non-overlapping Gaussian mixture sampler for efficient, instance-aware ray sampling, which addresses the redundant and imbalanced sampling issue that still exists in previous state-of-the-art methods. Extensive experiments demonstrate the superior performance of ViPOcc in both 3D occupancy prediction and depth estimation tasks on the KITTI-360 and KITTI Raw datasets. Our code is available at: \url{this https URL}.

Title: GenLit: Reformulating Single-Image Relighting as Video Generation

Authors: Shrisha Bharadwaj, Haiwen Feng, Victoria Abrevaya, Michael J. Black
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2412.11224
Pdf URL: https://arxiv.org/pdf/2412.11224
Copy Paste: [[2412.11224]] GenLit: Reformulating Single-Image Relighting as Video Generation(https://arxiv.org/abs/2412.11224)
Keywords: diffusion, foundation model
Abstract: Manipulating the illumination within a single image represents a fundamental challenge in computer vision and graphics. This problem has been traditionally addressed using inverse rendering techniques, which require explicit 3D asset reconstruction and costly ray tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be practical and possible -- one that replaces explicit physical models with networks that are trained on massive amounts of image and video data. In this paper, we explore the potential of exploiting video diffusion models, and in particular Stable Video Diffusion (SVD), in understanding the physical world to perform relighting tasks given a single image. Specifically, we introduce GenLit, a framework that distills the ability of a graphics engine to perform light manipulation into a video generation model, enabling users to directly insert and manipulate a point light in the 3D world within a given image and generate the results directly as a video sequence. We find that a model fine-tuned on only a small synthetic dataset (270 objects) is able to generalize to real images, enabling single-image relighting with realistic ray tracing effects and cast shadows. These results reveal the ability of video foundation models to capture rich information about lighting, material, and shape. Our findings suggest that such models, with minimal training, can be used for physically-based rendering without explicit physically asset reconstruction and complex ray tracing. This further suggests the potential of such models for controllable and physically accurate image synthesis tasks.

Title: Learning Set Functions with Implicit Differentiation

Authors: Gözde Özcan, Chengzhi Shi, Stratis Ioannidis
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11239
Pdf URL: https://arxiv.org/pdf/2412.11239
Copy Paste: [[2412.11239]] Learning Set Functions with Implicit Differentiation(https://arxiv.org/abs/2412.11239)
Keywords: anomaly
Abstract: Ou et al. (2022) introduce the problem of learning set functions from data generated by a so-called optimal subset oracle. Their approach approximates the underlying utility function with an energy-based model, whose parameters are estimated via mean-field variational inference. Ou et al. (2022) show this reduces to fixed point iterations; however, as the number of iterations increases, automatic differentiation quickly becomes computationally prohibitive due to the size of the Jacobians that are stacked during backpropagation. We address this challenge with implicit differentiation and examine the convergence conditions for the fixed-point iterations. We empirically demonstrate the efficiency of our method on synthetic and real-world subset selection applications including product recommendation, set anomaly detection and compound selection tasks.

Title: Wasserstein Bounds for generative diffusion models with Gaussian tail targets

Authors: Xixian Wang, Zhongjian Wang
Subjects: cs.LG, math.AP, math.NA
Abstract URL: https://arxiv.org/abs/2412.11251
Pdf URL: https://arxiv.org/pdf/2412.11251
Copy Paste: [[2412.11251]] Wasserstein Bounds for generative diffusion models with Gaussian tail targets(https://arxiv.org/abs/2412.11251)
Keywords: diffusion, generative
Abstract: We present an estimate of the Wasserstein distance between the data distribution and the generation of score-based generative models, assuming an $\epsilon$-accurate approximation of the score and a Gaussian-type tail behavior of the data distribution. The complexity bound in dimension is $O(\sqrt{d})$, with a logarithmic constant. Such Gaussian tail assumption applies to the distribution of a compact support target with early stopping technique and the Bayesian posterior with a bounded observation operator. Corresponding convergence and complexity bounds are derived. The crux of the analysis lies in the Lipchitz bound of the score, which is related to the Hessian estimate of a viscous Hamilton-Jacobi equation (vHJ). This latter is demonstrated by employing a dimension independent kernel estimate. Consequently, our complexity bound scales linearly (up to a logarithmic constant) with the square root of the trace of the covariance operator, which relates to the invariant distribution of forward process. Our analysis also extends to the probabilistic flow ODE, as the sampling process.

Title: Wearable Accelerometer Foundation Models for Health via Knowledge Distillation

Authors: Salar Abbaspourazad, Anshuman Mishra, Joseph Futoma, Andrew C. Miller, Ian Shapiro
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2412.11276
Pdf URL: https://arxiv.org/pdf/2412.11276
Copy Paste: [[2412.11276]] Wearable Accelerometer Foundation Models for Health via Knowledge Distillation(https://arxiv.org/abs/2412.11276)
Keywords: self-supervised, foundation model
Abstract: Modern wearable devices can conveniently and continuously record various biosignals in the many different environments of daily living, ultimately enabling a rich view of individual health. However, not all biosignals are the same: high-fidelity measurements, such as photoplethysmography (PPG), contain more physiological information, but require optical sensors with a high power footprint. In a resource-constrained setting, such biosignals may be unavailable. Alternatively, a lower-fidelity biosignal, such as accelerometry that captures minute cardiovascular information during low-motion periods, has a significantly smaller power footprint and is available in almost any wearable device. Here, we demonstrate that we can distill representational knowledge across biosignals, i.e., from PPG to accelerometry, using 20 million minutes of unlabeled data, collected from ~172K participants in the Apple Heart and Movement Study under informed consent. We first pre-train PPG encoders via self-supervised learning, and then distill their representational knowledge to accelerometry encoders. We demonstrate strong cross-modal alignment on unseen data, e.g., 99.2% top-1 accuracy for retrieving PPG embeddings from accelerometry embeddings. We show that distilled accelerometry encoders have significantly more informative representations compared to self-supervised or supervised encoders trained directly on accelerometry data, observed by at least 23%-49% improved performance for predicting heart rate and heart rate variability. We also show that distilled accelerometry encoders are readily predictive of a wide array of downstream health targets, i.e., they are generalist foundation models. We believe accelerometry foundation models for health may unlock new opportunities for developing digital biomarkers from any wearable device, and help individuals track their health more frequently and conveniently.

Title: VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping

Authors: Hao Shao, Shulun Wang, Yang Zhou, Guanglu Song, Dailan He, Shuo Qin, Zhuofan Zong, Bingqi Ma, Yu Liu, Hongsheng Li
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2412.11279
Pdf URL: https://arxiv.org/pdf/2412.11279
Copy Paste: [[2412.11279]] VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping(https://arxiv.org/abs/2412.11279)
Keywords: diffusion
Abstract: Video face swapping is becoming increasingly popular across various applications, yet existing methods primarily focus on static images and struggle with video face swapping because of temporal consistency and complex scenarios. In this paper, we present the first diffusion-based framework specifically designed for video face swapping. Our approach introduces a novel image-video hybrid training framework that leverages both abundant static image data and temporal video sequences, addressing the inherent limitations of video-only training. The framework incorporates a specially designed diffusion model coupled with a VidFaceVAE that effectively processes both types of data to better maintain temporal coherence of the generated videos. To further disentangle identity and pose features, we construct the Attribute-Identity Disentanglement Triplet (AIDT) Dataset, where each triplet has three face images, with two images sharing the same pose and two sharing the same identity. Enhanced with a comprehensive occlusion augmentation, this dataset also improves robustness against occlusions. Additionally, we integrate 3D reconstruction techniques as input conditioning to our network for handling large pose variations. Extensive experiments demonstrate that our framework achieves superior performance in identity preservation, temporal consistency, and visual quality compared to existing methods, while requiring fewer inference steps. Our approach effectively mitigates key challenges in video face swapping, including temporal flickering, identity preservation, and robustness to occlusions and pose variations.

Title: Detecting Daily Living Gait Amid Huntington's Disease Chorea using a Foundation Deep Learning Model

Authors: Dafna Schwartz, Lori Quinn, Nora E. Fritz, Lisa M. Muratori, Jeffery M. Hausdorff, Ran Gilad Bachrach
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11286
Pdf URL: https://arxiv.org/pdf/2412.11286
Copy Paste: [[2412.11286]] Detecting Daily Living Gait Amid Huntington's Disease Chorea using a Foundation Deep Learning Model(https://arxiv.org/abs/2412.11286)
Keywords: self-supervised, foundation model, generative
Abstract: Wearable sensors offer a non-invasive way to collect physical activity (PA) data, with walking as a key component. Existing models often struggle to detect gait bouts in individuals with neurodegenerative diseases (NDDs) involving involuntary movements. We developed J-Net, a deep learning model inspired by U-Net, which uses a pre-trained self-supervised foundation model fine-tuned with Huntington`s disease (HD) in-lab data and paired with a segmentation head for gait detection. J-Net processes wrist-worn accelerometer data to detect gait during daily living. We evaluated J-Net on in-lab and daily-living data from HD, Parkinson`s disease (PD), and controls. J-Net achieved a 10-percentage point improvement in ROC-AUC for HD over existing methods, reaching 0.97 for in-lab data. In daily-living environments, J-Net estimates showed no significant differences in median daily walking time between HD and controls (p = 0.23), in contrast to other models, which indicated counterintuitive results (p < 0.005). Walking time measured by J-Net correlated with the UHDRS-TMS clinical severity score (r=-0.52; p=0.02), confirming its clinical relevance. Fine-tuning J-Net on PD data also improved gait detection over current methods. J-Net`s architecture effectively addresses the challenges of gait detection in severe chorea and offers robust performance in daily living. The dataset and J-Net model are publicly available, providing a resource for further research into NDD-related gait impairments.

Title: Grassmannian Geometry Meets Dynamic Mode Decomposition in DMD-GEN: A New Metric for Mode Collapse in Time Series Generative Models

Authors: Amime Mohamed Aboussalah, Yassine Abbahaddou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.11292
Pdf URL: https://arxiv.org/pdf/2412.11292
Copy Paste: [[2412.11292]] Grassmannian Geometry Meets Dynamic Mode Decomposition in DMD-GEN: A New Metric for Mode Collapse in Time Series Generative Models(https://arxiv.org/abs/2412.11292)
Keywords: diffusion, generative
Abstract: Generative models like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) often fail to capture the full diversity of their training data, leading to mode collapse. While this issue is well-explored in image generation, it remains underinvestigated for time series data. We introduce a new definition of mode collapse specific to time series and propose a novel metric, DMD-GEN, to quantify its severity. Our metric utilizes Dynamic Mode Decomposition (DMD), a data-driven technique for identifying coherent spatiotemporal patterns, and employs Optimal Transport between DMD eigenvectors to assess discrepancies between the underlying dynamics of the original and generated data. This approach not only quantifies the preservation of essential dynamic characteristics but also provides interpretability by pinpointing which modes have collapsed. We validate DMD-GEN on both synthetic and real-world datasets using various generative models, including TimeGAN, TimeVAE, and DiffusionTS. The results demonstrate that DMD-GEN correlates well with traditional evaluation metrics for static data while offering the advantage of applicability to dynamic data. This work offers for the first time a definition of mode collapse for time series, improving understanding, and forming the basis of our tool for assessing and improving generative models in the time series domain.

Title: Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models

Authors: Xiaochen Zhu, Georgi Karadzhov, Chenxi Whitehouse, Andreas Vlachos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11333
Pdf URL: https://arxiv.org/pdf/2412.11333
Copy Paste: [[2412.11333]] Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models(https://arxiv.org/abs/2412.11333)
Keywords: diffusion
Abstract: Diffusion models have shown promise in text generation but often struggle with generating long, coherent, and contextually accurate text. Token-level diffusion overlooks word-order dependencies and enforces short output windows, while passage-level diffusion struggles with learning robust representation for long-form text. To address these challenges, we propose Segment-Level Diffusion (SLD), a framework that enhances diffusion-based text generation through text segmentation, robust representation training with adversarial and contrastive learning, and improved latent-space guidance. By segmenting long-form outputs into separate latent representations and decoding them with an autoregressive decoder, SLD simplifies diffusion predictions and improves scalability. Experiments on XSum, ROCStories, DialogSum, and DeliData demonstrate that SLD achieves competitive or superior performance in fluency, coherence, and contextual compatibility across automatic and human evaluation metrics comparing with other diffusion and autoregressive baselines. Ablation studies further validate the effectiveness of our segmentation and representation learning strategies.

Title: ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data

Authors: Chengsen Wang, Qi Qi, Jingyu Wang, Haifeng Sun, Zirui Zhuang, Jinming Wu, Lei Zhang, Jianxin Liao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11376
Pdf URL: https://arxiv.org/pdf/2412.11376
Copy Paste: [[2412.11376]] ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data(https://arxiv.org/abs/2412.11376)
Keywords: foundation model
Abstract: Human experts typically integrate numerical and textual multimodal information to analyze time series. However, most traditional deep learning predictors rely solely on unimodal numerical data, using a fixed-length window for training and prediction on a single dataset, and cannot adapt to different scenarios. The powered pre-trained large language model has introduced new opportunities for time series analysis. Yet, existing methods are either inefficient in training, incapable of handling textual information, or lack zero-shot forecasting capability. In this paper, we innovatively model time series as a foreign language and construct ChatTime, a unified framework for time series and text processing. As an out-of-the-box multimodal time series foundation model, ChatTime provides zero-shot forecasting capability and supports bimodal input/output for both time series and text. We design a series of experiments to verify the superior performance of ChatTime across multiple tasks and scenarios, and create four multimodal datasets to address data gaps. The experimental results demonstrate the potential and utility of ChatTime.

Title: Adapting Segment Anything Model (SAM) to Experimental Datasets via Fine-Tuning on GAN-based Simulation: A Case Study in Additive Manufacturing

Authors: Anika Tabassum, Amirkoushyar Ziabari
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2412.11381
Pdf URL: https://arxiv.org/pdf/2412.11381
Copy Paste: [[2412.11381]] Adapting Segment Anything Model (SAM) to Experimental Datasets via Fine-Tuning on GAN-based Simulation: A Case Study in Additive Manufacturing(https://arxiv.org/abs/2412.11381)
Keywords: generative
Abstract: Industrial X-ray computed tomography (XCT) is a powerful tool for non-destructive characterization of materials and manufactured components. XCT commonly accompanied by advanced image analysis and computer vision algorithms to extract relevant information from the images. Traditional computer vision models often struggle due to noise, resolution variability, and complex internal structures, particularly in scientific imaging applications. State-of-the-art foundational models, like the Segment Anything Model (SAM)-designed for general-purpose image segmentation-have revolutionized image segmentation across various domains, yet their application in specialized fields like materials science remains under-explored. In this work, we explore the application and limitations of SAM for industrial X-ray CT inspection of additive manufacturing components. We demonstrate that while SAM shows promise, it struggles with out-of-distribution data, multiclass segmentation, and computational efficiency during fine-tuning. To address these issues, we propose a fine-tuning strategy utilizing parameter-efficient techniques, specifically Conv-LoRa, to adapt SAM for material-specific datasets. Additionally, we leverage generative adversarial network (GAN)-generated data to enhance the training process and improve the model's segmentation performance on complex X-ray CT data. Our experimental results highlight the importance of tailored segmentation models for accurate inspection, showing that fine-tuning SAM on domain-specific scientific imaging data significantly improves performance. However, despite improvements, the model's ability to generalize across diverse datasets remains limited, highlighting the need for further research into robust, scalable solutions for domain-specific segmentation tasks.

Title: Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes

Authors: Antonio Carlos Rivera, Anthony Moore, Steven Robinson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11396
Pdf URL: https://arxiv.org/pdf/2412.11396
Copy Paste: [[2412.11396]] Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes(https://arxiv.org/abs/2412.11396)
Keywords: generative
Abstract: Object-aware reasoning in vision-language tasks poses significant challenges for current models, particularly in handling unseen objects, reducing hallucinations, and capturing fine-grained relationships in complex visual scenes. To address these limitations, we propose the Vision-Aware Retrieval-Augmented Prompting (VRAP) framework, a generative approach that enhances Large Vision-Language Models (LVLMs) by integrating retrieval-augmented object tags into their prompts. VRAP introduces a novel pipeline where structured tags, including objects, attributes, and relationships, are extracted using pretrained visual encoders and scene graph parsers. These tags are enriched with external knowledge and incorporated into the LLM's input, enabling detailed and accurate reasoning. We evaluate VRAP across multiple vision-language benchmarks, including VQAv2, GQA, VizWiz, and COCO, achieving state-of-the-art performance in fine-grained reasoning and multimodal understanding. Additionally, our ablation studies highlight the importance of retrieval-augmented tags and contrastive learning, while human evaluations confirm VRAP's ability to generate accurate, detailed, and contextually relevant responses. Notably, VRAP achieves a 40% reduction in inference latency by eliminating runtime retrieval. These results demonstrate that VRAP is a robust and efficient framework for advancing object-aware multimodal reasoning.

Title: Quantization of Climate Change Impacts on Renewable Energy Generation Capacity: A Super-Resolution Recurrent Diffusion Model

Authors: Xiaochong Dong, Jun Dan, Yingyun Sun, Yang Liu, Xuemin Zhang, Shengwei Mei
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2412.11399
Pdf URL: https://arxiv.org/pdf/2412.11399
Copy Paste: [[2412.11399]] Quantization of Climate Change Impacts on Renewable Energy Generation Capacity: A Super-Resolution Recurrent Diffusion Model(https://arxiv.org/abs/2412.11399)
Keywords: diffusion, generative
Abstract: Driven by global climate change and the ongoing energy transition, the coupling between power supply capabilities and meteorological factors has become increasingly significant. Over the long term, accurately quantifying the power generation capacity of renewable energy under the influence of climate change is essential for the development of sustainable power systems. However, due to interdisciplinary differences in data requirements, climate data often lacks the necessary hourly resolution to capture the short-term variability and uncertainties of renewable energy resources. To address this limitation, a super-resolution recurrent diffusion model (SRDM) has been developed to enhance the temporal resolution of climate data and model the short-term uncertainty. The SRDM incorporates a pre-trained decoder and a denoising network, that generates long-term, high-resolution climate data through a recurrent coupling mechanism. The high-resolution climate data is then converted into power value using the mechanism model, enabling the simulation of wind and photovoltaic (PV) power generation capacity on future long-term scales. Case studies were conducted in the Ejina region of Inner Mongolia, China, using fifth-generation reanalysis (ERA5) and coupled model intercomparison project (CMIP6) data under two climate pathways: SSP126 and SSP585. The results demonstrate that the SRDM outperforms existing generative models in generating super-resolution climate data. For the Ejina region, under a high-emission pathway, the annual utilization hours of wind power are projected to decrease by 2.82 hours/year, while those for PV power are projected to decrease by 0.26 hours/year. Furthermore, the research highlights the estimation biases introduced when low-resolution climate data is used for power conversion.

Title: Scaled Conjugate Gradient Method for Nonconvex Optimization in Deep Neural Networks

Authors: Naoki Sato, Koshiro Izumi, Hideaki Iiduka
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.11400
Pdf URL: https://arxiv.org/pdf/2412.11400
Copy Paste: [[2412.11400]] Scaled Conjugate Gradient Method for Nonconvex Optimization in Deep Neural Networks(https://arxiv.org/abs/2412.11400)
Keywords: generative
Abstract: A scaled conjugate gradient method that accelerates existing adaptive methods utilizing stochastic gradients is proposed for solving nonconvex optimization problems with deep neural networks. It is shown theoretically that, whether with constant or diminishing learning rates, the proposed method can obtain a stationary point of the problem. Additionally, its rate of convergence with diminishing learning rates is verified to be superior to that of the conjugate gradient method. The proposed method is shown to minimize training loss functions faster than the existing adaptive methods in practical applications of image and text classification. Furthermore, in the training of generative adversarial networks, one version of the proposed method achieved the lowest Frechet inception distance score among those of the adaptive methods.

Title: Biased or Flawed? Mitigating Stereotypes in Generative Language Models by Addressing Task-Specific Flaws

Authors: Akshita Jha, Sanchit Kabra, Chandan K. Reddy
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11414
Pdf URL: https://arxiv.org/pdf/2412.11414
Copy Paste: [[2412.11414]] Biased or Flawed? Mitigating Stereotypes in Generative Language Models by Addressing Task-Specific Flaws(https://arxiv.org/abs/2412.11414)
Keywords: generative
Abstract: Recent studies have shown that generative language models often reflect and amplify societal biases in their outputs. However, these studies frequently conflate observed biases with other task-specific shortcomings, such as comprehension failure. For example, when a model misinterprets a text and produces a response that reinforces a stereotype, it becomes difficult to determine whether the issue arises from inherent bias or from a misunderstanding of the given content. In this paper, we conduct a multi-faceted evaluation that distinctly disentangles bias from flaws within the reading comprehension task. We propose a targeted stereotype mitigation framework that implicitly mitigates observed stereotypes in generative models through instruction-tuning on general-purpose datasets. We reduce stereotypical outputs by over 60% across multiple dimensions -- including nationality, age, gender, disability, and physical appearance -- by addressing comprehension-based failures, and without relying on explicit debiasing techniques. We evaluate several state-of-the-art generative models to demonstrate the effectiveness of our approach while maintaining the overall utility. Our findings highlight the need to critically disentangle the concept of `bias' from other types of errors to build more targeted and effective mitigation strategies. CONTENT WARNING: Some examples contain offensive stereotypes.

Title: Category Level 6D Object Pose Estimation from a Single RGB Image using Diffusion

Authors: Adam Bethell, Ravi Garg, Ian Reid
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11420
Pdf URL: https://arxiv.org/pdf/2412.11420
Copy Paste: [[2412.11420]] Category Level 6D Object Pose Estimation from a Single RGB Image using Diffusion(https://arxiv.org/abs/2412.11420)
Keywords: diffusion
Abstract: Estimating the 6D pose and 3D size of an object from an image is a fundamental task in computer vision. Most current approaches are restricted to specific instances with known models or require ground truth depth information or point cloud captures from LIDAR. We tackle the harder problem of pose estimation for category-level objects from a single RGB image. We propose a novel solution that eliminates the need for specific object models or depth information. Our method utilises score-based diffusion models to generate object pose hypotheses to model the distribution of possible poses for the object. Unlike previous methods that rely on costly trained likelihood estimators to remove outliers before pose aggregation using mean pooling, we introduce a simpler approach using Mean Shift to estimate the mode of the distribution as the final pose estimate. Our approach outperforms the current state-of-the-art on the REAL275 dataset by a significant margin.

Title: Nearly Zero-Cost Protection Against Mimicry by Personalized Diffusion Models

Authors: Namhyuk Ahn, KiYoon Yoo, Wonhyuk Ahn, Daesik Kim, Seung-Hun Nam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11423
Pdf URL: https://arxiv.org/pdf/2412.11423
Copy Paste: [[2412.11423]] Nearly Zero-Cost Protection Against Mimicry by Personalized Diffusion Models(https://arxiv.org/abs/2412.11423)
Keywords: diffusion
Abstract: Recent advancements in diffusion models revolutionize image generation but pose risks of misuse, such as replicating artworks or generating deepfakes. Existing image protection methods, though effective, struggle to balance protection efficacy, invisibility, and latency, thus limiting practical use. We introduce perturbation pre-training to reduce latency and propose a mixture-of-perturbations approach that dynamically adapts to input images to minimize performance degradation. Our novel training strategy computes protection loss across multiple VAE feature spaces, while adaptive targeted protection at inference enhances robustness and invisibility. Experiments show comparable protection performance with improved invisibility and drastically reduced inference time. The code and demo are available at \url{this https URL}

Title: Towards Scientific Discovery with Generative AI: Progress, Opportunities, and Challenges

Authors: Chandan K Reddy, Parshin Shojaee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11427
Pdf URL: https://arxiv.org/pdf/2412.11427
Copy Paste: [[2412.11427]] Towards Scientific Discovery with Generative AI: Progress, Opportunities, and Challenges(https://arxiv.org/abs/2412.11427)
Keywords: generative
Abstract: Scientific discovery is a complex cognitive process that has driven human knowledge and technological progress for centuries. While artificial intelligence (AI) has made significant advances in automating aspects of scientific reasoning, simulation, and experimentation, we still lack integrated AI systems capable of performing autonomous long-term scientific research and discovery. This paper examines the current state of AI for scientific discovery, highlighting recent progress in large language models and other AI techniques applied to scientific tasks. We then outline key challenges and promising research directions toward developing more comprehensive AI systems for scientific discovery, including the need for science-focused AI agents, improved benchmarks and evaluation metrics, multimodal scientific representations, and unified frameworks combining reasoning, theorem proving, and data-driven modeling. Addressing these challenges could lead to transformative AI tools to accelerate progress across disciplines towards scientific discovery.

Title: View Transformation Robustness for Multi-View 3D Object Reconstruction with Reconstruction Error-Guided View Selection

Authors: Qi Zhang, Zhouhang Luo, Tao Yu, Hui Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11428
Pdf URL: https://arxiv.org/pdf/2412.11428
Copy Paste: [[2412.11428]] View Transformation Robustness for Multi-View 3D Object Reconstruction with Reconstruction Error-Guided View Selection(https://arxiv.org/abs/2412.11428)
Keywords: diffusion
Abstract: View transformation robustness (VTR) is critical for deep-learning-based multi-view 3D object reconstruction models, which indicates the methods' stability under inputs with various view transformations. However, existing research seldom focused on view transformation robustness in multi-view 3D object reconstruction. One direct way to improve the models' VTR is to produce data with more view transformations and add them to model training. Recent progress on large vision models, particularly Stable Diffusion models, has provided great potential for generating 3D models or synthesizing novel view images with only a single image input. Directly deploying these models at inference consumes heavy computation resources and their robustness to view transformations is not guaranteed either. To fully utilize the power of Stable Diffusion models without extra inference computation burdens, we propose to generate novel views with Stable Diffusion models for better view transformation robustness. Instead of synthesizing random views, we propose a reconstruction error-guided view selection method, which considers the reconstruction errors' spatial distribution of the 3D predictions and chooses the views that could cover the reconstruction errors as much as possible. The methods are trained and tested on sets with large view transformations to validate the 3D reconstruction models' robustness to view transformations. Extensive experiments demonstrate that the proposed method can outperform state-of-the-art 3D reconstruction methods and other view transformation robustness comparison methods.

Title: Bayesian Flow Is All You Need to Sample Out-of-Distribution Chemical Spaces

Authors: Nianze Tao
Subjects: cs.LG, cs.AI, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2412.11439
Pdf URL: https://arxiv.org/pdf/2412.11439
Copy Paste: [[2412.11439]] Bayesian Flow Is All You Need to Sample Out-of-Distribution Chemical Spaces(https://arxiv.org/abs/2412.11439)
Keywords: diffusion
Abstract: Generating novel molecules with higher properties than the training space, namely the out-of-distribution generation, is important for ${de~novo}$ drug design. However, it is not easy for distribution learning-based models, for example diffusion models, to solve this challenge as these methods are designed to fit the distribution of training data as close as possible. In this paper, we show that Bayesian flow network is capable of effortlessly generating high quality out-of-distribution samples that meet several scenarios. We introduce a semi-autoregressive training/sampling method that helps to enhance the model performance and surpass the state-of-the-art models.

Title: UIBDiffusion: Universal Imperceptible Backdoor Attack for Diffusion Models

Authors: Yuning Han, Bingyin Zhao, Rui Chu, Feng Luo, Biplab Sikdar, Yingjie Lao
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11441
Pdf URL: https://arxiv.org/pdf/2412.11441
Copy Paste: [[2412.11441]] UIBDiffusion: Universal Imperceptible Backdoor Attack for Diffusion Models(https://arxiv.org/abs/2412.11441)
Keywords: diffusion
Abstract: Recent studies show that diffusion models (DMs) are vulnerable to backdoor attacks. Existing backdoor attacks impose unconcealed triggers (e.g., a gray box and eyeglasses) that contain evident patterns, rendering remarkable attack effects yet easy detection upon human inspection and defensive algorithms. While it is possible to improve stealthiness by reducing the strength of the backdoor, doing so can significantly compromise its generality and effectiveness. In this paper, we propose UIBDiffusion, the universal imperceptible backdoor attack for diffusion models, which allows us to achieve superior attack and generation performance while evading state-of-the-art defenses. We propose a novel trigger generation approach based on universal adversarial perturbations (UAPs) and reveal that such perturbations, which are initially devised for fooling pre-trained discriminative models, can be adapted as potent imperceptible backdoor triggers for DMs. We evaluate UIBDiffusion on multiple types of DMs with different kinds of samplers across various datasets and targets. Experimental results demonstrate that UIBDiffusion brings three advantages: 1) Universality, the imperceptible trigger is universal (i.e., image and model agnostic) where a single trigger is effective to any images and all diffusion models with different samplers; 2) Utility, it achieves comparable generation quality (e.g., FID) and even better attack success rate (i.e., ASR) at low poison rates compared to the prior works; and 3) Undetectability, UIBDiffusion is plausible to human perception and can bypass Elijah and TERD, the SOTA defenses against backdoors for DMs. We will release our backdoor triggers and code.

Title: MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes

Authors: Ruijie Lu, Yixin Chen, Junfeng Ni, Baoxiong Jia, Yu Liu, Diwen Wan, Gang Zeng, Siyuan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11457
Pdf URL: https://arxiv.org/pdf/2412.11457
Copy Paste: [[2412.11457]] MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes(https://arxiv.org/abs/2412.11457)
Keywords: diffusion
Abstract: Repurposing pre-trained diffusion models has been proven to be effective for NVS. However, these methods are mostly limited to a single object; directly applying such methods to compositional multi-object scenarios yields inferior results, especially incorrect object placement and inconsistent shape and appearance under novel views. How to enhance and systematically evaluate the cross-view consistency of such models remains under-explored. To address this issue, we propose MOVIS to enhance the structural awareness of the view-conditioned diffusion model for multi-object NVS in terms of model inputs, auxiliary tasks, and training strategy. First, we inject structure-aware features, including depth and object mask, into the denoising U-Net to enhance the model's comprehension of object instances and their spatial relationships. Second, we introduce an auxiliary task requiring the model to simultaneously predict novel view object masks, further improving the model's capability in differentiating and placing objects. Finally, we conduct an in-depth analysis of the diffusion sampling process and carefully devise a structure-guided timestep sampling scheduler during training, which balances the learning of global object placement and fine-grained detail recovery. To systematically evaluate the plausibility of synthesized images, we propose to assess cross-view consistency and novel view object placement alongside existing image-level NVS metrics. Extensive experiments on challenging synthetic and realistic datasets demonstrate that our method exhibits strong generalization capabilities and produces consistent novel view synthesis, highlighting its potential to guide future 3D-aware multi-object NVS tasks.

Title: Understanding Knowledge Hijack Mechanism in In-context Learning through Associative Memory

Authors: Shuo Wang, Issei Sato
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11459
Pdf URL: https://arxiv.org/pdf/2412.11459
Copy Paste: [[2412.11459]] Understanding Knowledge Hijack Mechanism in In-context Learning through Associative Memory(https://arxiv.org/abs/2412.11459)
Keywords: in-context
Abstract: In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks without fine-tuning by leveraging contextual information provided within a prompt. However, ICL relies not only on contextual clues but also on the global knowledge acquired during pretraining for the next token prediction. Analyzing this process has been challenging due to the complex computational circuitry of LLMs. This paper investigates the balance between in-context information and pretrained bigram knowledge in token prediction, focusing on the induction head mechanism, a key component in ICL. Leveraging the fact that a two-layer transformer can implement the induction head mechanism with associative memories, we theoretically analyze the logits when a two-layer transformer is given prompts generated by a bigram model. In the experiments, we design specific prompts to evaluate whether the outputs of a two-layer transformer align with the theoretical results.

Title: Unsupervised Anomaly Detection for Tabular Data Using Noise Evaluation

Authors: Wei Dai, Kai Hwang, Jicong Fan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11461
Pdf URL: https://arxiv.org/pdf/2412.11461
Copy Paste: [[2412.11461]] Unsupervised Anomaly Detection for Tabular Data Using Noise Evaluation(https://arxiv.org/abs/2412.11461)
Keywords: anomaly
Abstract: Unsupervised anomaly detection (UAD) plays an important role in modern data analytics and it is crucial to provide simple yet effective and guaranteed UAD algorithms for real applications. In this paper, we present a novel UAD method for tabular data by evaluating how much noise is in the data. Specifically, we propose to learn a deep neural network from the clean (normal) training dataset and a noisy dataset, where the latter is generated by adding highly diverse noises to the clean data. The neural network can learn a reliable decision boundary between normal data and anomalous data when the diversity of the generated noisy data is sufficiently high so that the hard abnormal samples lie in the noisy region. Importantly, we provide theoretical guarantees, proving that the proposed method can detect anomalous data successfully, although the method does not utilize any real anomalous data in the training stage. Extensive experiments through more than 60 benchmark datasets demonstrate the effectiveness of the proposed method in comparison to 12 baselines of UAD. Our method obtains a 92.27\% AUC score and a 1.68 ranking score on average. Moreover, compared to the state-of-the-art UAD methods, our method is easier to implement.

Title: FedCAR: Cross-client Adaptive Re-weighting for Generative Models in Federated Learning

Authors: Minjun Kim, Minjee Kim, Jinhoon Jeong
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11463
Pdf URL: https://arxiv.org/pdf/2412.11463
Copy Paste: [[2412.11463]] FedCAR: Cross-client Adaptive Re-weighting for Generative Models in Federated Learning(https://arxiv.org/abs/2412.11463)
Keywords: generative
Abstract: Generative models trained on multi-institutional datasets can provide an enriched understanding through diverse data distributions. However, training the models on medical images is often challenging due to hospitals' reluctance to share data for privacy reasons. Federated learning(FL) has emerged as a privacy-preserving solution for training distributed datasets across data centers by aggregating model weights from multiple clients instead of sharing raw data. Previous research has explored the adaptation of FL to generative models, yet effective aggregation algorithms specifically tailored for generative models remain unexplored. We hereby propose a novel algorithm aimed at improving the performance of generative models within FL. Our approach adaptively re-weights the contribution of each client, resulting in well-trained shared parameters. In each round, the server side measures the distribution distance between fake images generated by clients instead of directly comparing the Fréchet Inception Distance per client, thereby enhancing efficiency of the learning. Experimental results on three public chest X-ray datasets show superior performance in medical image generation, outperforming both centralized learning and conventional FL algorithms. Our code is available at this https URL.

Title: IGR: Improving Diffusion Model for Garment Restoration from Person Image

Authors: Le Shen, Rong Huang, Zhijie Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11513
Pdf URL: https://arxiv.org/pdf/2412.11513
Copy Paste: [[2412.11513]] IGR: Improving Diffusion Model for Garment Restoration from Person Image(https://arxiv.org/abs/2412.11513)
Keywords: diffusion
Abstract: Garment restoration, the inverse of virtual try-on task, focuses on restoring standard garment from a person image, requiring accurate capture of garment details. However, existing methods often fail to preserve the identity of the garment or rely on complex processes. To address these limitations, we propose an improved diffusion model for restoring authentic garments. Our approach employs two garment extractors to independently capture low-level features and high-level semantics from the person image. Leveraging a pretrained latent diffusion model, these features are integrated into the denoising process through garment fusion blocks, which combine self-attention and cross-attention layers to align the restored garment with the person image. Furthermore, a coarse-to-fine training strategy is introduced to enhance the fidelity and authenticity of the generated garments. Experimental results demonstrate that our model effectively preserves garment identity and generates high-quality restorations, even in challenging scenarios such as complex garments or those with occlusions.

Title: LineArt: A Knowledge-guided Training-free High-quality Appearance Transfer for Design Drawing with Diffusion Model

Authors: Xi Wang, Hongzhen Li, Heng Fang, Yichen Peng, Haoran Xie, Xi Yang, Chuntao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11519
Pdf URL: https://arxiv.org/pdf/2412.11519
Copy Paste: [[2412.11519]] LineArt: A Knowledge-guided Training-free High-quality Appearance Transfer for Design Drawing with Diffusion Model(https://arxiv.org/abs/2412.11519)
Keywords: diffusion
Abstract: Image rendering from line drawings is vital in design and image generation technologies reduce costs, yet professional line drawings demand preserving complex details. Text prompts struggle with accuracy, and image translation struggles with consistency and fine-grained control. We present LineArt, a framework that transfers complex appearance onto detailed design drawings, facilitating design and artistic creation. It generates high-fidelity appearance while preserving structural accuracy by simulating hierarchical visual cognition and integrating human artistic experience to guide the diffusion process. LineArt overcomes the limitations of current methods in terms of difficulty in fine-grained control and style degradation in design drawings. It requires no precise 3D modeling, physical property specs, or network training, making it more convenient for design tasks. LineArt consists of two stages: a multi-frequency lines fusion module to supplement the input design drawing with detailed structural information and a two-part painting process for Base Layer Shaping and Surface Layer Coloring. We also present a new design drawing dataset ProLines for evaluation. The experiments show that LineArt performs better in accuracy, realism, and material precision compared to SOTAs.

Title: EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting

Authors: Dong In Lee, Hyeongcheol Park, Jiyoung Seo, Eunbyung Park, Hyunje Park, Ha Dam Baek, Shin Sangheon, Sangmin kim, Sangpil Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11520
Pdf URL: https://arxiv.org/pdf/2412.11520
Copy Paste: [[2412.11520]] EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting(https://arxiv.org/abs/2412.11520)
Keywords: diffusion
Abstract: Recent advancements in 3D editing have highlighted the potential of text-driven methods in real-time, user-friendly AR/VR applications. However, current methods rely on 2D diffusion models without adequately considering multi-view information, resulting in multi-view inconsistency. While 3D Gaussian Splatting (3DGS) significantly improves rendering quality and speed, its 3D editing process encounters difficulties with inefficient optimization, as pre-trained Gaussians retain excessive source information, hindering optimization. To address these limitations, we propose \textbf{EditSplat}, a novel 3D editing framework that integrates Multi-view Fusion Guidance (MFG) and Attention-Guided Trimming (AGT). Our MFG ensures multi-view consistency by incorporating essential multi-view information into the diffusion process, leveraging classifier-free guidance from the text-to-image diffusion model and the geometric properties of 3DGS. Additionally, our AGT leverages the explicit representation of 3DGS to selectively prune and optimize 3D Gaussians, enhancing optimization efficiency and enabling precise, semantically rich local edits. Through extensive qualitative and quantitative evaluations, EditSplat achieves superior multi-view consistency and editing quality over existing methods, significantly enhancing overall efficiency.

Title: Towards a Speech Foundation Model for Singapore and Beyond

Authors: Muhammad Huzaifah, Tianchi Liu, Hardik B. Sailor, Kye Min Tan, Tarun K. Vangani, Qiongqiong Wang, Jeremy H. M. Wong, Nancy F. Chen, Ai Ti Aw
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2412.11538
Pdf URL: https://arxiv.org/pdf/2412.11538
Copy Paste: [[2412.11538]] Towards a Speech Foundation Model for Singapore and Beyond(https://arxiv.org/abs/2412.11538)
Keywords: self-supervised, foundation model
Abstract: This technical report describes the MERaLiON Speech Encoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore's National Multimodal Large Language Model Programme, the MERaLiON Speech Encoder is tailored to address the speech processing needs in Singapore and the surrounding Southeast Asian region. The model currently supports mainly English, including the variety spoken in Singapore. We are actively expanding our datasets to gradually cover other languages in subsequent releases. The MERaLiON Speech Encoder was pre-trained from scratch on 200K hours of unlabelled speech data using a self-supervised learning approach based on masked language modelling. We describe our training procedure and hyperparameter tuning experiments in detail below. Our evaluation demonstrates improvements to spontaneous and Singapore speech benchmarks for speech recognition, while remaining competitive to other state-of-the-art speech encoders across ten other speech tasks. We commit to releasing our model, supporting broader research endeavours, both in Singapore and beyond.

Title: MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models

Authors: Weilun Feng, Haotong Qin, Chuanguang Yang, Zhulin An, Libo Huang, Boyu Diao, Fei Wang, Renshuai Tao, Yongjun Xu, Michele Magno
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11549
Pdf URL: https://arxiv.org/pdf/2412.11549
Copy Paste: [[2412.11549]] MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models(https://arxiv.org/abs/2412.11549)
Keywords: diffusion
Abstract: Diffusion models have received wide attention in generation tasks. However, the expensive computation cost prevents the application of diffusion models in resource-constrained scenarios. Quantization emerges as a practical solution that significantly saves storage and computation by reducing the bit-width of parameters. However, the existing quantization methods for diffusion models still cause severe degradation in performance, especially under extremely low bit-widths (2-4 bit). The primary decrease in performance comes from the significant discretization of activation values at low bit quantization. Too few activation candidates are unfriendly for outlier significant weight channel quantization, and the discretized features prevent stable learning over different time steps of the diffusion model. This paper presents MPQ-DM, a Mixed-Precision Quantization method for Diffusion Models. The proposed MPQ-DM mainly relies on two techniques:(1) To mitigate the quantization error caused by outlier severe weight channels, we propose an Outlier-Driven Mixed Quantization (OMQ) technique that uses $Kurtosis$ to quantify outlier salient channels and apply optimized intra-layer mixed-precision bit-width allocation to recover accuracy performance within target efficiency.(2) To robustly learn representations crossing time steps, we construct a Time-Smoothed Relation Distillation (TRD) scheme between the quantized diffusion model and its full-precision counterpart, transferring discrete and continuous latent to a unified relation space to reduce the representation inconsistency. Comprehensive experiments demonstrate that MPQ-DM achieves significant accuracy gains under extremely low bit-widths compared with SOTA quantization methods. MPQ-DM achieves a 58\% FID decrease under W2A4 setting compared with baseline, while all other methods even collapse.

Title: Aligning Visual and Semantic Interpretability through Visually Grounded Concept Bottleneck Models

Authors: Patrick Knab, Katharina Prasse, Sascha Marton, Christian Bartelt, Margret Keuper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11576
Pdf URL: https://arxiv.org/pdf/2412.11576
Copy Paste: [[2412.11576]] Aligning Visual and Semantic Interpretability through Visually Grounded Concept Bottleneck Models(https://arxiv.org/abs/2412.11576)
Keywords: foundation model
Abstract: The performance of neural networks increases steadily, but our understanding of their decision-making lags behind. Concept Bottleneck Models (CBMs) address this issue by incorporating human-understandable concepts into the prediction process, thereby enhancing transparency and interpretability. Since existing approaches often rely on large language models (LLMs) to infer concepts, their results may contain inaccurate or incomplete mappings, especially in complex visual domains. We introduce visually Grounded Concept Bottleneck Models (GCBM), which derive concepts on the image level using segmentation and detection foundation models. Our method generates inherently interpretable concepts, which can be grounded in the input image using attribution methods, allowing interpretations to be traced back to the image plane. We show that GCBM concepts are meaningful interpretability vehicles, which aid our understanding of model embedding spaces. GCBMs allow users to control the granularity, number, and naming of concepts, providing flexibility and are easily adaptable to new datasets without pre-training or additional data needed. Prediction accuracy is within 0.3-6% of the linear probe and GCBMs perform especially well for fine-grained classification interpretability on CUB, due to their dataset specificity. Our code is available on this https URL.

Title: StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors

Authors: Xiaokun Sun, Zeyu Cai, Zhenyu Zhang, Ying Tai, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11586
Pdf URL: https://arxiv.org/pdf/2412.11586
Copy Paste: [[2412.11586]] StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors(https://arxiv.org/abs/2412.11586)
Keywords: diffusion, generative
Abstract: While haircut indicates distinct personality, existing avatar generation methods fail to model practical hair due to the general or entangled representation. We propose StrandHead, a novel text to 3D head avatar generation method capable of generating disentangled 3D hair with strand representation. Without using 3D data for supervision, we demonstrate that realistic hair strands can be generated from prompts by distilling 2D generative diffusion models. To this end, we propose a series of reliable priors on shape initialization, geometric primitives, and statistical haircut features, leading to a stable optimization and text-aligned performance. Extensive experiments show that StrandHead achieves the state-of-the-art reality and diversity of generated 3D head and hair. The generated 3D hair can also be easily implemented in the Unreal Engine for physical simulation and other applications. The code will be available at this https URL.

Title: VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis

Authors: Zhipeng Chen, Lan Yang, Yonggang Qi, Honggang Zhang, Kaiyue Pang, Ke Li, Yi-Zhe Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11594
Pdf URL: https://arxiv.org/pdf/2412.11594
Copy Paste: [[2412.11594]] VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis(https://arxiv.org/abs/2412.11594)
Keywords: diffusion, generative
Abstract: Despite the rapid advancements in text-to-image (T2I) synthesis, enabling precise visual control remains a significant challenge. Existing works attempted to incorporate multi-facet controls (text and sketch), aiming to enhance the creative control over generated images. However, our pilot study reveals that the expressive power of humans far surpasses the capabilities of current methods. Users desire a more versatile approach that can accommodate their diverse creative intents, ranging from controlling individual subjects to manipulating the entire scene composition. We present VersaGen, a generative AI agent that enables versatile visual control in T2I synthesis. VersaGen admits four types of visual controls: i) single visual subject; ii) multiple visual subjects; iii) scene background; iv) any combination of the three above or merely no control at all. We train an adaptor upon a frozen T2I model to accommodate the visual information into the text-dominated diffusion process. We introduce three optimization strategies during the inference phase of VersaGen to improve generation results and enhance user experience. Comprehensive experiments on COCO and Sketchy validate the effectiveness and flexibility of VersaGen, as evidenced by both qualitative and quantitative results.

Title: 3D$^2$-Actor: Learning Pose-Conditioned 3D-Aware Denoiser for Realistic Gaussian Avatar Modeling

Authors: Zichen Tang, Hongyu Yang, Hanchen Zhang, Jiaxin Chen, Di Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11599
Pdf URL: https://arxiv.org/pdf/2412.11599
Copy Paste: [[2412.11599]] 3D$^2$-Actor: Learning Pose-Conditioned 3D-Aware Denoiser for Realistic Gaussian Avatar Modeling(https://arxiv.org/abs/2412.11599)
Keywords: diffusion
Abstract: Advancements in neural implicit representations and differentiable rendering have markedly improved the ability to learn animatable 3D avatars from sparse multi-view RGB videos. However, current methods that map observation space to canonical space often face challenges in capturing pose-dependent details and generalizing to novel poses. While diffusion models have demonstrated remarkable zero-shot capabilities in 2D image generation, their potential for creating animatable 3D avatars from 2D inputs remains underexplored. In this work, we introduce 3D$^2$-Actor, a novel approach featuring a pose-conditioned 3D-aware human modeling pipeline that integrates iterative 2D denoising and 3D rectifying steps. The 2D denoiser, guided by pose cues, generates detailed multi-view images that provide the rich feature set necessary for high-fidelity 3D reconstruction and pose rendering. Complementing this, our Gaussian-based 3D rectifier renders images with enhanced 3D consistency through a two-stage projection strategy and a novel local coordinate representation. Additionally, we propose an innovative sampling strategy to ensure smooth temporal continuity across frames in video synthesis. Our method effectively addresses the limitations of traditional numerical solutions in handling ill-posed mappings, producing realistic and animatable 3D human avatars. Experimental results demonstrate that 3D$^2$-Actor excels in high-fidelity avatar modeling and robustly generalizes to novel poses. Code is available at: this https URL.

Title: VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Authors: Muhammet Furkan Ilaslan, Ali Koksal, Kevin Qinhong Lin, Burak Satar, Mike Zheng Shou, Qianli Xu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2412.11621
Pdf URL: https://arxiv.org/pdf/2412.11621
Copy Paste: [[2412.11621]] VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting(https://arxiv.org/abs/2412.11621)
Keywords: diffusion
Abstract: Large Language Model (LLM)-based agents have shown promise in procedural tasks, but the potential of multimodal instructions augmented by texts and videos to assist users remains under-explored. To address this gap, we propose the Visually Grounded Text-Video Prompting (VG-TVP) method which is a novel LLM-empowered Multimodal Procedural Planning (MPP) framework. It generates cohesive text and video procedural plans given a specified high-level objective. The main challenges are achieving textual and visual informativeness, temporal coherence, and accuracy in procedural plans. VG-TVP leverages the zero-shot reasoning capability of LLMs, the video-to-text generation ability of the video captioning models, and the text-to-video generation ability of diffusion models. VG-TVP improves the interaction between modalities by proposing a novel Fusion of Captioning (FoC) method and using Text-to-Video Bridge (T2V-B) and Video-to-Text Bridge (V2T-B). They allow LLMs to guide the generation of visually-grounded text plans and textual-grounded video plans. To address the scarcity of datasets suitable for MPP, we have curated a new dataset called Daily-Life Task Procedural Plans (Daily-PP). We conduct comprehensive experiments and benchmarks to evaluate human preferences (regarding textual and visual informativeness, temporal coherence, and plan accuracy). Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset.

Title: Predicting the Original Appearance of Damaged Historical Documents

Authors: Zhenhua Yang, Dezhi Peng, Yongxin Shi, Yuyi Zhang, Chongyu Liu, Lianwen Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11634
Pdf URL: https://arxiv.org/pdf/2412.11634
Copy Paste: [[2412.11634]] Predicting the Original Appearance of Damaged Historical Documents(https://arxiv.org/abs/2412.11634)
Keywords: diffusion
Abstract: Historical documents encompass a wealth of cultural treasures but suffer from severe damages including character missing, paper damage, and ink erosion over time. However, existing document processing methods primarily focus on binarization, enhancement, etc., neglecting the repair of these damages. To this end, we present a new task, termed Historical Document Repair (HDR), which aims to predict the original appearance of damaged historical documents. To fill the gap in this field, we propose a large-scale dataset HDR28K and a diffusion-based network DiffHDR for historical document repair. Specifically, HDR28K contains 28,552 damaged-repaired image pairs with character-level annotations and multi-style degradations. Moreover, DiffHDR augments the vanilla diffusion framework with semantic and spatial information and a meticulously designed character perceptual loss for contextual and visual coherence. Experimental results demonstrate that the proposed DiffHDR trained using HDR28K significantly surpasses existing approaches and exhibits remarkable performance in handling real damaged documents. Notably, DiffHDR can also be extended to document editing and text block generation, showcasing its high flexibility and generalization capacity. We believe this study could pioneer a new direction of document processing and contribute to the inheritance of invaluable cultures and civilizations. The dataset and code is available at this https URL.

Title: IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation

Authors: Yiren Song, Pei Yang, Hai Ci, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11638
Pdf URL: https://arxiv.org/pdf/2412.11638
Copy Paste: [[2412.11638]] IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation(https://arxiv.org/abs/2412.11638)
Keywords: generative
Abstract: Recently, zero-shot methods like InstantID have revolutionized identity-preserving generation. Unlike multi-image finetuning approaches such as DreamBooth, these zero-shot methods leverage powerful facial encoders to extract identity information from a single portrait photo, enabling efficient identity-preserving generation through a single inference pass. However, this convenience introduces new threats to the facial identity protection. This paper aims to safeguard portrait photos from unauthorized encoder-based customization. We introduce IDProtector, an adversarial noise encoder that applies imperceptible adversarial noise to portrait photos in a single forward pass. Our approach offers universal protection for portraits against multiple state-of-the-art encoder-based methods, including InstantID, IP-Adapter, and PhotoMaker, while ensuring robustness to common image transformations such as JPEG compression, resizing, and affine transformations. Experiments across diverse portrait datasets and generative models reveal that IDProtector generalizes effectively to unseen data and even closed-source proprietary models.

Title: SE-GCL: An Event-Based Simple and Effective Graph Contrastive Learning for Text Representation

Authors: Tao Meng, Wei Ai, Jianbin Li, Ze Wang, Yuntao Shou, Keqin Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11652
Pdf URL: https://arxiv.org/pdf/2412.11652
Copy Paste: [[2412.11652]] SE-GCL: An Event-Based Simple and Effective Graph Contrastive Learning for Text Representation(https://arxiv.org/abs/2412.11652)
Keywords: self-supervised
Abstract: Text representation learning is significant as the cornerstone of natural language processing. In recent years, graph contrastive learning (GCL) has been widely used in text representation learning due to its ability to represent and capture complex text information in a self-supervised setting. However, current mainstream graph contrastive learning methods often require the incorporation of domain knowledge or cumbersome computations to guide the data augmentation process, which significantly limits the application efficiency and scope of GCL. Additionally, many methods learn text representations only by constructing word-document relationships, which overlooks the rich contextual semantic information in the text. To address these issues and exploit representative textual semantics, we present an event-based, simple, and effective graph contrastive learning (SE-GCL) for text representation. Precisely, we extract event blocks from text and construct internal relation graphs to represent inter-semantic interconnections, which can ensure that the most critical semantic information is preserved. Then, we devise a streamlined, unsupervised graph contrastive learning framework to leverage the complementary nature of the event semantic and structural information for intricate feature data capture. In particular, we introduce the concept of an event skeleton for core representation semantics and simplify the typically complex data augmentation techniques found in existing graph contrastive learning to boost algorithmic efficiency. We employ multiple loss functions to prompt diverse embeddings to converge or diverge within a confined distance in the vector space, ultimately achieving a harmonious equilibrium. We conducted experiments on the proposed SE-GCL on four standard data sets (AG News, 20NG, SougouNews, and THUCNews) to verify its effectiveness in text representation learning.

Title: Self-Adaptive Paraphrasing and Preference Learning for Improved Claim Verifiability

Authors: Amelie Wührl, Roman Klinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11653
Pdf URL: https://arxiv.org/pdf/2412.11653
Copy Paste: [[2412.11653]] Self-Adaptive Paraphrasing and Preference Learning for Improved Claim Verifiability(https://arxiv.org/abs/2412.11653)
Keywords: generative
Abstract: In fact-checking, structure and phrasing of claims critically influence a model's ability to predict verdicts accurately. Social media content in particular rarely serves as optimal input for verification systems, which necessitates pre-processing to extract the claim from noisy context before fact checking. Prior work suggests extracting a claim representation that humans find to be checkworthy and verifiable. This has two limitations: (1) the format may not be optimal for a fact-checking model, and (2), it requires annotated data to learn the extraction task from. We address both issues and propose a method to extract claims that is not reliant on labeled training data. Instead, our self-adaptive approach only requires a black-box fact checking model and a generative language model (LM). Given a tweet, we iteratively optimize the LM to generate a claim paraphrase that increases the performance of a fact checking model. By learning from preference pairs, we align the LM to the fact checker using direct preference optimization. We show that this novel setup extracts a claim paraphrase that is more verifiable than their original social media formulations, and is on par with competitive baselines. For refuted claims, our method consistently outperforms all baselines.

Title: $\texttt{DINO-Foresight}$: Looking into the Future with DINO

Authors: Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11673
Pdf URL: https://arxiv.org/pdf/2412.11673
Copy Paste: [[2412.11673]] $\texttt{DINO-Foresight}$: Looking into the Future with DINO(https://arxiv.org/abs/2412.11673)
Keywords: self-supervised, foundation model
Abstract: Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce $\texttt{DINO-Foresight}$, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show that our framework outperforms existing methods, demonstrating its robustness and scalability. Additionally, we highlight how intermediate transformer representations in $\texttt{DINO-Foresight}$ improve downstream task performance, offering a promising path for the self-supervised enhancement of VFM features. We provide the implementation code at this https URL .

Title: CiTrus: Squeezing Extra Performance out of Low-data Bio-signal Transfer Learning

Authors: Eloy Geenjaar, Lie Lu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2412.11695
Pdf URL: https://arxiv.org/pdf/2412.11695
Copy Paste: [[2412.11695]] CiTrus: Squeezing Extra Performance out of Low-data Bio-signal Transfer Learning(https://arxiv.org/abs/2412.11695)
Keywords: self-supervised
Abstract: Transfer learning for bio-signals has recently become an important technique to improve prediction performance on downstream tasks with small bio-signal datasets. Recent works have shown that pre-training a neural network model on a large dataset (e.g. EEG) with a self-supervised task, replacing the self-supervised head with a linear classification head, and fine-tuning the model on different downstream bio-signal datasets (e.g., EMG or ECG) can dramatically improve the performance on those datasets. In this paper, we propose a new convolution-transformer hybrid model architecture with masked auto-encoding for low-data bio-signal transfer learning, introduce a frequency-based masked auto-encoding task, employ a more comprehensive evaluation framework, and evaluate how much and when (multimodal) pre-training improves fine-tuning performance. We also introduce a dramatically more performant method of aligning a downstream dataset with a different temporal length and sampling rate to the original pre-training dataset. Our findings indicate that the convolution-only part of our hybrid model can achieve state-of-the-art performance on some low-data downstream tasks. The performance is often improved even further with our full model. In the case of transformer-based models we find that pre-training especially improves performance on downstream datasets, multimodal pre-training often increases those gains further, and our frequency-based pre-training performs the best on average for the lowest and highest data regimes.

Title: On Large Language Models in Mission-Critical IT Governance: Are We Ready Yet?

Authors: Matteo Esposito, Francesco Palagiano, Valentina Lenarduzzi, Davide Taibi
Subjects: cs.CR, cs.AI, cs.ET, cs.SE
Abstract URL: https://arxiv.org/abs/2412.11698
Pdf URL: https://arxiv.org/pdf/2412.11698
Copy Paste: [[2412.11698]] On Large Language Models in Mission-Critical IT Governance: Are We Ready Yet?(https://arxiv.org/abs/2412.11698)
Keywords: generative
Abstract: Context. The security of critical infrastructure has been a fundamental concern since the advent of computers, and this concern has only intensified in today's cyber warfare landscape. Protecting mission-critical systems (MCSs), including essential assets like healthcare, telecommunications, and military coordination, is vital for national security. These systems require prompt and comprehensive governance to ensure their resilience, yet recent events have shown that meeting these demands is increasingly challenging. Aim. Building on prior research that demonstrated the potential of GAI, particularly Large Language Models (LLMs), in improving risk analysis tasks, we aim to explore practitioners' perspectives, specifically developers and security personnel, on using generative AI (GAI) in the governance of IT MCSs seeking to provide insights and recommendations for various stakeholders, including researchers, practitioners, and policymakers. Method. We designed a survey to collect practical experiences, concerns, and expectations of practitioners who develop and implement security solutions in the context of MCSs. Analyzing this data will help identify key trends, challenges, and opportunities for introducing GAIs in this niche domain. Conclusions and Future Works. Our findings highlight that the safe use of LLMs in MCS governance requires interdisciplinary collaboration. Researchers should focus on designing regulation-oriented models and focus on accountability; practitioners emphasize data protection and transparency, while policymakers must establish a unified AI framework with global benchmarks to ensure ethical and secure LLMs-based MCS governance.

Title: AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

Authors: Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Zhao Jin, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11706
Pdf URL: https://arxiv.org/pdf/2412.11706
Copy Paste: [[2412.11706]] AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration(https://arxiv.org/abs/2412.11706)
Keywords: diffusion
Abstract: Video Diffusion Transformers (DiTs) have demonstrated significant potential for generating high-fidelity videos but are computationally intensive. Existing acceleration methods include distillation, which requires costly retraining, and feature caching, which is highly sensitive to network architecture. Recent token reduction methods are training-free and architecture-agnostic, offering greater flexibility and wider applicability. However, they enforce the same sequence length across different components, constraining their acceleration potential. We observe that intra-sequence redundancy in video DiTs varies across features, blocks, and denoising timesteps. Building on this observation, we propose Asymmetric Reduction and Restoration (AsymRnR), a training-free approach to accelerate video DiTs. It offers a flexible and adaptive strategy that reduces the number of tokens based on their redundancy to enhance both acceleration and generation quality. We further propose matching cache to facilitate faster processing. Integrated into state-of-the-art video DiTs, AsymRnR achieves a superior speedup without compromising the quality.

Title: Re-Attentional Controllable Video Diffusion Editing

Authors: Yuanzhi Wang, Yong Li, Mengyi Liu, Xiaoya Zhang, Xin Liu, Zhen Cui, Antoni B. Chan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11710
Pdf URL: https://arxiv.org/pdf/2412.11710
Copy Paste: [[2412.11710]] Re-Attentional Controllable Video Diffusion Editing(https://arxiv.org/abs/2412.11710)
Keywords: diffusion
Abstract: Editing videos with textual guidance has garnered popularity due to its streamlined process which mandates users to solely edit the text prompt corresponding to the source video. Recent studies have explored and exploited large-scale text-to-image diffusion models for text-guided video editing, resulting in remarkable video editing capabilities. However, they may still suffer from some limitations such as mislocated objects, incorrect number of objects. Therefore, the controllability of video editing remains a formidable challenge. In this paper, we aim to challenge the above limitations by proposing a Re-Attentional Controllable Video Diffusion Editing (ReAtCo) method. Specially, to align the spatial placement of the target objects with the edited text prompt in a training-free manner, we propose a Re-Attentional Diffusion (RAD) to refocus the cross-attention activation responses between the edited text prompt and the target video during the denoising stage, resulting in a spatially location-aligned and semantically high-fidelity manipulated video. In particular, to faithfully preserve the invariant region content with less border artifacts, we propose an Invariant Region-guided Joint Sampling (IRJS) strategy to mitigate the intrinsic sampling errors w.r.t the invariant regions at each denoising timestep and constrain the generated content to be harmonized with the invariant region content. Experimental results verify that ReAtCo consistently improves the controllability of video diffusion editing and achieves superior video editing performance.

Title: Transferable Adversarial Face Attack with Text Controlled Attribute

Authors: Wenyun Li, Zheng Zhang, Xiangyuan Lan, Dongmei Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11735
Pdf URL: https://arxiv.org/pdf/2412.11735
Copy Paste: [[2412.11735]] Transferable Adversarial Face Attack with Text Controlled Attribute(https://arxiv.org/abs/2412.11735)
Keywords: generative
Abstract: Traditional adversarial attacks typically produce adversarial examples under norm-constrained conditions, whereas unrestricted adversarial examples are free-form with semantically meaningful perturbations. Current unrestricted adversarial impersonation attacks exhibit limited control over adversarial face attributes and often suffer from low transferability. In this paper, we propose a novel Text Controlled Attribute Attack (TCA$^2$) to generate photorealistic adversarial impersonation faces guided by natural language. Specifically, the category-level personal softmax vector is employed to precisely guide the impersonation attacks. Additionally, we propose both data and model augmentation strategies to achieve transferable attacks on unknown target models. Finally, a generative model, \textit{i.e}, Style-GAN, is utilized to synthesize impersonated faces with desired attributes. Extensive experiments on two high-resolution face recognition datasets validate that our TCA$^2$ method can generate natural text-guided adversarial impersonation faces with high transferability. We also evaluate our method on real-world face recognition systems, \textit{i.e}, Face++ and Aliyun, further demonstrating the practical potential of our approach.

Title: Generative Inbetweening through Frame-wise Conditions-Driven Video Generation

Authors: Tianyi Zhu, Dongwei Ren, Qilong Wang, Xiaohe Wu, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11755
Pdf URL: https://arxiv.org/pdf/2412.11755
Copy Paste: [[2412.11755]] Generative Inbetweening through Frame-wise Conditions-Driven Video Generation(https://arxiv.org/abs/2412.11755)
Keywords: generative
Abstract: Generative inbetweening aims to generate intermediate frame sequences by utilizing two key frames as input. Although remarkable progress has been made in video generation models, generative inbetweening still faces challenges in maintaining temporal stability due to the ambiguous interpolation path between two key frames. This issue becomes particularly severe when there is a large motion gap between input frames. In this paper, we propose a straightforward yet highly effective Frame-wise Conditions-driven Video Generation (FCVG) method that significantly enhances the temporal stability of interpolated video frames. Specifically, our FCVG provides an explicit condition for each frame, making it much easier to identify the interpolation path between two input frames and thus ensuring temporally stable production of visually plausible video frames. To achieve this, we suggest extracting matched lines from two input frames that can then be easily interpolated frame by frame, serving as frame-wise conditions seamlessly integrated into existing video generation models. In extensive evaluations covering diverse scenarios such as natural landscapes, complex human poses, camera movements and animations, existing methods often exhibit incoherent transitions across frames. In contrast, our FCVG demonstrates the capability to generate temporally stable videos using both linear and non-linear interpolation curves. Our project page and code are available at \url{this https URL}.

Title: IDEA-Bench: How Far are Generative Models from Professional Designing?

Authors: Chen Liang, Lianghua Huang, Jingwu Fang, Huanzhang Dou, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Junge Zhang, Xin Zhao, Yu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11767
Pdf URL: https://arxiv.org/pdf/2412.11767
Copy Paste: [[2412.11767]] IDEA-Bench: How Far are Generative Models from Professional Designing?(https://arxiv.org/abs/2412.11767)
Keywords: generative
Abstract: Real-world design tasks - such as picture book creation, film storyboard development using character sets, photo retouching, visual effects, and font transfer - are highly diverse and complex, requiring deep interpretation and extraction of various elements from instructions, descriptions, and reference images. The resulting images often implicitly capture key features from references or user inputs, making it challenging to develop models that can effectively address such varied tasks. While existing visual generative models can produce high-quality images based on prompts, they face significant limitations in professional design scenarios that involve varied forms and multiple inputs and outputs, even when enhanced with adapters like ControlNets and LoRAs. To address this, we introduce IDEA-Bench, a comprehensive benchmark encompassing 100 real-world design tasks, including rendering, visual effects, storyboarding, picture books, fonts, style-based, and identity-preserving generation, with 275 test cases to thoroughly evaluate a model's general-purpose generation capabilities. Notably, even the best-performing model only achieves 22.48 on IDEA-Bench, while the best general-purpose model only achieves 6.81. We provide a detailed analysis of these results, highlighting the inherent challenges and providing actionable directions for improvement. Additionally, we provide a subset of 18 representative tasks equipped with multimodal large language model (MLLM)-based auto-evaluation techniques to facilitate rapid model development and comparison. We releases the benchmark data, evaluation toolkits, and an online leaderboard at this https URL, aiming to drive the advancement of generative models toward more versatile and applicable intelligent design systems.

Title: No More Adam: Learning Rate Scaling at Initialization is All You Need

Authors: Minghao Xu, Lichuan Xiang, Xu Cai, Hongkai Wen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11768
Pdf URL: https://arxiv.org/pdf/2412.11768
Copy Paste: [[2412.11768]] No More Adam: Learning Rate Scaling at Initialization is All You Need(https://arxiv.org/abs/2412.11768)
Keywords: diffusion
Abstract: In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first iteration and cuts the optimizer's memory usage by half compared to AdamW. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks, effectively overcoming a long-standing challenge of using SGD for training Transformers. SGD-SaI excels in ImageNet-1K classification with Vision Transformers(ViT) and GPT-2 pretraining for large language models (LLMs, transformer decoder-only), demonstrating robustness to hyperparameter variations and practicality for diverse applications. We further tested its robustness on tasks like LoRA fine-tuning for LLMs and diffusion models, where it consistently outperforms state-of-the-art optimizers. From a memory efficiency perspective, SGD-SaI achieves substantial memory savings for optimizer states, reducing memory usage by 5.93 GB for GPT-2 (1.5B parameters) and 25.15 GB for Llama2-7B compared to AdamW in full-precision training settings.

Title: InterDyn: Controllable Interactive Dynamics with Video Diffusion Models

Authors: Rick Akkerman, Haiwen Feng, Michael J. Black, Dimitrios Tzionas, Victoria Fernández Abrevaya
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11785
Pdf URL: https://arxiv.org/pdf/2412.11785
Copy Paste: [[2412.11785]] InterDyn: Controllable Interactive Dynamics with Video Diffusion Models(https://arxiv.org/abs/2412.11785)
Keywords: diffusion, foundation model, generative
Abstract: Predicting the dynamics of interacting objects is essential for both humans and intelligent systems. However, existing approaches are limited to simplified, toy settings and lack generalizability to complex, real-world environments. Recent advances in generative models have enabled the prediction of state transitions based on interventions, but focus on generating a single future state which neglects the continuous motion and subsequent dynamics resulting from the interaction. To address this gap, we propose InterDyn, a novel framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video foundation models can act as both neural renderers and implicit physics simulators by learning interactive dynamics from large-scale video data. To effectively harness this capability, we introduce an interactive control mechanism that conditions the video generation process on the motion of the driving entity. Qualitative results demonstrate that InterDyn generates plausible, temporally consistent videos of complex object interactions while generalizing to unseen objects. Quantitative evaluations show that InterDyn outperforms baselines that focus on static state transitions. This work highlights the potential of leveraging video generative models as implicit physics engines.

Title: Scalable Temporal Anomaly Causality Discovery in Large Systems: Achieving Computational Efficiency with Binary Anomaly Flag Data

Authors: Mulugeta Weldezgina Asres, Christian Walter Omlin, The CMS-HCAL Collaboration
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2412.11800
Pdf URL: https://arxiv.org/pdf/2412.11800
Copy Paste: [[2412.11800]] Scalable Temporal Anomaly Causality Discovery in Large Systems: Achieving Computational Efficiency with Binary Anomaly Flag Data(https://arxiv.org/abs/2412.11800)
Keywords: anomaly
Abstract: Extracting anomaly causality facilitates diagnostics once monitoring systems detect system faults. Identifying anomaly causes in large systems involves investigating a more extensive set of monitoring variables across multiple subsystems. However, learning causal graphs comes with a significant computational burden that restrains the applicability of most existing methods in real-time and large-scale deployments. In addition, modern monitoring applications for large systems often generate large amounts of binary alarm flags, and the distinct characteristics of binary anomaly data -- the meaning of state transition and data sparsity -- challenge existing causality learning mechanisms. This study proposes an anomaly causal discovery approach (AnomalyCD), addressing the accuracy and computational challenges of generating causal graphs from binary flag data sets. The AnomalyCD framework presents several strategies, such as anomaly flag characteristics incorporating causality testing, sparse data and link compression, and edge pruning adjustment approaches. We validate the performance of this framework on two datasets: monitoring sensor data of the readout-box system of the Compact Muon Solenoid experiment at CERN, and a public data set for information technology monitoring. The results demonstrate the considerable reduction of the computation overhead and moderate enhancement of the accuracy of temporal causal discovery on binary anomaly data sets.

Title: AMI-Net: Adaptive Mask Inpainting Network for Industrial Anomaly Detection and Localization

Authors: Wei Luo, Haiming Yao, Wenyong Yu, Zhengyong Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11802
Pdf URL: https://arxiv.org/pdf/2412.11802
Copy Paste: [[2412.11802]] AMI-Net: Adaptive Mask Inpainting Network for Industrial Anomaly Detection and Localization(https://arxiv.org/abs/2412.11802)
Keywords: anomaly
Abstract: Unsupervised visual anomaly detection is crucial for enhancing industrial production quality and efficiency. Among unsupervised methods, reconstruction approaches are popular due to their simplicity and effectiveness. The key aspect of reconstruction methods lies in the restoration of anomalous regions, which current methods have not satisfactorily achieved. To tackle this issue, we introduce a novel \uline{A}daptive \uline{M}ask \uline{I}npainting \uline{Net}work (AMI-Net) from the perspective of adaptive mask-inpainting. In contrast to traditional reconstruction methods that treat non-semantic image pixels as targets, our method uses a pre-trained network to extract multi-scale semantic features as reconstruction targets. Given the multiscale nature of industrial defects, we incorporate a training strategy involving random positional and quantitative masking. Moreover, we propose an innovative adaptive mask generator capable of generating adaptive masks that effectively mask anomalous regions while preserving normal regions. In this manner, the model can leverage the visible normal global contextual information to restore the masked anomalous regions, thereby effectively suppressing the reconstruction of defects. Extensive experimental results on the MVTec AD and BTAD industrial datasets validate the effectiveness of the proposed method. Additionally, AMI-Net exhibits exceptional real-time performance, striking a favorable balance between detection accuracy and speed, rendering it highly suitable for industrial applications. Code is available at: this https URL

Title: ColorFlow: Retrieval-Augmented Image Sequence Colorization

Authors: Junhao Zhuang, Xuan Ju, Zhaoyang Zhang, Yong Liu, Shiyi Zhang, Chun Yuan, Ying Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11815
Pdf URL: https://arxiv.org/pdf/2412.11815
Copy Paste: [[2412.11815]] ColorFlow: Retrieval-Augmented Image Sequence Colorization(https://arxiv.org/abs/2412.11815)
Keywords: diffusion, generative, in-context
Abstract: Automatic black-and-white image sequence colorization while preserving character and object identity (ID) is a complex task with significant market demand, such as in cartoon or comic series colorization. Despite advancements in visual colorization using large-scale generative models like diffusion models, challenges with controllability and identity consistency persist, making current solutions unsuitable for industrial this http URL address this, we propose ColorFlow, a three-stage diffusion-based framework tailored for image sequence colorization in industrial applications. Unlike existing methods that require per-ID finetuning or explicit ID embedding extraction, we propose a novel robust and generalizable Retrieval Augmented Colorization pipeline for colorizing images with relevant color references. Our pipeline also features a dual-branch design: one branch for color identity extraction and the other for colorization, leveraging the strengths of diffusion models. We utilize the self-attention mechanism in diffusion models for strong in-context learning and color identity matching. To evaluate our model, we introduce ColorFlow-Bench, a comprehensive benchmark for reference-based colorization. Results show that ColorFlow outperforms existing models across multiple metrics, setting a new standard in sequential image colorization and potentially benefiting the art industry. We release our codes and models on our project page: this https URL.

Title: Spatiotemporal Blind-Spot Network with Calibrated Flow Alignment for Self-Supervised Video Denoising

Authors: Zikang Chen, Tao Jiang, Xiaowan Hu, Wang Zhang, Huaqiu Li, Haoqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11820
Pdf URL: https://arxiv.org/pdf/2412.11820
Copy Paste: [[2412.11820]] Spatiotemporal Blind-Spot Network with Calibrated Flow Alignment for Self-Supervised Video Denoising(https://arxiv.org/abs/2412.11820)
Keywords: self-supervised
Abstract: Self-supervised video denoising aims to remove noise from videos without relying on ground truth data, leveraging the video itself to recover clean frames. Existing methods often rely on simplistic feature stacking or apply optical flow without thorough analysis. This results in suboptimal utilization of both inter-frame and intra-frame information, and it also neglects the potential of optical flow alignment under self-supervised conditions, leading to biased and insufficient denoising outcomes. To this end, we first explore the practicality of optical flow in the self-supervised setting and introduce a SpatioTemporal Blind-spot Network (STBN) for global frame feature utilization. In the temporal domain, we utilize bidirectional blind-spot feature propagation through the proposed blind-spot alignment block to ensure accurate temporal alignment and effectively capture long-range dependencies. In the spatial domain, we introduce the spatial receptive field expansion module, which enhances the receptive field and improves global perception capabilities. Additionally, to reduce the sensitivity of optical flow estimation to noise, we propose an unsupervised optical flow distillation mechanism that refines fine-grained inter-frame interactions during optical flow alignment. Our method demonstrates superior performance across both synthetic and real-world video denoising datasets. The source code is publicly available at this https URL.

Title: Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture

Authors: Jingze Shi, Bingheng Wu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2412.11834
Pdf URL: https://arxiv.org/pdf/2412.11834
Copy Paste: [[2412.11834]] Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture(https://arxiv.org/abs/2412.11834)
Keywords: foundation model
Abstract: In order to make the foundation model more efficient and effective, our idea is combining sequence transformation and state transformation. First, we prove the availability of rotary position embedding in the state space duality algorithm, which reduces the perplexity of the hybrid quadratic causal self-attention and state space duality by more than 4%, to ensure that the combining sequence transformation unifies position encoding. Second, we propose dynamic mask attention, which maintains 100% accuracy in the more challenging multi-query associative recall task, improving by more than 150% compared to quadratic causal self-attention and state space duality, to ensure that the combining sequence transformation selectively filters relevant information. Third, we design cross domain mixture of experts, which makes the computational speed of expert retrieval with more than 1024 experts 8 to 10 times faster than the mixture of experts, to ensure that the combining state transformation quickly retrieval mixture. Finally, we summarize these matrix algorithms that can form the foundation model: Wonderful Matrices, which can be a competitor to popular model architectures.

Title: A Benchmark and Robustness Study of In-Context-Learning with Large Language Models in Music Entity Detection

Authors: Simon Hachmeier, Robert Jäschke
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2412.11851
Pdf URL: https://arxiv.org/pdf/2412.11851
Copy Paste: [[2412.11851]] A Benchmark and Robustness Study of In-Context-Learning with Large Language Models in Music Entity Detection(https://arxiv.org/abs/2412.11851)
Keywords: in-context
Abstract: Detecting music entities such as song titles or artist names is a useful application to help use cases like processing music search queries or analyzing music consumption on the web. Recent approaches incorporate smaller language models (SLMs) like BERT and achieve high results. However, further research indicates a high influence of entity exposure during pre-training on the performance of the models. With the advent of large language models (LLMs), these outperform SLMs in a variety of downstream tasks. However, researchers are still divided if this is applicable to tasks like entity detection in texts due to issues like hallucination. In this paper, we provide a novel dataset of user-generated metadata and conduct a benchmark and a robustness study using recent LLMs with in-context-learning (ICL). Our results indicate that LLMs in the ICL setting yield higher performance than SLMs. We further uncover the large impact of entity exposure on the best performing LLM in our study.

Title: Event-based Motion Deblurring via Multi-Temporal Granularity Fusion

Authors: Xiaopeng Lin, Hongwei Ren, Yulong Huang, Zunchang Liu, Yue Zhou, Haotian Fu, Biao Pan, Bojun Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11866
Pdf URL: https://arxiv.org/pdf/2412.11866
Copy Paste: [[2412.11866]] Event-based Motion Deblurring via Multi-Temporal Granularity Fusion(https://arxiv.org/abs/2412.11866)
Keywords: diffusion
Abstract: Conventional frame-based cameras inevitably produce blurry effects due to motion occurring during the exposure time. Event camera, a bio-inspired sensor offering continuous visual information could enhance the deblurring performance. Effectively utilizing the high-temporal-resolution event data is crucial for extracting precise motion information and enhancing deblurring performance. However, existing event-based image deblurring methods usually utilize voxel-based event representations, losing the fine-grained temporal details that are mathematically essential for fast motion deblurring. In this paper, we first introduce point cloud-based event representation into the image deblurring task and propose a Multi-Temporal Granularity Network (MTGNet). It combines the spatially dense but temporally coarse-grained voxel-based event representation and the temporally fine-grained but spatially sparse point cloud-based event. To seamlessly integrate such complementary representations, we design a Fine-grained Point Branch. An Aggregation and Mapping Module (AMM) is proposed to align the low-level point-based features with frame-based features and an Adaptive Feature Diffusion Module (AFDM) is designed to manage the resolution discrepancies between event data and image data by enriching the sparse point feature. Extensive subjective and objective evaluations demonstrate that our method outperforms current state-of-the-art approaches on both synthetic and real-world datasets.

Title: PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension

Authors: Kun Ouyang, Yuanxin Liu, Shicheng Li, Yi Liu, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11906
Pdf URL: https://arxiv.org/pdf/2412.11906
Copy Paste: [[2412.11906]] PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension(https://arxiv.org/abs/2412.11906)
Keywords: in-context
Abstract: Multimodal punchlines, which involve humor or sarcasm conveyed in image-caption pairs, are a popular way of communication on online multimedia platforms. With the rapid development of multimodal large language models (MLLMs), it is essential to assess their ability to effectively comprehend these punchlines. However, existing benchmarks on punchline comprehension suffer from three major limitations: 1) language shortcuts that allow models to solely rely on text, 2) lack of question diversity, and 3) narrow focus on a specific domain of multimodal content (e.g., cartoon). To address these limitations, we introduce a multimodal \textbf{Punch}line comprehension \textbf{Bench}mark, named \textbf{PunchBench}, which is tailored for accurate and comprehensive evaluation of punchline comprehension. To enhance the evaluation accuracy, we generate synonymous and antonymous captions by modifying original captions, which mitigates the impact of shortcuts in the captions. To provide a comprehensive evaluation, PunchBench incorporates diverse question formats and image-captions from various domains. On this basis, we conduct extensive evaluations and reveal a significant gap between state-of-the-art MLLMs and humans in punchline comprehension. To improve punchline comprehension, we propose Simple-to-Complex Chain-of-Question (SC-CoQ) strategy, enabling the models to incrementally address complicated questions by first mastering simple ones. SC-CoQ effectively enhances the performance of various MLLMs on PunchBench, surpassing in-context learning and chain-of-thought.

Title: CharacterBench: Benchmarking Character Customization of Large Language Models

Authors: Jinfeng Zhou, Yongkang Huang, Bosi Wen, Guanqun Bi, Yuxuan Chen, Pei Ke, Zhuang Chen, Xiyao Xiao, Libiao Peng, Kuntian Tang, Rongsheng Zhang, Le Zhang, Tangjie Lv, Zhipeng Hu, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11912
Pdf URL: https://arxiv.org/pdf/2412.11912
Copy Paste: [[2412.11912]] CharacterBench: Benchmarking Character Customization of Large Language Models(https://arxiv.org/abs/2412.11912)
Keywords: generative
Abstract: Character-based dialogue (aka role-playing) enables users to freely customize characters for interaction, which often relies on LLMs, raising the need to evaluate LLMs' character customization capability. However, existing benchmarks fail to ensure a robust evaluation as they often only involve a single character category or evaluate limited dimensions. Moreover, the sparsity of character features in responses makes feature-focused generative evaluation both ineffective and inefficient. To address these issues, we propose CharacterBench, the largest bilingual generative benchmark, with 22,859 human-annotated samples covering 3,956 characters from 25 detailed character categories. We define 11 dimensions of 6 aspects, classified as sparse and dense dimensions based on whether character features evaluated by specific dimensions manifest in each response. We enable effective and efficient evaluation by crafting tailored queries for each dimension to induce characters' responses related to specific dimensions. Further, we develop CharacterJudge model for cost-effective and stable evaluations. Experiments show its superiority over SOTA automatic judges (e.g., GPT-4) and our benchmark's potential to optimize LLMs' character customization. Our repository is at this https URL.

Title: RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation

Authors: Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yongkang Wu, Zhonghua Li, Qi Ye, Zhicheng Dou
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2412.11919
Pdf URL: https://arxiv.org/pdf/2412.11919
Copy Paste: [[2412.11919]] RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation(https://arxiv.org/abs/2412.11919)
Keywords: generative
Abstract: Large language models (LLMs) exhibit remarkable generative capabilities but often suffer from hallucinations. Retrieval-augmented generation (RAG) offers an effective solution by incorporating external knowledge, but existing methods still face several limitations: additional deployment costs of separate retrievers, redundant input tokens from retrieved text chunks, and the lack of joint optimization of retrieval and generation. To address these issues, we propose \textbf{RetroLLM}, a unified framework that integrates retrieval and generation into a single, cohesive process, enabling LLMs to directly generate fine-grained evidence from the corpus with constrained decoding. Moreover, to mitigate false pruning in the process of constrained evidence generation, we introduce (1) hierarchical FM-Index constraints, which generate corpus-constrained clues to identify a subset of relevant documents before evidence generation, reducing irrelevant decoding space; and (2) a forward-looking constrained decoding strategy, which considers the relevance of future sequences to improve evidence accuracy. Extensive experiments on five open-domain QA datasets demonstrate RetroLLM's superior performance across both in-domain and out-of-domain tasks. The code is available at \url{this https URL}.

Title: PICLe: Pseudo-Annotations for In-Context Learning in Low-Resource Named Entity Detection

Authors: Sepideh Mamooler, Syrielle Montariol, Alexander Mathis, Antoine Bosselut
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2412.11923
Pdf URL: https://arxiv.org/pdf/2412.11923
Copy Paste: [[2412.11923]] PICLe: Pseudo-Annotations for In-Context Learning in Low-Resource Named Entity Detection(https://arxiv.org/abs/2412.11923)
Keywords: in-context
Abstract: In-context learning (ICL) enables Large Language Models (LLMs) to perform tasks using few demonstrations, facilitating task adaptation when labeled examples are hard to obtain. However, ICL is sensitive to the choice of demonstrations, and it remains unclear which demonstration attributes enable in-context generalization. In this work, we conduct a perturbation study of in-context demonstrations for low-resource Named Entity Detection (NED). Our surprising finding is that in-context demonstrations with partially correct annotated entity mentions can be as effective for task transfer as fully correct demonstrations. Based off our findings, we propose Pseudo-annotated In-Context Learning (PICLe), a framework for in-context learning with noisy, pseudo-annotated demonstrations. PICLe leverages LLMs to annotate many demonstrations in a zero-shot first pass. We then cluster these synthetic demonstrations, sample specific sets of in-context demonstrations from each cluster, and predict entity mentions using each set independently. Finally, we use self-verification to select the final set of entity mentions. We evaluate PICLe on five biomedical NED datasets and show that, with zero human annotation, PICLe outperforms ICL in low-resource settings where limited gold examples can be used as in-context demonstrations.

Title: Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning

Authors: Yuti Liu, Shice Liu, Junyuan Gao, Pengtao Jiang, Hao Zhang, Jinwei Chen, Bo Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2412.11952
Pdf URL: https://arxiv.org/pdf/2412.11952
Copy Paste: [[2412.11952]] Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning(https://arxiv.org/abs/2412.11952)
Keywords: self-supervised, in-context
Abstract: Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image's aesthetic values, and identifying its highlights and areas for improvement. Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets, thus impairing in-depth aesthetic comprehension. Despite efforts to overcome this challenge through the application of Multi-modal Large Language Models (MLLMs), such models remain underdeveloped for IAA purposes. To address this, we propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight. Central to our approach is an innovative multi-scale text-guided self-supervised learning technique. This technique features a multi-scale feature alignment module and capitalizes on a wealth of unlabeled data in a self-supervised manner to structurally and functionally enhance aesthetic ability. The empirical evidence indicates that accompanied with extensive instruct-tuning, our model sets new state-of-the-art benchmarks across multiple tasks, including aesthetic scoring, aesthetic commenting, and personalized image aesthetic assessment. Remarkably, it also demonstrates zero-shot learning capabilities in the emerging task of aesthetic suggesting. Furthermore, for personalized image aesthetic assessment, we harness the potential of in-context learning and showcase its inherent advantages.

Title: DARWIN 1.5: Large Language Models as Materials Science Adapted Learners

Authors: Tong Xie, Yuwei Wan, Yixuan Liu, Yuchen Zeng, Wenjie Zhang, Chunyu Kit, Dongzhan Zhou, Bram Hoex
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2412.11970
Pdf URL: https://arxiv.org/pdf/2412.11970
Copy Paste: [[2412.11970]] DARWIN 1.5: Large Language Models as Materials Science Adapted Learners(https://arxiv.org/abs/2412.11970)
Keywords: foundation model
Abstract: Materials discovery and design aim to find components and structures with desirable properties over highly complex and diverse search spaces. Traditional solutions, such as high-throughput simulations and machine learning (ML), often rely on complex descriptors, which hinder generalizability and transferability across tasks. Moreover, these descriptors may deviate from experimental data due to inevitable defects and purity issues in the real world, which may reduce their effectiveness in practical applications. To address these challenges, we propose Darwin 1.5, an open-source large language model (LLM) tailored for materials science. By leveraging natural language as input, Darwin eliminates the need for task-specific descriptors and enables a flexible, unified approach to material property prediction and discovery. We employ a two-stage training strategy combining question-answering (QA) fine-tuning with multi-task learning (MTL) to inject domain-specific knowledge in various modalities and facilitate cross-task knowledge transfer. Through our strategic approach, we achieved a significant enhancement in the prediction accuracy of LLMs, with a maximum improvement of 60\% compared to LLaMA-7B base models. It further outperforms traditional machine learning models on various tasks in material science, showcasing the potential of LLMs to provide a more versatile and scalable foundation model for materials discovery and design.

Title: Controllable Shadow Generation with Single-Step Diffusion Models from Synthetic Data

Authors: Onur Tasar, Clément Chadebec, Benjamin Aubin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11972
Pdf URL: https://arxiv.org/pdf/2412.11972
Copy Paste: [[2412.11972]] Controllable Shadow Generation with Single-Step Diffusion Models from Synthetic Data(https://arxiv.org/abs/2412.11972)
Keywords: diffusion
Abstract: Realistic shadow generation is a critical component for high-quality image compositing and visual effects, yet existing methods suffer from certain limitations: Physics-based approaches require a 3D scene geometry, which is often unavailable, while learning-based techniques struggle with control and visual artifacts. We introduce a novel method for fast, controllable, and background-free shadow generation for 2D object images. We create a large synthetic dataset using a 3D rendering engine to train a diffusion model for controllable shadow generation, generating shadow maps for diverse light source parameters. Through extensive ablation studies, we find that rectified flow objective achieves high-quality results with just a single sampling step enabling real-time applications. Furthermore, our experiments demonstrate that the model generalizes well to real-world images. To facilitate further research in evaluating quality and controllability in shadow generation, we release a new public benchmark containing a diverse set of object images and shadow maps in various settings. The project page is available at this https URL

Title: Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection

Authors: Beomseok Lee, Marco Gaido, Ioan Calapodescu, Laurent Besacier, Matteo Negri
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2412.11978
Pdf URL: https://arxiv.org/pdf/2412.11978
Copy Paste: [[2412.11978]] Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection(https://arxiv.org/abs/2412.11978)
Keywords: foundation model
Abstract: While crowdsourcing is an established solution for facilitating and scaling the collection of speech data, the involvement of non-experts necessitates protocols to ensure final data quality. To reduce the costs of these essential controls, this paper investigates the use of Speech Foundation Models (SFMs) to automate the validation process, examining for the first time the cost/quality trade-off in data acquisition. Experiments conducted on French, German, and Korean data demonstrate that SFM-based validation has the potential to reduce reliance on human validation, resulting in an estimated cost saving of over 40.0% without degrading final data quality. These findings open new opportunities for more efficient, cost-effective, and scalable speech data acquisition.

Title: SAMIC: Segment Anything with In-Context Spatial Prompt Engineering

Authors: Savinay Nagendra, Kashif Rashid, Chaopeng Shen, Daniel Kifer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.11998
Pdf URL: https://arxiv.org/pdf/2412.11998
Copy Paste: [[2412.11998]] SAMIC: Segment Anything with In-Context Spatial Prompt Engineering(https://arxiv.org/abs/2412.11998)
Keywords: foundation model, in-context
Abstract: Few-shot segmentation is the problem of learning to identify specific types of objects (e.g., airplanes) in images from a small set of labeled reference images. The current state of the art is driven by resource-intensive construction of models for every new domain-specific application. Such models must be trained on enormous labeled datasets of unrelated objects (e.g., cars, trains, animals) so that their ``knowledge'' can be transferred to new types of objects. In this paper, we show how to leverage existing vision foundation models (VFMs) to reduce the incremental cost of creating few-shot segmentation models for new domains. Specifically, we introduce SAMIC, a small network that learns how to prompt VFMs in order to segment new types of objects in domain-specific applications. SAMIC enables any task to be approached as a few-shot learning problem. At 2.6 million parameters, it is 94% smaller than the leading models (e.g., having ResNet 101 backbone with 45+ million parameters). Even using 1/5th of the training data provided by one-shot benchmarks, SAMIC is competitive with, or sets the state of the art, on a variety of few-shot and semantic segmentation datasets including COCO-$20^i$, Pascal-$5^i$, PerSeg, FSS-1000, and NWPU VHR-10.

Title: FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

Authors: Gaojian Wang, Feng Lin, Tong Wu, Zhenguang Liu, Zhongjie Ba, Kui Ren
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2412.12032
Pdf URL: https://arxiv.org/pdf/2412.12032
Copy Paste: [[2412.12032]] FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning(https://arxiv.org/abs/2412.12032)
Keywords: diffusion, self-supervised, foundation model
Abstract: This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable facial representation that boosts various face security tasks with respect to generalization performance? We make the first attempt and propose a self-supervised pretraining framework to learn fundamental representations of real face images, FSFM, that leverages the synergy between masked image modeling (MIM) and instance discrimination (ID). We explore various facial masking strategies for MIM and present a simple yet powerful CRFR-P masking, which explicitly forces the model to capture meaningful intra-region consistency and challenging inter-region coherency. Furthermore, we devise the ID network that naturally couples with MIM to establish underlying local-to-global correspondence via tailored self-distillation. These three learning objectives, namely 3C, empower encoding both local features and global semantics of real faces. After pretraining, a vanilla ViT serves as a universal vision foundation model for downstream face security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection. Extensive experiments on 10 public datasets demonstrate that our model transfers better than supervised pretraining, visual and facial self-supervised learning arts, and even outperforms task-specialized SOTA methods.

Title: A LoRA is Worth a Thousand Pictures

Authors: Chenxi Liu, Towaki Takikawa, Alec Jacobson
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12048
Pdf URL: https://arxiv.org/pdf/2412.12048
Copy Paste: [[2412.12048]] A LoRA is Worth a Thousand Pictures(https://arxiv.org/abs/2412.12048)
Keywords: diffusion
Abstract: Recent advances in diffusion models and parameter-efficient fine-tuning (PEFT) have made text-to-image generation and customization widely accessible, with Low Rank Adaptation (LoRA) able to replicate an artist's style or subject using minimal data and computation. In this paper, we examine the relationship between LoRA weights and artistic styles, demonstrating that LoRA weights alone can serve as an effective descriptor of style, without the need for additional image generation or knowledge of the original training set. Our findings show that LoRA weights yield better performance in clustering of artistic styles compared to traditional pre-trained features, such as CLIP and DINO, with strong structural similarities between LoRA-based and conventional image-based embeddings observed both qualitatively and quantitatively. We identify various retrieval scenarios for the growing collection of customized models and show that our approach enables more accurate retrieval in real-world settings where knowledge of the training images is unavailable and additional generation is required. We conclude with a discussion on potential future applications, such as zero-shot LoRA fine-tuning and model attribution.

Title: CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology

Authors: Yuxuan Sun, Yixuan Si, Chenglu Zhu, Xuan Gong, Kai Zhang, Pingyi Chen, Ye Zhang, Zhongyi Shui, Tao Lin, Lin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12077
Pdf URL: https://arxiv.org/pdf/2412.12077
Copy Paste: [[2412.12077]] CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology(https://arxiv.org/abs/2412.12077)
Keywords: foundation model
Abstract: The emergence of large multimodal models (LMMs) has brought significant advancements to pathology. Previous research has primarily focused on separately training patch-level and whole-slide image (WSI)-level models, limiting the integration of learned knowledge across patches and WSIs, and resulting in redundant models. In this work, we introduce CPath-Omni, the first 15-billion-parameter LMM designed to unify both patch and WSI level image analysis, consolidating a variety of tasks at both levels, including classification, visual question answering, captioning, and visual referring prompting. Extensive experiments demonstrate that CPath-Omni achieves state-of-the-art (SOTA) performance across seven diverse tasks on 39 out of 42 datasets, outperforming or matching task-specific models trained for individual tasks. Additionally, we develop a specialized pathology CLIP-based visual processor for CPath-Omni, CPath-CLIP, which, for the first time, integrates different vision models and incorporates a large language model as a text encoder to build a more powerful CLIP model, which achieves SOTA performance on nine zero-shot and four few-shot datasets. Our findings highlight CPath-Omni's ability to unify diverse pathology tasks, demonstrating its potential to streamline and advance the field of foundation model in pathology.

Title: IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

Authors: Zhibing Li, Tong Wu, Jing Tan, Mengchen Zhang, Jiaqi Wang, Dahua Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12083
Pdf URL: https://arxiv.org/pdf/2412.12083
Copy Paste: [[2412.12083]] IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations(https://arxiv.org/abs/2412.12083)
Keywords: diffusion
Abstract: Capturing geometric and material information from images remains a fundamental challenge in computer vision and graphics. Traditional optimization-based methods often require hours of computational time to reconstruct geometry, material properties, and environmental lighting from dense multi-view inputs, while still struggling with inherent ambiguities between lighting and material. On the other hand, learning-based approaches leverage rich material priors from existing 3D object datasets but face challenges with maintaining multi-view consistency. In this paper, we introduce IDArb, a diffusion-based model designed to perform intrinsic decomposition on an arbitrary number of images under varying illuminations. Our method achieves accurate and multi-view consistent estimation on surface normals and material properties. This is made possible through a novel cross-view, cross-domain attention module and an illumination-augmented, view-adaptive training strategy. Additionally, we introduce ARB-Objaverse, a new dataset that provides large-scale multi-view intrinsic data and renderings under diverse lighting conditions, supporting robust training. Extensive experiments demonstrate that IDArb outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, our approach facilitates a range of downstream tasks, including single-image relighting, photometric stereo, and 3D reconstruction, highlighting its broad applications in realistic 3D content creation.

Title: Wonderland: Navigating 3D Scenes from a Single Image

Authors: Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N. Plataniotis, Sergey Tulyakov, Jian Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12091
Pdf URL: https://arxiv.org/pdf/2412.12091
Copy Paste: [[2412.12091]] Wonderland: Navigating 3D Scenes from a Single Image(https://arxiv.org/abs/2412.12091)
Keywords: diffusion
Abstract: This paper addresses a challenging question: How can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image? Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality in backgrounds, and distorted reconstructions in unseen areas. We propose a novel pipeline to overcome these limitations. Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that contain multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets demonstrate that our model significantly outperforms existing methods for single-view 3D scene generation, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.

Title: CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models

Authors: Felix Taubner, Ruihang Zhang, Mathieu Tuli, David B. Lindell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12093
Pdf URL: https://arxiv.org/pdf/2412.12093
Copy Paste: [[2412.12093]] CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models(https://arxiv.org/abs/2412.12093)
Keywords: diffusion, generative
Abstract: Reconstructing photorealistic and dynamic portrait avatars from images is essential to many applications including advertising, visual effects, and virtual reality. Depending on the application, avatar reconstruction involves different capture setups and constraints $-$ for example, visual effects studios use camera arrays to capture hundreds of reference images, while content creators may seek to animate a single portrait image downloaded from the internet. As such, there is a large and heterogeneous ecosystem of methods for avatar reconstruction. Techniques based on multi-view stereo or neural rendering achieve the highest quality results, but require hundreds of reference images. Recent generative models produce convincing avatars from a single reference image, but visual fidelity yet lags behind multi-view techniques. Here, we present CAP4D: an approach that uses a morphable multi-view diffusion model to reconstruct photoreal 4D (dynamic 3D) portrait avatars from any number of reference images (i.e., one to 100) and animate and render them in real time. Our approach demonstrates state-of-the-art performance for single-, few-, and multi-image 4D portrait avatar reconstruction, and takes steps to bridge the gap in visual fidelity between single-image and multi-view reconstruction techniques.

Title: Causal Diffusion Transformers for Generative Modeling

Authors: Chaorui Deng, Deyao Zh, Kunchang Li, Shi Guan, Haoqi Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2412.12095
Pdf URL: https://arxiv.org/pdf/2412.12095
Copy Paste: [[2412.12095]] Causal Diffusion Transformers for Generative Modeling(https://arxiv.org/abs/2412.12095)
Keywords: diffusion, generative, in-context
Abstract: We introduce Causal Diffusion as the autoregressive (AR) counterpart of Diffusion models. It is a next-token(s) forecasting framework that is friendly to both discrete and continuous modalities and compatible with existing next-token prediction models like LLaMA and GPT. While recent works attempt to combine diffusion with AR models, we show that introducing sequential factorization to a diffusion model can substantially improve its performance and enables a smooth transition between AR and diffusion generation modes. Hence, we propose CausalFusion - a decoder-only transformer that dual-factorizes data across sequential tokens and diffusion noise levels, leading to state-of-the-art results on the ImageNet generation benchmark while also enjoying the AR advantage of generating an arbitrary number of tokens for in-context reasoning. We further demonstrate CausalFusion's multimodal capabilities through a joint image generation and captioning model, and showcase CausalFusion's ability for zero-shot in-context image manipulations. We hope that this work could provide the community with a fresh perspective on training multimodal models over discrete and continuous data.