2024-11-26

Title: Adaptively Controllable Diffusion Model for Efficient Conditional Image Generation

Authors: Yucheng Xing, Xiaodong Liu, Xin Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15199
Pdf URL: https://arxiv.org/pdf/2411.15199
Copy Paste: [[2411.15199]] Adaptively Controllable Diffusion Model for Efficient Conditional Image Generation(https://arxiv.org/abs/2411.15199)
Keywords: diffusion, generative
Abstract: With the development of artificial intelligence, more and more attention has been put onto generative models, which represent the creativity, a very important aspect of intelligence. In recent years, diffusion models have been studied and proven to be more reasonable and effective than previous methods. However, common diffusion frameworks suffer from controllability problems. Although extra conditions have been considered by some work to guide the diffusion process for a specific target generation, it only controls the generation result but not its process. In this work, we propose a new adaptive framework, $\textit{Adaptively Controllable Diffusion (AC-Diff) Model}$, to automatically and fully control the generation process, including not only the type of generation result but also the length and parameters of the generation process. Both inputs and conditions will be first fed into a $\textit{Conditional Time-Step (CTS) Module}$ to determine the number of steps needed for a generation. Then according to the length of the process, the diffusion rate parameters will be estimated through our $\textit{Adaptive Hybrid Noise Schedule (AHNS) Module}$. We further train the network with the corresponding adaptive sampling mechanism to learn how to adjust itself according to the conditions for the overall performance improvement. To enable its practical applications, AC-Diff is expected to largely reduce the average number of generation steps and execution time while maintaining the same performance as done in the literature diffusion models.

Title: Self-Supervised Conditional Distribution Learning on Graphs

Authors: Jie Chen, Hua Mao, Yuanbiao Gou, Zhu Wang, Xi Peng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15206
Pdf URL: https://arxiv.org/pdf/2411.15206
Copy Paste: [[2411.15206]] Self-Supervised Conditional Distribution Learning on Graphs(https://arxiv.org/abs/2411.15206)
Keywords: self-supervised
Abstract: Graph contrastive learning (GCL) has shown promising performance in semisupervised graph classification. However, existing studies still encounter significant challenges in GCL. First, successive layers in graph neural network (GNN) tend to produce more similar node embeddings, while GCL aims to increase the dissimilarity between negative pairs of node embeddings. This inevitably results in a conflict between the message-passing mechanism of GNNs and the contrastive learning of negative pairs via intraviews. Second, leveraging the diversity and quantity of data provided by graph-structured data augmentations while preserving intrinsic semantic information is challenging. In this paper, we propose a self-supervised conditional distribution learning (SSCDL) method designed to learn graph representations from graph-structured data for semisupervised graph classification. Specifically, we present an end-to-end graph representation learning model to align the conditional distributions of weakly and strongly augmented features over the original features. This alignment effectively reduces the risk of disrupting intrinsic semantic information through graph-structured data augmentation. To avoid conflict between the message-passing mechanism and contrastive learning of negative pairs, positive pairs of node representations are retained for measuring the similarity between the original features and the corresponding weakly augmented features. Extensive experiments with several benchmark graph datasets demonstrate the effectiveness of the proposed SSCDL method.

Title: Quantized symbolic time series approximation

Authors: Erin Carson, Xinye Chen, Cheng Kang
Subjects: cs.LG, eess.SP, stat.ML
Abstract URL: https://arxiv.org/abs/2411.15209
Pdf URL: https://arxiv.org/pdf/2411.15209
Copy Paste: [[2411.15209]] Quantized symbolic time series approximation(https://arxiv.org/abs/2411.15209)
Keywords: anomaly
Abstract: Time series are ubiquitous in numerous science and engineering domains, e.g., signal processing, bioinformatics, and astronomy. Previous work has verified the efficacy of symbolic time series representation in a variety of engineering applications due to its storage efficiency and numerosity reduction. The most recent symbolic aggregate approximation technique, ABBA, has been shown to preserve essential shape information of time series and improve downstream applications, e.g., neural network inference regarding prediction and anomaly detection in time series. Motivated by the emergence of high-performance hardware which enables efficient computation for low bit-width representations, we present a new quantization-based ABBA symbolic approximation technique, QABBA, which exhibits improved storage efficiency while retaining the original speed and accuracy of symbolic reconstruction. We prove an upper bound for the error arising from quantization and discuss how the number of bits should be chosen to balance this with other errors. An application of QABBA with large language models (LLMs) for time series regression is also presented, and its utility is investigated. By representing the symbolic chain of patterns on time series, QABBA not only avoids the training of embedding from scratch, but also achieves a new state-of-the-art on Monash regression dataset. The symbolic approximation to the time series offers a more efficient way to fine-tune LLMs on the time series regression task which contains various application domains. We further present a set of extensive experiments performed across various well-established datasets to demonstrate the advantages of the QABBA method for symbolic approximation.

Title: S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning

Authors: Mingze Yin, Hanjing Zhou, Jialu Wu, Yiheng Zhu, Yuxuan Zhan, Zitai Kong, Hongxia Xu, Chang-Yu Hsieh, Jintai Chen, Tingjun Hou, Jian Wu
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2411.15215
Pdf URL: https://arxiv.org/pdf/2411.15215
Copy Paste: [[2411.15215]] S$^2$ALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning(https://arxiv.org/abs/2411.15215)
Keywords: foundation model
Abstract: Antibodies safeguard our health through their precise and potent binding to specific antigens, demonstrating promising therapeutic efficacy in the treatment of numerous diseases, including COVID-19. Recent advancements in biomedical language models have shown the great potential to interpret complex biological structures and functions. However, existing antibody specific models have a notable limitation that they lack explicit consideration for antibody structural information, despite the fact that both 1D sequence and 3D structure carry unique and complementary insights into antibody behavior and functionality. This paper proposes Sequence-Structure multi-level pre-trained Antibody Language Model (S$^2$ALM), combining holistic sequential and structural information in one unified, generic antibody foundation model. We construct a hierarchical pre-training paradigm incorporated with two customized multi-level training objectives to facilitate the modeling of comprehensive antibody representations. S$^2$ALM's representation space uncovers inherent functional binding mechanisms, biological evolution properties and structural interaction patterns. Pre-trained over 75 million sequences and 11.7 million structures, S$^2$ALM can be adopted for diverse downstream tasks: accurately predicting antigen-antibody binding affinities, precisely distinguishing B cell maturation stages, identifying antibody crucial binding positions, and specifically designing novel coronavirus-binding antibodies. Remarkably, S$^2$ALM outperforms well-established and renowned baselines and sets new state-of-the-art performance across extensive antibody specific understanding and generation tasks. S$^2$ALM's ability to model comprehensive and generalized representations further positions its potential to advance real-world therapeutic antibody development, potentially addressing unmet academic, industrial, and clinical needs.

Title: Sampling with Adaptive Variance for Multimodal Distributions

Authors: Björn Engquist, Kui Ren, Yunan Yang
Subjects: cs.LG, math.NA, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2411.15220
Pdf URL: https://arxiv.org/pdf/2411.15220
Copy Paste: [[2411.15220]] Sampling with Adaptive Variance for Multimodal Distributions(https://arxiv.org/abs/2411.15220)
Keywords: diffusion
Abstract: We propose and analyze a class of adaptive sampling algorithms for multimodal distributions on a bounded domain, which share a structural resemblance to the classic overdamped Langevin dynamics. We first demonstrate that this class of linear dynamics with adaptive diffusion coefficients and vector fields can be interpreted and analyzed as weighted Wasserstein gradient flows of the Kullback--Leibler (KL) divergence between the current distribution and the target Gibbs distribution, which directly leads to the exponential convergence of both the KL and $\chi^2$ divergences, with rates depending on the weighted Wasserstein metric and the Gibbs potential. We then show that a derivative-free version of the dynamics can be used for sampling without gradient information of the Gibbs potential and that for Gibbs distributions with nonconvex potentials, this approach could achieve significantly faster convergence than the classical overdamped Langevin dynamics. A comparison of the mean transition times between local minima of a nonconvex potential further highlights the better efficiency of the derivative-free dynamics in sampling.

Title: IterIS: Iterative Inference-Solving Alignment for LoRA Merging

Authors: Hongxu Chen, Runshi Li, Bowei Zhu, Zhen Wang, Long Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15231
Pdf URL: https://arxiv.org/pdf/2411.15231
Copy Paste: [[2411.15231]] IterIS: Iterative Inference-Solving Alignment for LoRA Merging(https://arxiv.org/abs/2411.15231)
Keywords: diffusion
Abstract: Low-rank adaptations (LoRA) are widely used to fine-tune large models across various domains for specific downstream tasks. While task-specific LoRAs are often available, concerns about data privacy and intellectual property can restrict access to training data, limiting the acquisition of a multi-task model through gradient-based training. In response, LoRA merging presents an effective solution by combining multiple LoRAs into a unified adapter while maintaining data privacy. Prior works on LoRA merging primarily frame it as an optimization problem, yet these approaches face several limitations, including the rough assumption about input features utilized in optimization, massive sample requirements, and the unbalanced optimization objective. These limitations can significantly degrade performance. To address these, we propose a novel optimization-based method, named IterIS: 1) We formulate LoRA merging as an advanced optimization problem to mitigate the rough assumption. Additionally, we employ an iterative inference-solving framework in our algorithm. It can progressively refine the optimization objective for improved performance. 2) We introduce an efficient regularization term to reduce the need for massive sample requirements (requiring only 1-5% of the unlabeled samples compared to prior methods). 3) We utilize adaptive weights in the optimization objective to mitigate potential unbalances in LoRA merging process. Our method demonstrates significant improvements over multiple baselines and state-of-the-art methods in composing tasks for text-to-image diffusion, vision-language models, and large language models. Furthermore, our layer-wise algorithm can achieve convergence with minimal steps, ensuring efficiency in both memory and computation.

Title: BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models

Authors: Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, Yiming Xiao
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.15232
Pdf URL: https://arxiv.org/pdf/2411.15232
Copy Paste: [[2411.15232]] BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models(https://arxiv.org/abs/2411.15232)
Keywords: self-supervised
Abstract: Recent advancements in vision-language models (VLMs), such as CLIP, have demonstrated substantial success in self-supervised representation learning for vision tasks. However, effectively adapting VLMs to downstream applications remains challenging, as their accuracy often depends on time-intensive and expertise-demanding prompt engineering, while full model fine-tuning is costly. This is particularly true for biomedical images, which, unlike natural images, typically suffer from limited annotated datasets, unintuitive image contrasts, and nuanced visual features. Recent prompt learning techniques, such as Context Optimization (CoOp) intend to tackle these issues, but still fall short in generalizability. Meanwhile, explorations in prompt learning for biomedical image analysis are still highly limited. In this work, we propose BiomedCoOp, a novel prompt learning framework that enables efficient adaptation of BiomedCLIP for accurate and highly generalizable few-shot biomedical image classification. Our approach achieves effective prompt context learning by leveraging semantic consistency with average prompt ensembles from Large Language Models (LLMs) and knowledge distillation with a statistics-based prompt selection strategy. We conducted comprehensive validation of our proposed framework on 11 medical datasets across 9 modalities and 10 organs against existing state-of-the-art methods, demonstrating significant improvements in both accuracy and generalizability. The code will be publicly available at this https URL.

Title: Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps

Authors: Jeeyung Kim, Erfan Esmaeili, Qiang Qiu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15236
Pdf URL: https://arxiv.org/pdf/2411.15236
Copy Paste: [[2411.15236]] Text Embedding is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps(https://arxiv.org/abs/2411.15236)
Keywords: diffusion
Abstract: In text-to-image diffusion models, the cross-attention map of each text token indicates the specific image regions attended. Comparing these maps of syntactically related tokens provides insights into how well the generated image reflects the text prompt. For example, in the prompt, "a black car and a white clock", the cross-attention maps for "black" and "car" should focus on overlapping regions to depict a black car, while "car" and "clock" should not. Incorrect overlapping in the maps generally produces generation flaws such as missing objects and incorrect attribute binding. Our study makes the key observations investigating this issue in the existing text-to-image models:(1) the similarity in text embeddings between different tokens -- used as conditioning inputs -- can cause their cross-attention maps to focus on the same image regions; and (2) text embeddings often fail to faithfully capture syntactic relations already within text attention maps. As a result, such syntactic relationships can be overlooked in cross-attention module, leading to inaccurate image generation. To address this, we propose a method that directly transfers syntactic relations from the text attention maps to the cross-attention module via a test-time optimization. Our approach leverages this inherent yet unexploited information within text attention maps to enhance image-text semantic alignment across diverse prompts, without relying on external guidance.

Title: Faithful Label-free Knowledge Distillation

Authors: Evelyn J. Mannix, Liam Hodgkinson, Howard Bondell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15239
Pdf URL: https://arxiv.org/pdf/2411.15239
Copy Paste: [[2411.15239]] Faithful Label-free Knowledge Distillation(https://arxiv.org/abs/2411.15239)
Keywords: foundation model
Abstract: Knowledge distillation approaches are model compression techniques, with the goal of training a highly performant student model by using a teacher network that is larger or contains a different inductive bias. These approaches are particularly useful when applied to large computer vision foundation models, which can be compressed into smaller variants that retain desirable properties such as improved robustness. This paper presents a label-free knowledge distillation approach called Teacher in the Middle (TinTeM), which improves on previous methods by learning an approximately orthogonal mapping from the latent space of the teacher to the student network. This produces a more faithful student, which better replicates the behavior of the teacher network across a range of benchmarks testing model robustness, generalisability and out-of-distribution detection. It is further shown that knowledge distillation with TinTeM on task specific datasets leads to more accurate models with greater generalisability and OOD detection performance, and that this technique provides a competitive pathway for training highly performant lightweight models on small datasets.

Title: Is Attention All You Need For Actigraphy? Foundation Models of Wearable Accelerometer Data for Mental Health Research

Authors: Franklin Y. Ruan, Aiwei Zhang, Jenny Y. Oh, SouYoung Jin, Nicholas C Jacobson
Subjects: cs.LG, cs.AI, cs.HC, q-bio.QM
Abstract URL: https://arxiv.org/abs/2411.15240
Pdf URL: https://arxiv.org/pdf/2411.15240
Copy Paste: [[2411.15240]] Is Attention All You Need For Actigraphy? Foundation Models of Wearable Accelerometer Data for Mental Health Research(https://arxiv.org/abs/2411.15240)
Keywords: foundation model
Abstract: Wearable accelerometry (actigraphy) has provided valuable data for clinical insights since the 1970s and is increasingly important as wearable devices continue to become widespread. The effectiveness of actigraphy in research and clinical contexts is heavily dependent on the modeling architecture utilized. To address this, we developed the Pretrained Actigraphy Transformer (PAT)--the first pretrained and fully attention-based model designed specifically to handle actigraphy. PAT was pretrained on actigraphy from 29,307 participants in NHANES, enabling it to deliver state-of-the-art performance when fine-tuned across various actigraphy prediction tasks in the mental health domain, even in data-limited scenarios. For example, when trained to predict benzodiazepine usage using actigraphy from only 500 labeled participants, PAT achieved an 8.8 percentage-point AUC improvement over the best baseline. With fewer than 2 million parameters and built-in model explainability, PAT is robust yet easy to deploy in health research settings. GitHub: this https URL

Title: Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

Authors: Zhiwei Jia, Yuesong Nan, Huixi Zhao, Gengdai Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15247
Pdf URL: https://arxiv.org/pdf/2411.15247
Copy Paste: [[2411.15247]] Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward(https://arxiv.org/abs/2411.15247)
Keywords: diffusion
Abstract: Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to timestep-distilled DMs is challenging for ultra-fast ($\le2$-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for efficient reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and specifically targets image generation $\le2$ steps for reward optimization, enhancing generalizability and efficiency. LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives, outperforming popular RL methods including PPO and DPO. We further show LaSRO's connection to value-based RL, providing theoretical insights. See our webpage at this https URL.

Title: TPLogAD: Unsupervised Log Anomaly Detection Based on Event Templates and Key Parameters

Authors: Jiawei Lu, Chengrong Wu
Subjects: cs.LG, cs.AI, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2411.15250
Pdf URL: https://arxiv.org/pdf/2411.15250
Copy Paste: [[2411.15250]] TPLogAD: Unsupervised Log Anomaly Detection Based on Event Templates and Key Parameters(https://arxiv.org/abs/2411.15250)
Keywords: anomaly
Abstract: Log-system is an important mechanism for recording the runtime status and events of Web service systems, and anomaly detection in logs is an effective method of detecting problems. However, manual anomaly detection in logs is inefficient, error-prone, and unrealistic. Existing log anomaly detection methods either use the indexes of event templates, or form vectors by embedding the fixed string part of the template as a sentence, or use time parameters for sequence analysis. However, log entries often contain features and semantic information that cannot be fully represented by these methods, resulting in missed and false alarms. In this paper, we propose TPLogAD, a universal unsupervised method for analyzing unstructured logs, which performs anomaly detection based on event templates and key parameters. The itemplate2vec and para2vec included in TPLogAD are two efficient and easy-to-implement semantic representation methods for logs, detecting anomalies in event templates and parameters respectively, which has not been achieved in previous work. Additionally, TPLogAD can avoid the interference of log diversity and dynamics on anomaly detection. Our experiments on four public log datasets show that TPLogAD outperforms existing log anomaly detection methods.

Title: LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation

Authors: Fan Deng, Yaguang Wu, Xinyang Yu, Xiangjun Huang, Jian Yang, Guangyu Yan, Qiang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15252
Pdf URL: https://arxiv.org/pdf/2411.15252
Copy Paste: [[2411.15252]] LocRef-Diffusion:Tuning-Free Layout and Appearance-Guided Generation(https://arxiv.org/abs/2411.15252)
Keywords: diffusion
Abstract: Recently, text-to-image models based on diffusion have achieved remarkable success in generating high-quality images. However, the challenge of personalized, controllable generation of instances within these images remains an area in need of further development. In this paper, we present LocRef-Diffusion, a novel, tuning-free model capable of personalized customization of multiple instances' appearance and position within an image. To enhance the precision of instance placement, we introduce a Layout-net, which controls instance generation locations by leveraging both explicit instance layout information and an instance region cross-attention module. To improve the appearance fidelity to reference images, we employ an appearance-net that extracts instance appearance features and integrates them into the diffusion model through cross-attention mechanisms. We conducted extensive experiments on the COCO and OpenImages datasets, and the results demonstrate that our proposed method achieves state-of-the-art performance in layout and appearance guided generation.

Title: VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing

Authors: Jiahao Hu, Tianxiong Zhong, Xuebo Wang, Boyuan Jiang, Xingye Tian, Fei Yang, Pengfei Wan, Di Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15260
Pdf URL: https://arxiv.org/pdf/2411.15260
Copy Paste: [[2411.15260]] VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing(https://arxiv.org/abs/2411.15260)
Keywords: diffusion
Abstract: Diffusion-based image editing models have made remarkable progress in recent years. However, achieving high-quality video editing remains a significant challenge. One major hurdle is the absence of open-source, large-scale video editing datasets based on real-world data, as constructing such datasets is both time-consuming and costly. Moreover, video data requires a significantly larger number of tokens for representation, which substantially increases the training costs for video editing models. Lastly, current video editing models offer limited interactivity, often making it difficult for users to express their editing requirements effectively in a single attempt. To address these challenges, this paper introduces a dataset VIVID-10M and a baseline model VIVID. VIVID-10M is the first large-scale hybrid image-video local editing dataset aimed at reducing data construction and model training costs, which comprises 9.7M samples that encompass a wide range of video editing tasks. VIVID is a Versatile and Interactive VIdeo local eDiting model trained on VIVID-10M, which supports entity addition, modification, and deletion. At its core, a keyframe-guided interactive video editing mechanism is proposed, enabling users to iteratively edit keyframes and propagate it to other frames, thereby reducing latency in achieving desired outcomes. Extensive experimental evaluations show that our approach achieves state-of-the-art performance in video local editing, surpassing baseline methods in both automated metrics and user studies. The VIVID-10M dataset and the VIVID editing model will be available at \url{this https URL}.

Title: MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

Authors: Weijia Wu, Mingyu Liu, Zeyu Zhu, Xi Xia, Haoen Feng, Wen Wang, Kevin Qinghong Lin, Chunhua Shen, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15262
Pdf URL: https://arxiv.org/pdf/2411.15262
Copy Paste: [[2411.15262]] MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation(https://arxiv.org/abs/2411.15262)
Keywords: diffusion
Abstract: Recent advancements in video generation models, like Stable Video Diffusion, show promising results, but primarily focus on short, single-scene videos. These models struggle with generating long videos that involve multiple scenes, coherent narratives, and consistent characters. Furthermore, there is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models. In this paper, we present MovieBench: A Hierarchical Movie-Level Dataset for Long Video Generation, which addresses these challenges by providing unique contributions: (1) movie-length videos featuring rich, coherent storylines and multi-scene narratives, (2) consistency of character appearance and audio across scenes, and (3) hierarchical data structure contains high-level movie information and detailed shot-level descriptions. Experiments demonstrate that MovieBench brings some new insights and challenges, such as maintaining character ID consistency across multiple scenes for various characters. The dataset will be public and continuously maintained, aiming to advance the field of long video generation. Data can be found at: this https URL.

Title: Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI

Authors: Won Jun Kim, Hyungjin Chung, Jaemin Kim, Sangmin Lee, Byeongsu Sim, Jong Chul Ye
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15265
Pdf URL: https://arxiv.org/pdf/2411.15265
Copy Paste: [[2411.15265]] Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI(https://arxiv.org/abs/2411.15265)
Keywords: diffusion
Abstract: Gradient-based methods are a prototypical family of explainability techniques, especially for image-based models. Nonetheless, they have several shortcomings in that they (1) require white-box access to models, (2) are vulnerable to adversarial attacks, and (3) produce attributions that lie off the image manifold, leading to explanations that are not actually faithful to the model and do not align well with human perception. To overcome these challenges, we introduce Derivative-Free Diffusion Manifold-Constrainted Gradients (FreeMCG), a novel method that serves as an improved basis for explainability of a given neural network than the traditional gradient. Specifically, by leveraging ensemble Kalman filters and diffusion models, we derive a derivative-free approximation of the model's gradient projected onto the data manifold, requiring access only to the model's outputs. We demonstrate the effectiveness of FreeMCG by applying it to both counterfactual generation and feature attribution, which have traditionally been treated as distinct tasks. Through comprehensive evaluation on both tasks, counterfactual explanation and feature attribution, we show that our method yields state-of-the-art results while preserving the essential properties expected of XAI tools.

Title: EADReg: Probabilistic Correspondence Generation with Efficient Autoregressive Diffusion Model for Outdoor Point Cloud Registration

Authors: Linrui Gong, Jiuming Liu, Junyi Ma, Lihao Liu, Yaonan Wang, Hesheng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15271
Pdf URL: https://arxiv.org/pdf/2411.15271
Copy Paste: [[2411.15271]] EADReg: Probabilistic Correspondence Generation with Efficient Autoregressive Diffusion Model for Outdoor Point Cloud Registration(https://arxiv.org/abs/2411.15271)
Keywords: diffusion
Abstract: Diffusion models have shown the great potential in the point cloud registration (PCR) task, especially for enhancing the robustness to challenging cases. However, existing diffusion-based PCR methods primarily focus on instance-level scenarios and struggle with outdoor LiDAR points, where the sparsity, irregularity, and huge point scale inherent in LiDAR points pose challenges to establishing dense global point-to-point correspondences. To address this issue, we propose a novel framework named EADReg for efficient and robust registration of LiDAR point clouds based on autoregressive diffusion models. EADReg follows a coarse-to-fine registration paradigm. In the coarse stage, we employ a Bi-directional Gaussian Mixture Model (BGMM) to reject outlier points and obtain purified point cloud pairs. BGMM establishes correspondences between the Gaussian Mixture Models (GMMs) from the source and target frames, enabling reliable coarse registration based on filtered features and geometric information. In the fine stage, we treat diffusion-based PCR as an autoregressive process to generate robust point correspondences, which are then iteratively refined on upper layers. Despite common criticisms of diffusion-based methods regarding inference speed, EADReg achieves runtime comparable to convolutional-based methods. Extensive experiments on the KITTI and NuScenes benchmark datasets highlight the state-of-the-art performance of our proposed method. Codes will be released upon publication.

Title: Foundation Cures Personalization: Recovering Facial Personalized Models' Prompt Consistency

Authors: Yiyang Cai, Zhengkai Jiang, Yulong Liu, Chunyang Jiang, Wei Xue, Wenhan Luo, Yike Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15277
Pdf URL: https://arxiv.org/pdf/2411.15277
Copy Paste: [[2411.15277]] Foundation Cures Personalization: Recovering Facial Personalized Models' Prompt Consistency(https://arxiv.org/abs/2411.15277)
Keywords: foundation model
Abstract: Facial personalization represents a crucial downstream task in the domain of text-to-image generation. To preserve identity fidelity while ensuring alignment with user-defined prompts, current mainstream frameworks for facial personalization predominantly employ identity embedding mechanisms to associate identity information with textual embeddings. However, our experiments show that identity embeddings compromise the effectiveness of other tokens within the prompt, thereby hindering high prompt consistency, particularly when prompts involve multiple facial attributes. Moreover, previous works overlook the fact that their corresponding foundation models hold great potential to generate faces aligning to prompts well and can be easily leveraged to cure these ill-aligned attributes in personalized models. Building upon these insights, we propose FreeCure, a training-free framework that harnesses the intrinsic knowledge from the foundation models themselves to improve the prompt consistency of personalization models. First, by extracting cross-attention and semantic maps from the denoising process of foundation models, we identify easily localized attributes (e.g., hair, accessories, etc). Second, we enhance multiple attributes in the outputs of personalization models through a novel noise-blending strategy coupled with an inversion-based process. Our approach offers several advantages: it eliminates the need for training; it effectively facilitates the enhancement for a wide array of facial attributes in a non-intrusive manner; and it can be seamlessly integrated into existing popular personalization models. FreeCure has demonstrated significant improvements in prompt consistency across a diverse set of state-of-the-art facial personalization models while maintaining the integrity of original identity fidelity.

Title: There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks

Authors: Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15288
Pdf URL: https://arxiv.org/pdf/2411.15288
Copy Paste: [[2411.15288]] There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks(https://arxiv.org/abs/2411.15288)
Keywords: in-context
Abstract: The Segment Anything Model (SAM) was originally designed for label-agnostic mask generation. Does this model also possess inherent semantic understanding, of value to broader visual tasks? In this work we follow a multi-staged approach towards exploring this question. We firstly quantify SAM's semantic capabilities by comparing base image encoder efficacy under classification tasks, in comparison with established models (CLIP and DINOv2). Our findings reveal a significant lack of semantic discriminability in SAM feature representations, limiting potential for tasks that require class differentiation. This initial result motivates our exploratory study that attempts to enable semantic information via in-context learning with lightweight fine-tuning where we observe that generalisability to unseen classes remains limited. Our observations culminate in the proposal of a training-free approach that leverages DINOv2 features, towards better endowing SAM with semantic understanding and achieving instance-level class differentiation through feature-based similarity. Our study suggests that incorporation of external semantic sources provides a promising direction for the enhancement of SAM's utility with respect to complex visual tasks that require semantic understanding.

Title: PPLqa: An Unsupervised Information-Theoretic Quality Metric for Comparing Generative Large Language Models

Authors: Gerald Friedland, Xin Huang, Yueying Cui, Vishaal Kapoor, Ashish Khetan, Sanjiv Das
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15320
Pdf URL: https://arxiv.org/pdf/2411.15320
Copy Paste: [[2411.15320]] PPLqa: An Unsupervised Information-Theoretic Quality Metric for Comparing Generative Large Language Models(https://arxiv.org/abs/2411.15320)
Keywords: generative
Abstract: We propose PPLqa, an easy to compute, language independent, information-theoretic metric to measure the quality of responses of generative Large Language Models (LLMs) in an unsupervised way, without requiring ground truth annotations or human supervision. The method and metric enables users to rank generative language models for quality of responses, so as to make a selection of the best model for a given task. Our single metric assesses LLMs with an approach that subsumes, but is not explicitly based on, coherence and fluency (quality of writing) and relevance and consistency (appropriateness of response) to the query. PPLqa performs as well as other related metrics, and works better with long-form Q\&A. Thus, PPLqa enables bypassing the lengthy annotation process required for ground truth evaluations, and it also correlates well with human and LLM rankings.

Title: Zero-Shot Coreset Selection: Efficient Pruning for Unlabeled Data

Authors: Brent A. Griffin, Jacob Marks, Jason J. Corso
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15349
Pdf URL: https://arxiv.org/pdf/2411.15349
Copy Paste: [[2411.15349]] Zero-Shot Coreset Selection: Efficient Pruning for Unlabeled Data(https://arxiv.org/abs/2411.15349)
Keywords: foundation model
Abstract: Deep learning increasingly relies on massive data with substantial costs for storage, annotation, and model training. To reduce these costs, coreset selection aims to find a representative subset of data to train models while ideally performing on par with the full data training. State-of-the-art coreset methods use carefully-designed criteria to quantify the importance of each data example via ground truth labels and dataset-specific training, then select examples whose scores lie in a certain range to construct a coreset. These methods work well in their respective settings, however, they cannot select data that are unlabeled, which is the majority of real-world data. To that end, this paper motivates and formalizes the problem of unlabeled coreset selection to enable greater scale and reduce annotation costs for deep learning. As a solution, we develop Zero-Shot Coreset Selection (ZCore), a method that efficiently selects coresets without ground truth labels or training on candidate data. Instead, ZCore uses existing foundation models to generate a zero-shot embedding space for unlabeled data, then quantifies the relative importance of each example based on overall coverage and redundancy within the embedding distribution. We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods, leading to a strong baseline for future research in unlabeled coreset selection. On ImageNet, ZCore selections achieve a downstream model accuracy of 53.99% with only 10% training data, which outperforms label-based methods while removing annotation requirements for 1.15 million images. Our code is publicly available at this https URL.

Title: Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage

Authors: Soumil Datta, Shih-Chieh Dai, Leo Yu, Guanhong Tao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15367
Pdf URL: https://arxiv.org/pdf/2411.15367
Copy Paste: [[2411.15367]] Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage(https://arxiv.org/abs/2411.15367)
Keywords: diffusion, generative
Abstract: Text-to-image diffusion models, such as Stable Diffusion, have shown exceptional potential in generating high-quality images. However, recent studies highlight concerns over the use of unauthorized data in training these models, which may lead to intellectual property infringement or privacy violations. A promising approach to mitigate these issues is to apply a watermark to images and subsequently check if generative models reproduce similar watermark features. In this paper, we examine the robustness of various watermark-based protection methods applied to text-to-image models. We observe that common image transformations are ineffective at removing the watermark effect. Therefore, we propose \tech{}, that leverages the diffusion process to conduct controlled image generation on the protected input, preserving the high-level features of the input while ignoring the low-level details utilized by watermarks. A small number of generated images are then used to fine-tune protected models. Our experiments on three datasets and 140 text-to-image diffusion models reveal that existing state-of-the-art protections are not robust against RATTAN.

Title: Gradient dynamics for low-rank fine-tuning beyond kernels

Authors: Arif Kerem Dayi, Sitan Chen
Subjects: cs.LG, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2411.15385
Pdf URL: https://arxiv.org/pdf/2411.15385
Copy Paste: [[2411.15385]] Gradient dynamics for low-rank fine-tuning beyond kernels(https://arxiv.org/abs/2411.15385)
Keywords: foundation model
Abstract: LoRA has emerged as one of the de facto methods for fine-tuning foundation models with low computational cost and memory footprint. The idea is to only train a low-rank perturbation to the weights of a pre-trained model, given supervised data for a downstream task. Despite its empirical sucess, from a mathematical perspective it remains poorly understood what learning mechanisms ensure that gradient descent converges to useful low-rank perturbations. In this work we study low-rank fine-tuning in a student-teacher setting. We are given the weights of a two-layer base model $f$, as well as i.i.d. samples $(x,f^*(x))$ where $x$ is Gaussian and $f^*$ is the teacher model given by perturbing the weights of $f$ by a rank-1 matrix. This generalizes the setting of generalized linear model (GLM) regression where the weights of $f$ are zero. When the rank-1 perturbation is comparable in norm to the weight matrix of $f$, the training dynamics are nonlinear. Nevertheless, in this regime we prove under mild assumptions that a student model which is initialized at the base model and trained with online gradient descent will converge to the teacher in $dk^{O(1)}$ iterations, where $k$ is the number of neurons in $f$. Importantly, unlike in the GLM setting, the complexity does not depend on fine-grained properties of the activation's Hermite expansion. We also prove that in our setting, learning the teacher model "from scratch'' can require significantly more iterations.

Title: From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set

Authors: Mara Finkelstein, Dan Deutsch, Parker Riley, Juraj Juraska, Geza Kovacs, Markus Freitag
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15387
Pdf URL: https://arxiv.org/pdf/2411.15387
Copy Paste: [[2411.15387]] From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set(https://arxiv.org/abs/2411.15387)
Keywords: in-context
Abstract: As LLMs continue to become more powerful and versatile, human evaluation has quickly become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves state-of-the-art evaluators for many tasks. These Autoraters are typically designed so that they generalize to new systems and test sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure certain capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate our Specialist method on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the state-of-the-art XCOMET metric by 54% and 119% on the WMT'23 and WMT'24 test sets, respectively. We perform extensive analyses to understand the representations learned by our Specialist metrics, and how variability in rater behavior affects their performance. We also verify the generalizability and robustness of our Specialist method for designing automatic metrics across different numbers of ICL examples, LLM backbones, systems to evaluate, and evaluation tasks.

Title: Gradient-Free Classifier Guidance for Diffusion Model Sampling

Authors: Rahul Shenoy, Zhihong Pan, Kaushik Balakrishnan, Qisen Cheng, Yongmoon Jeon, Heejune Yang, Jaewon Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15393
Pdf URL: https://arxiv.org/pdf/2411.15393
Copy Paste: [[2411.15393]] Gradient-Free Classifier Guidance for Diffusion Model Sampling(https://arxiv.org/abs/2411.15393)
Keywords: diffusion
Abstract: Image generation using diffusion models have demonstrated outstanding learning capabilities, effectively capturing the full distribution of the training dataset. They are known to generate wide variations in sampled images, albeit with a trade-off in image fidelity. Guided sampling methods, such as classifier guidance (CG) and classifier-free guidance (CFG), focus sampling in well-learned high-probability regions to generate images of high fidelity, but each has its limitations. CG is computationally expensive due to the use of back-propagation for classifier gradient descent, while CFG, being gradient-free, is more efficient but compromises class label alignment compared to CG. In this work, we propose an efficient guidance method that fully utilizes a pre-trained classifier without using gradient descent. By using the classifier solely in inference mode, a time-adaptive reference class label and corresponding guidance scale are determined at each time step for guided sampling. Experiments on both class-conditioned and text-to-image generation diffusion models demonstrate that the proposed Gradient-free Classifier Guidance (GFCG) method consistently improves class prediction accuracy. We also show GFCG to be complementary to other guided sampling methods like CFG. When combined with the state-of-the-art Autoguidance (ATG), without additional computational overhead, it enhances image fidelity while preserving diversity. For ImageNet 512$\times$512, we achieve a record $\text{FD}_{\text{DINOv2}}$ of 23.09, while simultaneously attaining a higher classification Precision (94.3%) compared to ATG (90.2%)

Title: LDM-Morph: Latent diffusion model guided deformable image registration

Authors: Jiong Wu, Kuang Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15426
Pdf URL: https://arxiv.org/pdf/2411.15426
Copy Paste: [[2411.15426]] LDM-Morph: Latent diffusion model guided deformable image registration(https://arxiv.org/abs/2411.15426)
Keywords: diffusion
Abstract: Deformable image registration plays an essential role in various medical image tasks. Existing deep learning-based deformable registration frameworks primarily utilize convolutional neural networks (CNNs) or Transformers to learn features to predict the deformations. However, the lack of semantic information in the learned features limits the registration performance. Furthermore, the similarity metric of the loss function is often evaluated only in the pixel space, which ignores the matching of high-level anatomical features and can lead to deformation folding. To address these issues, in this work, we proposed LDM-Morph, an unsupervised deformable registration algorithm for medical image registration. LDM-Morph integrated features extracted from the latent diffusion model (LDM) to enrich the semantic information. Additionally, a latent and global feature-based cross-attention module (LGCA) was designed to enhance the interaction of semantic information from LDM and global information from multi-head self-attention operations. Finally, a hierarchical metric was proposed to evaluate the similarity of image pairs in both the original pixel space and latent-feature space, enhancing topology preservation while improving registration accuracy. Extensive experiments on four public 2D cardiac image datasets show that the proposed LDM-Morph framework outperformed existing state-of-the-art CNNs- and Transformers-based registration methods regarding accuracy and topology preservation with comparable computational efficiency. Our code is publicly available at this https URL.

Title: ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance

Authors: Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15436
Pdf URL: https://arxiv.org/pdf/2411.15436
Copy Paste: [[2411.15436]] ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance(https://arxiv.org/abs/2411.15436)
Keywords: diffusion
Abstract: Diffusion models have shown impressive potential on talking head generation. While plausible appearance and talking effect are achieved, these methods still suffer from temporal, 3D or expression inconsistency due to the error accumulation and inherent limitation of single-image generation ability. In this paper, we propose ConsistentAvatar, a novel framework for fully consistent and high-fidelity talking avatar generation. Instead of directly employing multi-modal conditions to the diffusion process, our method learns to first model the temporal representation for stability between adjacent frames. Specifically, we propose a Temporally-Sensitive Detail (TSD) map containing high-frequency feature and contours that vary significantly along the time axis. Using a temporal consistent diffusion module, we learn to align TSD of the initial result to that of the video frame ground truth. The final avatar is generated by a fully consistent diffusion module, conditioned on the aligned TSD, rough head normal, and emotion prompt embedding. We find that the aligned TSD, which represents the temporal patterns, constrains the diffusion process to generate temporally stable talking head. Further, its reliable guidance complements the inaccuracy of other conditions, suppressing the accumulated error while improving the consistency on various aspects. Extensive experiments demonstrate that ConsistentAvatar outperforms the state-of-the-art methods on the generated appearance, 3D, expression and temporal consistency. Project page: this https URL

Title: Twin Trigger Generative Networks for Backdoor Attacks against Object Detection

Authors: Zhiying Li, Zhi Liu, Guanggang Geng, Shreyank N Gowda, Shuyuan Lin, Jian Weng, Xiaobo Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15439
Pdf URL: https://arxiv.org/pdf/2411.15439
Copy Paste: [[2411.15439]] Twin Trigger Generative Networks for Backdoor Attacks against Object Detection(https://arxiv.org/abs/2411.15439)
Keywords: generative
Abstract: Object detectors, which are widely used in real-world applications, are vulnerable to backdoor attacks. This vulnerability arises because many users rely on datasets or pre-trained models provided by third parties due to constraints on data and resources. However, most research on backdoor attacks has focused on image classification, with limited investigation into object detection. Furthermore, the triggers for most existing backdoor attacks on object detection are manually generated, requiring prior knowledge and consistent patterns between the training and inference stages. This approach makes the attacks either easy to detect or difficult to adapt to various scenarios. To address these limitations, we propose novel twin trigger generative networks in the frequency domain to generate invisible triggers for implanting stealthy backdoors into models during training, and visible triggers for steady activation during inference, making the attack process difficult to trace. Specifically, for the invisible trigger generative network, we deploy a Gaussian smoothing layer and a high-frequency artifact classifier to enhance the stealthiness of backdoor implantation in object detectors. For the visible trigger generative network, we design a novel alignment loss to optimize the visible triggers so that they differ from the original patterns but still align with the malicious activation behavior of the invisible triggers. Extensive experimental results and analyses prove the possibility of using different triggers in the training stage and the inference stage, and demonstrate the attack effectiveness of our proposed visible trigger and invisible trigger generative networks, significantly reducing the mAP_0.5 of the object detectors by 70.0% and 84.5%, including YOLOv5 and YOLOv7 with different settings, respectively.

Title: Mamba-CL: Optimizing Selective State Space Model in Null Space for Continual Learning

Authors: De Cheng, Yue Lu, Lingfeng He, Shizhou Zhang, Xi Yang, Nannan Wang, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15469
Pdf URL: https://arxiv.org/pdf/2411.15469
Copy Paste: [[2411.15469]] Mamba-CL: Optimizing Selective State Space Model in Null Space for Continual Learning(https://arxiv.org/abs/2411.15469)
Keywords: foundation model
Abstract: Continual Learning (CL) aims to equip AI models with the ability to learn a sequence of tasks over time, without forgetting previously learned knowledge. Recently, State Space Models (SSMs), particularly the Mamba model, have achieved notable success in computer vision. Building on the strengths of SSMs, this study explores leveraging the Mamba model for CL. Therefore, we introduce Mamba-CL, a framework that continuously fine-tunes the core SSMs of the large-scale Mamba foundation model by updating parameters orthogonal to the feature subspace of previous tasks. This approach theoretically guarantees the consistency objective aiming to preserves consistent output for each SSM module across both previous and current tasks, so as to overcome catastrophic forgetting issue. Specifically, we achieve this goal by deducing the overall consistency constraints on four key time-invariant parameters in the Mamba model, streamlining its recurrent state-space structure and non-linear discretization process in SSM. In practice, we apply the null-space projection to efficiently implement the orthogonality within Mamba model. Extensive experiments on four class-incremental benchmarks demonstrate the effectiveness of Mamba-CL for anti-forgetting, achieving superior performances to state-of-the-art methods. Code is available in the supplementary materials.

Title: SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous Driving

Authors: Su Sun, Cheng Zhao, Zhuoyang Sun, Yingjie Victor Chen, Mei Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15482
Pdf URL: https://arxiv.org/pdf/2411.15482
Copy Paste: [[2411.15482]] SplatFlow: Self-Supervised Dynamic Gaussian Splatting in Neural Motion Flow Field for Autonomous Driving(https://arxiv.org/abs/2411.15482)
Keywords: self-supervised
Abstract: Most existing Dynamic Gaussian Splatting methods for complex dynamic urban scenarios rely on accurate object-level supervision from expensive manual labeling, limiting their scalability in real-world applications. In this paper, we introduce SplatFlow, a Self-Supervised Dynamic Gaussian Splatting within Neural Motion Flow Fields (NMFF) to learn 4D space-time representations without requiring tracked 3D bounding boxes, enabling accurate dynamic scene reconstruction and novel view RGB, depth and flow synthesis. SplatFlow designs a unified framework to seamlessly integrate time-dependent 4D Gaussian representation within NMFF, where NMFF is a set of implicit functions to model temporal motions of both LiDAR points and Gaussians as continuous motion flow fields. Leveraging NMFF, SplatFlow effectively decomposes static background and dynamic objects, representing them with 3D and 4D Gaussian primitives, respectively. NMFF also models the status correspondences of each 4D Gaussian across time, which aggregates temporal features to enhance cross-view consistency of dynamic components. SplatFlow further improves dynamic scene identification by distilling features from 2D foundational models into 4D space-time representation. Comprehensive evaluations conducted on the Waymo Open Dataset and KITTI Dataset validate SplatFlow's state-of-the-art (SOTA) performance for both image reconstruction and novel view synthesis in dynamic urban scenarios.

Title: Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark

Authors: Rong-Cheng Tu, Zi-Ao Ma, Tian Lan, Yuehao Zhao, Heyan Huang, Xian-Ling Mao
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2411.15488
Pdf URL: https://arxiv.org/pdf/2411.15488
Copy Paste: [[2411.15488]] Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark(https://arxiv.org/abs/2411.15488)
Keywords: diffusion
Abstract: Driven by the remarkable progress in diffusion models, text-to-image generation has made significant strides, creating a pressing demand for automatic quality evaluation of generated images. Current state-of-the-art automatic evaluation methods heavily rely on Multi-modal Large Language Models (MLLMs), particularly powerful commercial models like GPT-4o. While these models are highly effective, their substantial costs limit scalability in large-scale evaluations. Adopting open-source MLLMs is an alternative; however, their performance falls short due to significant limitations in processing multi-modal data compared to commercial MLLMs. To tackle these problems, we first propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset, where the complex evaluation task is decoupled into simpler sub-tasks, effectively reducing the learning complexity. Based on this dataset, we design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6. Furthermore, to reliably and comprehensively assess prior works and our proposed model, we manually annotate a meta-evaluation benchmark that includes chain-of-thought explanations alongside quality scores for generated images. Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline, VIEScore, with over 4.6\% improvement in Spearman and Kendall correlations with human judgments.

Title: Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation

Authors: Junhyeok Lee, Yujin Oh, Dahyoun Lee, Hyon Keun Joh, Chul-Ho Sohn, Sung Hyun Baik, Cheol Kyu Jung, Jung Hyun Park, Kyu Sung Choi, Byung-Hoon Kim, Jong Chul Ye
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.15490
Pdf URL: https://arxiv.org/pdf/2411.15490
Copy Paste: [[2411.15490]] Improving Factuality of 3D Brain MRI Report Generation with Paired Image-domain Retrieval and Text-domain Augmentation(https://arxiv.org/abs/2411.15490)
Keywords: diffusion
Abstract: Acute ischemic stroke (AIS) requires time-critical management, with hours of delayed intervention leading to an irreversible disability of the patient. Since diffusion weighted imaging (DWI) using the magnetic resonance image (MRI) plays a crucial role in the detection of AIS, automated prediction of AIS from DWI has been a research topic of clinical importance. While text radiology reports contain the most relevant clinical information from the image findings, the difficulty of mapping across different modalities has limited the factuality of conventional direct DWI-to-report generation methods. Here, we propose paired image-domain retrieval and text-domain augmentation (PIRTA), a cross-modal retrieval-augmented generation (RAG) framework for providing clinician-interpretative AIS radiology reports with improved factuality. PIRTA mitigates the need for learning cross-modal mapping, which poses difficulty in image-to-text generation, by casting the cross-modal mapping problem as an in-domain retrieval of similar DWI images that have paired ground-truth text radiology reports. By exploiting the retrieved radiology reports to augment the report generation process of the query image, we show by experiments with extensive in-house and public datasets that PIRTA can accurately retrieve relevant reports from 3D DWI images. This approach enables the generation of radiology reports with significantly higher accuracy compared to direct image-to-text generation using state-of-the-art multimodal language models.

Title: AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation

Authors: Datao Tang, Xiangyong Cao, Xuan Wu, Jialin Li, Jing Yao, Xueru Bai, Dongsheng Jiang, Yin Li, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15497
Pdf URL: https://arxiv.org/pdf/2411.15497
Copy Paste: [[2411.15497]] AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation(https://arxiv.org/abs/2411.15497)
Keywords: diffusion, generative
Abstract: Remote sensing image object detection (RSIOD) aims to identify and locate specific objects within satellite or aerial imagery. However, there is a scarcity of labeled data in current RSIOD datasets, which significantly limits the performance of current detection algorithms. Although existing techniques, e.g., data augmentation and semi-supervised learning, can mitigate this scarcity issue to some extent, they are heavily dependent on high-quality labeled data and perform worse in rare object classes. To address this issue, this paper proposes a layout-controllable diffusion generative model (i.e. AeroGen) tailored for RSIOD. To our knowledge, AeroGen is the first model to simultaneously support horizontal and rotated bounding box condition generation, thus enabling the generation of high-quality synthetic images that meet specific layout and object category requirements. Additionally, we propose an end-to-end data augmentation framework that integrates a diversity-conditioned generator and a filtering mechanism to enhance both the diversity and quality of generated data. Experimental results demonstrate that the synthetic data produced by our method are of high quality and diversity. Furthermore, the synthetic RSIOD data can significantly improve the detection performance of existing RSIOD models, i.e., the mAP metrics on DIOR, DIOR-R, and HRSC datasets are improved by 3.7\%, 4.3\%, and 2.43\%, respectively. The code is available at this https URL.

Title: Interactive Visual Assessment for Text-to-Image Generation Models

Authors: Xiaoyue Mi, Fan Tang, Juan Cao, Qiang Sheng, Ziyao Huang, Peng Li, Yang Liu, Tong-Yee Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15509
Pdf URL: https://arxiv.org/pdf/2411.15509
Copy Paste: [[2411.15509]] Interactive Visual Assessment for Text-to-Image Generation Models(https://arxiv.org/abs/2411.15509)
Keywords: generative
Abstract: Visual generation models have achieved remarkable progress in computer graphics applications but still face significant challenges in real-world deployment. Current assessment approaches for visual generation tasks typically follow an isolated three-phase framework: test input collection, model output generation, and user assessment. These fashions suffer from fixed coverage, evolving difficulty, and data leakage risks, limiting their effectiveness in comprehensively evaluating increasingly complex generation models. To address these limitations, we propose DyEval, an LLM-powered dynamic interactive visual assessment framework that facilitates collaborative evaluation between humans and generative models for text-to-image systems. DyEval features an intuitive visual interface that enables users to interactively explore and analyze model behaviors, while adaptively generating hierarchical, fine-grained, and diverse textual inputs to continuously probe the capability boundaries of the models based on their feedback. Additionally, to provide interpretable analysis for users to further improve tested models, we develop a contextual reflection module that mines failure triggers of test inputs and reflects model potential failure patterns supporting in-depth analysis using the logical reasoning ability of LLM. Qualitative and quantitative experiments demonstrate that DyEval can effectively help users identify max up to 2.56 times generation failures than conventional methods, and uncover complex and rare failure patterns, such as issues with pronoun generation and specific cultural context generation. Our framework provides valuable insights for improving generative models and has broad implications for advancing the reliability and capabilities of visual generation systems across various domains.

Title: MUNBa: Machine Unlearning via Nash Bargaining

Authors: Jing Wu, Mehrtash Harandi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15537
Pdf URL: https://arxiv.org/pdf/2411.15537
Copy Paste: [[2411.15537]] MUNBa: Machine Unlearning via Nash Bargaining(https://arxiv.org/abs/2411.15537)
Keywords: diffusion
Abstract: Machine Unlearning (MU) aims to selectively erase harmful behaviors from models while retaining the overall utility of the model. As a multi-task learning problem, MU involves balancing objectives related to forgetting specific concepts/data and preserving general performance. A naive integration of these forgetting and preserving objectives can lead to gradient conflicts, impeding MU algorithms from reaching optimal solutions. To address the gradient conflict issue, we reformulate MU as a two-player cooperative game, where the two players, namely, the forgetting player and the preservation player, contribute via their gradient proposals to maximize their overall gain. To this end, inspired by the Nash bargaining theory, we derive a closed-form solution to guide the model toward the Pareto front, effectively avoiding the gradient conflicts. Our formulation of MU guarantees an equilibrium solution, where any deviation from the final state would lead to a reduction in the overall objectives for both players, ensuring optimality in each objective. We evaluate our algorithm's effectiveness on a diverse set of tasks across image classification and image generation. Extensive experiments with ResNet, vision-language model CLIP, and text-to-image diffusion models demonstrate that our method outperforms state-of-the-art MU algorithms, achieving superior performance on several benchmarks. For example, in the challenging scenario of sample-wise forgetting, our algorithm approaches the gold standard retrain baseline. Our results also highlight improvements in forgetting precision, preservation of generalization, and robustness against adversarial attacks.

Title: Optical-Flow Guided Prompt Optimization for Coherent Video Generation

Authors: Hyelin Nam, Jaemin Kim, Dohun Lee, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.15540
Pdf URL: https://arxiv.org/pdf/2411.15540
Copy Paste: [[2411.15540]] Optical-Flow Guided Prompt Optimization for Coherent Video Generation(https://arxiv.org/abs/2411.15540)
Keywords: diffusion
Abstract: While text-to-video diffusion models have made significant strides, many still face challenges in generating videos with temporal consistency. Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences. To address this, we propose a novel framework called MotionPrompt that guides the video generation process via optical flow. Specifically, we train a discriminator to distinguish optical flow between random pairs of frames from real videos and generated ones. Given that prompts can influence the entire video, we optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content. We demonstrate the effectiveness of our approach across various models.

Title: NeRF Inpainting with Geometric Diffusion Prior and Balanced Score Distillation

Authors: Menglin Zhang, Xin Luo, Yunwei Lan, Chang Liu, Rui Li, Kaidong Zhang, Ganlin Yang, Dong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15551
Pdf URL: https://arxiv.org/pdf/2411.15551
Copy Paste: [[2411.15551]] NeRF Inpainting with Geometric Diffusion Prior and Balanced Score Distillation(https://arxiv.org/abs/2411.15551)
Keywords: diffusion
Abstract: Recent advances in NeRF inpainting have leveraged pretrained diffusion models to enhance performance. However, these methods often yield suboptimal results due to their ineffective utilization of 2D diffusion priors. The limitations manifest in two critical aspects: the inadequate capture of geometric information by pretrained diffusion models and the suboptimal guidance provided by existing Score Distillation Sampling (SDS) methods. To address these problems, we introduce GB-NeRF, a novel framework that enhances NeRF inpainting through improved utilization of 2D diffusion priors. Our approach incorporates two key innovations: a fine-tuning strategy that simultaneously learns appearance and geometric priors and a specialized normal distillation loss that integrates these geometric priors into NeRF inpainting. We propose a technique called Balanced Score Distillation (BSD) that surpasses existing methods such as Score Distillation (SDS) and the improved version, Conditional Score Distillation (CSD). BSD offers improved inpainting quality in appearance and geometric aspects. Extensive experiments show that our method provides superior appearance fidelity and geometric consistency compared to existing approaches.

Title: From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars

Authors: Albert Kornilov, Tatiana Shavrina
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15577
Pdf URL: https://arxiv.org/pdf/2411.15577
Copy Paste: [[2411.15577]] From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars(https://arxiv.org/abs/2411.15577)
Keywords: in-context
Abstract: Recent advances in language modeling have demonstrated significant improvements in zero-shot capabilities, including in-context learning, instruction following, and machine translation for extremely under-resourced languages (Tanzer et al., 2024). However, many languages with limited written resources rely primarily on formal descriptions of grammar and vocabulary. In this paper, we introduce a set of benchmarks to evaluate how well models can extract and classify information from the complex descriptions found in linguistic grammars. We present a Retrieval-Augmented Generation (RAG)-based approach that leverages these descriptions for downstream tasks such as machine translation. Our benchmarks encompass linguistic descriptions for 248 languages across 142 language families, focusing on typological features from WALS and Grambank. This set of benchmarks offers the first comprehensive evaluation of language models' in-context ability to accurately interpret and extract linguistic features, providing a critical resource for scaling NLP to low-resource languages. The code and data are publicly available at \url{this https URL}.

Title: TKG-DM: Training-free Chroma Key Content Generation Diffusion Model

Authors: Ryugo Morita, Stanislav Frolov, Brian Bernhard Moser, Takahiro Shirakawa, Ko Watanabe, Andreas Dengel, Jinjia Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15580
Pdf URL: https://arxiv.org/pdf/2411.15580
Copy Paste: [[2411.15580]] TKG-DM: Training-free Chroma Key Content Generation Diffusion Model(https://arxiv.org/abs/2411.15580)
Keywords: diffusion, generative
Abstract: Diffusion models have enabled the generation of high-quality images with a strong focus on realism and textual fidelity. Yet, large-scale text-to-image models, such as Stable Diffusion, struggle to generate images where foreground objects are placed over a chroma key background, limiting their ability to separate foreground and background elements without fine-tuning. To address this limitation, we present a novel Training-Free Chroma Key Content Generation Diffusion Model (TKG-DM), which optimizes the initial random noise to produce images with foreground objects on a specifiable color background. Our proposed method is the first to explore the manipulation of the color aspects in initial noise for controlled background generation, enabling precise separation of foreground and background without fine-tuning. Extensive experiments demonstrate that our training-free method outperforms existing methods in both qualitative and quantitative evaluations, matching or surpassing fine-tuned models. Finally, we successfully extend it to other tasks (e.g., consistency models and text-to-video), highlighting its transformative potential across various generative applications where independent control of foreground and background is crucial.

Title: EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting

Authors: Xiaobao Wei, Qingpo Wuwu, Zhongyu Zhao, Zhuangzhe Wu, Nan Huang, Ming Lu, Ningning MA, Shanghang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15582
Pdf URL: https://arxiv.org/pdf/2411.15582
Copy Paste: [[2411.15582]] EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting(https://arxiv.org/abs/2411.15582)
Keywords: self-supervised
Abstract: Photorealistic reconstruction of street scenes is essential for developing real-world simulators in autonomous driving. While recent methods based on 3D/4D Gaussian Splatting (GS) have demonstrated promising results, they still encounter challenges in complex street scenes due to the unpredictable motion of dynamic objects. Current methods typically decompose street scenes into static and dynamic objects, learning the Gaussians in either a supervised manner (e.g., w/ 3D bounding-box) or a self-supervised manner (e.g., w/o 3D bounding-box). However, these approaches do not effectively model the motions of dynamic objects (e.g., the motion speed of pedestrians is clearly different from that of vehicles), resulting in suboptimal scene decomposition. To address this, we propose Explicit Motion Decomposition (EMD), which models the motions of dynamic objects by introducing learnable motion embeddings to the Gaussians, enhancing the decomposition in street scenes. The proposed EMD is a plug-and-play approach applicable to various baseline methods. We also propose tailored training strategies to apply EMD to both supervised and self-supervised baselines. Through comprehensive experimentation, we illustrate the effectiveness of our approach with various established baselines. The code will be released at: this https URL.

Title: FLD+: Data-efficient Evaluation Metric for Generative Models

Authors: Pranav Jeevan, Neeraj Nixon, Amit Sethi
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2411.15584
Pdf URL: https://arxiv.org/pdf/2411.15584
Copy Paste: [[2411.15584]] FLD+: Data-efficient Evaluation Metric for Generative Models(https://arxiv.org/abs/2411.15584)
Keywords: diffusion, generative
Abstract: We introduce a new metric to assess the quality of generated images that is more reliable, data-efficient, compute-efficient, and adaptable to new domains than the previous metrics, such as Fréchet Inception Distance (FID). The proposed metric is based on normalizing flows, which allows for the computation of density (exact log-likelihood) of images from any domain. Thus, unlike FID, the proposed Flow-based Likelihood Distance Plus (FLD+) metric exhibits strongly monotonic behavior with respect to different types of image degradations, including noise, occlusion, diffusion steps, and generative model size. Additionally, because normalizing flow can be trained stably and efficiently, FLD+ achieves stable results with two orders of magnitude fewer images than FID (which requires more images to reliably compute Fréchet distance between features of large samples of real and generated images). We made FLD+ computationally even more efficient by applying normalizing flows to features extracted in a lower-dimensional latent space instead of using a pre-trained network. We also show that FLD+ can easily be retrained on new domains, such as medical images, unlike the networks behind previous metrics -- such as InceptionNetV3 pre-trained on ImageNet.

Title: An adversarial feature learning based semantic communication method for Human 3D Reconstruction

Authors: Shaojiang Liu, Jiajun Zou, Zhendan Liu, Meixia Dong, Zhiping Wan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15595
Pdf URL: https://arxiv.org/pdf/2411.15595
Copy Paste: [[2411.15595]] An adversarial feature learning based semantic communication method for Human 3D Reconstruction(https://arxiv.org/abs/2411.15595)
Keywords: diffusion
Abstract: With the widespread application of human body 3D reconstruction technology across various fields, the demands for data transmission and processing efficiency continue to rise, particularly in scenarios where network bandwidth is limited and low latency is required. This paper introduces an Adversarial Feature Learning-based Semantic Communication method (AFLSC) for human body 3D reconstruction, which focuses on extracting and transmitting semantic information crucial for the 3D reconstruction task, thereby significantly optimizing data flow and alleviating bandwidth pressure. At the sender's end, we propose a multitask learning-based feature extraction method to capture the spatial layout, keypoints, posture, and depth information from 2D human images, and design a semantic encoding technique based on adversarial feature learning to encode these feature information into semantic data. We also develop a dynamic compression technique to efficiently transmit this semantic data, greatly enhancing transmission efficiency and reducing latency. At the receiver's end, we design an efficient multi-level semantic feature decoding method to convert semantic data back into key image features. Finally, an improved ViT-diffusion model is employed for 3D reconstruction, producing human body 3D mesh models. Experimental results validate the advantages of our method in terms of data transmission efficiency and reconstruction quality, demonstrating its excellent potential for application in bandwidth-limited environments.

Title: Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation

Authors: Jinwoo Ahn, Hyeokjoon Kwon, Hwiyeon Yoo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15620
Pdf URL: https://arxiv.org/pdf/2411.15620
Copy Paste: [[2411.15620]] Fine-Grained Open-Vocabulary Object Recognition via User-Guided Segmentation(https://arxiv.org/abs/2411.15620)
Keywords: foundation model
Abstract: Recent advent of vision-based foundation models has enabled efficient and high-quality object detection at ease. Despite the success of previous studies, object detection models face limitations on capturing small components from holistic objects and taking user intention into account. To address these challenges, we propose a novel foundation model-based detection method called FOCUS: Fine-grained Open-Vocabulary Object ReCognition via User-Guided Segmentation. FOCUS merges the capabilities of vision foundation models to automate open-vocabulary object detection at flexible granularity and allow users to directly guide the detection process via natural language. It not only excels at identifying and locating granular constituent elements but also minimizes unnecessary user intervention yet grants them significant control. With FOCUS, users can make explainable requests to actively guide the detection process in the intended direction. Our results show that FOCUS effectively enhances the detection capabilities of baseline models and shows consistent performance across varying object types.

Title: Multi-label Sequential Sentence Classification via Large Language Model

Authors: Mengfei Lan, Lecheng Zheng, Shufan Ming, Halil Kilicoglu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.15623
Pdf URL: https://arxiv.org/pdf/2411.15623
Copy Paste: [[2411.15623]] Multi-label Sequential Sentence Classification via Large Language Model(https://arxiv.org/abs/2411.15623)
Keywords: in-context
Abstract: Sequential sentence classification (SSC) in scientific publications is crucial for supporting downstream tasks such as fine-grained information retrieval and extractive summarization. However, current SSC methods are constrained by model size, sequence length, and single-label setting. To address these limitations, this paper proposes LLM-SSC, a large language model (LLM)-based framework for both single- and multi-label SSC tasks. Unlike previous approaches that employ small- or medium-sized language models, the proposed framework utilizes LLMs to generate SSC labels through designed prompts, which enhance task understanding by incorporating demonstrations and a query to describe the prediction target. We also present a multi-label contrastive learning loss with auto-weighting scheme, enabling the multi-label classification task. To support our multi-label SSC analysis, we introduce and release a new dataset, biorc800, which mainly contains unstructured abstracts in the biomedical domain with manual annotations. Experiments demonstrate LLM-SSC's strong performance in SSC under both in-context learning and task-specific tuning settings. We release biorc800 and our code at: this https URL.

Title: Effort: Efficient Orthogonal Modeling for Generalizable AI-Generated Image Detection

Authors: Zhiyuan Yan, Jiangming Wang, Zhendong Wang, Peng Jin, Ke-Yue Zhang, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, Li Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15633
Pdf URL: https://arxiv.org/pdf/2411.15633
Copy Paste: [[2411.15633]] Effort: Efficient Orthogonal Modeling for Generalizable AI-Generated Image Detection(https://arxiv.org/abs/2411.15633)
Keywords: foundation model
Abstract: Existing AI-generated image (AIGI) detection methods often suffer from limited generalization performance. In this paper, we identify a crucial yet previously overlooked asymmetry phenomenon in AIGI detection: during training, models tend to quickly overfit to specific fake patterns in the training set, while other information is not adequately captured, leading to poor generalization when faced with new fake methods. A key insight is to incorporate the rich semantic knowledge embedded within large-scale vision foundation models (VFMs) to expand the previous discriminative space (based on forgery patterns only), such that the discrimination is decided by both forgery and semantic cues, thereby reducing the overfitting to specific forgery patterns. A straightforward solution is to fully fine-tune VFMs, but it risks distorting the well-learned semantic knowledge, pushing the model back toward overfitting. To this end, we design a novel approach called Effort: Efficient orthogonal modeling for generalizable AIGI detection. Specifically, we employ Singular Value Decomposition (SVD) to construct the orthogonal semantic and forgery subspaces. By freezing the principal components and adapting the residual components ($\sim$0.19M parameters), we preserve the original semantic subspace and use its orthogonal subspace for learning forgeries. Extensive experiments on AIGI detection benchmarks demonstrate the superior effectiveness of our approach.

Title: Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment

Authors: Alvi Md Ishmam, Christopher Thomas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15673
Pdf URL: https://arxiv.org/pdf/2411.15673
Copy Paste: [[2411.15673]] Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment(https://arxiv.org/abs/2411.15673)
Keywords: self-supervised
Abstract: In recent years there has been enormous interest in vision-language models trained using self-supervised objectives. However, the use of large-scale datasets scraped from the web for training also makes these models vulnerable to potential security threats, such as backdooring and poisoning attacks. In this paper, we propose a method for mitigating such attacks on contrastively trained vision-language models. Our approach leverages external knowledge extracted from a language model to prevent models from learning correlations between image regions which lack strong alignment with external knowledge. We do this by imposing constraints to enforce that attention paid by the model to visual regions is proportional to the alignment of those regions with external knowledge. We conduct extensive experiments using a variety of recent backdooring and poisoning attacks on multiple datasets and architectures. Our results clearly demonstrate that our proposed approach is highly effective at defending against such attacks across multiple settings, while maintaining model utility and without requiring any changes at inference time

Title: Can a Large Language Model Learn Matrix Functions In Context?

Authors: Paimon Goulart, Evangelos E. Papalexakis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.15675
Pdf URL: https://arxiv.org/pdf/2411.15675
Copy Paste: [[2411.15675]] Can a Large Language Model Learn Matrix Functions In Context?(https://arxiv.org/abs/2411.15675)
Keywords: in-context
Abstract: Large Language Models (LLMs) have demonstrated the ability to solve complex tasks through In-Context Learning (ICL), where models learn from a few input-output pairs without explicit fine-tuning. In this paper, we explore the capacity of LLMs to solve non-linear numerical computations, with specific emphasis on functions of the Singular Value Decomposition. Our experiments show that while LLMs perform comparably to traditional models such as Stochastic Gradient Descent (SGD) based Linear Regression and Neural Networks (NN) for simpler tasks, they outperform these models on more complex tasks, particularly in the case of top-k Singular Values. Furthermore, LLMs demonstrate strong scalability, maintaining high accuracy even as the matrix size increases. Additionally, we found that LLMs can achieve high accuracy with minimal prior examples, converging quickly and avoiding the overfitting seen in classical models. These results suggest that LLMs could provide an efficient alternative to classical methods for solving high-dimensional problems. Future work will focus on extending these findings to larger matrices and more complex matrix operations while exploring the effect of using different numerical representations in ICL.

Title: Fixing the Perspective: A Critical Examination of Zero-1-to-3

Authors: Jack Yu, Xueying Jia, Charlie Sun, Prince Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15706
Pdf URL: https://arxiv.org/pdf/2411.15706
Copy Paste: [[2411.15706]] Fixing the Perspective: A Critical Examination of Zero-1-to-3(https://arxiv.org/abs/2411.15706)
Keywords: diffusion
Abstract: Novel view synthesis is a fundamental challenge in image-to-3D generation, requiring the generation of target view images from a set of conditioning images and their relative poses. While recent approaches like Zero-1-to-3 have demonstrated promising results using conditional latent diffusion models, they face significant challenges in generating consistent and accurate novel views, particularly when handling multiple conditioning images. In this work, we conduct a thorough investigation of Zero-1-to-3's cross-attention mechanism within the Spatial Transformer of the diffusion 2D-conditional UNet. Our analysis reveals a critical discrepancy between Zero-1-to-3's theoretical framework and its implementation, specifically in the processing of image-conditional context. We propose two significant improvements: (1) a corrected implementation that enables effective utilization of the cross-attention mechanism, and (2) an enhanced architecture that can leverage multiple conditional views simultaneously. Our theoretical analysis and preliminary results suggest potential improvements in novel view synthesis consistency and accuracy.

Title: ROOT: VLM based System for Indoor Scene Understanding and Beyond

Authors: Yonghui Wang, Shi-Yong Chen, Zhenxing Zhou, Siyi Li, Haoran Li, Wengang Zhou, Houqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15714
Pdf URL: https://arxiv.org/pdf/2411.15714
Copy Paste: [[2411.15714]] ROOT: VLM based System for Indoor Scene Understanding and Beyond(https://arxiv.org/abs/2411.15714)
Keywords: foundation model
Abstract: Recently, Vision Language Models (VLMs) have experienced significant advancements, yet these models still face challenges in spatial hierarchical reasoning within indoor scenes. In this study, we introduce ROOT, a VLM-based system designed to enhance the analysis of indoor scenes. Specifically, we first develop an iterative object perception algorithm using GPT-4V to detect object entities within indoor scenes. This is followed by employing vision foundation models to acquire additional meta-information about the scene, such as bounding boxes. Building on this foundational data, we propose a specialized VLM, SceneVLM, which is capable of generating spatial hierarchical scene graphs and providing distance information for objects within indoor environments. This information enhances our understanding of the spatial arrangement of indoor scenes. To train our SceneVLM, we collect over 610,000 images from various public indoor datasets and implement a scene data generation pipeline with a semi-automated technique to establish relationships and estimate distances among indoor objects. By utilizing this enriched data, we conduct various training recipes and finish SceneVLM. Our experiments demonstrate that \rootname facilitates indoor scene understanding and proves effective in diverse downstream applications, such as 3D scene generation and embodied AI. The code will be released at \url{this https URL}.

Title: AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea

Authors: Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, Yueting Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15738
Pdf URL: https://arxiv.org/pdf/2411.15738
Copy Paste: [[2411.15738]] AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea(https://arxiv.org/abs/2411.15738)
Keywords: diffusion
Abstract: Instruction-based image editing aims to modify specific image elements with natural language instructions. However, current models in this domain often struggle to accurately execute complex user instructions, as they are trained on low-quality data with limited editing types. We present AnyEdit, a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing pairs spanning over 20 editing types and five domains. We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results. Using the dataset, we further train a novel AnyEdit Stable Diffusion with task-aware routing and learnable task embedding for unified image editing. Comprehensive experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models. This presents prospects for developing instruction-driven image editing models that support human creativity.

Title: Beyond Data Scarcity: A Frequency-Driven Framework for Zero-Shot Forecasting

Authors: Liran Nochumsohn, Michal Moshkovitz, Orly Avner, Dotan Di Castro, Omri Azencot
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15743
Pdf URL: https://arxiv.org/pdf/2411.15743
Copy Paste: [[2411.15743]] Beyond Data Scarcity: A Frequency-Driven Framework for Zero-Shot Forecasting(https://arxiv.org/abs/2411.15743)
Keywords: foundation model
Abstract: Time series forecasting is critical in numerous real-world applications, requiring accurate predictions of future values based on observed patterns. While traditional forecasting techniques work well in in-domain scenarios with ample data, they struggle when data is scarce or not available at all, motivating the emergence of zero-shot and few-shot learning settings. Recent advancements often leverage large-scale foundation models for such tasks, but these methods require extensive data and compute resources, and their performance may be hindered by ineffective learning from the available training set. This raises a fundamental question: What factors influence effective learning from data in time series forecasting? Toward addressing this, we propose using Fourier analysis to investigate how models learn from synthetic and real-world time series data. Our findings reveal that forecasters commonly suffer from poor learning from data with multiple frequencies and poor generalization to unseen frequencies, which impedes their predictive performance. To alleviate these issues, we present a novel synthetic data generation framework, designed to enhance real data or replace it completely by creating task-specific frequency information, requiring only the sampling rate of the target data. Our approach, Freq-Synth, improves the robustness of both foundation as well as nonfoundation forecast models in zero-shot and few-shot settings, facilitating more reliable time series forecasting under limited data scenarios.

Title: ZeroGS: Training 3D Gaussian Splatting from Unposed Images

Authors: Yu Chen, Rolandos Alexandros Potamias, Evangelos Ververas, Jifei Song, Jiankang Deng, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15779
Pdf URL: https://arxiv.org/pdf/2411.15779
Copy Paste: [[2411.15779]] ZeroGS: Training 3D Gaussian Splatting from Unposed Images(https://arxiv.org/abs/2411.15779)
Keywords: foundation model
Abstract: Neural radiance fields (NeRF) and 3D Gaussian Splatting (3DGS) are popular techniques to reconstruct and render photo-realistic images. However, the pre-requisite of running Structure-from-Motion (SfM) to get camera poses limits their completeness. While previous methods can reconstruct from a few unposed images, they are not applicable when images are unordered or densely captured. In this work, we propose ZeroGS to train 3DGS from hundreds of unposed and unordered images. Our method leverages a pretrained foundation model as the neural scene representation. Since the accuracy of the predicted pointmaps does not suffice for accurate image registration and high-fidelity image rendering, we propose to mitigate the issue by initializing and finetuning the pretrained model from a seed image. Images are then progressively registered and added to the training buffer, which is further used to train the model. We also propose to refine the camera poses and pointmaps by minimizing a point-to-camera ray consistency loss across multiple views. Experiments on the LLFF dataset, the MipNeRF360 dataset, and the Tanks-and-Temples dataset show that our method recovers more accurate camera poses than state-of-the-art pose-free NeRF/3DGS methods, and even renders higher quality images than 3DGS with COLMAP poses. Our project page is available at this https URL.

Title: Multi-Token Enhancing for Vision Representation Learning

Authors: Zhong-Yu Li, Yu-Song Hu, Bo-Wen Yin, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15787
Pdf URL: https://arxiv.org/pdf/2411.15787
Copy Paste: [[2411.15787]] Multi-Token Enhancing for Vision Representation Learning(https://arxiv.org/abs/2411.15787)
Keywords: self-supervised
Abstract: Vision representation learning, especially self-supervised learning, is pivotal for various vision applications. Ensemble learning has also succeeded in enhancing the performance and robustness of the vision models. However, traditional ensemble strategies are impractical for representation learning, especially self-supervised representation learning that requires large-scale datasets and long schedules. This is because they require k times more training and inference computation costs for an ensemble of k models. Differently, we introduce Multi-Token Enhancing (MTE) that extracts multiple auxiliary tokens simultaneously from a single model to enhance representation learning, while incurring minimal additional training costs and no additional inference costs. These auxiliary tokens, including auxiliary CLS tokens and adaptively pooled tokens, capture complementary information due to their differences. Meanwhile, to address the increase in inference costs, we distill the knowledge acquired by the auxiliary tokens into a global token during pre-training. Consequently, we can discard the auxiliary tokens during inference without incurring additional costs. Our MTE is compatible with various self-supervised loss functions and architectures, consistently improving performances across different downstream tasks. Our source code will be made publicly available.

Title: Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing

Authors: Pengcheng Xu, Boyuan Jiang, Xiaobin Hu, Donghao Luo, Qingdong He, Jiangning Zhang, Chengjie Wang, Yunsheng Wu, Charles Ling, Boyu Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.15843
Pdf URL: https://arxiv.org/pdf/2411.15843
Copy Paste: [[2411.15843]] Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing(https://arxiv.org/abs/2411.15843)
Keywords: diffusion, generative
Abstract: Leveraging the large generative prior of the flow transformer for tuning-free image editing requires authentic inversion to project the image into the model's domain and a flexible invariance control mechanism to preserve non-target contents. However, the prevailing diffusion inversion performs deficiently in flow-based models, and the invariance control cannot reconcile diverse rigid and non-rigid editing tasks. To address these, we systematically analyze the \textbf{inversion and invariance} control based on the flow transformer. Specifically, we unveil that the Euler inversion shares a similar structure to DDIM yet is more susceptible to the approximation error. Thus, we propose a two-stage inversion to first refine the velocity estimation and then compensate for the leftover error, which pivots closely to the model prior and benefits editing. Meanwhile, we propose the invariance control that manipulates the text features within the adaptive layer normalization, connecting the changes in the text prompt to image semantics. This mechanism can simultaneously preserve the non-target contents while allowing rigid and non-rigid manipulation, enabling a wide range of editing types such as visual text, quantity, facial expression, etc. Experiments on versatile scenarios validate that our framework achieves flexible and accurate editing, unlocking the potential of the flow transformer for versatile image editing.

Title: Generalizable Single-view Object Pose Estimation by Two-side Generating and Matching

Authors: Yujing Sun, Caiyi Sun, Yuan Liu, Yuexin Ma, Siu Ming Yiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15860
Pdf URL: https://arxiv.org/pdf/2411.15860
Copy Paste: [[2411.15860]] Generalizable Single-view Object Pose Estimation by Two-side Generating and Matching(https://arxiv.org/abs/2411.15860)
Keywords: diffusion
Abstract: In this paper, we present a novel generalizable object pose estimation method to determine the object pose using only one RGB image. Unlike traditional approaches that rely on instance-level object pose estimation and necessitate extensive training data, our method offers generalization to unseen objects without extensive training, operates with a single reference image of the object, and eliminates the need for 3D object models or multiple views of the object. These characteristics are achieved by utilizing a diffusion model to generate novel-view images and conducting a two-sided matching on these generated images. Quantitative experiments demonstrate the superiority of our method over existing pose estimation techniques across both synthetic and real-world datasets. Remarkably, our approach maintains strong performance even in scenarios with significant viewpoint changes, highlighting its robustness and versatility in challenging conditions. The code will be re leased at this https URL.

Title: PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

Authors: Teng Zhou, Xiaoyu Zhang, Yongchuan Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15867
Pdf URL: https://arxiv.org/pdf/2411.15867
Copy Paste: [[2411.15867]] PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs(https://arxiv.org/abs/2411.15867)
Keywords: diffusion
Abstract: Panoramic Image Generation has emerged as an important task in image generation, driven by growing demands for large-scale visuals in creative and technical applications. While diffusion models have dominated this field, they face inherent limitations, including the multilevel-coherence challenge and implementation complexity, leading to suboptimal outcomes. In this paper, we introduce PanoLlama, a novel framework that redefines panoramic image generation as a next-token prediction task. Building on the pre-trained LlamaGen architecture, we generate images in an autoregressive manner and develop an expansion strategy to handle size limitations. This method aligns with the image token structure in a crop-wise and training-free manner, resulting in high-quality panoramas with minimal seams and maximum scalability. PanoLlama demonstrates its effectiveness and versatility in our experiments, achieving the best overall performance while offering flexibility for multi-scale, multi-layout, and multi-guidance generation. It overcomes the challenges that diffusion-based methods fail to address, setting a new paradigm for panoramic image generation tasks. Code is available at this https URL.

Title: Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

Authors: Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, Yansong Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15869
Pdf URL: https://arxiv.org/pdf/2411.15869
Copy Paste: [[2411.15869]] Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation(https://arxiv.org/abs/2411.15869)
Keywords: anomaly
Abstract: Recent advancements in pre-trained vision-language models like CLIP, have enabled the task of open-vocabulary segmentation. CLIP demonstrates impressive zero-shot capabilities in various downstream tasks that require holistic image understanding. However, due to its image-level pre-training, CLIP struggles to capture local details, resulting in poor performance in segmentation tasks. Our analysis reveals that anomaly tokens emerge during the forward pass, drawing excessive attention from normal patch tokens, thereby diminishing spatial awareness. To address this issue, we propose Self-Calibrated CLIP (SC-CLIP), a training-free method that calibrates CLIP to produce finer-grained representations while preserving its original generalization ability, without introducing new parameters or relying on additional backbones. Specifically, we first identify and resolve the anomaly tokens to mitigate their negative impact. Next, we enhance feature discriminability and attention correlation by leveraging the semantic consistency found in CLIP's intermediate features. Furthermore, we employ multi-level feature fusion to enrich details. Collectively, these strategies enhance CLIP's feature representation with greater granularity and coherence. Experimental results demonstrate the effectiveness of SC-CLIP, achieving state-of-the-art results across eight semantic segmentation datasets and surpassing previous methods by 9.5%. Notably, SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times. Our source code is available at this https URL.

Title: A Tunable Despeckling Neural Network Stabilized via Diffusion Equation

Authors: Yi Ran, Zhichang Guo, Jia Li, Yao Li, Martin Burger, Boying Wu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2411.15921
Pdf URL: https://arxiv.org/pdf/2411.15921
Copy Paste: [[2411.15921]] A Tunable Despeckling Neural Network Stabilized via Diffusion Equation(https://arxiv.org/abs/2411.15921)
Keywords: diffusion
Abstract: Multiplicative Gamma noise remove is a critical research area in the application of synthetic aperture radar (SAR) imaging, where neural networks serve as a potent tool. However, real-world data often diverges from theoretical models, exhibiting various disturbances, which makes the neural network less effective. Adversarial attacks work by finding perturbations that significantly disrupt functionality of neural networks, as the inherent instability of neural networks makes them highly susceptible. A network designed to withstand such extreme cases can more effectively mitigate general disturbances in real SAR data. In this work, the dissipative nature of diffusion equations is employed to underpin a novel approach for countering adversarial attacks and improve the resistance of real noise disturbance. We propose a tunable, regularized neural network that unrolls a denoising unit and a regularization unit into a single network for end-to-end training. In the network, the denoising unit and the regularization unit are composed of the denoising network and the simplest linear diffusion equation respectively. The regularization unit enhances network stability, allowing post-training time step adjustments to effectively mitigate the adverse impacts of adversarial attacks. The stability and convergence of our model are theoretically proven, and in the experiments, we compare our model with several state-of-the-art denoising methods on simulated images, adversarial samples, and real SAR images, yielding superior results in both quantitative and visual evaluations.

Title: Making Images from Images: Interleaving Denoising and Transformation

Authors: Shumeet Baluja, David Marwood, Ashwin Baluja
Subjects: cs.CV, cs.AI, cs.GR, cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2411.15925
Pdf URL: https://arxiv.org/pdf/2411.15925
Copy Paste: [[2411.15925]] Making Images from Images: Interleaving Denoising and Transformation(https://arxiv.org/abs/2411.15925)
Keywords: diffusion
Abstract: Simply by rearranging the regions of an image, we can create a new image of any subject matter. The definition of regions is user definable, ranging from regularly and irregularly-shaped blocks, concentric rings, or even individual pixels. Our method extends and improves recent work in the generation of optical illusions by simultaneously learning not only the content of the images, but also the parameterized transformations required to transform the desired images into each other. By learning the image transforms, we allow any source image to be pre-specified; any existing image (e.g. the Mona Lisa) can be transformed to a novel subject. We formulate this process as a constrained optimization problem and address it through interleaving the steps of image diffusion with an energy minimization step. Unlike previous methods, increasing the number of regions actually makes the problem easier and improves results. We demonstrate our approach in both pixel and latent spaces. Creative extensions, such as using infinite copies of the source image and employing multiple source images, are also given.

Title: Generative Context Distillation

Authors: Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, Minjoon Seo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.15927
Pdf URL: https://arxiv.org/pdf/2411.15927
Copy Paste: [[2411.15927]] Generative Context Distillation(https://arxiv.org/abs/2411.15927)
Keywords: generative
Abstract: Prompts used in recent large language model based applications are often fixed and lengthy, leading to significant computational overhead. To address this challenge, we propose Generative Context Distillation (GCD), a lightweight prompt internalization method that employs a joint training approach. This method not only replicates the behavior of models with prompt inputs but also generates the content of the prompt along with reasons for why the model's behavior should change accordingly. We demonstrate that our approach effectively internalizes complex prompts across various agent-based application scenarios. For effective training without interactions with the dedicated environments, we introduce a data synthesis technique that autonomously collects conversational datasets by swapping the roles of the agent and environment. This method is especially useful in scenarios where only a predefined prompt is available without a corresponding training dataset. By internalizing complex prompts, Generative Context Distillation enables high-performance and efficient inference without the need for explicit prompts.

Title: Improving Pre-Trained Self-Supervised Embeddings Through Effective Entropy Maximization

Authors: Deep Chakraborty, Yann LeCun, Tim G. J. Rudner, Erik Learned-Miller
Subjects: cs.LG, cs.CV, cs.IT, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2411.15931
Pdf URL: https://arxiv.org/pdf/2411.15931
Copy Paste: [[2411.15931]] Improving Pre-Trained Self-Supervised Embeddings Through Effective Entropy Maximization(https://arxiv.org/abs/2411.15931)
Keywords: self-supervised
Abstract: A number of different architectures and loss functions have been applied to the problem of self-supervised learning (SSL), with the goal of developing embeddings that provide the best possible pre-training for as-yet-unknown, lightly supervised downstream tasks. One of these SSL criteria is to maximize the entropy of a set of embeddings in some compact space. But the goal of maximizing the embedding entropy often depends--whether explicitly or implicitly--upon high dimensional entropy estimates, which typically perform poorly in more than a few dimensions. In this paper, we motivate an effective entropy maximization criterion (E2MC), defined in terms of easy-to-estimate, low-dimensional constraints. We demonstrate that using it to continue training an already-trained SSL model for only a handful of epochs leads to a consistent and, in some cases, significant improvement in downstream performance. We perform careful ablation studies to show that the improved performance is due to the proposed add-on criterion. We also show that continued pre-training with alternative criteria does not lead to notable improvements, and in some cases, even degrades performance.

Title: Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors

Authors: Soumava Paul, Prakhar Kaushik, Alan Yuille
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.15966
Pdf URL: https://arxiv.org/pdf/2411.15966
Copy Paste: [[2411.15966]] Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors(https://arxiv.org/abs/2411.15966)
Keywords: diffusion, generative
Abstract: In this work, we introduce a generative approach for pose-free reconstruction of $360^{\circ}$ scenes from a limited number of uncalibrated 2D images. Pose-free scene reconstruction from incomplete, unposed observations is usually regularized with depth estimation or 3D foundational priors. While recent advances have enabled sparse-view reconstruction of unbounded scenes with known camera poses using diffusion priors, these methods rely on explicit camera embeddings for extrapolating unobserved regions. This reliance limits their application in pose-free settings, where view-specific data is only implicitly available. To address this, we propose an instruction-following RGBD diffusion model designed to inpaint missing details and remove artifacts in novel view renders and depth maps of a 3D scene. We also propose a novel confidence measure for Gaussian representations to allow for better detection of these artifacts. By progressively integrating these novel views in a Gaussian-SLAM-inspired process, we achieve a multi-view-consistent Gaussian representation. Evaluations on the MipNeRF360 dataset demonstrate that our method surpasses existing pose-free techniques and performs competitively with state-of-the-art posed reconstruction methods in complex $360^{\circ}$ scenes.

Title: ROADS: Robust Prompt-driven Multi-Class Anomaly Detection under Domain Shift

Authors: Hossein Kashiani, Niloufar Alipour Talemi, Fatemeh Afghah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16049
Pdf URL: https://arxiv.org/pdf/2411.16049
Copy Paste: [[2411.16049]] ROADS: Robust Prompt-driven Multi-Class Anomaly Detection under Domain Shift(https://arxiv.org/abs/2411.16049)
Keywords: anomaly
Abstract: Recent advancements in anomaly detection have shifted focus towards Multi-class Unified Anomaly Detection (MUAD), offering more scalable and practical alternatives compared to traditional one-class-one-model approaches. However, existing MUAD methods often suffer from inter-class interference and are highly susceptible to domain shifts, leading to substantial performance degradation in real-world applications. In this paper, we propose a novel robust prompt-driven MUAD framework, called ROADS, to address these challenges. ROADS employs a hierarchical class-aware prompt integration mechanism that dynamically encodes class-specific information into our anomaly detector to mitigate interference among anomaly classes. Additionally, ROADS incorporates a domain adapter to enhance robustness against domain shifts by learning domain-invariant representations. Extensive experiments on MVTec-AD and VISA datasets demonstrate that ROADS surpasses state-of-the-art methods in both anomaly detection and localization, with notable improvements in out-of-distribution settings.

Title: VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction

Authors: Yadi Cao, Yuxuan Liu, Liu Yang, Rose Yu, Hayden Schaeffer, Stanley Osher
Subjects: cs.LG, math.NA, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2411.16063
Pdf URL: https://arxiv.org/pdf/2411.16063
Copy Paste: [[2411.16063]] VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction(https://arxiv.org/abs/2411.16063)
Keywords: in-context
Abstract: In-Context Operator Networks (ICONs) are models that learn operators across different types of PDEs using a few-shot, in-context approach. Although they show successful generalization to various PDEs, existing methods treat each data point as a single token, and suffer from computational inefficiency when processing dense data, limiting their application in higher spatial dimensions. In this work, we propose Vision In-Context Operator Networks (VICON), incorporating a vision transformer architecture that efficiently processes 2D functions through patch-wise operations. We evaluated our method on three fluid dynamics datasets, demonstrating both superior performance (reducing scaled $L^2$ error by $40\%$ and $61.6\%$ for two benchmark datasets for compressible flows, respectively) and computational efficiency (requiring only one-third of the inference time per frame) in long-term rollout predictions compared to the current state-of-the-art sequence-to-sequence model with fixed timestep prediction: Multiple Physics Pretraining (MPP). Compared to MPP, our method preserves the benefits of in-context operator learning, enabling flexible context formation when dealing with insufficient frame counts or varying timestep values.

Title: Geometry Distributions

Authors: Biao Zhang, Jing Ren, Peter Wonka
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.16076
Pdf URL: https://arxiv.org/pdf/2411.16076
Copy Paste: [[2411.16076]] Geometry Distributions(https://arxiv.org/abs/2411.16076)
Keywords: diffusion
Abstract: Neural representations of 3D data have been widely adopted across various applications, particularly in recent work leveraging coordinate-based networks to model scalar or vector fields. However, these approaches face inherent challenges, such as handling thin structures and non-watertight geometries, which limit their flexibility and accuracy. In contrast, we propose a novel geometric data representation that models geometry as distributions-a powerful representation that makes no assumptions about surface genus, connectivity, or boundary conditions. Our approach uses diffusion models with a novel network architecture to learn surface point distributions, capturing fine-grained geometric details. We evaluate our representation qualitatively and quantitatively across various object types, demonstrating its effectiveness in achieving high geometric fidelity. Additionally, we explore applications using our representation, such as textured mesh representation, neural surface compression, dynamic object modeling, and rendering, highlighting its potential to advance 3D geometric learning.

Title: Debiasing Classifiers by Amplifying Bias with Latent Diffusion and Large Language Models

Authors: Donggeun Ko, Dongjun Lee, Namjun Park, Wonkyeong Shim, Jaekwang Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16079
Pdf URL: https://arxiv.org/pdf/2411.16079
Copy Paste: [[2411.16079]] Debiasing Classifiers by Amplifying Bias with Latent Diffusion and Large Language Models(https://arxiv.org/abs/2411.16079)
Keywords: diffusion, generative
Abstract: Neural networks struggle with image classification when biases are learned and misleads correlations, affecting their generalization and performance. Previous methods require attribute labels (e.g. background, color) or utilizes Generative Adversarial Networks (GANs) to mitigate biases. We introduce DiffuBias, a novel pipeline for text-to-image generation that enhances classifier robustness by generating bias-conflict samples, without requiring training during the generation phase. Utilizing pretrained diffusion and image captioning models, DiffuBias generates images that challenge the biases of classifiers, using the top-$K$ losses from a biased classifier ($f_B$) to create more representative data samples. This method not only debiases effectively but also boosts classifier generalization capabilities. To the best of our knowledge, DiffuBias is the first approach leveraging a stable diffusion model to generate bias-conflict samples in debiasing tasks. Our comprehensive experimental evaluations demonstrate that DiffuBias achieves state-of-the-art performance on benchmark datasets. We also conduct a comparative analysis of various generative models in terms of carbon emissions and energy consumption to highlight the significance of computational efficiency.

Title: Boosting 3D Object Generation through PBR Materials

Authors: Yitong Wang, Xudong Xu, Li Ma, Haoran Wang, Bo Dai
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16080
Pdf URL: https://arxiv.org/pdf/2411.16080
Copy Paste: [[2411.16080]] Boosting 3D Object Generation through PBR Materials(https://arxiv.org/abs/2411.16080)
Keywords: diffusion
Abstract: Automatic 3D content creation has gained increasing attention recently, due to its potential in various applications such as video games, film industry, and AR/VR. Recent advancements in diffusion models and multimodal models have notably improved the quality and efficiency of 3D object generation given a single RGB image. However, 3D objects generated even by state-of-the-art methods are still unsatisfactory compared to human-created assets. Considering only textures instead of materials makes these methods encounter challenges in photo-realistic rendering, relighting, and flexible appearance editing. And they also suffer from severe misalignment between geometry and high-frequency texture details. In this work, we propose a novel approach to boost the quality of generated 3D objects from the perspective of Physics-Based Rendering (PBR) materials. By analyzing the components of PBR materials, we choose to consider albedo, roughness, metalness, and bump maps. For albedo and bump maps, we leverage Stable Diffusion fine-tuned on synthetic data to extract these values, with novel usages of these fine-tuned models to obtain 3D consistent albedo UV and bump UV for generated objects. In terms of roughness and metalness maps, we adopt a semi-automatic process to provide room for interactive adjustment, which we believe is more practical. Extensive experiments demonstrate that our model is generally beneficial for various state-of-the-art generation methods, significantly boosting the quality and realism of their generated 3D objects, with natural relighting effects and substantially improved geometry.

Title: AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity

Authors: Jili Xia, Lihuo He, Fei Gao, Kaifan Zhang, Leida Li, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16087
Pdf URL: https://arxiv.org/pdf/2411.16087
Copy Paste: [[2411.16087]] AI-Generated Image Quality Assessment Based on Task-Specific Prompt and Multi-Granularity Similarity(https://arxiv.org/abs/2411.16087)
Keywords: generative
Abstract: Recently, AI-generated images (AIGIs) created by given prompts (initial prompts) have garnered widespread attention. Nevertheless, due to technical nonproficiency, they often suffer from poor perception quality and Text-to-Image misalignment. Therefore, assessing the perception quality and alignment quality of AIGIs is crucial to improving the generative model's performance. Existing assessment methods overly rely on the initial prompts in the task prompt design and use the same prompts to guide both perceptual and alignment quality evaluation, overlooking the distinctions between the two tasks. To address this limitation, we propose a novel quality assessment method for AIGIs named TSP-MGS, which designs task-specific prompts and measures multi-granularity similarity between AIGIs and the prompts. Specifically, task-specific prompts are first constructed to describe perception and alignment quality degrees separately, and the initial prompt is introduced for detailed quality perception. Then, the coarse-grained similarity between AIGIs and task-specific prompts is calculated, which facilitates holistic quality awareness. In addition, to improve the understanding of AIGI details, the fine-grained similarity between the image and the initial prompt is measured. Finally, precise quality prediction is acquired by integrating the multi-granularity similarities. Experiments on the commonly used AGIQA-1K and AGIQA-3K benchmarks demonstrate the superiority of the proposed TSP-MGS.

Title: FUN-AD: Fully Unsupervised Learning for Anomaly Detection with Noisy Training Data

Authors: Jiin Im, Yongho Son, Je Hyeong Hong
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16110
Pdf URL: https://arxiv.org/pdf/2411.16110
Copy Paste: [[2411.16110]] FUN-AD: Fully Unsupervised Learning for Anomaly Detection with Noisy Training Data(https://arxiv.org/abs/2411.16110)
Keywords: anomaly
Abstract: While the mainstream research in anomaly detection has mainly followed the one-class classification, practical industrial environments often incur noisy training data due to annotation errors or lack of labels for new or refurbished products. To address these issues, we propose a novel learning-based approach for fully unsupervised anomaly detection with unlabeled and potentially contaminated training data. Our method is motivated by two observations, that i) the pairwise feature distances between the normal samples are on average likely to be smaller than those between the anomaly samples or heterogeneous samples and ii) pairs of features mutually closest to each other are likely to be homogeneous pairs, which hold if the normal data has smaller variance than the anomaly data. Building on the first observation that nearest-neighbor distances can distinguish between confident normal samples and anomalies, we propose a pseudo-labeling strategy using an iteratively reconstructed memory bank (IRMB). The second observation is utilized as a new loss function to promote class-homogeneity between mutually closest pairs thereby reducing the ill-posedness of the task. Experimental results on two public industrial anomaly benchmarks and semantic anomaly examples validate the effectiveness of FUN-AD across different scenarios and anomaly-to-normal ratios. Our code is available at this https URL.

Title: Med-PerSAM: One-Shot Visual Prompt Tuning for Personalized Segment Anything Model in Medical Domain

Authors: Hangyul Yoon, Doohyuk Jang, Jungeun Kim, Eunho Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16123
Pdf URL: https://arxiv.org/pdf/2411.16123
Copy Paste: [[2411.16123]] Med-PerSAM: One-Shot Visual Prompt Tuning for Personalized Segment Anything Model in Medical Domain(https://arxiv.org/abs/2411.16123)
Keywords: in-context
Abstract: Leveraging pre-trained models with tailored prompts for in-context learning has proven highly effective in NLP tasks. Building on this success, recent studies have applied a similar approach to the Segment Anything Model (SAM) within a ``one-shot" framework, where only a single reference image and its label are employed. However, these methods face limitations in the medical domain, primarily due to SAM's essential requirement for visual prompts and the over-reliance on pixel similarity for generating them. This dependency may lead to (1) inaccurate prompt generation and (2) clustering of point prompts, resulting in suboptimal outcomes. To address these challenges, we introduce \textbf{Med-PerSAM}, a novel and straightforward one-shot framework designed for the medical domain. Med-PerSAM uses only visual prompt engineering and eliminates the need for additional training of the pretrained SAM or human intervention, owing to our novel automated prompt generation process. By integrating our lightweight warping-based prompt tuning model with SAM, we enable the extraction and iterative refinement of visual prompts, enhancing the performance of the pre-trained SAM. This advancement is particularly meaningful in the medical domain, where creating visual prompts poses notable challenges for individuals lacking medical expertise. Our model outperforms various foundational models and previous SAM-based approaches across diverse 2D medical imaging datasets.

Title: CIA: Controllable Image Augmentation Framework Based on Stable Diffusion

Authors: Mohamed Benkedadra, Dany Rimez, Tiffanie Godelaine, Natarajan Chidambaram, Hamed Razavi Khosroshahi, Horacio Tellez, Matei Mancas, Benoit Macq, Sidi Ahmed Mahmoudi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16128
Pdf URL: https://arxiv.org/pdf/2411.16128
Copy Paste: [[2411.16128]] CIA: Controllable Image Augmentation Framework Based on Stable Diffusion(https://arxiv.org/abs/2411.16128)
Keywords: diffusion
Abstract: Computer vision tasks such as object detection and segmentation rely on the availability of extensive, accurately annotated datasets. In this work, We present CIA, a modular pipeline, for (1) generating synthetic images for dataset augmentation using Stable Diffusion, (2) filtering out low quality samples using defined quality metrics, (3) forcing the existence of specific patterns in generated images using accurate prompting and ControlNet. In order to show how CIA can be used to search for an optimal augmentation pipeline of training data, we study human object detection in a data constrained scenario, using YOLOv8n on COCO and Flickr30k datasets. We have recorded significant improvement using CIA-generated images, approaching the performances obtained when doubling the amount of real images in the dataset. Our findings suggest that our modular framework can significantly enhance object detection systems, and make it possible for future research to be done on data-constrained scenarios. The framework is available at: this http URL.

Title: DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders

Authors: Sizai Hou, Songze Li, Duanyi Yao
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2411.16154
Pdf URL: https://arxiv.org/pdf/2411.16154
Copy Paste: [[2411.16154]] DeDe: Detecting Backdoor Samples for SSL Encoders via Decoders(https://arxiv.org/abs/2411.16154)
Keywords: self-supervised
Abstract: Self-supervised learning (SSL) is pervasively exploited in training high-quality upstream encoders with a large amount of unlabeled data. However, it is found to be susceptible to backdoor attacks merely via polluting a small portion of training data. The victim encoders mismatch triggered inputs with target embeddings, e.g., match the triggered cat input to an airplane embedding, such that the downstream tasks are affected to misbehave when the trigger is activated. Emerging backdoor attacks have shown great threats in different SSL paradigms such as contrastive learning and CLIP, while few research is devoted to defending against such attacks. Besides, the existing ones fall short in detecting advanced stealthy backdoors. To address the limitations, we propose a novel detection mechanism, DeDe, which detects the activation of the backdoor mapping with the cooccurrence of victim encoder and trigger inputs. Specifically, DeDe trains a decoder for the SSL encoder on an auxiliary dataset (can be out-of-distribution or even slightly poisoned), such that for any triggered input that misleads to the target embedding, the decoder outputs an image significantly different from the input. We empirically evaluate DeDe on both contrastive learning and CLIP models against various types of backdoor attacks, and demonstrate its superior performance over SOTA detection methods in both upstream detection performance and ability of preventing backdoors in downstream tasks.

Title: Graph Adapter of EEG Foundation Models for Parameter Efficient Fine Tuning

Authors: Toyotaro Suzumura, Hiroki Kanezashi, Shotaro Akahori
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2411.16155
Pdf URL: https://arxiv.org/pdf/2411.16155
Copy Paste: [[2411.16155]] Graph Adapter of EEG Foundation Models for Parameter Efficient Fine Tuning(https://arxiv.org/abs/2411.16155)
Keywords: foundation model
Abstract: In diagnosing mental diseases from electroencephalography (EEG) data, neural network models such as Transformers have been employed to capture temporal dynamics. Additionally, it is crucial to learn the spatial relationships between EEG sensors, for which Graph Neural Networks (GNNs) are commonly used. However, fine-tuning large-scale complex neural network models simultaneously to capture both temporal and spatial features increases computational costs due to the more significant number of trainable parameters. It causes the limited availability of EEG datasets for downstream tasks, making it challenging to fine-tune large models effectively. We propose EEG-GraphAdapter (EGA), a parameter-efficient fine-tuning (PEFT) approach to address these challenges. EGA is integrated into pre-trained temporal backbone models as a GNN-based module and fine-tuned itself alone while keeping the backbone model parameters frozen. This enables the acquisition of spatial representations of EEG signals for downstream tasks, significantly reducing computational overhead and data requirements. Experimental evaluations on healthcare-related downstream tasks of Major Depressive Disorder and Abnormality Detection demonstrate that our EGA improves performance by up to 16.1% in the F1-score compared with the backbone BENDR model.

Title: MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

Authors: Chenjie Cao, Chaohui Yu, Shang Liu, Fan Wang, Xiangyang Xue, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16157
Pdf URL: https://arxiv.org/pdf/2411.16157
Copy Paste: [[2411.16157]] MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model(https://arxiv.org/abs/2411.16157)
Keywords: diffusion
Abstract: We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. MVGenMaster leverages 3D priors that are warped using metric depth and camera poses, significantly enhancing both generalization and 3D consistency in NVS. Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on arbitrary reference views and camera poses with a single forward process. Additionally, we have developed a comprehensive large-scale multi-view image dataset comprising up to 1.2 million scenes, equipped with well-aligned metric depth. Moreover, we present several training and model modifications to strengthen the model with scaled-up datasets. Extensive evaluations across in- and out-of-domain benchmarks demonstrate the effectiveness of our proposed method and data formulation. Models and codes will be released at this https URL.

Title: Text-to-Image Synthesis: A Decade Survey

Authors: Nonghai Zhang, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16164
Pdf URL: https://arxiv.org/pdf/2411.16164
Copy Paste: [[2411.16164]] Text-to-Image Synthesis: A Decade Survey(https://arxiv.org/abs/2411.16164)
Keywords: diffusion, foundation model, generative
Abstract: When humans read a specific text, they often visualize the corresponding images, and we hope that computers can do the same. Text-to-image synthesis (T2I), which focuses on generating high-quality images from textual descriptions, has become a significant aspect of Artificial Intelligence Generated Content (AIGC) and a transformative direction in artificial intelligence research. Foundation models play a crucial role in T2I. In this survey, we review over 440 recent works on T2I. We start by briefly introducing how GANs, autoregressive models, and diffusion models have been used for image generation. Building on this foundation, we discuss the development of these models for T2I, focusing on their generative capabilities and diversity when conditioned on text. We also explore cutting-edge research on various aspects of T2I, including performance, controllability, personalized generation, safety concerns, and consistency in content and spatial relationships. Furthermore, we summarize the datasets and evaluation metrics commonly used in T2I research. Finally, we discuss the potential applications of T2I within AIGC, along with the challenges and future research opportunities in this field.

Title: BadSFL: Backdoor Attack against Scaffold Federated Learning

Authors: Xingshuo Han, Xiang Lan, Haozhao Wang, Shengmin Xu, Shen Ren, Jason Zeng, Ming Wu, Michael Heinrich, Tianwei Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16167
Pdf URL: https://arxiv.org/pdf/2411.16167
Copy Paste: [[2411.16167]] BadSFL: Backdoor Attack against Scaffold Federated Learning(https://arxiv.org/abs/2411.16167)
Keywords: generative
Abstract: Federated learning (FL) enables the training of deep learning models on distributed clients to preserve data privacy. However, this learning paradigm is vulnerable to backdoor attacks, where malicious clients can upload poisoned local models to embed backdoors into the global model, leading to attacker-desired predictions. Existing backdoor attacks mainly focus on FL with independently and identically distributed (IID) scenarios, while real-world FL training data are typically non-IID. Current strategies for non-IID backdoor attacks suffer from limitations in maintaining effectiveness and durability. To address these challenges, we propose a novel backdoor attack method, \name, specifically designed for the FL framework using the scaffold aggregation algorithm in non-IID settings. \name leverages a Generative Adversarial Network (GAN) based on the global model to complement the training set, achieving high accuracy on both backdoor and benign samples. It utilizes a specific feature as the backdoor trigger to ensure stealthiness, and exploits the Scaffold's control variate to predict the global model's convergence direction, ensuring the backdoor's persistence. Extensive experiments on three benchmark datasets demonstrate the high effectiveness, stealthiness, and durability of \name. Notably, our attack remains effective over 60 rounds in the global model and up to 3 times longer than existing baseline attacks after stopping the injection of malicious updates.

Title: Image Generation Diversity Issues and How to Tame Them

Authors: Mischa Dombrowski, Weitong Zhang, Sarah Cechnicka, Hadrien Reynaud, Bernhard Kainz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16171
Pdf URL: https://arxiv.org/pdf/2411.16171
Copy Paste: [[2411.16171]] Image Generation Diversity Issues and How to Tame Them(https://arxiv.org/abs/2411.16171)
Keywords: diffusion, generative
Abstract: Generative methods now produce outputs nearly indistinguishable from real data but often fail to fully capture the data distribution. Unlike quality issues, diversity limitations in generative models are hard to detect visually, requiring specific metrics for assessment. In this paper, we draw attention to the current lack of diversity in generative models and the inability of common metrics to measure this. We achieve this by framing diversity as an image retrieval problem, where we measure how many real images can be retrieved using synthetic data as queries. This yields the Image Retrieval Score (IRS), an interpretable, hyperparameter-free metric that quantifies the diversity of a generative model's output. IRS requires only a subset of synthetic samples and provides a statistical measure of confidence. Our experiments indicate that current feature extractors commonly used in generative model assessment are inadequate for evaluating diversity effectively. Consequently, we perform an extensive search for the best feature extractors to assess diversity. Evaluation reveals that current diffusion models converge to limited subsets of the real distribution, with no current state-of-the-art models superpassing 77% of the diversity of the training data. To address this limitation, we introduce Diversity-Aware Diffusion Models (DiADM), a novel approach that improves diversity of unconditional diffusion models without loss of image quality. We do this by disentangling diversity from image quality by using a diversity aware module that uses pseudo-unconditional features as input. We provide a Python package offering unified feature extraction and metric computation to further facilitate the evaluation of generative models this https URL.

Title: U2NeRF: Unsupervised Underwater Image Restoration and Neural Radiance Fields

Authors: Vinayak Gupta, Manoj S, Mukund Varma T, Kaushik Mitra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16172
Pdf URL: https://arxiv.org/pdf/2411.16172
Copy Paste: [[2411.16172]] U2NeRF: Unsupervised Underwater Image Restoration and Neural Radiance Fields(https://arxiv.org/abs/2411.16172)
Keywords: self-supervised
Abstract: Underwater images suffer from colour shifts, low contrast, and haziness due to light absorption, refraction, scattering and restoring these images has warranted much attention. In this work, we present Unsupervised Underwater Neural Radiance Field U2NeRF, a transformer-based architecture that learns to render and restore novel views conditioned on multi-view geometry simultaneously. Due to the absence of supervision, we attempt to implicitly bake restoring capabilities onto the NeRF pipeline and disentangle the predicted color into several components - scene radiance, direct transmission map, backscatter transmission map, and global background light, and when combined reconstruct the underwater image in a self-supervised manner. In addition, we release an Underwater View Synthesis UVS dataset consisting of 12 underwater scenes, containing both synthetically-generated and real-world data. Our experiments demonstrate that when optimized on a single scene, U2NeRF outperforms several baselines by as much LPIPS 11%, UIQM 5%, UCIQE 4% (on average) and showcases improved rendering and restoration capabilities. Code will be made available upon acceptance.

Title: Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking

Authors: Phuc Nguyen, Minh Luu, Anh Tran, Cuong Pham, Khoi Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16183
Pdf URL: https://arxiv.org/pdf/2411.16183
Copy Paste: [[2411.16183]] Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking(https://arxiv.org/abs/2411.16183)
Keywords: foundation model
Abstract: Existing 3D instance segmentation methods frequently encounter issues with over-segmentation, leading to redundant and inaccurate 3D proposals that complicate downstream tasks. This challenge arises from their unsupervised merging approach, where dense 2D instance masks are lifted across frames into point clouds to form 3D candidate proposals without direct supervision. These candidates are then hierarchically merged based on heuristic criteria, often resulting in numerous redundant segments that fail to combine into precise 3D proposals. To overcome these limitations, we propose a 3D-Aware 2D Mask Tracking module that uses robust 3D priors from a 2D mask segmentation and tracking foundation model (SAM-2) to ensure consistent object masks across video frames. Rather than merging all visible superpoints across views to create a 3D mask, our 3D Mask Optimization module leverages a dynamic programming algorithm to select an optimal set of views, refining the superpoints to produce a final 3D proposal for each object. Our approach achieves comprehensive object coverage within the scene while reducing unnecessary proposals, which could otherwise impair downstream applications. Evaluations on ScanNet200 and ScanNet++ confirm the effectiveness of our method, with improvements across Class-Agnostic, Open-Vocabulary, and Open-Ended 3D Instance Segmentation tasks.

Title: Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation

Authors: Qiao Yu, Xianzhi Li, Yuan Tang, Xu Han, Long Hu, Yixue Hao, Min Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16185
Pdf URL: https://arxiv.org/pdf/2411.16185
Copy Paste: [[2411.16185]] Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation(https://arxiv.org/abs/2411.16185)
Keywords: diffusion
Abstract: Generating 3D meshes from a single image is an important but ill-posed task. Existing methods mainly adopt 2D multiview diffusion models to generate intermediate multiview images, and use the Large Reconstruction Model (LRM) to create the final meshes. However, the multiview images exhibit local inconsistencies, and the meshes often lack fidelity to the input image or look blurry. We propose Fancy123, featuring two enhancement modules and an unprojection operation to address the above three issues, respectively. The appearance enhancement module deforms the 2D multiview images to realign misaligned pixels for better multiview consistency. The fidelity enhancement module deforms the 3D mesh to match the input image. The unprojection of the input image and deformed multiview images onto LRM's generated mesh ensures high clarity, discarding LRM's predicted blurry-looking mesh colors. Extensive qualitative and quantitative experiments verify Fancy123's SoTA performance with significant improvement. Also, the two enhancement modules are plug-and-play and work at inference time, allowing seamless integration into various existing single-image-to-3D methods.

Title: Learn from Foundation Model: Fruit Detection Model without Manual Annotation

Authors: Yanan Wang, Zhenghao Fei, Ruichen Li, Yibin Ying
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16196
Pdf URL: https://arxiv.org/pdf/2411.16196
Copy Paste: [[2411.16196]] Learn from Foundation Model: Fruit Detection Model without Manual Annotation(https://arxiv.org/abs/2411.16196)
Keywords: foundation model
Abstract: Recent breakthroughs in large foundation models have enabled the possibility of transferring knowledge pre-trained on vast datasets to domains with limited data availability. Agriculture is one of the domains that lacks sufficient data. This study proposes a framework to train effective, domain-specific, small models from foundation models without manual annotation. Our approach begins with SDM (Segmentation-Description-Matching), a stage that leverages two foundation models: SAM2 (Segment Anything in Images and Videos) for segmentation and OpenCLIP (Open Contrastive Language-Image Pretraining) for zero-shot open-vocabulary classification. In the second stage, a novel knowledge distillation mechanism is utilized to distill compact, edge-deployable models from SDM, enhancing both inference speed and perception accuracy. The complete method, termed SDM-D (Segmentation-Description-Matching-Distilling), demonstrates strong performance across various fruit detection tasks object detection, semantic segmentation, and instance segmentation) without manual annotation. It nearly matches the performance of models trained with abundant labels. Notably, SDM-D outperforms open-set detection methods such as Grounding SAM and YOLO-World on all tested fruit detection datasets. Additionally, we introduce MegaFruits, a comprehensive fruit segmentation dataset encompassing over 25,000 images, and all code and datasets are made publicly available at this https URL.

Title: Interpreting Object-level Foundation Models via Visual Precision Search

Authors: Ruoyu Chen, Siyuan Liang, Jingzhi Li, Shiming Liu, Maosen Li, Zheng Huang, Hua Zhang, Xiaochun Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16198
Pdf URL: https://arxiv.org/pdf/2411.16198
Copy Paste: [[2411.16198]] Interpreting Object-level Foundation Models via Visual Precision Search(https://arxiv.org/abs/2411.16198)
Keywords: foundation model
Abstract: Advances in multimodal pre-training have propelled object-level foundation models, such as Grounding DINO and Florence-2, in tasks like visual grounding and object detection. However, interpreting these models\' decisions has grown increasingly challenging. Existing interpretable attribution methods for object-level task interpretation have notable limitations: (1) gradient-based methods lack precise localization due to visual-textual fusion in foundation models, and (2) perturbation-based methods produce noisy saliency maps, limiting fine-grained interpretability. To address these, we propose a Visual Precision Search method that generates accurate attribution maps with fewer regions. Our method bypasses internal model parameters to overcome attribution issues from multimodal fusion, dividing inputs into sparse sub-regions and using consistency and collaboration scores to accurately identify critical decision-making regions. We also conducted a theoretical analysis of the boundary guarantees and scope of applicability of our method. Experiments on RefCOCO, MS COCO, and LVIS show our approach enhances object-level task interpretability over SOTA for Grounding DINO and Florence-2 across various evaluation metrics, with faithfulness gains of 23.7\%, 31.6\%, and 20.1\% on MS COCO, LVIS, and RefCOCO for Grounding DINO, and 102.9\% and 66.9\% on MS COCO and RefCOCO for Florence-2. Additionally, our method can interpret failures in visual grounding and object detection tasks, surpassing existing methods across multiple evaluation metrics. The code will be released at \url{this https URL}.

Title: VIRES: Video Instance Repainting with Sketch and Text Guidance

Authors: Shuchen Weng, Haojie Zheng, Peixuan Zhan, Yuchen Hong, Han Jiang, Si Li, Boxin Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16199
Pdf URL: https://arxiv.org/pdf/2411.16199
Copy Paste: [[2411.16199]] VIRES: Video Instance Repainting with Sketch and Text Guidance(https://arxiv.org/abs/2411.16199)
Keywords: diffusion, generative
Abstract: We introduce VIRES, a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. Existing approaches struggle with temporal consistency and accurate alignment with the provided sketch sequence. VIRES leverages the generative priors of text-to-video models to maintain temporal consistency and produce visually pleasing results. We propose the Sequential ControlNet with the standardized self-scaling, which effectively extracts structure layouts and adaptively captures high-contrast sketch details. We further augment the diffusion transformer backbone with the sketch attention to interpret and inject fine-grained sketch semantics. A sketch-aware encoder ensures that repainted results are aligned with the provided sketch sequence. Additionally, we contribute the VireSet, a dataset with detailed annotations tailored for training and evaluating video instance editing methods. Experimental results demonstrate the effectiveness of VIRES, which outperforms state-of-the-art methods in visual quality, temporal consistency, condition alignment, and human ratings.

Title: SMGDiff: Soccer Motion Generation using diffusion probabilistic models

Authors: Hongdi Yang, Chengyang Li, Zhenxuan Wu, Gaozheng Li, Jingya Wang, Jingyi Yu, Zhuo Su, Lan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16216
Pdf URL: https://arxiv.org/pdf/2411.16216
Copy Paste: [[2411.16216]] SMGDiff: Soccer Motion Generation using diffusion probabilistic models(https://arxiv.org/abs/2411.16216)
Keywords: diffusion, generative
Abstract: Soccer is a globally renowned sport with significant applications in video games and VR/AR. However, generating realistic soccer motions remains challenging due to the intricate interactions between the human player and the ball. In this paper, we introduce SMGDiff, a novel two-stage framework for generating real-time and user-controllable soccer motions. Our key idea is to integrate real-time character control with a powerful diffusion-based generative model, ensuring high-quality and diverse output motion. In the first stage, we instantly transform coarse user controls into diverse global trajectories of the character. In the second stage, we employ a transformer-based autoregressive diffusion model to generate soccer motions based on trajectory conditioning. We further incorporate a contact guidance module during inference to optimize the contact details for realistic ball-foot interactions. Moreover, we contribute a large-scale soccer motion dataset consisting of over 1.08 million frames of diverse soccer motions. Extensive experiments demonstrate that our SMGDiff significantly outperforms existing methods in terms of motion quality and condition alignment.

Title: Weakly supervised image segmentation for defect-based grading of fresh produce

Authors: Manuel Knott, Divinefavour Odion, Sameer Sontakke, Anup Karwa, Thijs Defraeye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16219
Pdf URL: https://arxiv.org/pdf/2411.16219
Copy Paste: [[2411.16219]] Weakly supervised image segmentation for defect-based grading of fresh produce(https://arxiv.org/abs/2411.16219)
Keywords: foundation model
Abstract: Implementing image-based machine learning in agriculture is often limited by scarce data and annotations, making it hard to achieve high-quality model predictions. This study tackles the issue of postharvest quality assessment of bananas in decentralized supply chains. We propose a method to detect and segment surface defects in banana images using panoptic segmentation to quantify defect size and number. Instead of time-consuming pixel-level annotations, we use weak supervision with coarse labels. A dataset of 476 smartphone images of bananas was collected under real-world field conditions and annotated for bruises and scars. Using the Segment Anything Model (SAM), a recently published foundation model for image segmentation, we generated dense annotations from coarse bounding boxes to train a segmentation model, significantly reducing manual effort while achieving a panoptic quality score of 77.6%. This demonstrates SAM's potential for low-effort, accurate segmentation in agricultural settings with limited data.

Title: DoubleCCA: Improving Foundation Model Group Robustness with Random Sentence Embeddings

Authors: Hong Liu, Yitong Lu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2411.16236
Pdf URL: https://arxiv.org/pdf/2411.16236
Copy Paste: [[2411.16236]] DoubleCCA: Improving Foundation Model Group Robustness with Random Sentence Embeddings(https://arxiv.org/abs/2411.16236)
Keywords: foundation model
Abstract: This paper presents a novel method to improve the robustness of foundation models to group-based biases. We propose a simple yet effective method, called DoubleCCA, that leverages random sentences and Canonical Correlation Analysis (CCA) to enrich the text embeddings of the foundation model. First, we generate various random sentences that augment the original prompts, which extends the original prompts with random words or character sequences. Second, we use an additional sentence embedding model to generate different text embeddings with respect to these random sentences. We then use CCA double twice to align the representations and reconstruct them back to the original representation space. We demonstrate the effectiveness of our method on a variety of tasks and datasets, showing that it outperforms existing methods in terms of both performance and robustness. Our method is simple to implement and can be easily integrated into existing models, making it a practical solution for improving the robustness of foundation models to group-based biases.

Title: Transparent Neighborhood Approximation for Text Classifier Explanation

Authors: Yi Cai, Arthur Zimek, Eirini Ntoutsi, Gerhard Wunder
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16251
Pdf URL: https://arxiv.org/pdf/2411.16251
Copy Paste: [[2411.16251]] Transparent Neighborhood Approximation for Text Classifier Explanation(https://arxiv.org/abs/2411.16251)
Keywords: generative
Abstract: Recent literature highlights the critical role of neighborhood construction in deriving model-agnostic explanations, with a growing trend toward deploying generative models to improve synthetic instance quality, especially for explaining text classifiers. These approaches overcome the challenges in neighborhood construction posed by the unstructured nature of texts, thereby improving the quality of explanations. However, the deployed generators are usually implemented via neural networks and lack inherent explainability, sparking arguments over the transparency of the explanation process itself. To address this limitation while preserving neighborhood quality, this paper introduces a probability-based editing method as an alternative to black-box text generators. This approach generates neighboring texts by implementing manipulations based on in-text contexts. Substituting the generator-based construction process with recursive probability-based editing, the resultant explanation method, XPROB (explainer with probability-based editing), exhibits competitive performance according to the evaluation conducted on two real-world datasets. Additionally, XPROB's fully transparent and more controllable construction process leads to superior stability compared to the generator-based explainers.

Title: BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment

Authors: Shaolei Zhang, Kehao Zhang, Qingkai Fang, Shoutao Guo, Yan Zhou, Xiaodong Liu, Yang Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16300
Pdf URL: https://arxiv.org/pdf/2411.16300
Copy Paste: [[2411.16300]] BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment(https://arxiv.org/abs/2411.16300)
Keywords: foundation model, generative
Abstract: Large language models (LLMs), with their powerful generative capabilities and vast knowledge, empower various tasks in everyday life. However, these abilities are primarily concentrated in high-resource languages, leaving low-resource languages with weaker generative capabilities and relatively limited knowledge. Enhancing the multilingual capabilities of LLMs is therefore crucial for serving over 100 linguistic communities worldwide. An intuitive approach to enhance the multilingual capabilities would be to construct instruction data for various languages, but constructing instruction data for over 100 languages is prohibitively costly. In this paper, we introduce BayLing 2, which efficiently transfers generative capabilities and knowledge from high-resource languages to low-resource languages through language alignment. To achieve this, we constructed a dataset of 3.2 million instructions, comprising high-resource language instructions (Chinese and English) and cross-lingual instructions for 100+ languages and performed instruction tuning based on the dataset to facilitate the capability transfer between languages. Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B, and BayLing-3-8B, and conducted a comprehensive evaluation of BayLing. For multilingual translation across 100+ languages, BayLing shows superior performance compared to open-source models of similar scale. For multilingual knowledge and understanding benchmarks, BayLing achieves significant improvements across over 20 low-resource languages, demonstrating its capability of effective knowledge transfer from high-resource to low-resource languages. Furthermore, results on English benchmarks indicate that BayLing maintains high performance in highresource languages while enhancing the performance in low-resource languages. Demo, homepage, code and models of BayLing are available.

Title: DiffDesign: Controllable Diffusion with Meta Prior for Efficient Interior Design Generation

Authors: Yuxuan Yang, Jingyao Wang, Tao Geng, Wenwen Qiang, Changwen Zheng, Fuchun Sun
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16301
Pdf URL: https://arxiv.org/pdf/2411.16301
Copy Paste: [[2411.16301]] DiffDesign: Controllable Diffusion with Meta Prior for Efficient Interior Design Generation(https://arxiv.org/abs/2411.16301)
Keywords: diffusion, generative
Abstract: Interior design is a complex and creative discipline involving aesthetics, functionality, ergonomics, and materials science. Effective solutions must meet diverse requirements, typically producing multiple deliverables such as renderings and design drawings from various perspectives. Consequently, interior design processes are often inefficient and demand significant creativity. With advances in machine learning, generative models have emerged as a promising means of improving efficiency by creating designs from text descriptions or sketches. However, few generative works focus on interior design, leading to substantial discrepancies between outputs and practical needs, such as differences in size, spatial scope, and the lack of controllable generation quality. To address these challenges, we propose DiffDesign, a controllable diffusion model with meta priors for efficient interior design generation. Specifically, we utilize the generative priors of a 2D diffusion model pre-trained on a large image dataset as our rendering backbone. We further guide the denoising process by disentangling cross-attention control over design attributes, such as appearance, pose, and size, and introduce an optimal transfer-based alignment module to enforce view consistency. Simultaneously, we construct an interior design-specific dataset, DesignHelper, consisting of over 400 solutions across more than 15 spatial types and 15 design styles. This dataset helps fine-tune DiffDesign. Extensive experiments conducted on various benchmark datasets demonstrate the effectiveness and robustness of DiffDesign.

Title: An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models

Authors: Wentao Qu, Jing Wang, YongShun Gong, Xiaoshui Huang, Liang Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16308
Pdf URL: https://arxiv.org/pdf/2411.16308
Copy Paste: [[2411.16308]] An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models(https://arxiv.org/abs/2411.16308)
Keywords: diffusion
Abstract: Existing conditional Denoising Diffusion Probabilistic Models (DDPMs) with a Noise-Conditional Framework (NCF) remain challenging for 3D scene understanding tasks, as the complex geometric details in scenes increase the difficulty of fitting the gradients of the data distribution (the scores) from semantic labels. This also results in longer training and inference time for DDPMs compared to non-DDPMs. From a different perspective, we delve deeply into the model paradigm dominated by the Conditional Network. In this paper, we propose an end-to-end robust semantic \textbf{Seg}mentation \textbf{Net}work based on a \textbf{C}onditional-Noise Framework (CNF) of D\textbf{D}PMs, named \textbf{CDSegNet}. Specifically, CDSegNet models the Noise Network (NN) as a learnable noise-feature generator. This enables the Conditional Network (CN) to understand 3D scene semantics under multi-level feature perturbations, enhancing the generalization in unseen scenes. Meanwhile, benefiting from the noise system of DDPMs, CDSegNet exhibits strong noise and sparsity robustness in experiments. Moreover, thanks to CNF, CDSegNet can generate the semantic labels in a single-step inference like non-DDPMs, due to avoiding directly fitting the scores from semantic labels in the dominant network of CDSegNet. On public indoor and outdoor benchmarks, CDSegNet significantly outperforms existing methods, achieving state-of-the-art performance.

Title: One Diffusion to Generate Them All

Authors: Duong H. Le, Tuan Pham, Sangho Lee, Christopher Clark, Aniruddha Kembhavi, Stephan Mandt, Ranjay Krishna, Jiasen Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16318
Pdf URL: https://arxiv.org/pdf/2411.16318
Copy Paste: [[2411.16318]] One Diffusion to Generate Them All(https://arxiv.org/abs/2411.16318)
Keywords: diffusion
Abstract: We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset. Our code and checkpoint are freely available at this https URL

Title: Cluster-based human-in-the-loop strategy for improving machine learning-based circulating tumor cell detection in liquid biopsy

Authors: Hümeyra Husseini-Wüsthoff, Sabine Riethdorf, Andreas Schneeweiss, Andreas Trumpp, Klaus Pantel, Harriet Wikman, Maximilian Nielsen, René Werner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16332
Pdf URL: https://arxiv.org/pdf/2411.16332
Copy Paste: [[2411.16332]] Cluster-based human-in-the-loop strategy for improving machine learning-based circulating tumor cell detection in liquid biopsy(https://arxiv.org/abs/2411.16332)
Keywords: self-supervised
Abstract: Detection and differentiation of circulating tumor cells (CTCs) and non-CTCs in blood draws of cancer patients pose multiple challenges. While the gold standard relies on tedious manual evaluation of an automatically generated selection of images, machine learning (ML) techniques offer the potential to automate these processes. However, human assessment remains indispensable when the ML system arrives at uncertain or wrong decisions due to an insufficient set of labeled training data. This study introduces a human-in-the-loop (HiL) strategy for improving ML-based CTC detection. We combine self-supervised deep learning and a conventional ML-based classifier and propose iterative targeted sampling and labeling of new unlabeled training samples by human experts. The sampling strategy is based on the classification performance of local latent space clusters. The advantages of the proposed approach compared to naive random sampling are demonstrated for liquid biopsy data from patients with metastatic breast cancer.

Title: Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring

Authors: Kathrin Seßler, Maurice Fürstenberg, Babette Bühler, Enkelejda Kasneci
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2411.16337
Pdf URL: https://arxiv.org/pdf/2411.16337
Copy Paste: [[2411.16337]] Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring(https://arxiv.org/abs/2411.16337)
Keywords: generative
Abstract: The manual assessment and grading of student writing is a time-consuming yet critical task for teachers. Recent developments in generative AI, such as large language models, offer potential solutions to facilitate essay-scoring tasks for teachers. In our study, we evaluate the performance and reliability of both open-source and closed-source LLMs in assessing German student essays, comparing their evaluations to those of 37 teachers across 10 pre-defined criteria (i.e., plot logic, expression). A corpus of 20 real-world essays from Year 7 and 8 students was analyzed using five LLMs: GPT-3.5, GPT-4, o1, LLaMA 3-70B, and Mixtral 8x7B, aiming to provide in-depth insights into LLMs' scoring capabilities. Closed-source GPT models outperform open-source models in both internal consistency and alignment with human ratings, particularly excelling in language-related criteria. The novel o1 model outperforms all other LLMs, achieving Spearman's $r = .74$ with human assessments in the overall score, and an internal consistency of $ICC=.80$. These findings indicate that LLM-based assessment can be a useful tool to reduce teacher workload by supporting the evaluation of essays, especially with regard to language-related criteria. However, due to their tendency for higher scores, the models require further refinement to better capture aspects of content quality.

Title: Towards Foundation Models for Critical Care Time Series

Authors: Manuel Burger, Fedor Sergeev, Malte Londschien, Daphné Chopard, Hugo Yèche, Eike Gerdes, Polina Leshetkina, Alexander Morgenroth, Zeynep Babür, Jasmina Bogojeska, Martin Faltys, Rita Kuznetsova, Gunnar Rätsch
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2411.16346
Pdf URL: https://arxiv.org/pdf/2411.16346
Copy Paste: [[2411.16346]] Towards Foundation Models for Critical Care Time Series(https://arxiv.org/abs/2411.16346)
Keywords: foundation model
Abstract: Notable progress has been made in generalist medical large language models across various healthcare areas. However, large-scale modeling of in-hospital time series data - such as vital signs, lab results, and treatments in critical care - remains underexplored. Existing datasets are relatively small, but combining them can enhance patient diversity and improve model robustness. To effectively utilize these combined datasets for large-scale modeling, it is essential to address the distribution shifts caused by varying treatment policies, necessitating the harmonization of treatment variables across the different datasets. This work aims to establish a foundation for training large-scale multi-variate time series models on critical care data and to provide a benchmark for machine learning models in transfer learning across hospitals to study and address distribution shift challenges. We introduce a harmonized dataset for sequence modeling and transfer learning research, representing the first large-scale collection to include core treatment variables. Future plans involve expanding this dataset to support further advancements in transfer learning and the development of scalable, generalizable models for critical healthcare applications.

Title: Multi-modal Retrieval Augmented Multi-modal Generation: A Benchmark, Evaluate Metrics and Strong Baselines

Authors: Zi-Ao Ma, Tian Lan, Rong-Cheng Tu, Yong Hu, Heyan Huang, Xian-Ling Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2411.16365
Pdf URL: https://arxiv.org/pdf/2411.16365
Copy Paste: [[2411.16365]] Multi-modal Retrieval Augmented Multi-modal Generation: A Benchmark, Evaluate Metrics and Strong Baselines(https://arxiv.org/abs/2411.16365)
Keywords: foundation model
Abstract: This paper investigates an intriguing task of Multi-modal Retrieval Augmented Multi-modal Generation (M$^2$RAG). This task requires foundation models to browse multi-modal web pages, with mixed text and images, and generate multi-modal responses for solving user queries, which exhibits better information density and readability. Given the early researching stage of M$^2$RAG task, there is a lack of systematic studies and analysis. To fill this gap, we construct a benchmark for M$^2$RAG task, equipped with a suite of text-modal metrics and multi-modal metrics to analyze the capabilities of existing foundation models. Besides, we also propose several effective methods for foundation models to accomplish this task, based on the comprehensive evaluation results on our benchmark. Extensive experimental results reveal several intriguing phenomena worth further research.

Title: Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

Authors: Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, Long Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16375
Pdf URL: https://arxiv.org/pdf/2411.16375
Copy Paste: [[2411.16375]] Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing(https://arxiv.org/abs/2411.16375)
Keywords: diffusion
Abstract: With the advance of diffusion models, today's video generation has achieved impressive quality. To extend the generation length and facilitate real-world applications, a majority of video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent clips conditioned on the last frame(s) of the previous clip. However, existing autoregressive VDMs are highly inefficient and redundant: The model must re-compute all the conditional frames that are overlapped between adjacent clips. This issue is exacerbated when the conditional frames are extended autoregressively to provide the model with long-term context. In such cases, the computational demands increase significantly (i.e., with a quadratic complexity w.r.t. the autoregression step). In this paper, we propose Ca2-VDM, an efficient autoregressive VDM with Causal generation and Cache sharing. For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps and reused in every subsequent step, eliminating redundant computations. For cache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost. Extensive experiments demonstrated that our Ca2-VDM achieves state-of-the-art quantitative and qualitative video generation results and significantly improves the generation speed. Code is available at this https URL

Title: Human-Calibrated Automated Testing and Validation of Generative Language Models

Authors: Agus Sudjianto, Aijun Zhang, Srinivas Neppalli, Tarun Joshi, Michal Malohlava
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16391
Pdf URL: https://arxiv.org/pdf/2411.16391
Copy Paste: [[2411.16391]] Human-Calibrated Automated Testing and Validation of Generative Language Models(https://arxiv.org/abs/2411.16391)
Keywords: generative
Abstract: This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs), with a focus on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking. GLM evaluation is challenging due to open-ended outputs and subjective quality assessments. Leveraging the structured nature of RAG systems, where generated responses are grounded in a predefined document collection, we propose the Human-Calibrated Automated Testing (HCAT) framework. HCAT integrates a) automated test generation using stratified sampling, b) embedding-based metrics for explainable assessment of functionality, risk and safety attributes, and c) a two-stage calibration approach that aligns machine-generated evaluations with human judgments through probability calibration and conformal prediction. In addition, the framework includes robustness testing to evaluate model performance against adversarial, out-of-distribution, and varied input conditions, as well as targeted weakness identification using marginal and bivariate analysis to pinpoint specific areas for improvement. This human-calibrated, multi-layered evaluation framework offers a scalable, transparent, and interpretable approach to GLM assessment, providing a practical and reliable solution for deploying GLMs in applications where accuracy, transparency, and regulatory compliance are paramount.

Title: Synthesising Handwritten Music with GANs: A Comprehensive Evaluation of CycleWGAN, ProGAN, and DCGAN

Authors: Elona Shatri, Kalikidhar Palavala, George Fazekas
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2411.16405
Pdf URL: https://arxiv.org/pdf/2411.16405
Copy Paste: [[2411.16405]] Synthesising Handwritten Music with GANs: A Comprehensive Evaluation of CycleWGAN, ProGAN, and DCGAN(https://arxiv.org/abs/2411.16405)
Keywords: generative
Abstract: The generation of handwritten music sheets is a crucial step toward enhancing Optical Music Recognition (OMR) systems, which rely on large and diverse datasets for optimal performance. However, handwritten music sheets, often found in archives, present challenges for digitisation due to their fragility, varied handwriting styles, and image quality. This paper addresses the data scarcity problem by applying Generative Adversarial Networks (GANs) to synthesise realistic handwritten music sheets. We provide a comprehensive evaluation of three GAN models - DCGAN, ProGAN, and CycleWGAN - comparing their ability to generate diverse and high-quality handwritten music images. The proposed CycleWGAN model, which enhances style transfer and training stability, significantly outperforms DCGAN and ProGAN in both qualitative and quantitative evaluations. CycleWGAN achieves superior performance, with an FID score of 41.87, an IS of 2.29, and a KID of 0.05, making it a promising solution for improving OMR systems.

Title: Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and Tasks

Authors: Asanobu Kitamoto, Erwan Dzik, Gaspar Faure
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16421
Pdf URL: https://arxiv.org/pdf/2411.16421
Copy Paste: [[2411.16421]] Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and Tasks(https://arxiv.org/abs/2411.16421)
Keywords: self-supervised
Abstract: This paper presents the Digital Typhoon Dataset V2, a new version of the longest typhoon satellite image dataset for 40+ years aimed at benchmarking machine learning models for long-term spatio-temporal data. The new addition in Dataset V2 is tropical cyclone data from the southern hemisphere, in addition to the northern hemisphere data in Dataset V1. Having data from two hemispheres allows us to ask new research questions about regional differences across basins and hemispheres. We also discuss new developments in representations and tasks of the dataset. We first introduce a self-supervised learning framework for representation learning. Combined with the LSTM model, we discuss performance on intensity forecasting and extra-tropical transition forecasting tasks. We then propose new tasks, such as the typhoon center estimation task. We show that an object detection-based model performs better for stronger typhoons. Finally, we study how machine learning models can generalize across basins and hemispheres, by training the model on the northern hemisphere data and testing it on the southern hemisphere data. The dataset is publicly available at \url{this http URL} and \url{this https URL}.

Title: Unsupervised Event Outlier Detection in Continuous Time

Authors: Somjit Nath, Yik Chau Lui, Siqi Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16427
Pdf URL: https://arxiv.org/pdf/2411.16427
Copy Paste: [[2411.16427]] Unsupervised Event Outlier Detection in Continuous Time(https://arxiv.org/abs/2411.16427)
Keywords: generative, anomaly
Abstract: Event sequence data record the occurrences of events in continuous time. Event sequence forecasting based on temporal point processes (TPPs) has been extensively studied, but outlier or anomaly detection, especially without any supervision from humans, is still underexplored. In this work, we develop, to the best our knowledge, the first unsupervised outlier detection approach to detecting abnormal events. Our novel unsupervised outlier detection framework is based on ideas from generative adversarial networks (GANs) and reinforcement learning (RL). We train a 'generator' that corrects outliers in the data with a 'discriminator' that learns to discriminate the corrected data from the real data, which may contain outliers. A key insight is that if the generator made a mistake in the correction, it would generate anomalies that are different from the anomalies in the real data, so it serves as data augmentation for the discriminator learning. Different from typical GAN-based outlier detection approaches, our method employs the generator to detect outliers in an online manner. The experimental results show that our method can detect event outliers more accurately than the state-of-the-art approaches.

Title: Privacy Protection in Personalized Diffusion Models via Targeted Cross-Attention Adversarial Attack

Authors: Xide Xu, Muhammad Atif Butt, Sandesh Kamath, Bogdan Raducanu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16437
Pdf URL: https://arxiv.org/pdf/2411.16437
Copy Paste: [[2411.16437]] Privacy Protection in Personalized Diffusion Models via Targeted Cross-Attention Adversarial Attack(https://arxiv.org/abs/2411.16437)
Keywords: diffusion
Abstract: The growing demand for customized visual content has led to the rise of personalized text-to-image (T2I) diffusion models. Despite their remarkable potential, they pose significant privacy risk when misused for malicious purposes. In this paper, we propose a novel and efficient adversarial attack method, Concept Protection by Selective Attention Manipulation (CoPSAM) which targets only the cross-attention layers of a T2I diffusion model. For this purpose, we carefully construct an imperceptible noise to be added to clean samples to get their adversarial counterparts. This is obtained during the fine-tuning process by maximizing the discrepancy between the corresponding cross-attention maps of the user-specific token and the class-specific token, respectively. Experimental validation on a subset of CelebA-HQ face images dataset demonstrates that our approach outperforms existing methods. Besides this, our method presents two important advantages derived from the qualitative evaluation: (i) we obtain better protection results for lower noise levels than our competitors; and (ii) we protect the content from unauthorized use thereby protecting the individual's identity from potential misuse.

Title: Multi-Resolution Generative Modeling of Human Motion from Limited Data

Authors: David Eduardo Moreno-Villamarín, Anna Hilsmann, Peter Eisert
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16498
Pdf URL: https://arxiv.org/pdf/2411.16498
Copy Paste: [[2411.16498]] Multi-Resolution Generative Modeling of Human Motion from Limited Data(https://arxiv.org/abs/2411.16498)
Keywords: generative
Abstract: We present a generative model that learns to synthesize human motion from limited training sequences. Our framework provides conditional generation and blending across multiple temporal resolutions. The model adeptly captures human motion patterns by integrating skeletal convolution layers and a multi-scale architecture. Our model contains a set of generative and adversarial networks, along with embedding modules, each tailored for generating motions at specific frame rates while exerting control over their content and details. Notably, our approach also extends to the synthesis of co-speech gestures, demonstrating its ability to generate synchronized gestures from speech inputs, even with limited paired data. Through direct synthesis of SMPL pose parameters, our approach avoids test-time adjustments to fit human body meshes. Experimental results showcase our model's ability to achieve extensive coverage of training examples, while generating diverse motions, as indicated by local and global diversity metrics.

Title: Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

Authors: Boming Miao, Chunxiao Li, Xiaoxiao Wang, Andi Zhang, Rui Sun, Zizhe Wang, Yao Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16503
Pdf URL: https://arxiv.org/pdf/2411.16503
Copy Paste: [[2411.16503]] Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis(https://arxiv.org/abs/2411.16503)
Keywords: diffusion
Abstract: Diffusion models have achieved impressive success in generating photorealistic images, but challenges remain in ensuring precise semantic alignment with input prompts. Optimizing the initial noisy latent offers a more efficient alternative to modifying model architectures or prompt engineering for improving semantic alignment. A latest approach, InitNo, refines the initial noisy latent by leveraging attention maps; however, these maps capture only limited information, and the effectiveness of InitNo is highly dependent on the initial starting point, as it tends to converge on a local optimum near this point. To this end, this paper proposes leveraging the language comprehension capabilities of large vision-language models (LVLMs) to guide the optimization of the initial noisy latent, and introduces the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency. Furthermore, we provide a theoretical analysis of the condition under which the update improves semantic faithfulness. Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models. The code is available at this https URL.

Title: LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation

Authors: Steven Song, Anirudh Subramanyam, Irene Madejski, Robert L. Grossman
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2411.16523
Pdf URL: https://arxiv.org/pdf/2411.16523
Copy Paste: [[2411.16523]] LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation(https://arxiv.org/abs/2411.16523)
Keywords: generative
Abstract: In the current paradigm of image captioning, deep learning models are trained to generate text from image embeddings of latent features. We challenge the assumption that these latent features ought to be high-dimensional vectors which require model fine tuning to handle. Here we propose Label Boosted Retrieval Augmented Generation (LaB-RAG), a text-based approach to image captioning that leverages image descriptors in the form of categorical labels to boost standard retrieval augmented generation (RAG) with pretrained large language models (LLMs). We study our method in the context of radiology report generation (RRG), where the task is to generate a clinician's report detailing their observations from a set of radiological images, such as X-rays. We argue that simple linear classifiers over extracted image embeddings can effectively transform X-rays into text-space as radiology-specific labels. In combination with standard RAG, we show that these derived text labels can be used with general-domain LLMs to generate radiology reports. Without ever training our generative language model or image feature encoder models, and without ever directly "showing" the LLM an X-ray, we demonstrate that LaB-RAG achieves better results across natural language and radiology language metrics compared with other retrieval-based RRG methods, while attaining competitive results compared to other fine-tuned vision-language RRG models. We further present results of our experiments with various components of LaB-RAG to better understand our method. Finally, we critique the use of a popular RRG metric, arguing it is possible to artificially inflate its results without true data-leakage.

Title: Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency

Authors: Jerry Yao-Chieh Hu, Wei-Po Wang, Ammar Gilani, Chenyang Li, Zhao Song, Han Liu
Subjects: cs.LG, cs.AI, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2411.16525
Pdf URL: https://arxiv.org/pdf/2411.16525
Copy Paste: [[2411.16525]] Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency(https://arxiv.org/abs/2411.16525)
Keywords: foundation model
Abstract: We investigate the statistical and computational limits of prompt tuning for transformer-based foundation models. Our key contributions are prompt tuning on \textit{single-head} transformers with only a \textit{single} self-attention layer: (i) is universal, and (ii) supports efficient (even almost-linear time) algorithms under the Strong Exponential Time Hypothesis (SETH). Statistically, we prove that prompt tuning on such simplest possible transformers are universal approximators for sequence-to-sequence Lipschitz functions. In addition, we provide an exponential-in-$dL$ and -in-$(1/\epsilon)$ lower bound on the required soft-prompt tokens for prompt tuning to memorize any dataset with 1-layer, 1-head transformers. Computationally, we identify a phase transition in the efficiency of prompt tuning, determined by the norm of the \textit{soft-prompt-induced} keys and queries, and provide an upper bound criterion. Beyond this criterion, no sub-quadratic (efficient) algorithm for prompt tuning exists under SETH. Within this criterion, we showcase our theory by proving the existence of almost-linear time prompt tuning inference algorithms. These fundamental limits provide important necessary conditions for designing expressive and efficient prompt tuning methods for practitioners.

Title: Continual Deep Reinforcement Learning with Task-Agnostic Policy Distillation

Authors: Muhammad Burhan Hafez, Kerim Erekmen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16532
Pdf URL: https://arxiv.org/pdf/2411.16532
Copy Paste: [[2411.16532]] Continual Deep Reinforcement Learning with Task-Agnostic Policy Distillation(https://arxiv.org/abs/2411.16532)
Keywords: self-supervised
Abstract: Central to the development of universal learning systems is the ability to solve multiple tasks without retraining from scratch when new data arrives. This is crucial because each task requires significant training time. Addressing the problem of continual learning necessitates various methods due to the complexity of the problem space. This problem space includes: (1) addressing catastrophic forgetting to retain previously learned tasks, (2) demonstrating positive forward transfer for faster learning, (3) ensuring scalability across numerous tasks, and (4) facilitating learning without requiring task labels, even in the absence of clear task boundaries. In this paper, the Task-Agnostic Policy Distillation (TAPD) framework is introduced. This framework alleviates problems (1)-(4) by incorporating a task-agnostic phase, where an agent explores its environment without any external goal and maximizes only its intrinsic motivation. The knowledge gained during this phase is later distilled for further exploration. Therefore, the agent acts in a self-supervised manner by systematically seeking novel states. By utilizing task-agnostic distilled knowledge, the agent can solve downstream tasks more efficiently, leading to improved sample efficiency. Our code is available at the repository: this https URL.

Title: Transformers are Deep Optimizers: Provable In-Context Learning for Deep Model Training

Authors: Weimin Wu, Maojiang Su, Jerry Yao-Chieh Hu, Zhao Song, Han Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16549
Pdf URL: https://arxiv.org/pdf/2411.16549
Copy Paste: [[2411.16549]] Transformers are Deep Optimizers: Provable In-Context Learning for Deep Model Training(https://arxiv.org/abs/2411.16549)
Keywords: in-context
Abstract: We investigate the transformer's capability for in-context learning (ICL) to simulate the training process of deep models. Our key contribution is providing a positive example of using a transformer to train a deep neural network by gradient descent in an implicit fashion via ICL. Specifically, we provide an explicit construction of a $(2N+4)L$-layer transformer capable of simulating $L$ gradient descent steps of an $N$-layer ReLU network through ICL. We also give the theoretical guarantees for the approximation within any given error and the convergence of the ICL gradient descent. Additionally, we extend our analysis to the more practical setting using Softmax-based transformers. We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. The results show that ICL performance matches that of direct training.

Title: Representation Collapsing Problems in Vector Quantization

Authors: Wenhao Zhao, Qiran Zou, Rushi Shah, Dianbo Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2411.16550
Pdf URL: https://arxiv.org/pdf/2411.16550
Copy Paste: [[2411.16550]] Representation Collapsing Problems in Vector Quantization(https://arxiv.org/abs/2411.16550)
Keywords: diffusion, generative
Abstract: Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain largely underexplored. In this study, we investigate representation collapse in vector quantization - a critical degradation where codebook tokens or latent embeddings lose their discriminative power by converging to a limited subset of values. This collapse fundamentally compromises the model's ability to capture diverse data patterns. By leveraging both synthetic and real datasets, we identify the severity of each type of collapses and triggering conditions. Our analysis reveals that restricted initialization and limited encoder capacity result in tokens collapse and embeddings collapse. Building on these findings, we propose potential solutions aimed at mitigating each collapse. To the best of our knowledge, this is the first comprehensive study examining representation collapsing problems in vector quantization.

Title: Enhancing Few-Shot Learning with Integrated Data and GAN Model Approaches

Authors: Yinqiu Feng, Aoran Shen, Jiacheng Hu, Yingbin Liang, Shiru Wang, Junliang Du
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2411.16567
Pdf URL: https://arxiv.org/pdf/2411.16567
Copy Paste: [[2411.16567]] Enhancing Few-Shot Learning with Integrated Data and GAN Model Approaches(https://arxiv.org/abs/2411.16567)
Keywords: generative
Abstract: This paper presents an innovative approach to enhancing few-shot learning by integrating data augmentation with model fine-tuning in a framework designed to tackle the challenges posed by small-sample data. Recognizing the critical limitations of traditional machine learning models that require large datasets-especially in fields such as drug discovery, target recognition, and malicious traffic detection-this study proposes a novel strategy that leverages Generative Adversarial Networks (GANs) and advanced optimization techniques to improve model performance with limited data. Specifically, the paper addresses the noise and bias issues introduced by data augmentation methods, contrasting them with model-based approaches, such as fine-tuning and metric learning, which rely heavily on related datasets. By combining Markov Chain Monte Carlo (MCMC) sampling and discriminative model ensemble strategies within a GAN framework, the proposed model adjusts generative and discriminative distributions to simulate a broader range of relevant data. Furthermore, it employs MHLoss and a reparameterized GAN ensemble to enhance stability and accelerate convergence, ultimately leading to improved classification performance on small-sample images and structured datasets. Results confirm that the MhERGAN algorithm developed in this research is highly effective for few-shot learning, offering a practical solution that bridges data scarcity with high-performing model adaptability and generalization.

Title: Rethinking Diffusion for Text-Driven Human Motion Generation

Authors: Zichong Meng, Yiming Xie, Xiaogang Peng, Zeyu Han, Huaizu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16575
Pdf URL: https://arxiv.org/pdf/2411.16575
Copy Paste: [[2411.16575]] Rethinking Diffusion for Text-Driven Human Motion Generation(https://arxiv.org/abs/2411.16575)
Keywords: diffusion
Abstract: Since 2023, Vector Quantization (VQ)-based discrete generation methods have rapidly dominated human motion generation, primarily surpassing diffusion-based continuous generation methods in standard performance metrics. However, VQ-based methods have inherent limitations. Representing continuous motion data as limited discrete tokens leads to inevitable information loss, reduces the diversity of generated motions, and restricts their ability to function effectively as motion priors or generation guidance. In contrast, the continuous space generation nature of diffusion-based methods makes them well-suited to address these limitations and with even potential for model scalability. In this work, we systematically investigate why current VQ-based methods perform well and explore the limitations of existing diffusion-based methods from the perspective of motion data representation and distribution. Drawing on these insights, we preserve the inherent strengths of a diffusion-based human motion generation model and gradually optimize it with inspiration from VQ-based approaches. Our approach introduces a human motion diffusion model enabled to perform bidirectional masked autoregression, optimized with a reformed data representation and distribution. Additionally, we also propose more robust evaluation methods to fairly assess different-based methods. Extensive experiments on benchmark human motion generation datasets demonstrate that our method excels previous methods and achieves state-of-the-art performances.

Title: Unlocking The Potential of Adaptive Attacks on Diffusion-Based Purification

Authors: Andre Kassis, Urs Hengartner, Yaoliang Yu
Subjects: cs.CR, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2411.16598
Pdf URL: https://arxiv.org/pdf/2411.16598
Copy Paste: [[2411.16598]] Unlocking The Potential of Adaptive Attacks on Diffusion-Based Purification(https://arxiv.org/abs/2411.16598)
Keywords: diffusion
Abstract: Diffusion-based purification (DBP) is a defense against adversarial examples (AEs), amassing popularity for its ability to protect classifiers in an attack-oblivious manner and resistance to strong adversaries with access to the defense. Its robustness has been claimed to ensue from the reliance on diffusion models (DMs) that project the AEs onto the natural distribution. We revisit this claim, focusing on gradient-based strategies that back-propagate the loss gradients through the defense, commonly referred to as ``adaptive attacks". Analytically, we show that such an optimization method invalidates DBP's core foundations, effectively targeting the DM rather than the classifier and restricting the purified outputs to a distribution over malicious samples instead. Thus, we reassess the reported empirical robustness, uncovering implementation flaws in the gradient back-propagation techniques used thus far for DBP. We fix these issues, providing the first reliable gradient library for DBP and demonstrating how adaptive attacks drastically degrade its robustness. We then study a less efficient yet stricter majority-vote setting where the classifier evaluates multiple purified copies of the input to make its decision. Here, DBP's stochasticity enables it to remain partially robust against traditional norm-bounded AEs. We propose a novel adaptation of a recent optimization method against deepfake watermarking that crafts systemic malicious perturbations while ensuring imperceptibility. When integrated with the adaptive attack, it completely defeats DBP, even in the majority-vote setup. Our findings prove that DBP, in its current state, is not a viable defense against AEs.

Title: Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models

Authors: Ronghuan Wu, Wanchao Su, Jing Liao
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2411.16602
Pdf URL: https://arxiv.org/pdf/2411.16602
Copy Paste: [[2411.16602]] Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models(https://arxiv.org/abs/2411.16602)
Keywords: diffusion
Abstract: Scalable Vector Graphics (SVG) has become the de facto standard for vector graphics in digital design, offering resolution independence and precise control over individual elements. Despite their advantages, creating high-quality SVG content remains challenging, as it demands technical expertise with professional editing software and a considerable time investment to craft complex shapes. Recent text-to-SVG generation methods aim to make vector graphics creation more accessible, but they still encounter limitations in shape regularity, generalization ability, and expressiveness. To address these challenges, we introduce Chat2SVG, a hybrid framework that combines the strengths of Large Language Models (LLMs) and image diffusion models for text-to-SVG generation. Our approach first uses an LLM to generate semantically meaningful SVG templates from basic geometric primitives. Guided by image diffusion models, a dual-stage optimization pipeline refines paths in latent space and adjusts point coordinates to enhance geometric complexity. Extensive experiments show that Chat2SVG outperforms existing methods in visual fidelity, path regularity, and semantic alignment. Additionally, our system enables intuitive editing through natural language instructions, making professional vector graphics creation accessible to all users.

Title: Exploring Discrete Flow Matching for 3D De Novo Molecule Generation

Authors: Ian Dunn, David R. Koes
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2411.16644
Pdf URL: https://arxiv.org/pdf/2411.16644
Copy Paste: [[2411.16644]] Exploring Discrete Flow Matching for 3D De Novo Molecule Generation(https://arxiv.org/abs/2411.16644)
Keywords: generative
Abstract: Deep generative models that produce novel molecular structures have the potential to facilitate chemical discovery. Flow matching is a recently proposed generative modeling framework that has achieved impressive performance on a variety of tasks including those on biomolecular structures. The seminal flow matching framework was developed only for continuous data. However, de novo molecular design tasks require generating discrete data such as atomic elements or sequences of amino acid residues. Several discrete flow matching methods have been proposed recently to address this gap. In this work we benchmark the performance of existing discrete flow matching methods for 3D de novo small molecule generation and provide explanations of their differing behavior. As a result we present FlowMol-CTMC, an open-source model that achieves state of the art performance for 3D de novo design with fewer learnable parameters than existing methods. Additionally, we propose the use of metrics that capture molecule quality beyond local chemical valency constraints and towards higher-order structural motifs. These metrics show that even though basic constraints are satisfied, the models tend to produce unusual and potentially problematic functional groups outside of the training data distribution. Code and trained models for reproducing this work are available at \url{this https URL}.

Title: Diffusion Features for Zero-Shot 6DoF Object Pose Estimation

Authors: Bernd Von Gimborn, Philipp Ausserlechner, Markus Vincze, Stefan Thalhammer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16668
Pdf URL: https://arxiv.org/pdf/2411.16668
Copy Paste: [[2411.16668]] Diffusion Features for Zero-Shot 6DoF Object Pose Estimation(https://arxiv.org/abs/2411.16668)
Keywords: diffusion, self-supervised, foundation model
Abstract: Zero-shot object pose estimation enables the retrieval of object poses from images without necessitating object-specific training. In recent approaches this is facilitated by vision foundation models (VFM), which are pre-trained models that are effectively general-purpose feature extractors. The characteristics exhibited by these VFMs vary depending on the training data, network architecture, and training paradigm. The prevailing choice in this field are self-supervised Vision Transformers (ViT). This study assesses the influence of Latent Diffusion Model (LDM) backbones on zero-shot pose estimation. In order to facilitate a comparison between the two families of models on a common ground we adopt and modify a recent approach. Therefore, a template-based multi-staged method for estimating poses in a zero-shot fashion using LDMs is presented. The efficacy of the proposed approach is empirically evaluated on three standard datasets for object-specific 6DoF pose estimation. The experiments demonstrate an Average Recall improvement of up to 27% over the ViT baseline. The source code is available at: this https URL.

Title: Generative Omnimatte: Learning to Decompose Video into Layers

Authors: Yao-Chih Lee, Erika Lu, Sarah Rumbley, Michal Geyer, Jia-Bin Huang, Tali Dekel, Forrester Cole
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2411.16683
Pdf URL: https://arxiv.org/pdf/2411.16683
Copy Paste: [[2411.16683]] Generative Omnimatte: Learning to Decompose Video into Layers(https://arxiv.org/abs/2411.16683)
Keywords: diffusion, generative
Abstract: Given a video and a set of input object masks, an omnimatte method aims to decompose the video into semantically meaningful layers containing individual objects along with their associated effects, such as shadows and reflections. Existing omnimatte methods assume a static background or accurate pose and depth estimation and produce poor decompositions when these assumptions are violated. Furthermore, due to the lack of generative prior on natural videos, existing methods cannot complete dynamic occluded regions. We present a novel generative layered video decomposition framework to address the omnimatte problem. Our method does not assume a stationary scene or require camera pose or depth information and produces clean, complete layers, including convincing completions of occluded dynamic regions. Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object. We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset, and demonstrate high-quality decompositions and editing results for a wide range of casually captured videos containing soft shadows, glossy reflections, splashing water, and more.