2024-03-26

Title: Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

Authors: Jesse Atuhurra, Seiveright Cargill Dujohn, Hidetaka Kamigaito, Hiroyuki Shindo, Taro Watanabe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.15430
Pdf URL: https://arxiv.org/pdf/2403.15430
Copy Paste: [[2403.15430]] Distilling Named Entity Recognition Models for Endangered Species from Large Language Models(https://arxiv.org/abs/2403.15430)
Keywords: in-context
Abstract: Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.

Title: Loops On Retrieval Augmented Generation (LoRAG)

Authors: Ayush Thakur, Rashmi Vashisth
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2403.15450
Pdf URL: https://arxiv.org/pdf/2403.15450
Copy Paste: [[2403.15450]] Loops On Retrieval Augmented Generation (LoRAG)(https://arxiv.org/abs/2403.15450)
Keywords: generative
Abstract: This paper presents Loops On Retrieval Augmented Generation (LoRAG), a new framework designed to enhance the quality of retrieval-augmented text generation through the incorporation of an iterative loop mechanism. The architecture integrates a generative model, a retrieval mechanism, and a dynamic loop module, allowing for iterative refinement of the generated text through interactions with relevant information retrieved from the input context. Experimental evaluations on benchmark datasets demonstrate that LoRAG surpasses existing state-of-the-art models in terms of BLEU score, ROUGE score, and perplexity, showcasing its effectiveness in achieving both coherence and relevance in generated text. The qualitative assessment further illustrates LoRAG's capability to produce contextually rich and coherent outputs. This research contributes valuable insights into the potential of iterative loops in mitigating challenges in text generation, positioning LoRAG as a promising advancement in the field.

Title: Unveiling the Anomalies in an Ever-Changing World: A Benchmark for Pixel-Level Anomaly Detection in Continual Learning

Authors: Nikola Bugarin, Jovana Bugaric, Manuel Barusco, Davide Dalle Pezze, Gian Antonio Susto
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.15463
Pdf URL: https://arxiv.org/pdf/2403.15463
Copy Paste: [[2403.15463]] Unveiling the Anomalies in an Ever-Changing World: A Benchmark for Pixel-Level Anomaly Detection in Continual Learning(https://arxiv.org/abs/2403.15463)
Keywords: anomaly
Abstract: Anomaly Detection is a relevant problem in numerous real-world applications, especially when dealing with images. However, little attention has been paid to the issue of changes over time in the input data distribution, which may cause a significant decrease in performance. In this study, we investigate the problem of Pixel-Level Anomaly Detection in the Continual Learning setting, where new data arrives over time and the goal is to perform well on new and old data. We implement several state-of-the-art techniques to solve the Anomaly Detection problem in the classic setting and adapt them to work in the Continual Learning setting. To validate the approaches, we use a real-world dataset of images with pixel-based anomalies to provide a reliable benchmark and serve as a foundation for further advancements in the field. We provide a comprehensive analysis, discussing which Anomaly Detection methods and which families of approaches seem more suitable for the Continual Learning setting.

Title: Learning to Infer Generative Template Programs for Visual Concepts

Authors: R. Kenny Jones, Siddhartha Chaudhuri, Daniel Ritchie
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.15476
Pdf URL: https://arxiv.org/pdf/2403.15476
Copy Paste: [[2403.15476]] Learning to Infer Generative Template Programs for Visual Concepts(https://arxiv.org/abs/2403.15476)
Keywords: generative
Abstract: People grasp flexible visual concepts from a few examples. We explore a neurosymbolic system that learns how to infer programs that capture visual concepts in a domain-general fashion. We introduce Template Programs: programmatic expressions from a domain-specific language that specify structural and parametric patterns common to an input concept. Our framework supports multiple concept-related tasks, including few-shot generation and co-segmentation through parsing. We develop a learning paradigm that allows us to train networks that infer Template Programs directly from visual datasets that contain concept groupings. We run experiments across multiple visual domains: 2D layouts, Omniglot characters, and 3D shapes. We find that our method outperforms task-specific alternatives, and performs competitively against domain-specific approaches for the limited domains where they exist.

Title: Integrating Supervised Extractive and Generative Language Models for Suicide Risk Evidence Summarization

Authors: Rika Tanaka, Yusuke Fukazawa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.15478
Pdf URL: https://arxiv.org/pdf/2403.15478
Copy Paste: [[2403.15478]] Integrating Supervised Extractive and Generative Language Models for Suicide Risk Evidence Summarization(https://arxiv.org/abs/2403.15478)
Keywords: generative
Abstract: We propose a method that integrates supervised extractive and generative language models for providing supporting evidence of suicide risk in the CLPsych 2024 shared task. Our approach comprises three steps. Initially, we construct a BERT-based model for estimating sentence-level suicide risk and negative sentiment. Next, we precisely identify high suicide risk sentences by emphasizing elevated probabilities of both suicide risk and negative sentiment. Finally, we integrate generative summaries using the MentaLLaMa framework and extractive summaries from identified high suicide risk sentences and a specialized dictionary of suicidal risk words. SophiaADS, our team, achieved 1st place for highlight extraction and ranked 10th for summary generation, both based on recall and consistency metrics, respectively.

Title: RakutenAI-7B: Extending Large Language Models for Japanese

Authors: Rakuten Group Inc., Aaron Levine, Connie Huang, Chenguang Wang, Eduardo Batista, Ewa Szymanska, Hongyi Ding, Hou Wei Chou, Jean-François Pessiot, Johanes Effendi, Justin Chiu, Kai Torben Ohlhus, Karan Chopra, Keiji Shinzato, Koji Murakami, Lee Xiong, Lei Chen, Maki Kubota, Maksim Tkachenko, Miroku Lee, Naoki Takahashi, Prathyusha Jwalapuram, Ryutaro Tatsushima, Saurabh Jain, Sunil Kumar Yadav, Ting Cai, Wei-Te Chen, Yandi Xia, Yuki Nakayama, Yutaka Higashiyama
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.15484
Pdf URL: https://arxiv.org/pdf/2403.15484
Copy Paste: [[2403.15484]] RakutenAI-7B: Extending Large Language Models for Japanese(https://arxiv.org/abs/2403.15484)
Keywords: foundation model
Abstract: We introduce RakutenAI-7B, a suite of Japanese-oriented large language models that achieve the best performance on the Japanese LM Harness benchmarks among the open 7B models. Along with the foundation model, we release instruction- and chat-tuned models, RakutenAI-7B-instruct and RakutenAI-7B-chat respectively, under the Apache 2.0 license.

Title: Sequence-to-Sequence Language Models for Character and Emotion Detection in Dream Narratives

Authors: Gustave Cortal (ENS Paris Saclay, LISN)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.15486
Pdf URL: https://arxiv.org/pdf/2403.15486
Copy Paste: [[2403.15486]] Sequence-to-Sequence Language Models for Character and Emotion Detection in Dream Narratives(https://arxiv.org/abs/2403.15486)
Keywords: in-context
Abstract: The study of dreams has been central to understanding human (un)consciousness, cognition, and culture for centuries. Analyzing dreams quantitatively depends on labor-intensive, manual annotation of dream narratives. We automate this process through a natural language sequence-to-sequence generation framework. This paper presents the first study on character and emotion detection in the English portion of the open DreamBank corpus of dream narratives. Our results show that language models can effectively address this complex task. To get insight into prediction performance, we evaluate the impact of model size, prediction order of characters, and the consideration of proper names and character traits. We compare our approach with a large language model using in-context learning. Our supervised models perform better while having 28 times fewer parameters. Our model and its generated annotations are made publicly available.

Title: GTC: GNN-Transformer Co-contrastive Learning for Self-supervised Heterogeneous Graph Representation

Authors: Yundong Sun, Dongjie Zhu, Yansong Wang, Zhaoshuo Tian
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2403.15520
Pdf URL: https://arxiv.org/pdf/2403.15520
Copy Paste: [[2403.15520]] GTC: GNN-Transformer Co-contrastive Learning for Self-supervised Heterogeneous Graph Representation(https://arxiv.org/abs/2403.15520)
Keywords: self-supervised
Abstract: Graph Neural Networks (GNNs) have emerged as the most powerful weapon for various graph tasks due to the message-passing mechanism's great local information aggregation ability. However, over-smoothing has always hindered GNNs from going deeper and capturing multi-hop neighbors. Unlike GNNs, Transformers can model global information and multi-hop interactions via multi-head self-attention and a proper Transformer structure can show more immunity to the over-smoothing problem. So, can we propose a novel framework to combine GNN and Transformer, integrating both GNN's local information aggregation and Transformer's global information modeling ability to eliminate the over-smoothing problem? To realize this, this paper proposes a collaborative learning scheme for GNN-Transformer and constructs GTC architecture. GTC leverages the GNN and Transformer branch to encode node information from different views respectively, and establishes contrastive learning tasks based on the encoded cross-view information to realize self-supervised heterogeneous graph representation. For the Transformer branch, we propose Metapath-aware Hop2Token and CG-Hetphormer, which can cooperate with GNN to attentively encode neighborhood information from different levels. As far as we know, this is the first attempt in the field of graph representation learning to utilize both GNN and Transformer to collaboratively capture different view information and conduct cross-view contrastive learning. The experiments on real datasets show that GTC exhibits superior performance compared with state-of-the-art methods. Codes can be available at https://github.com/PHD-lanyu/GTC.

Title: An Optimization Framework to Enforce Multi-View Consistency for Texturing 3D Meshes Using Pre-Trained Text-to-Image Models

Authors: Zhengyi Zhao, Chen Song, Xiaodong Gu, Yuan Dong, Qi Zuo, Weihao Yuan, Zilong Dong, Liefeng Bo, Qixing Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.15559
Pdf URL: https://arxiv.org/pdf/2403.15559
Copy Paste: [[2403.15559]] An Optimization Framework to Enforce Multi-View Consistency for Texturing 3D Meshes Using Pre-Trained Text-to-Image Models(https://arxiv.org/abs/2403.15559)
Keywords: diffusion
Abstract: A fundamental problem in the texturing of 3D meshes using pre-trained text-to-image models is to ensure multi-view consistency. State-of-the-art approaches typically use diffusion models to aggregate multi-view inputs, where common issues are the blurriness caused by the averaging operation in the aggregation step or inconsistencies in local features. This paper introduces an optimization framework that proceeds in four stages to achieve multi-view consistency. Specifically, the first stage generates an over-complete set of 2D textures from a predefined set of viewpoints using an MV-consistent diffusion process. The second stage selects a subset of views that are mutually consistent while covering the underlying 3D model. We show how to achieve this goal by solving semi-definite programs. The third stage performs non-rigid alignment to align the selected views across overlapping regions. The fourth stage solves an MRF problem to associate each mesh face with a selected view. In particular, the third and fourth stages are iterated, with the cuts obtained in the fourth stage encouraging non-rigid alignment in the third stage to focus on regions close to the cuts. Experimental results show that our approach significantly outperforms baseline approaches both qualitatively and quantitatively.

Title: MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis

Authors: Mai A. Shaaban, Adnan Khan, Mohammad Yaqub
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.15585
Pdf URL: https://arxiv.org/pdf/2403.15585
Copy Paste: [[2403.15585]] MedPromptX: Grounded Multimodal Prompting for Chest X-ray Diagnosis(https://arxiv.org/abs/2403.15585)
Keywords: in-context
Abstract: Chest X-ray images are commonly used for predicting acute and chronic cardiopulmonary conditions, but efforts to integrate them with structured clinical data face challenges due to incomplete electronic health records (EHR). This paper introduces \textbf{MedPromptX}, the first model to integrate multimodal large language models (MLLMs), few-shot prompting (FP) and visual grounding (VG) to combine imagery with EHR data for chest X-ray diagnosis. A pre-trained MLLM is utilized to complement the missing EHR information, providing a comprehensive understanding of patients' medical history. Additionally, FP reduces the necessity for extensive training of MLLMs while effectively tackling the issue of hallucination. Nevertheless, the process of determining the optimal number of few-shot examples and selecting high-quality candidates can be burdensome, yet it profoundly influences model performance. Hence, we propose a new technique that dynamically refines few-shot data for real-time adjustment to new patient scenarios. Moreover, VG aids in focusing the model's attention on relevant regions of interest in X-ray images, enhancing the identification of abnormalities. We release MedPromptX-VQA, a new in-context visual question answering dataset encompassing interleaved image and EHR data derived from MIMIC-IV and MIMIC-CXR databases. Results demonstrate the SOTA performance of MedPromptX, achieving an 11% improvement in F1-score compared to the baselines. Code and data are available at \url{https://github.com/BioMedIA-MBZUAI/MedPromptX}.

Title: EAGLE: A Domain Generalization Framework for AI-generated Text Detection

Authors: Amrita Bhattacharjee, Raha Moraffah, Joshua Garland, Huan Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.15690
Pdf URL: https://arxiv.org/pdf/2403.15690
Copy Paste: [[2403.15690]] EAGLE: A Domain Generalization Framework for AI-generated Text Detection(https://arxiv.org/abs/2403.15690)
Keywords: self-supervised
Abstract: With the advancement in capabilities of Large Language Models (LLMs), one major step in the responsible and safe use of such LLMs is to be able to detect text generated by these models. While supervised AI-generated text detectors perform well on text generated by older LLMs, with the frequent release of new LLMs, building supervised detectors for identifying text from such new models would require new labeled training data, which is infeasible in practice. In this work, we tackle this problem and propose a domain generalization framework for the detection of AI-generated text from unseen target generators. Our proposed framework, EAGLE, leverages the labeled data that is available so far from older language models and learns features invariant across these generators, in order to detect text generated by an unknown target generator. EAGLE learns such domain-invariant features by combining the representational power of self-supervised contrastive learning with domain adversarial training. Through our experiments we demonstrate how EAGLE effectively achieves impressive performance in detecting text generated by unseen target generators, including recent state-of-the-art ones such as GPT-4 and Claude, reaching detection scores of within 4.7% of a fully supervised detector.

Title: Technical Report: Masked Skeleton Sequence Modeling for Learning Larval Zebrafish Behavior Latent Embeddings

Authors: Lanxin Xu, Shuo Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.15693
Pdf URL: https://arxiv.org/pdf/2403.15693
Copy Paste: [[2403.15693]] Technical Report: Masked Skeleton Sequence Modeling for Learning Larval Zebrafish Behavior Latent Embeddings(https://arxiv.org/abs/2403.15693)
Keywords: self-supervised, generative
Abstract: In this report, we introduce a novel self-supervised learning method for extracting latent embeddings from behaviors of larval zebrafish. Drawing inspiration from Masked Modeling techniquesutilized in image processing with Masked Autoencoders (MAE) \cite{he2022masked} and in natural language processing with Generative Pre-trained Transformer (GPT) \cite{radford2018improving}, we treat behavior sequences as a blend of images and language. For the skeletal sequences of swimming zebrafish, we propose a pioneering Transformer-CNN architecture, the Sequence Spatial-Temporal Transformer (SSTFormer), designed to capture the inter-frame correlation of different joints. This correlation is particularly valuable, as it reflects the coordinated movement of various parts of the fish body across adjacent frames. To handle the high frame rate, we segment the skeleton sequence into distinct time slices, analogous to "words" in a sentence, and employ self-attention transformer layers to encode the consecutive frames within each slice, capturing the spatial correlation among different joints. Furthermore, we incorporate a CNN-based attention module to enhance the representations outputted by the transformer layers. Lastly, we introduce a temporal feature aggregation operation between time slices to improve the discrimination of similar behaviors.

Title: SceneX:Procedural Controllable Large-scale Scene Generation via Large-language Models

Authors: Mengqi Zhou, Jun Hou, Chuanchen Luo, Yuxi Wang, Zhaoxiang Zhang, Junran Peng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.15698
Pdf URL: https://arxiv.org/pdf/2403.15698
Copy Paste: [[2403.15698]] SceneX:Procedural Controllable Large-scale Scene Generation via Large-language Models(https://arxiv.org/abs/2403.15698)
Keywords: generative
Abstract: Due to its great application potential, large-scale scene generation has drawn extensive attention in academia and industry. Recent research employs powerful generative models to create desired scenes and achieves promising results. However, most of these methods represent the scene using 3D primitives (e.g. point cloud or radiance field) incompatible with the industrial pipeline, which leads to a substantial gap between academic research and industrial deployment. Procedural Controllable Generation (PCG) is an efficient technique for creating scalable and high-quality assets, but it is unfriendly for ordinary users as it demands profound domain expertise. To address these issues, we resort to using the large language model (LLM) to drive the procedural modeling. In this paper, we introduce a large-scale scene generation framework, SceneX, which can automatically produce high-quality procedural models according to designers' textual descriptions.Specifically, the proposed method comprises two components, PCGBench and PCGPlanner. The former encompasses an extensive collection of accessible procedural assets and thousands of hand-craft API documents. The latter aims to generate executable actions for Blender to produce controllable and precise 3D assets guided by the user's instructions. Our SceneX can generate a city spanning 2.5 km times 2.5 km with delicate layout and geometric structures, drastically reducing the time cost from several weeks for professional PCG engineers to just a few hours for an ordinary user. Extensive experiments demonstrated the capability of our method in controllable large-scale scene generation and editing, including asset placement and season translation.

Title: Convection-Diffusion Equation: A Theoretically Certified Framework for Neural Networks

Authors: Tangjun Wang, Chenglong Bao, Zuoqiang Shi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.15726
Pdf URL: https://arxiv.org/pdf/2403.15726
Copy Paste: [[2403.15726]] Convection-Diffusion Equation: A Theoretically Certified Framework for Neural Networks(https://arxiv.org/abs/2403.15726)
Keywords: diffusion
Abstract: In this paper, we study the partial differential equation models of neural networks. Neural network can be viewed as a map from a simple base model to a complicate function. Based on solid analysis, we show that this map can be formulated by a convection-diffusion equation. This theoretically certified framework gives mathematical foundation and more understanding of neural networks. Moreover, based on the convection-diffusion equation model, we design a novel network structure, which incorporates diffusion mechanism into network architecture. Extensive experiments on both benchmark datasets and real-world applications validate the performance of the proposed model.

Title: BEND: Bagging Deep Learning Training Based on Efficient Neural Network Diffusion

Authors: Jia Wei, Xingjun Zhang, Witold Pedrycz
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.15766
Pdf URL: https://arxiv.org/pdf/2403.15766
Copy Paste: [[2403.15766]] BEND: Bagging Deep Learning Training Based on Efficient Neural Network Diffusion(https://arxiv.org/abs/2403.15766)
Keywords: diffusion
Abstract: Bagging has achieved great success in the field of machine learning by integrating multiple base classifiers to build a single strong classifier to reduce model variance. The performance improvement of bagging mainly relies on the number and diversity of base classifiers. However, traditional deep learning model training methods are expensive to train individually and difficult to train multiple models with low similarity in a restricted dataset. Recently, diffusion models, which have been tremendously successful in the fields of imaging and vision, have been found to be effective in generating neural network model weights and biases with diversity. We creatively propose a Bagging deep learning training algorithm based on Efficient Neural network Diffusion (BEND). The originality of BEND comes from the first use of a neural network diffusion model to efficiently build base classifiers for bagging. Our approach is simple but effective, first using multiple trained model weights and biases as inputs to train autoencoder and latent diffusion model to realize a diffusion model from noise to valid neural network parameters. Subsequently, we generate several base classifiers using the trained diffusion model. Finally, we integrate these ba se classifiers for various inference tasks using the Bagging method. Resulting experiments on multiple models and datasets show that our proposed BEND algorithm can consistently outperform the mean and median accuracies of both the original trained model and the diffused model. At the same time, new models diffused using the diffusion model have higher diversity and lower cost than multiple models trained using traditional methods. The BEND approach successfully introduces diffusion models into the new deep learning training domain and provides a new paradigm for future deep learning training and inference.

Title: In-Context Matting

Authors: He Guo, Zixuan Ye, Zhiguo Cao, Hao Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.15789
Pdf URL: https://arxiv.org/pdf/2403.15789
Copy Paste: [[2403.15789]] In-Context Matting(https://arxiv.org/abs/2403.15789)
Keywords: diffusion, in-context
Abstract: We introduce in-context matting, a novel task setting of image matting. Given a reference image of a certain foreground and guided priors such as points, scribbles, and masks, in-context matting enables automatic alpha estimation on a batch of target images of the same foreground category, without additional auxiliary input. This setting marries good performance in auxiliary input-based matting and ease of use in automatic matting, which finds a good trade-off between customization and automation. To overcome the key challenge of accurate foreground matching, we introduce IconMatting, an in-context matting model built upon a pre-trained text-to-image diffusion model. Conditioned on inter- and intra-similarity matching, IconMatting can make full use of reference context to generate accurate target alpha mattes. To benchmark the task, we also introduce a novel testing dataset ICM-$57$, covering 57 groups of real-world images. Quantitative and qualitative results on the ICM-57 testing set show that IconMatting rivals the accuracy of trimap-based matting while retaining the automation level akin to automatic matting. Code is available at https://github.com/tiny-smart/in-context-matting

Title: Boarding for ISS: Imbalanced Self-Supervised: Discovery of a Scaled Autoencoder for Mixed Tabular Datasets

Authors: Samuel Stocksieker, Denys Pommeret, Arthur Charpentier
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2403.15790
Pdf URL: https://arxiv.org/pdf/2403.15790
Copy Paste: [[2403.15790]] Boarding for ISS: Imbalanced Self-Supervised: Discovery of a Scaled Autoencoder for Mixed Tabular Datasets(https://arxiv.org/abs/2403.15790)
Keywords: self-supervised, generative
Abstract: The field of imbalanced self-supervised learning, especially in the context of tabular data, has not been extensively studied. Existing research has predominantly focused on image datasets. This paper aims to fill this gap by examining the specific challenges posed by data imbalance in self-supervised learning in the domain of tabular data, with a primary focus on autoencoders. Autoencoders are widely employed for learning and constructing a new representation of a dataset, particularly for dimensionality reduction. They are also often used for generative model learning, as seen in variational autoencoders. When dealing with mixed tabular data, qualitative variables are often encoded using a one-hot encoder with a standard loss function (MSE or Cross Entropy). In this paper, we analyze the drawbacks of this approach, especially when categorical variables are imbalanced. We propose a novel metric to balance learning: a Multi-Supervised Balanced MSE. This approach reduces the reconstruction error by balancing the influence of variables. Finally, we empirically demonstrate that this new metric, compared to the standard MSE: i) outperforms when the dataset is imbalanced, especially when the learning process is insufficient, and ii) provides similar results in the opposite case.

Title: Diffusion-based Aesthetic QR Code Generation via Scanning-Robust Perceptual Guidance

Authors: Jia-Wei Liao, Winston Wang, Tzu-Sian Wang, Li-Xuan Peng, Cheng-Fu Chou, Jun-Cheng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.15878
Pdf URL: https://arxiv.org/pdf/2403.15878
Copy Paste: [[2403.15878]] Diffusion-based Aesthetic QR Code Generation via Scanning-Robust Perceptual Guidance(https://arxiv.org/abs/2403.15878)
Keywords: diffusion
Abstract: QR codes, prevalent in daily applications, lack visual appeal due to their conventional black-and-white design. Integrating aesthetics while maintaining scannability poses a challenge. In this paper, we introduce a novel diffusion-model-based aesthetic QR code generation pipeline, utilizing pre-trained ControlNet and guided iterative refinement via a novel classifier guidance (SRG) based on the proposed Scanning-Robust Loss (SRL) tailored with QR code mechanisms, which ensures both aesthetics and scannability. To further improve the scannability while preserving aesthetics, we propose a two-stage pipeline with Scanning-Robust Perceptual Guidance (SRPG). Moreover, we can further enhance the scannability of the generated QR code by post-processing it through the proposed Scanning-Robust Projected Gradient Descent (SRPGD) post-processing technique based on SRL with proven convergence. With extensive quantitative, qualitative, and subjective experiments, the results demonstrate that the proposed approach can generate diverse aesthetic QR codes with flexibility in detail. In addition, our pipelines outperforming existing models in terms of Scanning Success Rate (SSR) 86.67% (+40%) with comparable aesthetic scores. The pipeline combined with SRPGD further achieves 96.67% (+50%). Our code will be available https://github.com/jwliao1209/DiffQRCode.

Title: X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

Authors: You Xie, Hongyi Xu, Guoxian Song, Chao Wang, Yichun Shi, Linjie Luo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.15931
Pdf URL: https://arxiv.org/pdf/2403.15931
Copy Paste: [[2403.15931]] X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention(https://arxiv.org/abs/2403.15931)
Keywords: diffusion, generative
Abstract: We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively enhance the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scaling-augmented cross-identity images, ensuring maximized disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with consistently maintained identity characteristics.

Title: Feature Manipulation for DDPM based Change Detection

Authors: Zhenglin Li, Yangchen Huang, Mengran Zhu, Jingyu Zhang, JingHao Chang, Houze Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.15943
Pdf URL: https://arxiv.org/pdf/2403.15943
Copy Paste: [[2403.15943]] Feature Manipulation for DDPM based Change Detection(https://arxiv.org/abs/2403.15943)
Keywords: diffusion
Abstract: Change Detection is a classic task of computer vision that receives a bi-temporal image pair as input and separates the semantically changed and unchanged regions of it. The diffusion model is used in image synthesis and as a feature extractor and has been applied to various downstream tasks. Using this, a feature map is extracted from the pre-trained diffusion model from the large-scale data set, and changes are detected through the additional network. On the one hand, the current diffusion-based change detection approach focuses only on extracting a good feature map using the diffusion model. It obtains and uses differences without further adjustment to the created feature map. Our method focuses on manipulating the feature map extracted from the Diffusion Model to be more semantically useful, and for this, we propose two methods: Feature Attention and FDAF. Our model with Feature Attention achieved a state-of-the-art F1 score (90.18) and IoU (83.86) on the LEVIR-CD dataset.

Title: IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Authors: Haz Sameen Shahgir, Khondker Salman Sayeed, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yue Dong, Rifat Shahriyar
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2403.15952
Pdf URL: https://arxiv.org/pdf/2403.15952
Copy Paste: [[2403.15952]] IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models(https://arxiv.org/abs/2403.15952)
Keywords: in-context
Abstract: The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best-performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of GeminiPro on the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.

Title: Fill in the ____ (a Diffusion-based Image Inpainting Pipeline)

Authors: Eyoel Gebre, Krishna Saxena, Timothy Tran
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.16016
Pdf URL: https://arxiv.org/pdf/2403.16016
Copy Paste: [[2403.16016]] Fill in the ____ (a Diffusion-based Image Inpainting Pipeline)(https://arxiv.org/abs/2403.16016)
Keywords: diffusion
Abstract: Image inpainting is the process of taking an image and generating lost or intentionally occluded portions. Inpainting has countless applications including restoring previously damaged pictures, restoring the quality of images that have been degraded due to compression, and removing unwanted objects/text. Modern inpainting techniques have shown remarkable ability in generating sensible completions for images with mask occlusions. In our paper, an overview of the progress of inpainting techniques will be provided, along with identifying current leading approaches, focusing on their strengths and weaknesses. A critical gap in these existing models will be addressed, focusing on the ability to prompt and control what exactly is generated. We will additionally justify why we think this is the natural next progressive step that inpainting models must take, and provide multiple approaches to implementing this functionality. Finally, we will evaluate the results of our approaches by qualitatively checking whether they generate high-quality images that correctly inpaint regions with the objects that they are instructed to produce.

Title: A Unified Module for Accelerating STABLE-DIFFUSION: LCM-LORA

Authors: Ayush Thakur, Rashmi Vashisth
Subjects: cs.LG, cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2403.16024
Pdf URL: https://arxiv.org/pdf/2403.16024
Copy Paste: [[2403.16024]] A Unified Module for Accelerating STABLE-DIFFUSION: LCM-LORA(https://arxiv.org/abs/2403.16024)
Keywords: diffusion
Abstract: This paper presents a comprehensive study on the unified module for accelerating stable-diffusion processes, specifically focusing on the lcm-lora module. Stable-diffusion processes play a crucial role in various scientific and engineering domains, and their acceleration is of paramount importance for efficient computational performance. The standard iterative procedures for solving fixed-source discrete ordinates problems often exhibit slow convergence, particularly in optically thick scenarios. To address this challenge, unconditionally stable diffusion-acceleration methods have been developed, aiming to enhance the computational efficiency of transport equations and discrete ordinates problems. This study delves into the theoretical foundations and numerical results of unconditionally stable diffusion synthetic acceleration methods, providing insights into their stability and performance for model discrete ordinates problems. Furthermore, the paper explores recent advancements in diffusion model acceleration, including on device acceleration of large diffusion models via gpu aware optimizations, highlighting the potential for significantly improved inference latency. The results and analyses in this study provide important insights into stable diffusion processes and have important ramifications for the creation and application of acceleration methods specifically, the lcm-lora module in a variety of computing environments.

Title: Robust Diffusion Models for Adversarial Purification

Authors: Guang Lin, Zerui Tao, Jianhai Zhang, Toshihisa Tanaka, Qibin Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.16067
Pdf URL: https://arxiv.org/pdf/2403.16067
Copy Paste: [[2403.16067]] Robust Diffusion Models for Adversarial Purification(https://arxiv.org/abs/2403.16067)
Keywords: diffusion
Abstract: Diffusion models (DMs) based adversarial purification (AP) has shown to be the most powerful alternative to adversarial training (AT). However, these methods neglect the fact that pre-trained diffusion models themselves are not robust to adversarial attacks as well. Additionally, the diffusion process can easily destroy semantic information and generate a high quality image but totally different from the original input image after the reverse process, leading to degraded standard accuracy. To overcome these issues, a natural idea is to harness adversarial training strategy to retrain or fine-tune the pre-trained diffusion model, which is computationally prohibitive. We propose a novel robust reverse process with adversarial guidance, which is independent of given pre-trained DMs and avoids retraining or fine-tuning the DMs. This robust guidance can not only ensure to generate purified examples retaining more semantic content but also mitigate the accuracy-robustness trade-off of DMs for the first time, which also provides DM-based AP an efficient adaptive ability to new attacks. Extensive experiments are conducted to demonstrate that our method achieves the state-of-the-art results and exhibits generalization against different attacks.

Title: EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing

Authors: Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16111
Pdf URL: https://arxiv.org/pdf/2403.16111
Copy Paste: [[2403.16111]] EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing(https://arxiv.org/abs/2403.16111)
Keywords: diffusion
Abstract: Current diffusion-based video editing primarily focuses on local editing (\textit{e.g.,} object/background editing) or global style editing by utilizing various dense correspondences. However, these methods often fail to accurately edit the foreground and background simultaneously while preserving the original layout. We find that the crux of the issue stems from the imprecise distribution of attention weights across designated regions, including inaccurate text-to-attribute control and attention leakage. To tackle this issue, we introduce EVA, a \textbf{zero-shot} and \textbf{multi-attribute} video editing framework tailored for human-centric videos with complex motions. We incorporate a Spatial-Temporal Layout-Guided Attention mechanism that leverages the intrinsic positive and negative correspondences of cross-frame diffusion features. To avoid attention leakage, we utilize these correspondences to boost the attention scores of tokens within the same attribute across all video frames while limiting interactions between tokens of different attributes in the self-attention layer. For precise text-to-attribute manipulation, we use discrete text embeddings focused on specific layout areas within the cross-attention layer. Benefiting from the precise attention weight distribution, EVA can be easily generalized to multi-object editing scenarios and achieves accurate identity mapping. Extensive experiments demonstrate EVA achieves state-of-the-art results in real-world scenarios. Full results are provided at https://knightyxp.github.io/EVA/

Title: Self-Supervised Multi-Frame Neural Scene Flow

Authors: Dongrui Liu, Daqi Liu, Xueqian Li, Sihao Lin, Hongwei xie, Bing Wang, Xiaojun Chang, Lei Chu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.16116
Pdf URL: https://arxiv.org/pdf/2403.16116
Copy Paste: [[2403.16116]] Self-Supervised Multi-Frame Neural Scene Flow(https://arxiv.org/abs/2403.16116)
Keywords: self-supervised
Abstract: Neural Scene Flow Prior (NSFP) and Fast Neural Scene Flow (FNSF) have shown remarkable adaptability in the context of large out-of-distribution autonomous driving. Despite their success, the underlying reasons for their astonishing generalization capabilities remain unclear. Our research addresses this gap by examining the generalization capabilities of NSFP through the lens of uniform stability, revealing that its performance is inversely proportional to the number of input point clouds. This finding sheds light on NSFP's effectiveness in handling large-scale point cloud scene flow estimation tasks. Motivated by such theoretical insights, we further explore the improvement of scene flow estimation by leveraging historical point clouds across multiple frames, which inherently increases the number of point clouds. Consequently, we propose a simple and effective method for multi-frame point cloud scene flow estimation, along with a theoretical evaluation of its generalization abilities. Our analysis confirms that the proposed method maintains a limited generalization error, suggesting that adding multiple frames to the scene flow optimization process does not detract from its generalizability. Extensive experimental results on large-scale autonomous driving Waymo Open and Argoverse lidar datasets demonstrate that the proposed method achieves state-of-the-art performance.

Title: A Survey on Self-Supervised Pre-Training of Graph Foundation Models: A Knowledge-Based Perspective

Authors: Ziwen Zhao, Yuhua Li, Yixiong Zou, Ruixuan Li, Rui Zhang
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2403.16137
Pdf URL: https://arxiv.org/pdf/2403.16137
Copy Paste: [[2403.16137]] A Survey on Self-Supervised Pre-Training of Graph Foundation Models: A Knowledge-Based Perspective(https://arxiv.org/abs/2403.16137)
Keywords: self-supervised, foundation model
Abstract: Graph self-supervised learning is now a go-to method for pre-training graph foundation models, including graph neural networks, graph transformers, and more recent large language model (LLM)-based graph models. There is a wide variety of knowledge patterns embedded in the structure and properties of graphs which may be used for pre-training, but we lack a systematic overview of self-supervised pre-training tasks from the perspective of graph knowledge. In this paper, we comprehensively survey and analyze the pre-training tasks of graph foundation models from a knowledge-based perspective, consisting of microscopic (nodes, links, etc) and macroscopic knowledge (clusters, global structure, etc). It covers a total of 9 knowledge categories and 25 pre-training tasks, as well as various downstream task adaptation strategies. Furthermore, an extensive list of the related papers with detailed metadata is provided at https://github.com/Newiz430/Pretext.

Title: One Masked Model is All You Need for Sensor Fault Detection, Isolation and Accommodation

Authors: Yiwei Fu, Weizhong Yan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.16153
Pdf URL: https://arxiv.org/pdf/2403.16153
Copy Paste: [[2403.16153]] One Masked Model is All You Need for Sensor Fault Detection, Isolation and Accommodation(https://arxiv.org/abs/2403.16153)
Keywords: self-supervised
Abstract: Accurate and reliable sensor measurements are critical for ensuring the safety and longevity of complex engineering systems such as wind turbines. In this paper, we propose a novel framework for sensor fault detection, isolation, and accommodation (FDIA) using masked models and self-supervised learning. Our proposed approach is a general time series modeling approach that can be applied to any neural network (NN) model capable of sequence modeling, and captures the complex spatio-temporal relationships among different sensors. During training, the proposed masked approach creates a random mask, which acts like a fault, for one or more sensors, making the training and inference task unified: finding the faulty sensors and correcting them. We validate our proposed technique on both a public dataset and a real-world dataset from GE offshore wind turbines, and demonstrate its effectiveness in detecting, diagnosing and correcting sensor faults. The masked model not only simplifies the overall FDIA pipeline, but also outperforms existing approaches. Our proposed technique has the potential to significantly improve the accuracy and reliability of sensor measurements in complex engineering systems in real-time, and could be applied to other types of sensors and engineering systems in the future. We believe that our proposed framework can contribute to the development of more efficient and effective FDIA techniques for a wide range of applications.

Title: Gaze-guided Hand-Object Interaction Synthesis: Benchmark and Method

Authors: Jie Tian, Lingxiao Yang, Ran Ji, Yuexin Ma, Lan Xu, Jingyi Yu, Ye Shi, Jingya Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16169
Pdf URL: https://arxiv.org/pdf/2403.16169
Copy Paste: [[2403.16169]] Gaze-guided Hand-Object Interaction Synthesis: Benchmark and Method(https://arxiv.org/abs/2403.16169)
Keywords: diffusion
Abstract: Gaze plays a crucial role in revealing human attention and intention, shedding light on the cognitive processes behind human actions. The integration of gaze guidance with the dynamics of hand-object interactions boosts the accuracy of human motion prediction. However, the lack of datasets that capture the intricate relationship and consistency among gaze, hand, and object movements remains a substantial hurdle. In this paper, we introduce the first Gaze-guided Hand-Object Interaction dataset, GazeHOI, and present a novel task for synthesizing gaze-guided hand-object interactions. Our dataset, GazeHOI, features simultaneous 3D modeling of gaze, hand, and object interactions, comprising 479 sequences with an average duration of 19.1 seconds, 812 sub-sequences, and 33 objects of various sizes. We propose a hierarchical framework centered on a gaze-guided hand-object interaction diffusion model, named GHO-Diffusion. In the pre-diffusion phase, we separate gaze conditions into spatial-temporal features and goal pose conditions at different levels of information granularity. During the diffusion phase, two gaze-conditioned diffusion models are stacked to simplify the complex synthesis of hand-object motions. Here, the object motion diffusion model generates sequences of object motions based on gaze conditions, while the hand motion diffusion model produces hand motions based on the generated object motion. To improve fine-grained goal pose alignment, we introduce a Spherical Gaussian constraint to guide the denoising step. In the subsequent post-diffusion phase, we optimize the generated hand motions using contact consistency. Our extensive experiments highlight the uniqueness of our dataset and the effectiveness of our approach.

Title: Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery

Authors: Siddharth Tourani, Ahmed Alwheibi, Arif Mahmood, Muhammad Haris Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16194
Pdf URL: https://arxiv.org/pdf/2403.16194
Copy Paste: [[2403.16194]] Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery(https://arxiv.org/abs/2403.16194)
Keywords: diffusion, self-supervised
Abstract: Unsupervised landmarks discovery (ULD) for an object category is a challenging computer vision problem. In pursuit of developing a robust ULD framework, we explore the potential of a recent paradigm of self-supervised learning algorithms, known as diffusion models. Some recent works have shown that these models implicitly contain important correspondence cues. Towards harnessing the potential of diffusion models for the ULD task, we make the following core contributions. First, we propose a ZeroShot ULD baseline based on simple clustering of random pixel locations with nearest neighbour matching. It delivers better results than existing ULD methods. Second, motivated by the ZeroShot performance, we develop a ULD algorithm based on diffusion features using self-training and clustering which also outperforms prior methods by notable margins. Third, we introduce a new proxy task based on generating latent pose codes and also propose a two-stage clustering mechanism to facilitate effective pseudo-labeling, resulting in a significant performance improvement. Overall, our approach consistently outperforms state-of-the-art methods on four challenging benchmarks AFLW, MAFL, CatHeads and LS3D by significant margins.

Title: Diffusion Model is a Good Pose Estimator from 3D RF-Vision

Authors: Junqiao Fan, Jianfei Yang, Yuecong Xu, Lihua Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16198
Pdf URL: https://arxiv.org/pdf/2403.16198
Copy Paste: [[2403.16198]] Diffusion Model is a Good Pose Estimator from 3D RF-Vision(https://arxiv.org/abs/2403.16198)
Keywords: diffusion
Abstract: Human pose estimation (HPE) from Radio Frequency vision (RF-vision) performs human sensing using RF signals that penetrate obstacles without revealing privacy (e.g., facial information). Recently, mmWave radar has emerged as a promising RF-vision sensor, providing radar point clouds by processing RF signals. However, the mmWave radar has a limited resolution with severe noise, leading to inaccurate and inconsistent human pose estimation. This work proposes mmDiff, a novel diffusion-based pose estimator tailored for noisy radar data. Our approach aims to provide reliable guidance as conditions to diffusion models. Two key challenges are addressed by mmDiff: (1) miss-detection of parts of human bodies, which is addressed by a module that isolates feature extraction from different body parts, and (2) signal inconsistency due to environmental interference, which is tackled by incorporating prior knowledge of body structure and motion. Several modules are designed to achieve these goals, whose features work as the conditions for the subsequent diffusion model, eliminating the miss-detection and instability of HPE based on RF-vision. Extensive experiments demonstrate that mmDiff outperforms existing methods significantly, achieving state-of-the-art performances on public datasets.

Title: SQL-Encoder: Improving NL2SQL In-Context Learning Through a Context-Aware Encoder

Authors: Mohammadreza Pourreza, Davood Rafiei, Yuxi Feng, Raymond Li, Zhenan Fan, Weiwei Zhang
Subjects: cs.CL, cs.DB, cs.HC
Abstract URL: https://arxiv.org/abs/2403.16204
Pdf URL: https://arxiv.org/pdf/2403.16204
Copy Paste: [[2403.16204]] SQL-Encoder: Improving NL2SQL In-Context Learning Through a Context-Aware Encoder(https://arxiv.org/abs/2403.16204)
Keywords: in-context
Abstract: Detecting structural similarity between queries is essential for selecting examples in in-context learning models. However, assessing structural similarity based solely on the natural language expressions of queries, without considering SQL queries, presents a significant challenge. This paper explores the significance of this similarity metric and proposes a model for accurately estimating it. To achieve this, we leverage a dataset comprising 170k question pairs, meticulously curated to train a similarity prediction model. Our comprehensive evaluation demonstrates that the proposed model adeptly captures the structural similarity between questions, as evidenced by improvements in Kendall-Tau distance and precision@k metrics. Notably, our model outperforms strong competitive embedding models from OpenAI and Cohere. Furthermore, compared to these competitive models, our proposed encoder enhances the downstream performance of NL2SQL models in 1-shot in-context learning scenarios by 1-2\% for GPT-3.5-turbo, 4-8\% for CodeLlama-7B, and 2-3\% for CodeLlama-13B.

Title: Skull-to-Face: Anatomy-Guided 3D Facial Reconstruction and Editing

Authors: Yongqing Liang, Congyi Zhang, Junli Zhao, Wenping Wang, Xin Li
Subjects: cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2403.16207
Pdf URL: https://arxiv.org/pdf/2403.16207
Copy Paste: [[2403.16207]] Skull-to-Face: Anatomy-Guided 3D Facial Reconstruction and Editing(https://arxiv.org/abs/2403.16207)
Keywords: diffusion
Abstract: Deducing the 3D face from a skull is an essential but challenging task in forensic science and archaeology. Existing methods for automated facial reconstruction yield inaccurate results, suffering from the non-determinative nature of the problem that a skull with a sparse set of tissue depth cannot fully determine the skinned face. Additionally, their texture-less results require further post-processing stages to achieve a photo-realistic appearance. This paper proposes an end-to-end 3D face reconstruction and exploration tool, providing textured 3D faces for reference. With the help of state-of-the-art text-to-image diffusion models and image-based facial reconstruction techniques, we generate an initial reference 3D face, whose biological profile aligns with the given skull. We then adapt these initial faces to meet the statistical expectations of extruded anatomical landmarks on the skull through an optimization process. The joint statistical distribution of tissue depths is learned on a small set of anatomical landmarks on the skull. To support further adjustment, we propose an efficient face adaptation tool to assist users in tuning tissue depths, either globally or at local regions, while observing plausible visual feedback. Experiments conducted on a real skull-face dataset demonstrated the effectiveness of our proposed pipeline in terms of reconstruction accuracy, diversity, and stability.

Title: Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane

Authors: Han Yan, Yang Li, Zhennan Wu, Shenzhou Chen, Weixuan Sun, Taizhang Shang, Weizhe Liu, Tian Chen, Xiaqiang Dai, Chao Ma, Hongdong Li, Pan Ji
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2403.16210
Pdf URL: https://arxiv.org/pdf/2403.16210
Copy Paste: [[2403.16210]] Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane(https://arxiv.org/abs/2403.16210)
Keywords: diffusion
Abstract: We present Frankenstein, a diffusion-based framework that can generate semantic-compositional 3D scenes in a single pass. Unlike existing methods that output a single, unified 3D shape, Frankenstein simultaneously generates multiple separated shapes, each corresponding to a semantically meaningful part. The 3D scene information is encoded in one single tri-plane tensor, from which multiple Singed Distance Function (SDF) fields can be decoded to represent the compositional shapes. During training, an auto-encoder compresses tri-planes into a latent space, and then the denoising diffusion process is employed to approximate the distribution of the compositional scenes. Frankenstein demonstrates promising results in generating room interiors as well as human avatars with automatically separated parts. The generated scenes facilitate many downstream applications, such as part-wise re-texturing, object rearrangement in the room or avatar cloth re-targeting.

Title: Adversarially Masked Video Consistency for Unsupervised Domain Adaptation

Authors: Xiaoyu Zhu, Junwei Liang, Po-Yao Huang, Alex Hauptmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16242
Pdf URL: https://arxiv.org/pdf/2403.16242
Copy Paste: [[2403.16242]] Adversarially Masked Video Consistency for Unsupervised Domain Adaptation(https://arxiv.org/abs/2403.16242)
Keywords: generative
Abstract: We study the problem of unsupervised domain adaptation for egocentric videos. We propose a transformer-based model to learn class-discriminative and domain-invariant feature representations. It consists of two novel designs. The first module is called Generative Adversarial Domain Alignment Network with the aim of learning domain-invariant representations. It simultaneously learns a mask generator and a domain-invariant encoder in an adversarial way. The domain-invariant encoder is trained to minimize the distance between the source and target domain. The masking generator, conversely, aims at producing challenging masks by maximizing the domain distance. The second is a Masked Consistency Learning module to learn class-discriminative representations. It enforces the prediction consistency between the masked target videos and their full forms. To better evaluate the effectiveness of domain adaptation methods, we construct a more challenging benchmark for egocentric videos, U-Ego4D. Our method achieves state-of-the-art performance on the Epic-Kitchen and the proposed U-Ego4D benchmark.

Title: Connecting the Dots: Inferring Patent Phrase Similarity with Retrieved Phrase Graphs

Authors: Zhuoyi Peng, Yi Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.16265
Pdf URL: https://arxiv.org/pdf/2403.16265
Copy Paste: [[2403.16265]] Connecting the Dots: Inferring Patent Phrase Similarity with Retrieved Phrase Graphs(https://arxiv.org/abs/2403.16265)
Keywords: self-supervised
Abstract: We study the patent phrase similarity inference task, which measures the semantic similarity between two patent phrases. As patent documents employ legal and highly technical language, existing semantic textual similarity methods that use localized contextual information do not perform satisfactorily in inferring patent phrase similarity. To address this, we introduce a graph-augmented approach to amplify the global contextual information of the patent phrases. For each patent phrase, we construct a phrase graph that links to its focal patents and a list of patents that are either cited by or cite these focal patents. The augmented phrase embedding is then derived from combining its localized contextual embedding with its global embedding within the phrase graph. We further propose a self-supervised learning objective that capitalizes on the retrieved topology to refine both the contextualized embedding and the graph parameters in an end-to-end manner. Experimental results from a unique patent phrase similarity dataset demonstrate that our approach significantly enhances the representation of patent phrases, resulting in marked improvements in similarity inference in a self-supervised fashion. Substantial improvements are also observed in the supervised setting, underscoring the potential benefits of leveraging retrieved phrase graph augmentation.

Title: Constricting Normal Latent Space for Anomaly Detection with Normal-only Training Data

Authors: Marcella Astrid, Muhammad Zaigham Zaheer, Seung-Ik Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16270
Pdf URL: https://arxiv.org/pdf/2403.16270
Copy Paste: [[2403.16270]] Constricting Normal Latent Space for Anomaly Detection with Normal-only Training Data(https://arxiv.org/abs/2403.16270)
Keywords: anomaly
Abstract: In order to devise an anomaly detection model using only normal training data, an autoencoder (AE) is typically trained to reconstruct the data. As a result, the AE can extract normal representations in its latent space. During test time, since AE is not trained using real anomalies, it is expected to poorly reconstruct the anomalous data. However, several researchers have observed that it is not the case. In this work, we propose to limit the reconstruction capability of AE by introducing a novel latent constriction loss, which is added to the existing reconstruction loss. By using our method, no extra computational cost is added to the AE during test time. Evaluations using three video anomaly detection benchmark datasets, i.e., Ped2, Avenue, and ShanghaiTech, demonstrate the effectiveness of our method in limiting the reconstruction capability of AE, which leads to a better anomaly detection model.

Title: Object Detectors in the Open Environment:Challenges, Solutions, and Outlook

Authors: Siyuan Liang, Wei Wang, Ruoyu Chen, Aishan Liu, Boxi Wu, Ee-Chien Chang, Xiaochun Cao, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16271
Pdf URL: https://arxiv.org/pdf/2403.16271
Copy Paste: [[2403.16271]] Object Detectors in the Open Environment:Challenges, Solutions, and Outlook(https://arxiv.org/abs/2403.16271)
Keywords: foundation model
Abstract: With the emergence of foundation models, deep learning-based object detectors have shown practical usability in closed set scenarios. However, for real-world tasks, object detectors often operate in open environments, where crucial factors (\eg, data distribution, objective) that influence model learning are often changing. The dynamic and intricate nature of the open environment poses novel and formidable challenges to object detectors. Unfortunately, current research on object detectors in open environments lacks a comprehensive analysis of their distinctive characteristics, challenges, and corresponding solutions, which hinders their secure deployment in critical real-world scenarios. This paper aims to bridge this gap by conducting a comprehensive review and analysis of object detectors in open environments. We initially identified limitations of key structural components within the existing detection pipeline and propose the open environment object detector challenge framework that includes four quadrants (\ie, out-of-domain, out-of-category, robust learning, and incremental learning) based on the dimensions of the data / target changes. For each quadrant of challenges in the proposed framework, we present a detailed description and systematic analysis of the overarching goals and core difficulties, systematically review the corresponding solutions, and benchmark their performance over multiple widely adopted datasets. In addition, we engage in a discussion of open problems and potential avenues for future research. This paper aims to provide a fresh, comprehensive, and systematic understanding of the challenges and solutions associated with open-environment object detectors, thus catalyzing the development of more solid applications in real-world scenarios.

Title: L-MAE: Longitudinal masked auto-encoder with time and severity-aware encoding for diabetic retinopathy progression prediction

Authors: Rachid Zeghlache, Pierre-Henri Conze, Mostafa El Habib Daho, Yihao Li, Alireza Rezaei, Hugo Le Boité, Ramin Tadayoni, Pascal Massin, Béatrice Cochener, Ikram Brahim, Gwenolé Quellec, Mathieu Lamard
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.16272
Pdf URL: https://arxiv.org/pdf/2403.16272
Copy Paste: [[2403.16272]] L-MAE: Longitudinal masked auto-encoder with time and severity-aware encoding for diabetic retinopathy progression prediction(https://arxiv.org/abs/2403.16272)
Keywords: self-supervised
Abstract: Pre-training strategies based on self-supervised learning (SSL) have proven to be effective pretext tasks for many downstream tasks in computer vision. Due to the significant disparity between medical and natural images, the application of typical SSL is not straightforward in medical imaging. Additionally, those pretext tasks often lack context, which is critical for computer-aided clinical decision support. In this paper, we developed a longitudinal masked auto-encoder (MAE) based on the well-known Transformer-based MAE. In particular, we explored the importance of time-aware position embedding as well as disease progression-aware masking. Taking into account the time between examinations instead of just scheduling them offers the benefit of capturing temporal changes and trends. The masking strategy, for its part, evolves during follow-up to better capture pathological changes, ensuring a more accurate assessment of disease progression. Using OPHDIAT, a large follow-up screening dataset targeting diabetic retinopathy (DR), we evaluated the pre-trained weights on a longitudinal task, which is to predict the severity label of the next visit within 3 years based on the past time series examinations. Our results demonstrated the relevancy of both time-aware position embedding and masking strategies based on disease progression knowledge. Compared to popular baseline models and standard longitudinal Transformers, these simple yet effective extensions significantly enhance the predictive ability of deep classification models.

Title: latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction

Authors: Christopher Wewer, Kevin Raj, Eddy Ilg, Bernt Schiele, Jan Eric Lenssen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16292
Pdf URL: https://arxiv.org/pdf/2403.16292
Copy Paste: [[2403.16292]] latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction(https://arxiv.org/abs/2403.16292)
Keywords: generative
Abstract: We present latentSplat, a method to predict semantic Gaussians in a 3D latent space that can be splatted and decoded by a light-weight generative 2D architecture. Existing methods for generalizable 3D reconstruction either do not enable fast inference of high resolution novel views due to slow volume rendering, or are limited to interpolation of close input views, even in simpler settings with a single central object, where 360-degree generalization is possible. In this work, we combine a regression-based approach with a generative model, moving towards both of these capabilities within the same method, trained purely on readily available real video data. The core of our method are variational 3D Gaussians, a representation that efficiently encodes varying uncertainty within a latent space consisting of 3D feature Gaussians. From these Gaussians, specific instances can be sampled and rendered via efficient Gaussian splatting and a fast, generative decoder network. We show that latentSplat outperforms previous works in reconstruction quality and generalization, while being fast and scalable to high-resolution data.

Title: AutoInst: Automatic Instance-Based Segmentation of LiDAR 3D Scans

Authors: Cedric Perauer, Laurenz Adrian Heidrich, Haifan Zhang, Matthias Nießner, Anastasiia Kornilova, Alexey Artemov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16318
Pdf URL: https://arxiv.org/pdf/2403.16318
Copy Paste: [[2403.16318]] AutoInst: Automatic Instance-Based Segmentation of LiDAR 3D Scans(https://arxiv.org/abs/2403.16318)
Keywords: self-supervised
Abstract: Recently, progress in acquisition equipment such as LiDAR sensors has enabled sensing increasingly spacious outdoor 3D environments. Making sense of such 3D acquisitions requires fine-grained scene understanding, such as constructing instance-based 3D scene segmentations. Commonly, a neural network is trained for this task; however, this requires access to a large, densely annotated dataset, which is widely known to be challenging to obtain. To address this issue, in this work we propose to predict instance segmentations for 3D scenes in an unsupervised way, without relying on ground-truth annotations. To this end, we construct a learning framework consisting of two components: (1) a pseudo-annotation scheme for generating initial unsupervised pseudo-labels; and (2) a self-training algorithm for instance segmentation to fit robust, accurate instances from initial noisy proposals. To enable generating 3D instance mask proposals, we construct a weighted proxy-graph by connecting 3D points with edges integrating multi-modal image- and point-based self-supervised features, and perform graph-cuts to isolate individual pseudo-instances. We then build on a state-of-the-art point-based architecture and train a 3D instance segmentation model, resulting in significant refinement of initial proposals. To scale to arbitrary complexity 3D scenes, we design our algorithm to operate on local 3D point chunks and construct a merging step to generate scene-level instance segmentations. Experiments on the challenging SemanticKITTI benchmark demonstrate the potential of our approach, where it attains 13.3% higher Average Precision and 9.1% higher F1 score compared to the best-performing baseline. The code will be made publicly available at https://github.com/artonson/autoinst.

Title: Generating Potent Poisons and Backdoors from Scratch with Guided Diffusion

Authors: Hossein Souri, Arpit Bansal, Hamid Kazemi, Liam Fowl, Aniruddha Saha, Jonas Geiping, Andrew Gordon Wilson, Rama Chellappa, Tom Goldstein, Micah Goldblum
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2403.16365
Pdf URL: https://arxiv.org/pdf/2403.16365
Copy Paste: [[2403.16365]] Generating Potent Poisons and Backdoors from Scratch with Guided Diffusion(https://arxiv.org/abs/2403.16365)
Keywords: diffusion
Abstract: Modern neural networks are often trained on massive datasets that are web scraped with minimal human inspection. As a result of this insecure curation pipeline, an adversary can poison or backdoor the resulting model by uploading malicious data to the internet and waiting for a victim to scrape and train on it. Existing approaches for creating poisons and backdoors start with randomly sampled clean data, called base samples, and then modify those samples to craft poisons. However, some base samples may be significantly more amenable to poisoning than others. As a result, we may be able to craft more potent poisons by carefully choosing the base samples. In this work, we use guided diffusion to synthesize base samples from scratch that lead to significantly more potent poisons and backdoors than previous state-of-the-art attacks. Our Guided Diffusion Poisoning (GDP) base samples can be combined with any downstream poisoning or backdoor attack to boost its effectiveness. Our implementation code is publicly available at: https://github.com/hsouri/GDP .

Title: FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models

Authors: Lin Zhao, Tianchen Zhao, Zinan Lin, Xuefei Ning, Guohao Dai, Huazhong Yang, Yu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16379
Pdf URL: https://arxiv.org/pdf/2403.16379
Copy Paste: [[2403.16379]] FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models(https://arxiv.org/abs/2403.16379)
Keywords: diffusion, generative
Abstract: In recent years, there has been significant progress in the development of text-to-image generative models. Evaluating the quality of the generative models is one essential step in the development process. Unfortunately, the evaluation process could consume a significant amount of computational resources, making the required periodic evaluation of model performance (e.g., monitoring training progress) impractical. Therefore, we seek to improve the evaluation efficiency by selecting the representative subset of the text-image dataset. We systematically investigate the design choices, including the selection criteria (textural features or image-based metrics) and the selection granularity (prompt-level or set-level). We find that the insights from prior work on subset selection for training data do not generalize to this problem, and we propose FlashEval, an iterative search algorithm tailored to evaluation data selection. We demonstrate the effectiveness of FlashEval on ranking diffusion models with various configurations, including architectures, quantization levels, and sampler schedules on COCO and DiffusionDB datasets. Our searched 50-item subset could achieve comparable evaluation quality to the randomly sampled 500-item subset for COCO annotations on unseen models, achieving a 10x evaluation speedup. We release the condensed subset of these commonly used datasets to help facilitate diffusion algorithm design and evaluation, and open-source FlashEval as a tool for condensing future datasets, accessible at https://github.com/thu-nics/FlashEval.

Title: Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation

Authors: Sanyam Lakhanpal, Shivang Chopra, Vinija Jain, Aman Chadha, Man Luo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.16422
Pdf URL: https://arxiv.org/pdf/2403.16422
Copy Paste: [[2403.16422]] Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation(https://arxiv.org/abs/2403.16422)
Keywords: diffusion
Abstract: Over the past few years, Text-to-Image (T2I) generation approaches based on diffusion models have gained significant attention. However, vanilla diffusion models often suffer from spelling inaccuracies in the text displayed within the generated images. The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications. To produce accurate visual text images, state-of-the-art techniques adopt a glyph-controlled image generation approach, consisting of a text layout generator followed by an image generator that is conditioned on the generated text layout. Nevertheless, our study reveals that these models still face three primary challenges, prompting us to develop a testbed to facilitate future research. We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text. Subsequently, we introduce a training-free framework to enhance the two-stage generation approaches. We examine the effectiveness of our approach on both LenCom-Eval and MARIO-Eval benchmarks and demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores. For instance, our proposed framework improves the backbone model, TextDiffuser, by more than 23\% and 13.5\% in terms of OCR word F1 on LenCom-Eval and MARIO-Eval, respectively. Our work makes a unique contribution to the field by focusing on generating images with long and rare text sequences, a niche previously unexplored by existing literature

Title: PathoTune: Adapting Visual Foundation Model to Pathological Specialists

Authors: Jiaxuan Lu, Fang Yan, Xiaofan Zhang, Yue Gao, Shaoting Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.16497
Pdf URL: https://arxiv.org/pdf/2403.16497
Copy Paste: [[2403.16497]] PathoTune: Adapting Visual Foundation Model to Pathological Specialists(https://arxiv.org/abs/2403.16497)
Keywords: foundation model
Abstract: As natural image understanding moves towards the pretrain-finetune era, research in pathology imaging is concurrently evolving. Despite the predominant focus on pretraining pathological foundation models, how to adapt foundation models to downstream tasks is little explored. For downstream adaptation, we propose the existence of two domain gaps, i.e., the Foundation-Task Gap and the Task-Instance Gap. To mitigate these gaps, we introduce PathoTune, a framework designed to efficiently adapt pathological or even visual foundation models to pathology-specific tasks via multi-modal prompt tuning. The proposed framework leverages Task-specific Visual Prompts and Task-specific Textual Prompts to identify task-relevant features, along with Instance-specific Visual Prompts for encoding single pathological image features. Results across multiple datasets at both patch-level and WSI-level demonstrate its superior performance over single-modality prompt tuning approaches. Significantly, PathoTune facilitates the direct adaptation of natural visual foundation models to pathological tasks, drastically outperforming pathological foundation models with simple linear probing. The code will be available upon acceptance.

Title: Self-Supervised Learning for Medical Image Data with Anatomy-Oriented Imaging Planes

Authors: Tianwei Zhang, Dong Wei, Mengmeng Zhua, Shi Gu, Yefeng Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16499
Pdf URL: https://arxiv.org/pdf/2403.16499
Copy Paste: [[2403.16499]] Self-Supervised Learning for Medical Image Data with Anatomy-Oriented Imaging Planes(https://arxiv.org/abs/2403.16499)
Keywords: self-supervised
Abstract: Self-supervised learning has emerged as a powerful tool for pretraining deep networks on unlabeled data, prior to transfer learning of target tasks with limited annotation. The relevance between the pretraining pretext and target tasks is crucial to the success of transfer learning. Various pretext tasks have been proposed to utilize properties of medical image data (e.g., three dimensionality), which are more relevant to medical image analysis than generic ones for natural images. However, previous work rarely paid attention to data with anatomy-oriented imaging planes, e.g., standard cardiac magnetic resonance imaging views. As these imaging planes are defined according to the anatomy of the imaged organ, pretext tasks effectively exploiting this information can pretrain the networks to gain knowledge on the organ of interest. In this work, we propose two complementary pretext tasks for this group of medical image data based on the spatial relationship of the imaging planes. The first is to learn the relative orientation between the imaging planes and implemented as regressing their intersecting lines. The second exploits parallel imaging planes to regress their relative slice locations within a stack. Both pretext tasks are conceptually straightforward and easy to implement, and can be combined in multitask learning for better representation learning. Thorough experiments on two anatomical structures (heart and knee) and representative target tasks (semantic segmentation and classification) demonstrate that the proposed pretext tasks are effective in pretraining deep networks for remarkably boosted performance on the target tasks, and superior to other recent approaches.

Title: LARA: Linguistic-Adaptive Retrieval-Augmented LLMs for Multi-Turn Intent Classification

Authors: Liu Junhua, Tan Yong Keat, Fu Bin
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2403.16504
Pdf URL: https://arxiv.org/pdf/2403.16504
Copy Paste: [[2403.16504]] LARA: Linguistic-Adaptive Retrieval-Augmented LLMs for Multi-Turn Intent Classification(https://arxiv.org/abs/2403.16504)
Keywords: in-context
Abstract: Following the significant achievements of large language models (LLMs), researchers have employed in-context learning for text classification tasks. However, these studies focused on monolingual, single-turn classification tasks. In this paper, we introduce LARA (Linguistic-Adaptive Retrieval-Augmented Language Models), designed to enhance accuracy in multi-turn classification tasks across six languages, accommodating numerous intents in chatbot interactions. Multi-turn intent classification is notably challenging due to the complexity and evolving nature of conversational contexts. LARA tackles these issues by combining a fine-tuned smaller model with a retrieval-augmented mechanism, integrated within the architecture of LLMs. This integration allows LARA to dynamically utilize past dialogues and relevant intents, thereby improving the understanding of the context. Furthermore, our adaptive retrieval techniques bolster the cross-lingual capabilities of LLMs without extensive retraining and fine-tune. Comprehensive experiments demonstrate that LARA achieves state-of-the-art performance on multi-turn intent classification tasks, enhancing the average accuracy by 3.67% compared to existing methods.

Title: Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework

Authors: Ziyao Huang, Fan Tang, Yong Zhang, Xiaodong Cun, Juan Cao, Jintao Li, Tong-Yee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16510
Pdf URL: https://arxiv.org/pdf/2403.16510
Copy Paste: [[2403.16510]] Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework(https://arxiv.org/abs/2403.16510)
Keywords: diffusion
Abstract: Despite the remarkable process of talking-head-based avatar-creating solutions, directly generating anchor-style videos with full-body motions remains challenging. In this study, we propose Make-Your-Anchor, a novel system necessitating only a one-minute video clip of an individual for training, subsequently enabling the automatic generation of anchor-style videos with precise torso and hand movements. Specifically, we finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances. We adopt a two-stage training strategy for the diffusion model, effectively binding movements with specific appearances. To produce arbitrary long temporal video, we extend the 2D U-Net in the frame-wise diffusion model to a 3D style without additional training cost, and a simple yet effective batch-overlapped temporal denoising module is proposed to bypass the constraints on video length during inference. Finally, a novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos. Comparative experiments demonstrate the effectiveness and superiority of the system in terms of visual quality, temporal coherence, and identity preservation, outperforming SOTA diffusion/non-diffusion methods. Project page: \url{https://github.com/ICTMCG/Make-Your-Anchor}.

Title: LLMs Are Few-Shot In-Context Low-Resource Language Learners

Authors: Samuel Cahyawijaya, Holy Lovenia, Pascale Fung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.16512
Pdf URL: https://arxiv.org/pdf/2403.16512
Copy Paste: [[2403.16512]] LLMs Are Few-Shot In-Context Low-Resource Language Learners(https://arxiv.org/abs/2403.16512)
Keywords: in-context
Abstract: In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages using only short in-context information, offering a crucial avenue for narrowing the gap between high-resource and low-resource languages. Nonetheless, there is only a handful of works explored ICL for low-resource languages with most of them focusing on relatively high-resource languages, such as French and Spanish. In this work, we extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages. Our study not only assesses the effectiveness of ICL with LLMs in low-resource languages but also identifies the shortcomings of in-context label alignment, and introduces a more effective alternative: query alignment. Moreover, we provide valuable insights into various facets of ICL for low-resource languages. Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs through semantically relevant information by closing the language gap in the target language and aligning the semantics between the targeted low-resource and the high-resource language that the model is proficient in. Our work highlights the importance of advancing ICL research, particularly for low-resource languages.

Title: Let Real Images be as a Judger, Spotting Fake Images Synthesized with Generative Models

Authors: Ziyou Liang, Run Wang, Weifeng Liu, Yuyang Zhang, Wenyuan Yang, Lina Wang, Xingkai Wang
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2403.16513
Pdf URL: https://arxiv.org/pdf/2403.16513
Copy Paste: [[2403.16513]] Let Real Images be as a Judger, Spotting Fake Images Synthesized with Generative Models(https://arxiv.org/abs/2403.16513)
Keywords: diffusion, generative
Abstract: In the last few years, generative models have shown their powerful capabilities in synthesizing realistic images in both quality and diversity (i.e., facial images, and natural subjects). Unfortunately, the artifact patterns in fake images synthesized by different generative models are inconsistent, leading to the failure of previous research that relied on spotting subtle differences between real and fake. In our preliminary experiments, we find that the artifacts in fake images always change with the development of the generative model, while natural images exhibit stable statistical properties. In this paper, we employ natural traces shared only by real images as an additional predictive target in the detector. Specifically, the natural traces are learned from the wild real images and we introduce extended supervised contrastive learning to bring them closer to real images and further away from fake ones. This motivates the detector to make decisions based on the proximity of images to the natural traces. To conduct a comprehensive experiment, we built a high-quality and diverse dataset that includes generative models comprising 6 GAN and 6 diffusion models, to evaluate the effectiveness in generalizing unknown forgery techniques and robustness in surviving different transformations. Experimental results show that our proposed method gives 96.1% mAP significantly outperforms the baselines. Extensive experiments conducted on the widely recognized platform Midjourney reveal that our proposed method achieves an accuracy exceeding 78.4%, underscoring its practicality for real-world application deployment. The source code and partial self-built dataset are available in supplementary material.

Title: Visually Guided Generative Text-Layout Pre-training for Document Intelligence

Authors: Zhiming Mao, Haoli Bai, Lu Hou, Jiansheng Wei, Xin Jiang, Qun Liu, Kam-Fai Wong
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2403.16516
Pdf URL: https://arxiv.org/pdf/2403.16516
Copy Paste: [[2403.16516]] Visually Guided Generative Text-Layout Pre-training for Document Intelligence(https://arxiv.org/abs/2403.16516)
Keywords: generative
Abstract: Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.

Title: CMViM: Contrastive Masked Vim Autoencoder for 3D Multi-modal Representation Learning for AD classification

Authors: Guangqian Yang, Kangrui Du, Zhihan Yang, Ye Du, Yongping Zheng, Shujun Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16520
Pdf URL: https://arxiv.org/pdf/2403.16520
Copy Paste: [[2403.16520]] CMViM: Contrastive Masked Vim Autoencoder for 3D Multi-modal Representation Learning for AD classification(https://arxiv.org/abs/2403.16520)
Keywords: generative
Abstract: Alzheimer's disease (AD) is an incurable neurodegenerative condition leading to cognitive and functional deterioration. Given the lack of a cure, prompt and precise AD diagnosis is vital, a complex process dependent on multiple factors and multi-modal data. While successful efforts have been made to integrate multi-modal representation learning into medical datasets, scant attention has been given to 3D medical images. In this paper, we propose Contrastive Masked Vim Autoencoder (CMViM), the first efficient representation learning method tailored for 3D multi-modal data. Our proposed framework is built on a masked Vim autoencoder to learn a unified multi-modal representation and long-dependencies contained in 3D medical images. We also introduce an intra-modal contrastive learning module to enhance the capability of the multi-modal Vim encoder for modeling the discriminative features in the same modality, and an inter-modal contrastive learning module to alleviate misaligned representation among modalities. Our framework consists of two main steps: 1) incorporate the Vision Mamba (Vim) into the mask autoencoder to reconstruct 3D masked multi-modal data efficiently. 2) align the multi-modal representations with contrastive learning mechanisms from both intra-modal and inter-modal aspects. Our framework is pre-trained and validated ADNI2 dataset and validated on the downstream task for AD classification. The proposed CMViM yields 2.7\% AUC performance improvement compared with other state-of-the-art methods.

Title: An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Authors: Zizhao Hu, Shaochong Jia, Mohammad Rostami
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.16530
Pdf URL: https://arxiv.org/pdf/2403.16530
Copy Paste: [[2403.16530]] An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models(https://arxiv.org/abs/2403.16530)
Keywords: diffusion
Abstract: Diffusion models have been widely used for conditional data cross-modal generation tasks such as text-to-image and text-to-video. However, state-of-the-art models still fail to align the generated visual concepts with high-level semantics in a language such as object count, spatial relationship, etc. We approach this problem from a multimodal data fusion perspective and investigate how different fusion strategies can affect vision-language alignment. We discover that compared to the widely used early fusion of conditioning text in a pretrained image feature space, a specially designed intermediate fusion can: (i) boost text-to-image alignment with improved generation quality and (ii) improve training and inference efficiency by reducing low-rank text-to-image attention calculations. We perform experiments using a text-to-image generation task on the MS-COCO dataset. We compare our intermediate fusion mechanism with the classic early fusion mechanism on two common conditioning methods on a U-shaped ViT backbone. Our intermediate fusion model achieves a higher CLIP Score and lower FID, with 20% reduced FLOPs, and 50% increased training speed compared to a strong U-ViT baseline with an early fusion.

Title: SegICL: A Universal In-context Learning Framework for Enhanced Segmentation in Medical Imaging

Authors: Lingdong Shen, Fangxin Shang, Yehui Yang, Xiaoshuang Huang, Shining Xiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.16578
Pdf URL: https://arxiv.org/pdf/2403.16578
Copy Paste: [[2403.16578]] SegICL: A Universal In-context Learning Framework for Enhanced Segmentation in Medical Imaging(https://arxiv.org/abs/2403.16578)
Keywords: in-context
Abstract: Medical image segmentation models adapting to new tasks in a training-free manner through in-context learning is an exciting advancement. Universal segmentation models aim to generalize across the diverse modality of medical images, yet their effectiveness often diminishes when applied to out-of-distribution (OOD) data modalities and tasks, requiring intricate fine-tuning of model for optimal performance. For addressing this challenge, we introduce SegICL, a novel approach leveraging In-Context Learning (ICL) for image segmentation. Unlike existing methods, SegICL has the capability to employ text-guided segmentation and conduct in-context learning with a small set of image-mask pairs, eliminating the need for training the model from scratch or fine-tuning for OOD tasks (including OOD modality and dataset). Extensive experimental validation of SegICL demonstrates a positive correlation between the number of prompt samples and segmentation performance on OOD modalities and tasks. This indicates that SegICL effectively address new segmentation tasks based on contextual information. Additionally, SegICL also exhibits comparable segmentation performance to mainstream models on OOD and in-distribution tasks. Our code will be released soon.

Title: SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation

Authors: Aysim Toker, Marvin Eisenberger, Daniel Cremers, Laura Leal-Taixé
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16605
Pdf URL: https://arxiv.org/pdf/2403.16605
Copy Paste: [[2403.16605]] SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation(https://arxiv.org/abs/2403.16605)
Keywords: diffusion, generative
Abstract: In recent years, semantic segmentation has become a pivotal tool in processing and interpreting satellite imagery. Yet, a prevalent limitation of supervised learning techniques remains the need for extensive manual annotations by experts. In this work, we explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks. The main idea is to learn the joint data manifold of images and labels, leveraging recent advancements in denoising diffusion probabilistic models. To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation. We find that the obtained pairs not only display high quality in fine-scale features but also ensure a wide sampling diversity. Both aspects are crucial for earth observation data, where semantic classes can vary severely in scale and occurrence frequency. We employ the novel data instances for downstream segmentation, as a form of data augmentation. In our experiments, we provide comparisons to prior works based on discriminative diffusion models or GANs. We demonstrate that integrating generated samples yields significant quantitative improvements for satellite semantic segmentation -- both compared to baselines and when training only on the original data.

Title: SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

Authors: Yuda Song, Zehao Sun, Xuanwu Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16627
Pdf URL: https://arxiv.org/pdf/2403.16627
Copy Paste: [[2403.16627]] SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions(https://arxiv.org/abs/2403.16627)
Keywords: diffusion
Abstract: Recent advancements in diffusion models have positioned them at the forefront of image generation. Despite their superior performance, diffusion models are not without drawbacks; they are characterized by complex architectures and substantial computational demands, resulting in significant latency due to their iterative sampling process. To mitigate these limitations, we introduce a dual approach involving model miniaturization and a reduction in sampling steps, aimed at significantly decreasing model latency. Our methodology leverages knowledge distillation to streamline the U-Net and image decoder architectures, and introduces an innovative one-step DM training technique that utilizes feature matching and score distillation. We present two models, SDXS-512 and SDXS-1024, achieving inference speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FP (60x faster than SDXL) on a single GPU, respectively. Moreover, our training approach offers promising applications in image-conditioned control, facilitating efficient image-to-image translation.

Title: AI-Generated Video Detection via Spatio-Temporal Anomaly Learning

Authors: Jianfa Bai, Man Lin, Gang Cao
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2403.16638
Pdf URL: https://arxiv.org/pdf/2403.16638
Copy Paste: [[2403.16638]] AI-Generated Video Detection via Spatio-Temporal Anomaly Learning(https://arxiv.org/abs/2403.16638)
Keywords: anomaly
Abstract: The advancement of generation models has led to the emergence of highly realistic artificial intelligence (AI)-generated videos. Malicious users can easily create non-existent videos to spread false information. This letter proposes an effective AI-generated video detection (AIGVDet) scheme by capturing the forensic traces with a two-branch spatio-temporal convolutional neural network (CNN). Specifically, two ResNet sub-detectors are learned separately for identifying the anomalies in spatical and optical flow domains, respectively. Results of such sub-detectors are fused to further enhance the discrimination ability. A large-scale generated video dataset (GVD) is constructed as a benchmark for model training and evaluation. Extensive experimental results verify the high generalization and robustness of our AIGVDet scheme. Code and dataset will be available at https://github.com/multimediaFor/AIGVDet.

Title: Graph Augmentation for Recommendation

Authors: Qianru Zhang, Lianghao Xia, Xuheng Cai, Siuming Yiu, Chao Huang, Christian S. Jensen
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2403.16656
Pdf URL: https://arxiv.org/pdf/2403.16656
Copy Paste: [[2403.16656]] Graph Augmentation for Recommendation(https://arxiv.org/abs/2403.16656)
Keywords: self-supervised
Abstract: Graph augmentation with contrastive learning has gained significant attention in the field of recommendation systems due to its ability to learn expressive user representations, even when labeled data is limited. However, directly applying existing GCL models to real-world recommendation environments poses challenges. There are two primary issues to address. Firstly, the lack of consideration for data noise in contrastive learning can result in noisy self-supervised signals, leading to degraded performance. Secondly, many existing GCL approaches rely on graph neural network (GNN) architectures, which can suffer from over-smoothing problems due to non-adaptive message passing. To address these challenges, we propose a principled framework called GraphAug. This framework introduces a robust data augmentor that generates denoised self-supervised signals, enhancing recommender systems. The GraphAug framework incorporates a graph information bottleneck (GIB)-regularized augmentation paradigm, which automatically distills informative self-supervision information and adaptively adjusts contrastive view generation. Through rigorous experimentation on real-world datasets, we thoroughly assessed the performance of our novel GraphAug model. The outcomes consistently unveil its superiority over existing baseline methods. The source code for our model is publicly available at: https://github.com/HKUDS/GraphAug.

Title: Iso-Diffusion: Improving Diffusion Probabilistic Models Using the Isotropy of the Additive Gaussian Noise

Authors: Dilum Fernando, Dhananjaya jayasundara, Roshan Godaliyadda, Chaminda Bandara, Parakrama Ekanayake, Vijitha Herath
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.16790
Pdf URL: https://arxiv.org/pdf/2403.16790
Copy Paste: [[2403.16790]] Iso-Diffusion: Improving Diffusion Probabilistic Models Using the Isotropy of the Additive Gaussian Noise(https://arxiv.org/abs/2403.16790)
Keywords: diffusion, generative
Abstract: Denoising Diffusion Probabilistic Models (DDPMs) have accomplished much in the realm of generative AI. Despite their high performance, there is room for improvement, especially in terms of sample fidelity by utilizing statistical properties that impose structural integrity, such as isotropy. Minimizing the mean squared error between the additive and predicted noise alone does not impose constraints on the predicted noise to be isotropic. Thus, we were motivated to utilize the isotropy of the additive noise as a constraint on the objective function to enhance the fidelity of DDPMs. Our approach is simple and can be applied to any DDPM variant. We validate our approach by presenting experiments conducted on four synthetic 2D datasets as well as on unconditional image generation. As demonstrated by the results, the incorporation of this constraint improves the fidelity metrics, Precision and Density for the 2D datasets as well as for the unconditional image generation.

Title: Multiple-Source Localization from a Single-Snapshot Observation Using Graph Bayesian Optimization

Authors: Zonghan Zhang, Zijian Zhang, Zhiqian Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.16818
Pdf URL: https://arxiv.org/pdf/2403.16818
Copy Paste: [[2403.16818]] Multiple-Source Localization from a Single-Snapshot Observation Using Graph Bayesian Optimization(https://arxiv.org/abs/2403.16818)
Keywords: diffusion
Abstract: Due to the significance of its various applications, source localization has garnered considerable attention as one of the most important means to confront diffusion hazards. Multi-source localization from a single-snapshot observation is especially relevant due to its prevalence. However, the inherent complexities of this problem, such as limited information, interactions among sources, and dependence on diffusion models, pose challenges to resolution. Current methods typically utilize heuristics and greedy selection, and they are usually bonded with one diffusion model. Consequently, their effectiveness is constrained. To address these limitations, we propose a simulation-based method termed BOSouL. Bayesian optimization (BO) is adopted to approximate the results for its sample efficiency. A surrogate function models uncertainty from the limited information. It takes sets of nodes as the input instead of individual nodes. BOSouL can incorporate any diffusion model in the data acquisition process through simulations. Empirical studies demonstrate that its performance is robust across graph structures and diffusion models. The code is available at https://github.com/XGraph-Team/BOSouL.

Title: Convergence of a model-free entropy-regularized inverse reinforcement learning algorithm

Authors: Titouan Renard, Andreas Schlaginhaufen, Tingting Ni, Maryam Kamgarpour
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.16829
Pdf URL: https://arxiv.org/pdf/2403.16829
Copy Paste: [[2403.16829]] Convergence of a model-free entropy-regularized inverse reinforcement learning algorithm(https://arxiv.org/abs/2403.16829)
Keywords: generative
Abstract: Given a dataset of expert demonstrations, inverse reinforcement learning (IRL) aims to recover a reward for which the expert is optimal. This work proposes a model-free algorithm to solve entropy-regularized IRL problem. In particular, we employ a stochastic gradient descent update for the reward and a stochastic soft policy iteration update for the policy. Assuming access to a generative model, we prove that our algorithm is guaranteed to recover a reward for which the expert is $\varepsilon$-optimal using $\mathcal{O}(1/\varepsilon^{2})$ samples of the Markov decision process (MDP). Furthermore, with $\mathcal{O}(1/\varepsilon^{4})$ samples we prove that the optimal policy corresponding to the recovered reward is $\varepsilon$-close to the expert policy in total variation distance.

Title: Multiple Object Tracking as ID Prediction

Authors: Ruopeng Gao, Yijun Zhang, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16848
Pdf URL: https://arxiv.org/pdf/2403.16848
Copy Paste: [[2403.16848]] Multiple Object Tracking as ID Prediction(https://arxiv.org/abs/2403.16848)
Keywords: in-context
Abstract: In Multiple Object Tracking (MOT), tracking-by-detection methods have stood the test for a long time, which split the process into two parts according to the definition: object detection and association. They leverage robust single-frame detectors and treat object association as a post-processing step through hand-crafted heuristic algorithms and surrogate tasks. However, the nature of heuristic techniques prevents end-to-end exploitation of training data, leading to increasingly cumbersome and challenging manual modification while facing complicated or novel scenarios. In this paper, we regard this object association task as an End-to-End in-context ID prediction problem and propose a streamlined baseline called MOTIP. Specifically, we form the target embeddings into historical trajectory information while considering the corresponding IDs as in-context prompts, then directly predict the ID labels for the objects in the current frame. Thanks to this end-to-end process, MOTIP can learn tracking capabilities straight from training data, freeing itself from burdensome hand-crafted algorithms. Without bells and whistles, our method achieves impressive state-of-the-art performance in complex scenarios like DanceTrack and SportsMOT, and it performs competitively with other transformer-based methods on MOT17. We believe that MOTIP demonstrates remarkable potential and can serve as a starting point for future research. The code is available at https://github.com/MCG-NJU/MOTIP.

Title: Encoding of lexical tone in self-supervised models of spoken language

Authors: Gaofei Shen, Michaela Watkins, Afra Alishahi, Arianna Bisazza, Grzegorz Chrupała
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2403.16865
Pdf URL: https://arxiv.org/pdf/2403.16865
Copy Paste: [[2403.16865]] Encoding of lexical tone in self-supervised models of spoken language(https://arxiv.org/abs/2403.16865)
Keywords: self-supervised
Abstract: Interpretability research has shown that self-supervised Spoken Language Models (SLMs) encode a wide variety of features in human speech from the acoustic, phonetic, phonological, syntactic and semantic levels, to speaker characteristics. The bulk of prior research on representations of phonology has focused on segmental features such as phonemes; the encoding of suprasegmental phonology (such as tone and stress patterns) in SLMs is not yet well understood. Tone is a suprasegmental feature that is present in more than half of the world's languages. This paper aims to analyze the tone encoding capabilities of SLMs, using Mandarin and Vietnamese as case studies. We show that SLMs encode lexical tone to a significant degree even when they are trained on data from non-tonal languages. We further find that SLMs behave similarly to native and non-native human participants in tone and consonant perception studies, but they do not follow the same developmental trajectory.

Title: Discrete Latent Graph Generative Modeling with Diffusion Bridges

Authors: Van Khoa Nguyen, Yoann Boget, Frantzeska Lavda, Alexandros Kalousis
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2403.16883
Pdf URL: https://arxiv.org/pdf/2403.16883
Copy Paste: [[2403.16883]] Discrete Latent Graph Generative Modeling with Diffusion Bridges(https://arxiv.org/abs/2403.16883)
Keywords: diffusion, generative
Abstract: Learning graph generative models over latent spaces has received less attention compared to models that operate on the original data space and has so far demonstrated lacklustre performance. We present GLAD a latent space graph generative model. Unlike most previous latent space graph generative models, GLAD operates on a discrete latent space that preserves to a significant extent the discrete nature of the graph structures making no unnatural assumptions such as latent space continuity. We learn the prior of our discrete latent space by adapting diffusion bridges to its structure. By operating over an appropriately constructed latent space we avoid relying on decompositions that are often used in models that operate in the original data space. We present experiments on a series of graph benchmark datasets which clearly show the superiority of the discrete latent space and obtain state of the art graph generative performance, making GLAD the first latent space graph generative model with competitive performance. Our source code is published at: \url{https://github.com/v18nguye/GLAD}.

Title: FLIGAN: Enhancing Federated Learning with Incomplete Data using GAN

Authors: Paul Joe Maliakel, Shashikant Ilager, Ivona Brandic
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.16930
Pdf URL: https://arxiv.org/pdf/2403.16930
Copy Paste: [[2403.16930]] FLIGAN: Enhancing Federated Learning with Incomplete Data using GAN(https://arxiv.org/abs/2403.16930)
Keywords: generative
Abstract: Federated Learning (FL) provides a privacy-preserving mechanism for distributed training of machine learning models on networked devices (e.g., mobile devices, IoT edge nodes). It enables Artificial Intelligence (AI) at the edge by creating models without sharing the actual data across the network. Existing research works typically focus on generic aspects of non-IID data and heterogeneity in client's system characteristics, but they often neglect the issue of insufficient data for model development, which can arise from uneven class label distribution and highly variable data volumes across edge nodes. In this work, we propose FLIGAN, a novel approach to address the issue of data incompleteness in FL. First, we leverage Generative Adversarial Networks (GANs) to adeptly capture complex data distributions and generate synthetic data that closely resemble the real-world data. Then, we use synthetic data to enhance the robustness and completeness of datasets across nodes. Our methodology adheres to FL's privacy requirements by generating synthetic data in a federated manner without sharing the actual data in the process. We incorporate techniques such as classwise sampling and node grouping, designed to improve the federated GAN's performance, enabling the creation of high-quality synthetic datasets and facilitating efficient FL training. Empirical results from our experiments demonstrate that FLIGAN significantly improves the model accuracy, especially in scenarios with high class imbalances, achieving up to a 20% increase in model accuracy over traditional FL baselines.

Title: SPACE-IDEAS: A Dataset for Salient Information Detection in Space Innovation

Authors: Andrés García-Silva, Cristian Berrío, José Manuel Gómez-Pérez
Subjects: cs.CL, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2403.16941
Pdf URL: https://arxiv.org/pdf/2403.16941
Copy Paste: [[2403.16941]] SPACE-IDEAS: A Dataset for Salient Information Detection in Space Innovation(https://arxiv.org/abs/2403.16941)
Keywords: generative
Abstract: Detecting salient parts in text using natural language processing has been widely used to mitigate the effects of information overflow. Nevertheless, most of the datasets available for this task are derived mainly from academic publications. We introduce SPACE-IDEAS, a dataset for salient information detection from innovation ideas related to the Space domain. The text in SPACE-IDEAS varies greatly and includes informal, technical, academic and business-oriented writing styles. In addition to a manually annotated dataset we release an extended version that is annotated using a large generative language model. We train different sentence and sequential sentence classifiers, and show that the automatically annotated dataset can be leveraged using multitask learning to train better classifiers.

Title: Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance

Authors: Jingyuan Zhu, Huimin Ma, Jiansheng Chen, Jian Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16954
Pdf URL: https://arxiv.org/pdf/2403.16954
Copy Paste: [[2403.16954]] Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance(https://arxiv.org/abs/2403.16954)
Keywords: diffusion
Abstract: Large-scale text-to-image diffusion models have achieved great success in synthesizing high-quality and diverse images given target text prompts. Despite the revolutionary image generation ability, current state-of-the-art models still struggle to deal with multi-concept generation accurately in many cases. This phenomenon is known as ``concept bleeding" and displays as the unexpected overlapping or merging of various concepts. This paper presents a general approach for text-to-image diffusion models to address the mutual interference between different subjects and their attachments in complex scenes, pursuing better text-image consistency. The core idea is to isolate the synthesizing processes of different concepts. We propose to bind each attachment to corresponding subjects separately with split text prompts. Besides, we introduce a revision method to fix the concept bleeding problem in multi-subject synthesis. We first depend on pre-trained object detection and segmentation models to obtain the layouts of subjects. Then we isolate and resynthesize each subject individually with corresponding text prompts to avoid mutual interference. Overall, we achieve a training-free strategy, named Isolated Diffusion, to optimize multi-concept text-to-image synthesis. It is compatible with the latest Stable Diffusion XL (SDXL) and prior Stable Diffusion (SD) models. We compare our approach with alternative methods using a variety of multi-concept text prompts and demonstrate its effectiveness with clear advantages in text-image consistency and user study.

Title: Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

Authors: Omer Dahary, Or Patashnik, Kfir Aberman, Daniel Cohen-Or
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.16990
Pdf URL: https://arxiv.org/pdf/2403.16990
Copy Paste: [[2403.16990]] Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation(https://arxiv.org/abs/2403.16990)
Keywords: diffusion
Abstract: Text-to-image diffusion models have an unprecedented ability to generate diverse and high-quality images. However, they often struggle to faithfully capture the intended semantics of complex input prompts that include multiple subjects. Recently, numerous layout-to-image extensions have been introduced to improve user control, aiming to localize subjects represented by specific tokens. Yet, these methods often produce semantically inaccurate images, especially when dealing with multiple semantically or visually similar subjects. In this work, we study and analyze the causes of these limitations. Our exploration reveals that the primary issue stems from inadvertent semantic leakage between subjects in the denoising process. This leakage is attributed to the diffusion model's attention layers, which tend to blend the visual features of different subjects. To address these issues, we introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. Bounded Attention prevents detrimental leakage among subjects and enables guiding the generation to promote each subject's individuality, even with complex multi-subject conditioning. Through extensive experimentation, we demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.

Title: Comp4D: LLM-Guided Compositional 4D Scene Generation

Authors: Dejia Xu, Hanwen Liang, Neel P. Bhatt, Hezhen Hu, Hanxue Liang, Konstantinos N. Plataniotis, Zhangyang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.16993
Pdf URL: https://arxiv.org/pdf/2403.16993
Copy Paste: [[2403.16993]] Comp4D: LLM-Guided Compositional 4D Scene Generation(https://arxiv.org/abs/2403.16993)
Keywords: diffusion
Abstract: Recent advancements in diffusion models for 2D and 3D content creation have sparked a surge of interest in generating 4D content. However, the scarcity of 3D scene datasets constrains current methodologies to primarily object-centric generation. To overcome this limitation, we present Comp4D, a novel framework for Compositional 4D Generation. Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively constructs each 4D object within the scene separately. Utilizing Large Language Models (LLMs), the framework begins by decomposing an input text prompt into distinct entities and maps out their trajectories. It then constructs the compositional 4D scene by accurately positioning these objects along their designated paths. To refine the scene, our method employs a compositional score distillation technique guided by the pre-defined trajectories, utilizing pre-trained diffusion models across text-to-image, text-to-video, and text-to-3D domains. Extensive experiments demonstrate our outstanding 4D content creation capability compared to prior arts, showcasing superior visual quality, motion fidelity, and enhanced object interactions.

Title: Language Rectified Flow: Advancing Diffusion Language Generation with Probabilistic Flows

Authors: Shujian Zhang, Lemeng Wu, Chengyue Gong, Xingchao Liu
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2403.16995
Pdf URL: https://arxiv.org/pdf/2403.16995
Copy Paste: [[2403.16995]] Language Rectified Flow: Advancing Diffusion Language Generation with Probabilistic Flows(https://arxiv.org/abs/2403.16995)
Keywords: diffusion, generative
Abstract: Recent works have demonstrated success in controlling sentence attributes ($e.g.$, sentiment) and structure ($e.g.$, syntactic structure) based on the diffusion language model. A key component that drives theimpressive performance for generating high-quality samples from noise is iteratively denoise for thousands of steps. While beneficial, the complexity of starting from the noise and the learning steps has limited its implementation to many NLP real-world applications. This paper proposes Language Rectified Flow ({\ours}). Our method is based on the reformulation of the standard probabilistic flow models. Language rectified flow learns (neural) ordinary differential equation models to transport between the source distribution and the target distribution, hence providing a unified and effective solution to generative modeling and domain transfer. From the source distribution, our language rectified flow yields fast simulation and effectively decreases the inference time. Experiments on three challenging fine-grained control tasks and multiple high-quality text editing show that our method consistently outperforms its baselines. Extensive experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks.

Title: Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

Authors: Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wengang Zhou, Jiebo Luo, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2403.17000
Pdf URL: https://arxiv.org/pdf/2403.17000
Copy Paste: [[2403.17000]] Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution(https://arxiv.org/abs/2403.17000)
Keywords: diffusion
Abstract: Diffusion models are just at a tipping point for image super-resolution task. Nevertheless, it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos, but also the temporal consistency across video frames. In this paper, we propose a novel approach, pursuing Spatial Adaptation and Temporal Coherence (SATeCo), for video super-resolution. SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction. Technically, SATeCo freezes all the parameters of the pre-trained UNet and VAE, and only optimizes two deliberately-designed spatial feature adaptation (SFA) and temporal feature alignment (TFA) modules, in the decoder of UNet and VAE. SFA modulates frame features via adaptively estimating affine parameters for each pixel, guaranteeing pixel-wise guidance for high-resolution frame synthesis. TFA delves into feature interaction within a 3D local window (tubelet) through self-attention, and executes cross-attention between tubelet and its low-resolution counterpart to guide temporal feature alignment. Extensive experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach.

Title: VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation

Authors: Yang Chen, Yingwei Pan, Haibo Yang, Ting Yao, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2403.17001
Pdf URL: https://arxiv.org/pdf/2403.17001
Copy Paste: [[2403.17001]] VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation(https://arxiv.org/abs/2403.17001)
Keywords: diffusion
Abstract: Recent innovations on text-to-3D generation have featured Score Distillation Sampling (SDS), which enables the zero-shot learning of implicit 3D models (NeRF) by directly distilling prior knowledge from 2D diffusion models. However, current SDS-based models still struggle with intricate text prompts and commonly result in distorted 3D models with unrealistic textures or cross-view inconsistency issues. In this work, we introduce a novel Visual Prompt-guided text-to-3D diffusion model (VP3D) that explicitly unleashes the visual appearance knowledge in 2D visual prompt to boost text-to-3D generation. Instead of solely supervising SDS with text prompt, VP3D first capitalizes on 2D diffusion model to generate a high-quality image from input text, which subsequently acts as visual prompt to strengthen SDS optimization with explicit visual appearance. Meanwhile, we couple the SDS optimization with additional differentiable reward function that encourages rendering images of 3D models to better visually align with 2D visual prompt and semantically match with text prompt. Through extensive experiments, we show that the 2D Visual Prompt in our VP3D significantly eases the learning of visual appearance of 3D models and thus leads to higher visual fidelity with more detailed textures. It is also appealing in view that when replacing the self-generating visual prompt with a given reference image, VP3D is able to trigger a new task of stylized text-to-3D generation. Our project page is available at https://vp3d-cvpr24.github.io.

Title: SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

Authors: Rui Zhu, Yingwei Pan, Yehao Li, Ting Yao, Zhenglong Sun, Tao Mei, Chang Wen Chen
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2403.17004
Pdf URL: https://arxiv.org/pdf/2403.17004
Copy Paste: [[2403.17004]] SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer(https://arxiv.org/abs/2403.17004)
Keywords: diffusion, self-supervised, generative
Abstract: Diffusion Transformer (DiT) has emerged as the new trend of generative diffusion models on image generation. In view of extremely slow convergence in typical DiT, recent breakthroughs have been driven by mask strategy that significantly improves the training efficiency of DiT with additional intra-image contextual learning. Despite this progress, mask strategy still suffers from two inherent limitations: (a) training-inference discrepancy and (b) fuzzy relations between mask reconstruction & generative diffusion process, resulting in sub-optimal training of DiT. In this work, we address these limitations by novelly unleashing the self-supervised discrimination knowledge to boost DiT training. Technically, we frame our DiT in a teacher-student manner. The teacher-student discriminative pairs are built on the diffusion noises along the same Probability Flow Ordinary Differential Equation (PF-ODE). Instead of applying mask reconstruction loss over both DiT encoder and decoder, we decouple DiT encoder and decoder to separately tackle discriminative and generative objectives. In particular, by encoding discriminative pairs with student and teacher DiT encoders, a new discriminative loss is designed to encourage the inter-image alignment in the self-supervised embedding space. After that, student samples are fed into student DiT decoder to perform the typical generative diffusion task. Extensive experiments are conducted on ImageNet dataset, and our method achieves a competitive balance between training cost and generative capacity.

Title: TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models

Authors: Zhongwei Zhang, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Ting Yao, Yang Cao, Tao Mei
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2403.17005
Pdf URL: https://arxiv.org/pdf/2403.17005
Copy Paste: [[2403.17005]] TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models(https://arxiv.org/abs/2403.17005)
Keywords: diffusion
Abstract: Recent advances in text-to-video generation have demonstrated the utility of powerful diffusion models. Nevertheless, the problem is not trivial when shaping diffusion models to animate static image (i.e., image-to-video generation). The difficulty originates from the aspect that the diffusion process of subsequent animated frames should not only preserve the faithful alignment with the given image but also pursue temporal coherence among adjacent frames. To alleviate this, we present TRIP, a new recipe of image-to-video diffusion paradigm that pivots on image noise prior derived from static image to jointly trigger inter-frame relational reasoning and ease the coherent temporal modeling via temporal residual learning. Technically, the image noise prior is first attained through one-step backward diffusion process based on both static image and noised video latent codes. Next, TRIP executes a residual-like dual-path scheme for noise prediction: 1) a shortcut path that directly takes image noise prior as the reference noise of each frame to amplify the alignment between the first frame and subsequent frames; 2) a residual path that employs 3D-UNet over noised video and static image latent codes to enable inter-frame relational reasoning, thereby easing the learning of the residual noise for each frame. Furthermore, both reference and residual noise of each frame are dynamically merged via attention mechanism for final video generation. Extensive experiments on WebVid-10M, DTDB and MSR-VTT datasets demonstrate the effectiveness of our TRIP for image-to-video generation. Please see our project page at https://trip-i2v.github.io/TRIP/.

Title: Invertible Diffusion Models for Compressed Sensing

Authors: Bin Chen, Zhenyu Zhang, Weiqi Li, Chen Zhao, Jiwen Yu, Shijie Zhao, Jie Chen, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.17006
Pdf URL: https://arxiv.org/pdf/2403.17006
Copy Paste: [[2403.17006]] Invertible Diffusion Models for Compressed Sensing(https://arxiv.org/abs/2403.17006)
Keywords: diffusion
Abstract: While deep neural networks (NN) significantly advance image compressed sensing (CS) by improving reconstruction quality, the necessity of training current CS NNs from scratch constrains their effectiveness and hampers rapid deployment. Although recent methods utilize pre-trained diffusion models for image reconstruction, they struggle with slow inference and restricted adaptability to CS. To tackle these challenges, this paper proposes Invertible Diffusion Models (IDM), a novel efficient, end-to-end diffusion-based CS method. IDM repurposes a large-scale diffusion sampling process as a reconstruction model, and finetunes it end-to-end to recover original images directly from CS measurements, moving beyond the traditional paradigm of one-step noise estimation learning. To enable such memory-intensive end-to-end finetuning, we propose a novel two-level invertible design to transform both (1) the multi-step sampling process and (2) the noise estimation U-Net in each step into invertible networks. As a result, most intermediate features are cleared during training to reduce up to 93.8% GPU memory. In addition, we develop a set of lightweight modules to inject measurements into noise estimator to further facilitate reconstruction. Experiments demonstrate that IDM outperforms existing state-of-the-art CS networks by up to 2.64dB in PSNR. Compared to the recent diffusion model-based approach DDNM, our IDM achieves up to 10.09dB PSNR gain and 14.54 times faster inference.

Title: DreamLIP: Language-Image Pre-training with Long Captions

Authors: Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, Yujun Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.17007
Pdf URL: https://arxiv.org/pdf/2403.17007
Copy Paste: [[2403.17007]] DreamLIP: Language-Image Pre-training with Long Captions(https://arxiv.org/abs/2403.17007)
Keywords: self-supervised
Abstract: Language-image pre-training largely relies on how precisely and thoroughly a text describes its paired image. In practice, however, the contents of an image can be so rich that well describing them requires lengthy captions (e.g., with 10 sentences), which are usually missing in existing datasets. Consequently, there are currently no clear evidences on whether and how language-image pre-training could benefit from long captions. To figure this out, we first re-caption 30M images with detailed descriptions using a pre-trained Multi-modality Large Language Model (MLLM), and then study the usage of the resulting captions under a contrastive learning framework. We observe that, each sentence within a long caption is very likely to describe the image partially (e.g., an object). Motivated by this, we propose to dynamically sample sub-captions from the text label to construct multiple positive pairs, and introduce a grouping loss to match the embeddings of each sub-caption with its corresponding local image patches in a self-supervised manner. Experimental results on a wide rage of downstream tasks demonstrate the consistent superiority of our method, termed DreamLIP, over previous alternatives, highlighting its fine-grained representational capacity. It is noteworthy that, on the tasks of image-text retrieval and semantic segmentation, our model trained with 30M image-text pairs achieves on par or even better performance than CLIP trained with 400M pairs. Project page is available at https://zyf0619sjtu.github.io/dream-lip.