2024-04-26

Title: A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming

Authors: Pengyuan Zhou, Lin Wang, Zhi Liu, Yanbin Hao, Pan Hui, Sasu Tarkoma, Jussi Kangasharju
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2404.16038
Pdf URL: https://arxiv.org/pdf/2404.16038
Copy Paste: [[2404.16038]] A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming(https://arxiv.org/abs/2404.16038)
Keywords: generative
Abstract: This paper offers an insightful examination of how currently top-trending AI technologies, i.e., generative artificial intelligence (Generative AI) and large language models (LLMs), are reshaping the field of video technology, including video generation, understanding, and streaming. It highlights the innovative use of these technologies in producing highly realistic videos, a significant leap in bridging the gap between real-world dynamics and digital creation. The study also delves into the advanced capabilities of LLMs in video understanding, demonstrating their effectiveness in extracting meaningful information from visual content, thereby enhancing our interaction with videos. In the realm of video streaming, the paper discusses how LLMs contribute to more efficient and user-centric streaming experiences, adapting content delivery to individual viewer preferences. This comprehensive review navigates through the current achievements, ongoing challenges, and future possibilities of applying Generative AI and LLMs to video-related tasks, underscoring the immense potential these technologies hold for advancing the field of video technology related to multimedia, networking, and AI communities.

Title: Quantitative Characterization of Retinal Features in Translated OCTA

Authors: Rashadul Hasan Badhon, Atalie Carina Thompson, Jennifer I. Lim, Theodore Leng, Minhaj Nur Alam
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2404.16133
Pdf URL: https://arxiv.org/pdf/2404.16133
Copy Paste: [[2404.16133]] Quantitative Characterization of Retinal Features in Translated OCTA(https://arxiv.org/abs/2404.16133)
Keywords: generative
Abstract: Purpose: This study explores the feasibility of using generative machine learning (ML) to translate Optical Coherence Tomography (OCT) images into Optical Coherence Tomography Angiography (OCTA) images, potentially bypassing the need for specialized OCTA hardware. Methods: The method involved implementing a generative adversarial network framework that includes a 2D vascular segmentation model and a 2D OCTA image translation model. The study utilizes a public dataset of 500 patients, divided into subsets based on resolution and disease status, to validate the quality of TR-OCTA images. The validation employs several quality and quantitative metrics to compare the translated images with ground truth OCTAs (GT-OCTA). We then quantitatively characterize vascular features generated in TR-OCTAs with GT-OCTAs to assess the feasibility of using TR-OCTA for objective disease diagnosis. Result: TR-OCTAs showed high image quality in both 3 and 6 mm datasets (high-resolution, moderate structural similarity and contrast quality compared to GT-OCTAs). There were slight discrepancies in vascular metrics, especially in diseased patients. Blood vessel features like tortuosity and vessel perimeter index showed a better trend compared to density features which are affected by local vascular distortions. Conclusion: This study presents a promising solution to the limitations of OCTA adoption in clinical practice by using vascular features from TR-OCTA for disease detection. Translation relevance: This study has the potential to significantly enhance the diagnostic process for retinal diseases by making detailed vascular imaging more widely available and reducing dependency on costly OCTA equipment.

Title: Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Authors: Jiaqing Yuan, Lin Pan, Chung-Wei Hang, Jiang Guo, Jiarong Jiang, Bonan Min, Patrick Ng, Zhiguo Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.16164
Pdf URL: https://arxiv.org/pdf/2404.16164
Copy Paste: [[2404.16164]] Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall(https://arxiv.org/abs/2404.16164)
Keywords: in-context
Abstract: Large language models (LLMs) have shown remarkable performance on a variety of NLP tasks, and are being rapidly adopted in a wide range of use cases. It is therefore of vital importance to holistically evaluate the factuality of their generated outputs, as hallucinations remain a challenging issue. In this work, we focus on assessing LLMs' ability to recall factual knowledge learned from pretraining, and the factors that affect this ability. To that end, we construct FACT-BENCH, a representative benchmark covering 20 domains, 134 property types, 3 answer types, and different knowledge popularity levels. We benchmark 31 models from 10 model families and provide a holistic assessment of their strengths and weaknesses. We observe that instruction-tuning hurts knowledge recall, as pretraining-only models consistently outperform their instruction-tuned counterparts, and positive effects of model scaling, as larger models outperform smaller ones for all model families. However, the best performance from GPT-4 still represents a large gap with the upper-bound. We additionally study the role of in-context exemplars using counterfactual demonstrations, which lead to significant degradation of factual knowledge recall for large models. By further decoupling model known and unknown knowledge, we find the degradation is attributed to exemplars that contradict a model's known knowledge, as well as the number of such exemplars. Lastly, we fine-tune LLaMA-7B in different settings of known and unknown knowledge. In particular, fine-tuning on a model's known knowledge is beneficial, and consistently outperforms fine-tuning on unknown and mixed knowledge. We will make our benchmark publicly available.

Title: S2DEVFMAP: Self-Supervised Learning Framework with Dual Ensemble Voting Fusion for Maximizing Anomaly Prediction in Timeseries

Authors: Sarala Naidu, Ning Xiong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2404.16179
Pdf URL: https://arxiv.org/pdf/2404.16179
Copy Paste: [[2404.16179]] S2DEVFMAP: Self-Supervised Learning Framework with Dual Ensemble Voting Fusion for Maximizing Anomaly Prediction in Timeseries(https://arxiv.org/abs/2404.16179)
Keywords: self-supervised, anomaly
Abstract: Anomaly detection plays a crucial role in industrial settings, particularly in maintaining the reliability and optimal performance of cooling systems. Traditional anomaly detection methods often face challenges in handling diverse data characteristics and variations in noise levels, resulting in limited effectiveness. And yet traditional anomaly detection often relies on application of single models. This work proposes a novel, robust approach using five heterogeneous independent models combined with a dual ensemble fusion of voting techniques. Diverse models capture various system behaviors, while the fusion strategy maximizes detection effectiveness and minimizes false alarms. Each base autoencoder model learns a unique representation of the data, leveraging their complementary strengths to improve anomaly detection performance. To increase the effectiveness and reliability of final anomaly prediction, dual ensemble technique is applied. This approach outperforms in maximizing the coverage of identifying anomalies. Experimental results on a real-world dataset of industrial cooling system data demonstrate the effectiveness of the proposed approach. This approach can be extended to other industrial applications where anomaly detection is critical for ensuring system reliability and preventing potential malfunctions.

Title: ABCD: Trust enhanced Attention based Convolutional Autoencoder for Risk Assessment

Authors: Sarala Naidu, Ning Xiong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2404.16183
Pdf URL: https://arxiv.org/pdf/2404.16183
Copy Paste: [[2404.16183]] ABCD: Trust enhanced Attention based Convolutional Autoencoder for Risk Assessment(https://arxiv.org/abs/2404.16183)
Keywords: anomaly
Abstract: Anomaly detection in industrial systems is crucial for preventing equipment failures, ensuring risk identification, and maintaining overall system efficiency. Traditional monitoring methods often rely on fixed thresholds and empirical rules, which may not be sensitive enough to detect subtle changes in system health and predict impending failures. To address this limitation, this paper proposes, a novel Attention-based convolutional autoencoder (ABCD) for risk detection and map the risk value derive to the maintenance planning. ABCD learns the normal behavior of conductivity from historical data of a real-world industrial cooling system and reconstructs the input data, identifying anomalies that deviate from the expected patterns. The framework also employs calibration techniques to ensure the reliability of its predictions. Evaluation results demonstrate that with the attention mechanism in ABCD a 57.4% increase in performance and a reduction of false alarms by 9.37% is seen compared to without attention. The approach can effectively detect risks, the risk priority rank mapped to maintenance, providing valuable insights for cooling system designers and service personnel. Calibration error of 0.03% indicates that the model is well-calibrated and enhances model's trustworthiness, enabling informed decisions about maintenance strategies

Title: Towards Efficient Patient Recruitment for Clinical Trials: Application of a Prompt-Based Learning Model

Authors: Mojdeh Rahmanian, Seyed Mostafa Fakhrahmad, Seyedeh Zahra Mousavi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.16198
Pdf URL: https://arxiv.org/pdf/2404.16198
Copy Paste: [[2404.16198]] Towards Efficient Patient Recruitment for Clinical Trials: Application of a Prompt-Based Learning Model(https://arxiv.org/abs/2404.16198)
Keywords: generative
Abstract: Objective: Clinical trials are essential for advancing pharmaceutical interventions, but they face a bottleneck in selecting eligible participants. Although leveraging electronic health records (EHR) for recruitment has gained popularity, the complex nature of unstructured medical texts presents challenges in efficiently identifying participants. Natural Language Processing (NLP) techniques have emerged as a solution with a recent focus on transformer models. In this study, we aimed to evaluate the performance of a prompt-based large language model for the cohort selection task from unstructured medical notes collected in the EHR. Methods: To process the medical records, we selected the most related sentences of the records to the eligibility criteria needed for the trial. The SNOMED CT concepts related to each eligibility criterion were collected. Medical records were also annotated with MedCAT based on the SNOMED CT ontology. Annotated sentences including concepts matched with the criteria-relevant terms were extracted. A prompt-based large language model (Generative Pre-trained Transformer (GPT) in this study) was then used with the extracted sentences as the training set. To assess its effectiveness, we evaluated the model's performance using the dataset from the 2018 n2c2 challenge, which aimed to classify medical records of 311 patients based on 13 eligibility criteria through NLP techniques. Results: Our proposed model showed the overall micro and macro F measures of 0.9061 and 0.8060 which were among the highest scores achieved by the experiments performed with this dataset. Conclusion: The application of a prompt-based large language model in this study to classify patients based on eligibility criteria received promising scores. Besides, we proposed a method of extractive summarization with the aid of SNOMED CT ontology that can be also applied to other medical texts.

Title: An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape

Authors: Sifat Muhammad Abdullah, Aravind Cheruvu, Shravya Kanchi, Taejoong Chung, Peng Gao, Murtuza Jadliwala, Bimal Viswanath
Subjects: cs.CR, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2404.16212
Pdf URL: https://arxiv.org/pdf/2404.16212
Copy Paste: [[2404.16212]] An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape(https://arxiv.org/abs/2404.16212)
Keywords: foundation model, generative
Abstract: Deepfake or synthetic images produced using deep generative models pose serious risks to online platforms. This has triggered several research efforts to accurately detect deepfake images, achieving excellent performance on publicly available deepfake datasets. In this work, we study 8 state-of-the-art detectors and argue that they are far from being ready for deployment due to two recent developments. First, the emergence of lightweight methods to customize large generative models, can enable an attacker to create many customized generators (to create deepfakes), thereby substantially increasing the threat surface. We show that existing defenses fail to generalize well to such \emph{user-customized generative models} that are publicly available today. We discuss new machine learning approaches based on content-agnostic features, and ensemble modeling to improve generalization performance against user-customized models. Second, the emergence of \textit{vision foundation models} -- machine learning models trained on broad data that can be easily adapted to several downstream tasks -- can be misused by attackers to craft adversarial deepfakes that can evade existing defenses. We propose a simple adversarial attack that leverages existing foundation models to craft adversarial samples \textit{without adding any adversarial noise}, through careful semantic manipulation of the image content. We highlight the vulnerabilities of several defenses against our attack, and explore directions leveraging advanced foundation models and adversarial training to defend against this new threat.

Title: AutoGluon-Multimodal (AutoMM): Supercharging Multimodal AutoML with Foundation Models

Authors: Zhiqiang Tang, Haoyang Fang, Su Zhou, Taojiannan Yang, Zihan Zhong, Tony Hu, Katrin Kirchhoff, George Karypis
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2404.16233
Pdf URL: https://arxiv.org/pdf/2404.16233
Copy Paste: [[2404.16233]] AutoGluon-Multimodal (AutoMM): Supercharging Multimodal AutoML with Foundation Models(https://arxiv.org/abs/2404.16233)
Keywords: foundation model
Abstract: AutoGluon-Multimodal (AutoMM) is introduced as an open-source AutoML library designed specifically for multimodal learning. Distinguished by its exceptional ease of use, AutoMM enables fine-tuning of foundational models with just three lines of code. Supporting various modalities including image, text, and tabular data, both independently and in combination, the library offers a comprehensive suite of functionalities spanning classification, regression, object detection, semantic matching, and image segmentation. Experiments across diverse datasets and tasks showcases AutoMM's superior performance in basic classification and regression tasks compared to existing AutoML tools, while also demonstrating competitive results in advanced tasks, aligning with specialized toolboxes designed for such purposes.

Title: Reinforcement Learning with Generative Models for Compact Support Sets

Authors: Nico Schiavone, Xingyu Li
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2404.16300
Pdf URL: https://arxiv.org/pdf/2404.16300
Copy Paste: [[2404.16300]] Reinforcement Learning with Generative Models for Compact Support Sets(https://arxiv.org/abs/2404.16300)
Keywords: foundation model, generative
Abstract: Foundation models contain a wealth of information from their vast number of training samples. However, most prior arts fail to extract this information in a precise and efficient way for small sample sizes. In this work, we propose a framework utilizing reinforcement learning as a control for foundation models, allowing for the granular generation of small, focused synthetic support sets to augment the performance of neural network models on real data classification tasks. We first allow a reinforcement learning agent access to a novel context based dictionary; the agent then uses this dictionary with a novel prompt structure to form and optimize prompts as inputs to generative models, receiving feedback based on a reward function combining the change in validation accuracy and entropy. A support set is formed this way over several exploration steps. Our framework produced excellent results, increasing classification accuracy by significant margins for no additional labelling or data cost.

Title: CFMW: Cross-modality Fusion Mamba for Multispectral Object Detection under Adverse Weather Conditions

Authors: Haoyuan Li, Qi Hu, You Yao, Kailun Yang, Peng Chen
Subjects: cs.CV, cs.MM, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2404.16302
Pdf URL: https://arxiv.org/pdf/2404.16302
Copy Paste: [[2404.16302]] CFMW: Cross-modality Fusion Mamba for Multispectral Object Detection under Adverse Weather Conditions(https://arxiv.org/abs/2404.16302)
Keywords: diffusion
Abstract: Cross-modality images that integrate visible-infrared spectra cues can provide richer complementary information for object detection. Despite this, existing visible-infrared object detection methods severely degrade in severe weather conditions. This failure stems from the pronounced sensitivity of visible images to environmental perturbations, such as rain, haze, and snow, which frequently cause false negatives and false positives in detection. To address this issue, we introduce a novel and challenging task, termed visible-infrared object detection under adverse weather conditions. To foster this task, we have constructed a new Severe Weather Visible-Infrared Dataset (SWVID) with diverse severe weather scenes. Furthermore, we introduce the Cross-modality Fusion Mamba with Weather-removal (CFMW) to augment detection accuracy in adverse weather conditions. Thanks to the proposed Weather Removal Diffusion Model (WRDM) and Cross-modality Fusion Mamba (CFM) modules, CFMW is able to mine more essential information of pedestrian features in cross-modality fusion, thus could transfer to other rarer scenarios with high efficiency and has adequate availability on those platforms with low computing power. To the best of our knowledge, this is the first study that targeted improvement and integrated both Diffusion and Mamba modules in cross-modality object detection, successfully expanding the practical application of this type of model with its higher accuracy and more advanced architecture. Extensive experiments on both well-recognized and self-created datasets conclusively demonstrate that our CFMW achieves state-of-the-art detection performance, surpassing existing benchmarks. The dataset and source code will be made publicly available at https://github.com/lhy-zjut/CFMW.

Title: TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Authors: Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K. Marks
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16306
Pdf URL: https://arxiv.org/pdf/2404.16306
Copy Paste: [[2404.16306]] TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models(https://arxiv.org/abs/2404.16306)
Keywords: diffusion, foundation model, generative
Abstract: Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.

Title: Semantic Segmentation Refiner for Ultrasound Applications with Zero-Shot Foundation Models

Authors: Hedda Cohen Indelman, Elay Dahan, Angeles M. Perez-Agosto, Carmit Shiran, Doron Shaked, Nati Daniel
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2404.16325
Pdf URL: https://arxiv.org/pdf/2404.16325
Copy Paste: [[2404.16325]] Semantic Segmentation Refiner for Ultrasound Applications with Zero-Shot Foundation Models(https://arxiv.org/abs/2404.16325)
Keywords: foundation model
Abstract: Despite the remarkable success of deep learning in medical imaging analysis, medical image segmentation remains challenging due to the scarcity of high-quality labeled images for supervision. Further, the significant domain gap between natural and medical images in general and ultrasound images in particular hinders fine-tuning models trained on natural images to the task at hand. In this work, we address the performance degradation of segmentation models in low-data regimes and propose a prompt-less segmentation method harnessing the ability of segmentation foundation models to segment abstract shapes. We do that via our novel prompt point generation algorithm which uses coarse semantic segmentation masks as input and a zero-shot prompt-able foundation model as an optimization target. We demonstrate our method on a segmentation findings task (pathologic anomalies) in ultrasound images. Our method's advantages are brought to light in varying degrees of low-data regime experiments on a small-scale musculoskeletal ultrasound images dataset, yielding a larger performance gain as the training set size decreases.

Title: FedStyle: Style-Based Federated Learning Crowdsourcing Framework for Art Commissions

Authors: Changjuan Ran, Yeting Guo, Fang Liu, Shenglan Cui, Yunfan Ye
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2404.16336
Pdf URL: https://arxiv.org/pdf/2404.16336
Copy Paste: [[2404.16336]] FedStyle: Style-Based Federated Learning Crowdsourcing Framework for Art Commissions(https://arxiv.org/abs/2404.16336)
Keywords: generative
Abstract: The unique artistic style is crucial to artists' occupational competitiveness, yet prevailing Art Commission Platforms rarely support style-based retrieval. Meanwhile, the fast-growing generative AI techniques aggravate artists' concerns about releasing personal artworks to public platforms. To achieve artistic style-based retrieval without exposing personal artworks, we propose FedStyle, a style-based federated learning crowdsourcing framework. It allows artists to train local style models and share model parameters rather than artworks for collaboration. However, most artists possess a unique artistic style, resulting in severe model drift among them. FedStyle addresses such extreme data heterogeneity by having artists learn their abstract style representations and align with the server, rather than merely aggregating model parameters lacking semantics. Besides, we introduce contrastive learning to meticulously construct the style representation space, pulling artworks with similar styles closer and keeping different ones apart in the embedding space. Extensive experiments on the proposed datasets demonstrate the superiority of FedStyle.

Title: Guarding Graph Neural Networks for Unsupervised Graph Anomaly Detection

Authors: Yuanchen Bei, Sheng Zhou, Jinke Shi, Yao Ma, Haishuai Wang, Jiajun Bu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2404.16366
Pdf URL: https://arxiv.org/pdf/2404.16366
Copy Paste: [[2404.16366]] Guarding Graph Neural Networks for Unsupervised Graph Anomaly Detection(https://arxiv.org/abs/2404.16366)
Keywords: anomaly
Abstract: Unsupervised graph anomaly detection aims at identifying rare patterns that deviate from the majority in a graph without the aid of labels, which is important for a variety of real-world applications. Recent advances have utilized Graph Neural Networks (GNNs) to learn effective node representations by aggregating information from neighborhoods. This is motivated by the hypothesis that nodes in the graph tend to exhibit consistent behaviors with their neighborhoods. However, such consistency can be disrupted by graph anomalies in multiple ways. Most existing methods directly employ GNNs to learn representations, disregarding the negative impact of graph anomalies on GNNs, resulting in sub-optimal node representations and anomaly detection performance. While a few recent approaches have redesigned GNNs for graph anomaly detection under semi-supervised label guidance, how to address the adverse effects of graph anomalies on GNNs in unsupervised scenarios and learn effective representations for anomaly detection are still under-explored. To bridge this gap, in this paper, we propose a simple yet effective framework for Guarding Graph Neural Networks for Unsupervised Graph Anomaly Detection (G3AD). Specifically, G3AD introduces two auxiliary networks along with correlation constraints to guard the GNNs from inconsistent information encoding. Furthermore, G3AD introduces an adaptive caching module to guard the GNNs from solely reconstructing the observed data that contains anomalies. Extensive experiments demonstrate that our proposed G3AD can outperform seventeen state-of-the-art methods on both synthetic and real-world datasets.

Title: U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF

Authors: Xingchen Song, Di Wu, Binbin Zhang, Dinghao Zhou, Zhendong Peng, Bo Dang, Fuping Pan, Chao Yang
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2404.16407
Pdf URL: https://arxiv.org/pdf/2404.16407
Copy Paste: [[2404.16407]] U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF(https://arxiv.org/abs/2404.16407)
Keywords: foundation model
Abstract: Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable language models and this shift towards a new generation of foundation models is gaining momentum, particularly within the field of Automatic Speech Recognition (ASR). Recent works that incorporating MoE into ASR models have complex designs such as routing frames via supplementary embedding network, improving multilingual ability for the experts, and utilizing dedicated auxiliary losses for either expert load balancing or specific language handling. We found that delicate designs are not necessary, while an embarrassingly simple substitution of MoE layers for all Feed-Forward Network (FFN) layers is competent for the ASR task. To be more specific, we benchmark our proposed model on a large scale inner-source dataset (160k hours), the results show that we can scale our baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with bidirectional attention decoders (U2++), we achieve the streaming and non-streaming decoding modes in a single MoE based model, which we call U2++ MoE. We hope that our study can facilitate the research on scaling speech foundation models without sacrificing deployment efficiency.

Title: Asking and Answering Questions to Extract Event-Argument Structures

Authors: Md Nayem Uddin, Enfa Rose George, Eduardo Blanco, Steven Corman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.16413
Pdf URL: https://arxiv.org/pdf/2404.16413
Copy Paste: [[2404.16413]] Asking and Answering Questions to Extract Event-Argument Structures(https://arxiv.org/abs/2404.16413)
Keywords: generative
Abstract: This paper presents a question-answering approach to extract document-level event-argument structures. We automatically ask and answer questions for each argument type an event may have. Questions are generated using manually defined templates and generative transformers. Template-based questions are generated using predefined role-specific wh-words and event triggers from the context document. Transformer-based questions are generated using large language models trained to formulate questions based on a passage and the expected answer. Additionally, we develop novel data augmentation strategies specialized in inter-sentential event-argument relations. We use a simple span-swapping technique, coreference resolution, and large language models to augment the training instances. Our approach enables transfer learning without any corpora-specific modifications and yields competitive results with the RAMS dataset. It outperforms previous work, and it is especially beneficial to extract arguments that appear in different sentences than the event trigger. We also present detailed quantitative and qualitative analyses shedding light on the most common errors made by our best model.

Title: SynCellFactory: Generative Data Augmentation for Cell Tracking

Authors: Moritz Sturm, Lorenzo Cerrone, Fred A. Hamprecht
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16421
Pdf URL: https://arxiv.org/pdf/2404.16421
Copy Paste: [[2404.16421]] SynCellFactory: Generative Data Augmentation for Cell Tracking(https://arxiv.org/abs/2404.16421)
Keywords: generative
Abstract: Cell tracking remains a pivotal yet challenging task in biomedical research. The full potential of deep learning for this purpose is often untapped due to the limited availability of comprehensive and varied training data sets. In this paper, we present SynCellFactory, a generative cell video augmentation. At the heart of SynCellFactory lies the ControlNet architecture, which has been fine-tuned to synthesize cell imagery with photorealistic accuracy in style and motion patterns. This technique enables the creation of synthetic yet realistic cell videos that mirror the complexity of authentic microscopy time-lapses. Our experiments demonstrate that SynCellFactory boosts the performance of well-established deep learning models for cell tracking, particularly when original training data is sparse.

Title: Point-JEPA: A Joint Embedding Predictive Architecture for Self-Supervised Learning on Point Cloud

Authors: Ayumu Saito, Jiju Poovvancheri
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16432
Pdf URL: https://arxiv.org/pdf/2404.16432
Copy Paste: [[2404.16432]] Point-JEPA: A Joint Embedding Predictive Architecture for Self-Supervised Learning on Point Cloud(https://arxiv.org/abs/2404.16432)
Keywords: self-supervised
Abstract: Recent advancements in self-supervised learning in the point cloud domain have demonstrated significant potential. However, these methods often suffer from drawbacks, including lengthy pre-training time, the necessity of reconstruction in the input space, or the necessity of additional modalities. In order to address these issues, we introduce Point-JEPA, a joint embedding predictive architecture designed specifically for point cloud data. To this end, we introduce a sequencer that orders point cloud tokens to efficiently compute and utilize tokens proximity based on their indices during target and context selection. The sequencer also allows shared computations of the tokens proximity between context and target selection, further improving the efficiency. Experimentally, our method achieves competitive results with state-of-the-art methods while avoiding the reconstruction in the input space or additional modality.

Title: DiffSeg: A Segmentation Model for Skin Lesions Based on Diffusion Difference

Authors: Zhihao Shuai, Yinan Chen, Shunqiang Mao, Yihan Zho, Xiaohong Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2404.16474
Pdf URL: https://arxiv.org/pdf/2404.16474
Copy Paste: [[2404.16474]] DiffSeg: A Segmentation Model for Skin Lesions Based on Diffusion Difference(https://arxiv.org/abs/2404.16474)
Keywords: diffusion, generative
Abstract: Weakly supervised medical image segmentation (MIS) using generative models is crucial for clinical diagnosis. However, the accuracy of the segmentation results is often limited by insufficient supervision and the complex nature of medical imaging. Existing models also only provide a single outcome, which does not allow for the measurement of uncertainty. In this paper, we introduce DiffSeg, a segmentation model for skin lesions based on diffusion difference which exploits diffusion model principles to ex-tract noise-based features from images with diverse semantic information. By discerning difference between these noise features, the model identifies diseased areas. Moreover, its multi-output capability mimics doctors' annotation behavior, facilitating the visualization of segmentation result consistency and ambiguity. Additionally, it quantifies output uncertainty using Generalized Energy Distance (GED), aiding interpretability and decision-making for physicians. Finally, the model integrates outputs through the Dense Conditional Random Field (DenseCRF) algorithm to refine the segmentation boundaries by considering inter-pixel correlations, which improves the accuracy and optimizes the segmentation results. We demonstrate the effectiveness of DiffSeg on the ISIC 2018 Challenge dataset, outperforming state-of-the-art U-Net-based methods.

Title: 3D Face Modeling via Weakly-supervised Disentanglement Network joint Identity-consistency Prior

Authors: Guohao Li, Hongyu Yang, Di Huang, Yunhong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16536
Pdf URL: https://arxiv.org/pdf/2404.16536
Copy Paste: [[2404.16536]] 3D Face Modeling via Weakly-supervised Disentanglement Network joint Identity-consistency Prior(https://arxiv.org/abs/2404.16536)
Keywords: generative
Abstract: Generative 3D face models featuring disentangled controlling factors hold immense potential for diverse applications in computer vision and computer graphics. However, previous 3D face modeling methods face a challenge as they demand specific labels to effectively disentangle these factors. This becomes particularly problematic when integrating multiple 3D face datasets to improve the generalization of the model. Addressing this issue, this paper introduces a Weakly-Supervised Disentanglement Framework, denoted as WSDF, to facilitate the training of controllable 3D face models without an overly stringent labeling requirement. Adhering to the paradigm of Variational Autoencoders (VAEs), the proposed model achieves disentanglement of identity and expression controlling factors through a two-branch encoder equipped with dedicated identity-consistency prior. It then faithfully re-entangles these factors via a tensor-based combination mechanism. Notably, the introduction of the Neutral Bank allows precise acquisition of subject-specific information using only identity labels, thereby averting degeneration due to insufficient supervision. Additionally, the framework incorporates a label-free second-order loss function for the expression factor to regulate deformation space and eliminate extraneous information, resulting in enhanced disentanglement. Extensive experiments have been conducted to substantiate the superior performance of WSDF. Our code is available at https://github.com/liguohao96/WSDF.

Title: Conditional Distribution Modelling for Few-Shot Image Synthesis with Diffusion Models

Authors: Parul Gupta, Munawar Hayat, Abhinav Dhall, Thanh-Toan Do
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16556
Pdf URL: https://arxiv.org/pdf/2404.16556
Copy Paste: [[2404.16556]] Conditional Distribution Modelling for Few-Shot Image Synthesis with Diffusion Models(https://arxiv.org/abs/2404.16556)
Keywords: diffusion
Abstract: Few-shot image synthesis entails generating diverse and realistic images of novel categories using only a few example images. While multiple recent efforts in this direction have achieved impressive results, the existing approaches are dependent only upon the few novel samples available at test time in order to generate new images, which restricts the diversity of the generated images. To overcome this limitation, we propose Conditional Distribution Modelling (CDM) -- a framework which effectively utilizes Diffusion models for few-shot image generation. By modelling the distribution of the latent space used to condition a Diffusion process, CDM leverages the learnt statistics of the training data to get a better approximation of the unseen class distribution, thereby removing the bias arising due to limited number of few shot samples. Simultaneously, we devise a novel inversion based optimization strategy that further improves the approximated unseen class distribution, and ensures the fidelity of the generated samples to the unseen class. The experimental results on four benchmark datasets demonstrate the effectiveness of our proposed CDM for few-shot generation.

Title: MonoPCC: Photometric-invariant Cycle Constraint for Monocular Depth Estimation of Endoscopic Images

Authors: Zhiwei Wang, Ying Zhou, Shiquan He, Ting Li, Yitong Zhang, Xinxia Feng, Mei Liu, Qiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16571
Pdf URL: https://arxiv.org/pdf/2404.16571
Copy Paste: [[2404.16571]] MonoPCC: Photometric-invariant Cycle Constraint for Monocular Depth Estimation of Endoscopic Images(https://arxiv.org/abs/2404.16571)
Keywords: self-supervised
Abstract: Photometric constraint is indispensable for self-supervised monocular depth estimation. It involves warping a source image onto a target view using estimated depth&pose, and then minimizing the difference between the warped and target images. However, the endoscopic built-in light causes significant brightness fluctuations, and thus makes the photometric constraint unreliable. Previous efforts only mitigate this relying on extra models to calibrate image brightness. In this paper, we propose MonoPCC to address the brightness inconsistency radically by reshaping the photometric constraint into a cycle form. Instead of only warping the source image, MonoPCC constructs a closed loop consisting of two opposite forward-backward warping paths: from target to source and then back to target. Thus, the target image finally receives an image cycle-warped from itself, which naturally makes the constraint invariant to brightness changes. Moreover, MonoPCC transplants the source image's phase-frequency into the intermediate warped image to avoid structure lost, and also stabilizes the training via an exponential moving average (EMA) strategy to avoid frequent changes in the forward warping. The comprehensive and extensive experimental results on three datasets demonstrate that our proposed MonoPCC shows a great robustness to the brightness inconsistency, and exceeds other state-of-the-arts by reducing the absolute relative error by at least 7.27%.

Title: MuseumMaker: Continual Style Customization without Catastrophic Forgetting

Authors: Chenxi Liu, Gan Sun, Wenqi Liang, Jiahua Dong, Can Qin, Yang Cong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16612
Pdf URL: https://arxiv.org/pdf/2404.16612
Copy Paste: [[2404.16612]] MuseumMaker: Continual Style Customization without Catastrophic Forgetting(https://arxiv.org/abs/2404.16612)
Keywords: diffusion
Abstract: Pre-trained large text-to-image (T2I) models with an appropriate text prompt has attracted growing interests in customized images generation field. However, catastrophic forgetting issue make it hard to continually synthesize new user-provided styles while retaining the satisfying results amongst learned styles. In this paper, we propose MuseumMaker, a method that enables the synthesis of images by following a set of customized styles in a never-end manner, and gradually accumulate these creative artistic works as a Museum. When facing with a new customization style, we develop a style distillation loss module to transfer the style of the whole dataset into generation of images. It can minimize the learning biases caused by content of images, and address the catastrophic overfitting issue induced by few-shot images. To deal with catastrophic forgetting amongst past learned styles, we devise a dual regularization for shared-LoRA module to optimize the direction of model update, which could regularize the diffusion model from both weight and feature aspects, respectively. Meanwhile, a unique token embedding corresponding to this new style is learned by a task-wise token learning module, which could preserve historical knowledge from past styles with the limitation of LoRA parameter quantity. As any new user-provided style come, our MuseumMaker can capture the nuances of the new styles while maintaining the details of learned styles. Experimental results on diverse style datasets validate the effectiveness of our proposed MuseumMaker method, showcasing its robustness and versatility across various scenarios.

Title: Denoising: from classical methods to deep CNNs

Authors: Jean-Eric Campagne
Subjects: cs.CV, math.HO
Abstract URL: https://arxiv.org/abs/2404.16617
Pdf URL: https://arxiv.org/pdf/2404.16617
Copy Paste: [[2404.16617]] Denoising: from classical methods to deep CNNs(https://arxiv.org/abs/2404.16617)
Keywords: diffusion
Abstract: This paper aims to explore the evolution of image denoising in a pedagological way. We briefly review classical methods such as Fourier analysis and wavelet bases, highlighting the challenges they faced until the emergence of neural networks, notably the U-Net, in the 2010s. The remarkable performance of these networks has been demonstrated in studies such as Kadkhodaie et al. (2024). They exhibit adaptability to various image types, including those with fixed regularity, facial images, and bedroom scenes, achieving optimal results and biased towards geometry-adaptive harmonic basis. The introduction of score diffusion has played a crucial role in image generation. In this context, denoising becomes essential as it facilitates the estimation of probability density scores. We discuss the prerequisites for genuine learning of probability densities, offering insights that extend from mathematical research to the implications of universal structures.

Title: Zero-Shot Distillation for Image Encoders: How to Make Effective Use of Synthetic Data

Authors: Niclas Popp, Jan Hendrik Metzen, Matthias Hein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16637
Pdf URL: https://arxiv.org/pdf/2404.16637
Copy Paste: [[2404.16637]] Zero-Shot Distillation for Image Encoders: How to Make Effective Use of Synthetic Data(https://arxiv.org/abs/2404.16637)
Keywords: foundation model
Abstract: Multi-modal foundation models such as CLIP have showcased impressive zero-shot capabilities. However, their applicability in resource-constrained environments is limited due to their large number of parameters and high inference time. While existing approaches have scaled down the entire CLIP architecture, we focus on training smaller variants of the image encoder, which suffices for efficient zero-shot classification. The use of synthetic data has shown promise in distilling representations from larger teachers, resulting in strong few-shot and linear probe performance. However, we find that this approach surprisingly fails in true zero-shot settings when using contrastive losses. We identify the exploitation of spurious features as being responsible for poor generalization between synthetic and real data. However, by using the image feature-based L2 distillation loss, we mitigate these problems and train students that achieve zero-shot performance which on four domain-specific datasets is on-par with a ViT-B/32 teacher model trained on DataCompXL, while featuring up to 92% fewer parameters.

Title: Tele-FLM Technical Report

Authors: Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Chao Wang, Xinzhang Liu, Zihan Wang, Yu Zhao, Xin Wang, Yuyao Huang, Shuangyong Song, Yongxiang Li, Zheng Zhang, Bo Zhao, Aixin Sun, Yequan Wang, Zhongjiang He, Zhongyuan Wang, Xuelong Li, Tiejun Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.16645
Pdf URL: https://arxiv.org/pdf/2404.16645
Copy Paste: [[2404.16645]] Tele-FLM Technical Report(https://arxiv.org/abs/2404.16645)
Keywords: foundation model
Abstract: Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual language modeling abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.

Title: Formal Specification, Assessment, and Enforcement of Fairness for Generative AIs

Authors: Chih-Hong Cheng, Changshun Wu, Harald Ruess, Xingyu Zhao, Saddek Bensalem
Subjects: cs.LG, cs.AI, cs.CY, cs.LO, cs.SE
Abstract URL: https://arxiv.org/abs/2404.16663
Pdf URL: https://arxiv.org/pdf/2404.16663
Copy Paste: [[2404.16663]] Formal Specification, Assessment, and Enforcement of Fairness for Generative AIs(https://arxiv.org/abs/2404.16663)
Keywords: generative
Abstract: The risk of reinforcing or exacerbating societal biases and inequalities is growing as generative AI increasingly produces content that resembles human output, from text to images and beyond. Here we formally characterize the notion of fairness for generative AI as a basis for monitoring and enforcing fairness. We define two levels of fairness utilizing the concept of infinite words. The first is the fairness demonstrated on the generated sequences, which is only evaluated on the outputs while agnostic to the prompts/models used. The second is the inherent fairness of the generative AI model, which requires that fairness be manifested when input prompts are neutral, that is, they do not explicitly instruct the generative AI to produce a particular type of output. We also study relative intersectional fairness to counteract the combinatorial explosion of fairness when considering multiple categories together with lazy fairness enforcement. Our implemented specification monitoring and enforcement tool shows interesting results when tested against several generative AI models.

Title: Multimodal Semantic-Aware Automatic Colorization with Diffusion Prior

Authors: Han Wang, Xinning Chai, Yiwen Wang, Yuhong Zhang, Rong Xie, Li Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16678
Pdf URL: https://arxiv.org/pdf/2404.16678
Copy Paste: [[2404.16678]] Multimodal Semantic-Aware Automatic Colorization with Diffusion Prior(https://arxiv.org/abs/2404.16678)
Keywords: diffusion, generative
Abstract: Colorizing grayscale images offers an engaging visual experience. Existing automatic colorization methods often fail to generate satisfactory results due to incorrect semantic colors and unsaturated colors. In this work, we propose an automatic colorization pipeline to overcome these challenges. We leverage the extraordinary generative ability of the diffusion prior to synthesize color with plausible semantics. To overcome the artifacts introduced by the diffusion prior, we apply the luminance conditional guidance. Moreover, we adopt multimodal high-level semantic priors to help the model understand the image content and deliver saturated colors. Besides, a luminance-aware decoder is designed to restore details and enhance overall visual quality. The proposed pipeline synthesizes saturated colors while maintaining plausible semantics. Experiments indicate that our proposed method considers both diversity and fidelity, surpassing previous methods in terms of perceptual realism and gain most human preference.

Title: NTIRE 2024 Quality Assessment of AI-Generated Content Challenge

Authors: Xiaohong Liu, Xiongkuo Min, Guangtao Zhai, Chunyi Li, Tengchuan Kou, Wei Sun, Haoning Wu, Yixuan Gao, Yuqin Cao, Zicheng Zhang, Xiele Wu, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16687
Pdf URL: https://arxiv.org/pdf/2404.16687
Copy Paste: [[2404.16687]] NTIRE 2024 Quality Assessment of AI-Generated Content Challenge(https://arxiv.org/abs/2404.16687)
Keywords: generative
Abstract: This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Content (AIGC). The challenge is divided into the image track and the video track. The image track uses the AIGIQA-20K, which contains 20,000 AI-Generated Images (AIGIs) generated by 15 popular generative models. The image track has a total of 318 registered participants. A total of 1,646 submissions are received in the development phase, and 221 submissions are received in the test phase. Finally, 16 participating teams submitted their models and fact sheets. The video track uses the T2VQA-DB, which contains 10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V) models. A total of 196 participants have registered in the video track. A total of 991 submissions are received in the development phase, and 185 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. Some methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on AIGC.

Title: Cooperate or Collapse: Emergence of Sustainability Behaviors in a Society of LLM Agents

Authors: Giorgio Piatti, Zhijing Jin, Max Kleiman-Weiner, Bernhard Schölkopf, Mrinmaya Sachan, Rada Mihalcea
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.16698
Pdf URL: https://arxiv.org/pdf/2404.16698
Copy Paste: [[2404.16698]] Cooperate or Collapse: Emergence of Sustainability Behaviors in a Society of LLM Agents(https://arxiv.org/abs/2404.16698)
Keywords: generative
Abstract: In the rapidly evolving field of artificial intelligence, ensuring safe decision-making of Large Language Models (LLMs) is a significant challenge. This paper introduces Governance of the Commons Simulation (GovSim), a simulation platform designed to study strategic interactions and cooperative decision-making in LLMs. Through this simulation environment, we explore the dynamics of resource sharing among AI agents, highlighting the importance of ethical considerations, strategic planning, and negotiation skills. GovSim is versatile and supports any text-based agent, including LLMs agents. Using the Generative Agent framework, we create a standard agent that facilitates the integration of different LLMs. Our findings reveal that within GovSim, only two out of 15 tested LLMs managed to achieve a sustainable outcome, indicating a significant gap in the ability of models to manage shared resources. Furthermore, we find that by removing the ability of agents to communicate, they overuse the shared resource, highlighting the importance of communication for cooperation. Interestingly, most LLMs lack the ability to make universalized hypotheses, which highlights a significant weakness in their reasoning skills. We open source the full suite of our research results, including the simulation environment, agent prompts, and a comprehensive web interface.

Title: RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis

Authors: Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Jiayu Lei, Ya Zhang, Yanfeng Wang, Weidi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16754
Pdf URL: https://arxiv.org/pdf/2404.16754
Copy Paste: [[2404.16754]] RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis(https://arxiv.org/abs/2404.16754)
Keywords: foundation model
Abstract: Developing generalist foundation model has recently attracted tremendous attention among researchers in the field of AI for Medicine (AI4Medicine). A pivotal insight in developing these models is their reliance on dataset scaling, which emphasizes the requirements on developing open-source medical image datasets that incorporate diverse supervision signals across various imaging modalities. In this paper, we introduce RadGenome-Chest CT, a comprehensive, large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. Specifically, we leverage the latest powerful universal segmentation and large language models, to extend the original datasets (over 25,692 non-contrast 3D chest CT volume and reports from 20,000 patients) from the following aspects: (i) organ-level segmentation masks covering 197 categories, which provide intermediate reasoning visual clues for interpretation; (ii) 665 K multi-granularity grounded reports, where each sentence of the report is linked to the corresponding anatomical region of CT volume in the form of a segmentation mask; (iii) 1.3 M grounded VQA pairs, where questions and answers are all linked with reference segmentation masks, enabling models to associate visual evidence with textual explanations. All grounded reports and VQA pairs in the validation set have gone through manual verification to ensure dataset quality. We believe that RadGenome-Chest CT can significantly advance the development of multimodal medical foundation models, by training to generate texts based on given segmentation regions, which is unattainable with previous relevant datasets. We will release all segmentation masks, grounded reports, and VQA pairs to facilitate further research and development in this field.

Title: REBEL: Reinforcement Learning via Regressing Relative Rewards

Authors: Zhaolin Gao, Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, J. Andrew Bagnell, Jason D. Lee, Wen Sun
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2404.16767
Pdf URL: https://arxiv.org/pdf/2404.16767
Copy Paste: [[2404.16767]] REBEL: Reinforcement Learning via Regressing Relative Rewards(https://arxiv.org/abs/2404.16767)
Keywords: generative
Abstract: While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping) and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative rewards via a direct policy parameterization between two completions to a prompt, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally tractable than PPO.

Title: ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving

Authors: Jiehui Huang, Xiao Dong, Wenhui Song, Hanhui Li, Jun Zhou, Yuhao Cheng, Shutao Liao, Long Chen, Yiqiang Yan, Shengcai Liao, Xiaodan Liang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2404.16771
Pdf URL: https://arxiv.org/pdf/2404.16771
Copy Paste: [[2404.16771]] ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving(https://arxiv.org/abs/2404.16771)
Keywords: diffusion
Abstract: Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial details and the overall face. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Together, these components significantly enhance the accuracy of ID preservation by introducing fine-grained multimodal ID information from facial regions. To facilitate training of ConsistentID, we present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. Furthermore, while ConsistentID introduces more multimodal ID information, it maintains a fast inference speed during generation.

Title: ConKeD++ -- Improving descriptor learning for retinal image registration: A comprehensive study of contrastive losses

Authors: David Rivas-Villar, Álvaro S. Hervella, José Rouco, Jorge Novo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16773
Pdf URL: https://arxiv.org/pdf/2404.16773
Copy Paste: [[2404.16773]] ConKeD++ -- Improving descriptor learning for retinal image registration: A comprehensive study of contrastive losses(https://arxiv.org/abs/2404.16773)
Keywords: self-supervised
Abstract: Self-supervised contrastive learning has emerged as one of the most successful deep learning paradigms. In this regard, it has seen extensive use in image registration and, more recently, in the particular field of medical image registration. In this work, we propose to test and extend and improve a state-of-the-art framework for color fundus image registration, ConKeD. Using the ConKeD framework we test multiple loss functions, adapting them to the framework and the application domain. Furthermore, we evaluate our models using the standarized benchmark dataset FIRE as well as several datasets that have never been used before for color fundus registration, for which we are releasing the pairing data as well as a standardized evaluation approach. Our work demonstrates state-of-the-art performance across all datasets and metrics demonstrating several advantages over current SOTA color fundus registration methods

Title: In-Context Freeze-Thaw Bayesian Optimization for Hyperparameter Optimization

Authors: Herilalaina Rakotoarison, Steven Adriaensen, Neeratyoy Mallik, Samir Garibov, Edward Bergman, Frank Hutter
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2404.16795
Pdf URL: https://arxiv.org/pdf/2404.16795
Copy Paste: [[2404.16795]] In-Context Freeze-Thaw Bayesian Optimization for Hyperparameter Optimization(https://arxiv.org/abs/2404.16795)
Keywords: in-context
Abstract: With the increasing computational costs associated with deep learning, automated hyperparameter optimization methods, strongly relying on black-box Bayesian optimization (BO), face limitations. Freeze-thaw BO offers a promising grey-box alternative, strategically allocating scarce resources incrementally to different configurations. However, the frequent surrogate model updates inherent to this approach pose challenges for existing methods, requiring retraining or fine-tuning their neural network surrogates online, introducing overhead, instability, and hyper-hyperparameters. In this work, we propose FT-PFN, a novel surrogate for Freeze-thaw style BO. FT-PFN is a prior-data fitted network (PFN) that leverages the transformers' in-context learning ability to efficiently and reliably do Bayesian learning curve extrapolation in a single forward pass. Our empirical analysis across three benchmark suites shows that the predictions made by FT-PFN are more accurate and 10-100 times faster than those of the deep Gaussian process and deep ensemble surrogates used in previous work. Furthermore, we show that, when combined with our novel acquisition mechanism (MFPI-random), the resulting in-context freeze-thaw BO method (ifBO), yields new state-of-the-art performance in the same three families of deep learning HPO benchmarks considered in prior work.

Title: Improving Diversity of Commonsense Generation by Large Language Models via In-Context Learning

Authors: Tianhui Zhang, Bei Peng, Danushka Bollegala
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.16807
Pdf URL: https://arxiv.org/pdf/2404.16807
Copy Paste: [[2404.16807]] Improving Diversity of Commonsense Generation by Large Language Models via In-Context Learning(https://arxiv.org/abs/2404.16807)
Keywords: generative, in-context
Abstract: Generative Commonsense Reasoning (GCR) requires a model to reason about a situation using commonsense knowledge, while generating coherent sentences. Although the quality of the generated sentences is crucial, the diversity of the generation is equally important because it reflects the model's ability to use a range of commonsense knowledge facts. Large Language Models (LLMs) have shown proficiency in enhancing the generation quality across various tasks through in-context learning (ICL) using given examples without the need for any fine-tuning. However, the diversity aspect in LLM outputs has not been systematically studied before. To address this, we propose a simple method that diversifies the LLM generations, while preserving their quality. Experimental results on three benchmark GCR datasets show that our method achieves an ideal balance between the quality and diversity. Moreover, the sentences generated by our proposed method can be used as training data to improve diversity in existing commonsense generators.

Title: Boosting Unsupervised Semantic Segmentation with Principal Mask Proposals

Authors: Oliver Hahn, Nikita Araslanov, Simone Schaub-Meyer, Stefan Roth
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16818
Pdf URL: https://arxiv.org/pdf/2404.16818
Copy Paste: [[2404.16818]] Boosting Unsupervised Semantic Segmentation with Principal Mask Proposals(https://arxiv.org/abs/2404.16818)
Keywords: self-supervised
Abstract: Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines.

Title: Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

Authors: Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kajić, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Chris Knutsen, Cyrus Rashtchian, Jordi Pont-Tuset, Aida Nematzadeh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16820
Pdf URL: https://arxiv.org/pdf/2404.16820
Copy Paste: [[2404.16820]] Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings(https://arxiv.org/abs/2404.16820)
Keywords: generative
Abstract: While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of >100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.

Title: How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Authors: Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16821
Pdf URL: https://arxiv.org/pdf/2404.16821
Copy Paste: [[2404.16821]] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites(https://arxiv.org/abs/2404.16821)
Keywords: foundation model
Abstract: In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at https://github.com/OpenGVLab/InternVL.

Title: Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

Authors: Charig Yang, Weidi Xie, Andrew Zisserman
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2404.16828
Pdf URL: https://arxiv.org/pdf/2404.16828
Copy Paste: [[2404.16828]] Made to Order: Discovering monotonic temporal changes via self-supervised video ordering(https://arxiv.org/abs/2404.16828)
Keywords: self-supervised
Abstract: Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with `time' serving as a supervisory signal since only changes that are monotonic with time can give rise to the correct ordering. We also introduce a flexible transformer-based model for general-purpose ordering of image sequences of arbitrary length with built-in attribution maps. After training, the model successfully discovers and localizes monotonic changes while ignoring cyclic and stochastic ones. We demonstrate applications of the model in multiple video settings covering different scene and object types, discovering both object-level and environmental changes in unseen sequences. We also demonstrate that the attention-based attribution maps function as effective prompts for segmenting the changing regions, and that the learned representations can be used for downstream applications. Finally, we show that the model achieves the state of the art on standard benchmarks for ordering a set of images.

Title: Make-it-Real: Unleashing Large Multimodal Model's Ability for Painting 3D Objects with Realistic Materials

Authors: Ye Fang, Zeyi Sun, Tong Wu, Jiaqi Wang, Ziwei Liu, Gordon Wetzstein, Dahua Lin
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2404.16829
Pdf URL: https://arxiv.org/pdf/2404.16829
Copy Paste: [[2404.16829]] Make-it-Real: Unleashing Large Multimodal Model's Ability for Painting 3D Objects with Realistic Materials(https://arxiv.org/abs/2404.16829)
Keywords: generative
Abstract: Physically realistic materials are pivotal in augmenting the realism of 3D assets across various applications and lighting conditions. However, existing 3D assets and generative models often lack authentic material properties. Manual assignment of materials using graphic software is a tedious and time-consuming task. In this paper, we exploit advancements in Multimodal Large Language Models (MLLMs), particularly GPT-4V, to present a novel approach, Make-it-Real: 1) We demonstrate that GPT-4V can effectively recognize and describe materials, allowing the construction of a detailed material library. 2) Utilizing a combination of visual cues and hierarchical text prompts, GPT-4V precisely identifies and aligns materials with the corresponding components of 3D objects. 3) The correctly matched materials are then meticulously applied as reference for the new SVBRDF material generation according to the original diffuse map, significantly enhancing their visual authenticity. Make-it-Real offers a streamlined integration into the 3D content creation workflow, showcasing its utility as an essential tool for developers of 3D assets.

Title: The Third Monocular Depth Estimation Challenge

Authors: Jaime Spencer, Fabio Tosi, Matteo Poggi, Ripudaman Singh Arora, Chris Russell, Simon Hadfield, Richard Bowden, GuangYuan Zhou, ZhengXin Li, Qiang Rao, YiPing Bao, Xiao Liu, Dohyeong Kim, Jinseong Kim, Myunghyun Kim, Mykola Lavreniuk, Rui Li, Qing Mao, Jiang Wu, Yu Zhu, Jinqiu Sun, Yanning Zhang, Suraj Patni, Aradhye Agarwal, Chetan Arora, Pihai Sun, Kui Jiang, Gang Wu, Jian Liu, Xianming Liu, Junjun Jiang, Xidan Zhang, Jianing Wei, Fangjun Wang, Zhiming Tan, Jiabao Wang, Albert Luginov, Muhammad Shahzad, Seyed Hosseini, Aleksander Trajcevski, James H. Elder
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.16831
Pdf URL: https://arxiv.org/pdf/2404.16831
Copy Paste: [[2404.16831]] The Third Monocular Depth Estimation Challenge(https://arxiv.org/abs/2404.16831)
Keywords: self-supervised
Abstract: This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 submissions outperforming the baseline on the test set: 10 among them submitted a report describing their approach, highlighting a diffused use of foundational models such as Depth Anything at the core of their method. The challenge winners drastically improved 3D F-Score performance, from 17.51% to 23.72%.