2024-03-14

Title: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Authors: Wei Ye, Chaoya Jiang, Haiyang Xu, Chenhao Ye, Chenliang Li, Ming Yan, Shikun Zhang, Songhang Huang, Fei Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07883
Pdf URL: https://arxiv.org/pdf/2403.07883
Copy Paste: [[2403.07883]] Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection(https://arxiv.org/abs/2403.07883)
Keywords: generative
Abstract: Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training (VLP) models. Although previous VLP research has demonstrated the efficacy of ViTs, these efforts still struggle with computational inefficiencies caused by lengthy visual sequences. To address this challenge, we introduce an efficient VLP approach called TRIPS, which stands for Text-Relevant Image Patch Selection. TRIPS progressively reduces the visual sequence using a text-guided patch-selection layer in the visual backbone, thereby accelerating both training and inference processes. This patch-selection layer dynamically computes text-dependent visual attention, enabling it to identify attentive image tokens with text guidance and fuse inattentive ones in an end-to-end fashion. Importantly, TRIPS does not add any extra parameters and generalizes to most ViT-based VLP models. We incorporate TRIPS into three representative VLP models covering single-stream, dual-stream, and generative paradigms, and conduct extensive experiments on five widely-used multi-modal benchmark datasets. Our experimental results reveal that TRIPS delivers a 40% speedup, while maintaining competitive or superior performance on downstream tasks.

Title: Cross-modality debiasing: using language to mitigate sub-population shifts in imaging

Authors: Yijiang Pang, Hoang Bao, Jiayu Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07888
Pdf URL: https://arxiv.org/pdf/2403.07888
Copy Paste: [[2403.07888]] Cross-modality debiasing: using language to mitigate sub-population shifts in imaging(https://arxiv.org/abs/2403.07888)
Keywords: foundation model
Abstract: Sub-population shift is a specific type of domain shift that highlights changes in data distribution within specific sub-groups or populations between training and testing. Sub-population shift accounts for a significant source of algorithmic bias and calls for distributional robustness. Recent studies found inherent distributional robustness in multi-modality foundation models, such as the vision-language model CLIP, yet this robustness is vulnerable through parameter fine-tuning. In this paper, we propose leveraging the connection of robustness among different modalities and reshaping the distributional robustness of one modality with another. Specifically, in the context of the distributional robustness of CLIP, we propose to leverage natural language inputs to debias the image feature representations, to improve worst-case performance on sub-populations. Our extensive empirical studies show that image representations debiased by natural language can achieve significant performance improvement and reduction of performance instability under sub-population shifts.

Title: Merino: Entropy-driven Design for Generative Language Models on IoT Devices

Authors: Youpeng Zhao, Ming Lin, Huadong Tang, Qiang Wu, Jun Wang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2403.07921
Pdf URL: https://arxiv.org/pdf/2403.07921
Copy Paste: [[2403.07921]] Merino: Entropy-driven Design for Generative Language Models on IoT Devices(https://arxiv.org/abs/2403.07921)
Keywords: generative
Abstract: Generative Large Language Models (LLMs) stand as a revolutionary advancement in the modern era of artificial intelligence (AI). However, directly deploying LLMs in resource-constrained hardware, such as Internet-of-Things (IoT) devices, is difficult due to their high computational cost. In this paper, we propose a novel information-entropy framework for designing mobile-friendly generative language models. Our key design paradigm is to maximize the entropy of transformer decoders within the given computational budgets. The whole design procedure involves solving a mathematical programming (MP) problem, which can be done on the CPU within minutes, making it nearly zero-cost. We evaluate our designed models, termed MeRino, across nine NLP downstream tasks, showing their competitive performance against the state-of-the-art autoregressive transformer models under the mobile setting. Notably, MeRino achieves similar or better zero performance compared to the 350M parameter OPT while being 4.9x faster on NVIDIA Jetson Nano with 5.5x reduction in model size. Code will be made available soon.

Title: Sketching the Heat Kernel: Using Gaussian Processes to Embed Data

Authors: Anna C. Gilbert, Kevin O'Neill
Subjects: cs.LG, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2403.07929
Pdf URL: https://arxiv.org/pdf/2403.07929
Copy Paste: [[2403.07929]] Sketching the Heat Kernel: Using Gaussian Processes to Embed Data(https://arxiv.org/abs/2403.07929)
Keywords: diffusion
Abstract: This paper introduces a novel, non-deterministic method for embedding data in low-dimensional Euclidean space based on computing realizations of a Gaussian process depending on the geometry of the data. This type of embedding first appeared in (Adler et al, 2018) as a theoretical model for a generic manifold in high dimensions. In particular, we take the covariance function of the Gaussian process to be the heat kernel, and computing the embedding amounts to sketching a matrix representing the heat kernel. The Karhunen-Lo\`eve expansion reveals that the straight-line distances in the embedding approximate the diffusion distance in a probabilistic sense, avoiding the need for sharp cutoffs and maintaining some of the smaller-scale structure. Our method demonstrates further advantage in its robustness to outliers. We justify the approach with both theory and experiments.

Title: WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs

Authors: Deshun Yang, Luhui Hu, Yu Tian, Zihao Li, Chris Kelly, Bang Yang, Cindy Yang, Yuexian Zou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07944
Pdf URL: https://arxiv.org/pdf/2403.07944
Copy Paste: [[2403.07944]] WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs(https://arxiv.org/abs/2403.07944)
Keywords: diffusion
Abstract: Several text-to-video diffusion models have demonstrated commendable capabilities in synthesizing high-quality video content. However, it remains a formidable challenge pertaining to maintaining temporal consistency and ensuring action smoothness throughout the generated sequences. In this paper, we present an innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework based on textual prompts and accompanying images. The framework includes two parts: prompt enhancer and full video translation. The first part employs the capabilities of ChatGPT to meticulously distill and proactively construct precise prompts for each subsequent step, thereby guaranteeing the utmost accuracy in prompt communication and accurate execution in following model operations. The second part employ compatible with existing advanced diffusion techniques to expansively generate and refine the key frame at the conclusion of a video. Then we can expertly harness the power of leading and trailing key frames to craft videos with enhanced temporal consistency and action smoothness. The experimental results confirm that our method has strong effectiveness and novelty in constructing world models from text and image inputs over the other methods.

Title: AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production

Authors: Jiuniu Wang, Zehua Du, Yuyuan Zhao, Bo Yuan, Kexiang Wang, Jian Liang, Yaxi Zhao, Yihen Lu, Gengliang Li, Junlong Gao, Xin Tu, Zhenyu Guo
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2403.07952
Pdf URL: https://arxiv.org/pdf/2403.07952
Copy Paste: [[2403.07952]] AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production(https://arxiv.org/abs/2403.07952)
Keywords: generative
Abstract: The Agent and AIGC (Artificial Intelligence Generated Content) technologies have recently made significant progress. We propose AesopAgent, an Agent-driven Evolutionary System on Story-to-Video Production. AesopAgent is a practical application of agent technology for multimodal content generation. The system integrates multiple generative capabilities within a unified framework, so that individual users can leverage these modules easily. This innovative system would convert user story proposals into scripts, images, and audio, and then integrate these multimodal contents into videos. Additionally, the animating units (e.g., Gen-2 and Sora) could make the videos more infectious. The AesopAgent system could orchestrate task workflow for video generation, ensuring that the generated video is both rich in content and coherent. This system mainly contains two layers, i.e., the Horizontal Layer and the Utility Layer. In the Horizontal Layer, we introduce a novel RAG-based evolutionary system that optimizes the whole video generation workflow and the steps within the workflow. It continuously evolves and iteratively optimizes workflow by accumulating expert experience and professional knowledge, including optimizing the LLM prompts and utilities usage. The Utility Layer provides multiple utilities, leading to consistent image generation that is visually coherent in terms of composition, characters, and style. Meanwhile, it provides audio and special effects, integrating them into expressive and logically arranged videos. Overall, our AesopAgent achieves state-of-the-art performance compared with many previous works in visual storytelling. Our AesopAgent is designed for convenient service for individual users, which is available on the following page: https://aesopai.github.io/.

Title: An Interpretable Generalization Mechanism for Accurately Detecting Anomaly and Identifying Networking Intrusion Techniques

Authors: Hao-Ting Pai, Yu-Hsuan Kang, Wen-Cheng Chung
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07959
Pdf URL: https://arxiv.org/pdf/2403.07959
Copy Paste: [[2403.07959]] An Interpretable Generalization Mechanism for Accurately Detecting Anomaly and Identifying Networking Intrusion Techniques(https://arxiv.org/abs/2403.07959)
Keywords: anomaly
Abstract: Recent advancements in Intrusion Detection Systems (IDS), integrating Explainable AI (XAI) methodologies, have led to notable improvements in system performance via precise feature selection. However, a thorough understanding of cyber-attacks requires inherently explainable decision-making processes within IDS. In this paper, we present the Interpretable Generalization Mechanism (IG), poised to revolutionize IDS capabilities. IG discerns coherent patterns, making it interpretable in distinguishing between normal and anomalous network traffic. Further, the synthesis of coherent patterns sheds light on intricate intrusion pathways, providing essential insights for cybersecurity forensics. By experiments with real-world datasets NSL-KDD, UNSW-NB15, and UKM-IDS20, IG is accurate even at a low ratio of training-to-test. With 10%-to-90%, IG achieves Precision (PRE)=0.93, Recall (REC)=0.94, and Area Under Curve (AUC)=0.94 in NSL-KDD; PRE=0.98, REC=0.99, and AUC=0.99 in UNSW-NB15; and PRE=0.98, REC=0.98, and AUC=0.99 in UKM-IDS20. Notably, in UNSW-NB15, IG achieves REC=1.0 and at least PRE=0.98 since 40%-to-60%; in UKM-IDS20, IG achieves REC=1.0 and at least PRE=0.88 since 20%-to-80%. Importantly, in UKM-IDS20, IG successfully identifies all three anomalous instances without prior exposure, demonstrating its generalization capabilities. These results and inferences are reproducible. In sum, IG showcases superior generalization by consistently performing well across diverse datasets and training-to-test ratios (from 10%-to-90% to 90%-to-10%), and excels in identifying novel anomalies without prior exposure. Its interpretability is enhanced by coherent evidence that accurately distinguishes both normal and anomalous activities, significantly improving detection accuracy and reducing false alarms, thereby strengthening IDS reliability and trustworthiness.

Title: Do Agents Dream of Electric Sheep?: Improving Generalization in Reinforcement Learning through Generative Learning

Authors: Giorgio Franceschelli, Mirco Musolesi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.07979
Pdf URL: https://arxiv.org/pdf/2403.07979
Copy Paste: [[2403.07979]] Do Agents Dream of Electric Sheep?: Improving Generalization in Reinforcement Learning through Generative Learning(https://arxiv.org/abs/2403.07979)
Keywords: generative
Abstract: The Overfitted Brain hypothesis suggests dreams happen to allow generalization in the human brain. Here, we ask if the same is true for reinforcement learning agents as well. Given limited experience in a real environment, we use imagination-based reinforcement learning to train a policy on dream-like episodes, where non-imaginative, predicted trajectories are modified through generative augmentations. Experiments on four ProcGen environments show that, compared to classic imagination and offline training on collected experience, our method can reach a higher level of generalization when dealing with sparsely rewarded environments.

Title: Training Small Multimodal Models to Bridge Biomedical Competency Gap: A Case Study in Radiology Imaging

Authors: Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, Ziyi Yang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei Yang, Chunyuan Li, Jianfeng Gao, Yu Gu, Cliff Wong, Mu Wei, Tristan Naumann, Muhao Chen, Matthew P. Lungren, Serena Yeung-Levy, Curtis P. Langlotz, Sheng Wang, Hoifung Poon
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2403.08002
Pdf URL: https://arxiv.org/pdf/2403.08002
Copy Paste: [[2403.08002]] Training Small Multimodal Models to Bridge Biomedical Competency Gap: A Case Study in Radiology Imaging(https://arxiv.org/abs/2403.08002)
Keywords: foundation model
Abstract: The scaling laws and extraordinary performance of large foundation models motivate the development and utilization of such large models in biomedicine. However, despite early promising results on some biomedical benchmarks, there are still major challenges that need to be addressed before these models can be used in real-world applications. Frontier models such as GPT-4V still have major competency gaps in multimodal capabilities for biomedical applications. Moreover, pragmatic issues such as access, cost, latency, and compliance make it hard for clinicians to use privately-hosted state-of-the-art large models directly on private patient data. In this paper, we explore training open-source small multimodal models (SMMs) to bridge biomedical competency gaps for unmet clinical needs. To maximize data efficiency, we adopt a modular approach by incorporating state-of-the-art pre-trained models for image and text modalities, and focusing on training a lightweight adapter to ground each modality to the text embedding space. We conduct a comprehensive study of this approach on radiology imaging. For training, we assemble a large dataset with over 1 million image-text pairs. For evaluation, we propose a clinically driven novel approach using GPT-4 and demonstrate its parity with expert evaluation. We also study grounding qualitatively using attention. For best practice, we conduct a systematic ablation study on various choices in data engineering and multimodal training. The resulting LLaVA-Rad (7B) model attains state-of-the-art results on radiology tasks such as report generation and cross-modal retrieval, even outperforming much larger models such as GPT-4V and Med-PaLM M (84B). LLaVA-Rad is fast and can be run on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.

Title: Real-time Surgical Instrument Segmentation in Video Using Point Tracking and Segment Anything

Authors: Zijian Wu, Adam Schmidt, Peter Kazanzides, Septimiu E. Salcudean
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08003
Pdf URL: https://arxiv.org/pdf/2403.08003
Copy Paste: [[2403.08003]] Real-time Surgical Instrument Segmentation in Video Using Point Tracking and Segment Anything(https://arxiv.org/abs/2403.08003)
Keywords: foundation model
Abstract: The Segment Anything Model (SAM) is a powerful vision foundation model that is revolutionizing the traditional paradigm of segmentation. Despite this, a reliance on prompting each frame and large computational cost limit its usage in robotically assisted surgery. Applications, such as augmented reality guidance, require little user intervention along with efficient inference to be usable clinically. In this study, we address these limitations by adopting lightweight SAM variants to meet the speed requirement and employing fine-tuning techniques to enhance their generalization in surgical scenes. Recent advancements in Tracking Any Point (TAP) have shown promising results in both accuracy and efficiency, particularly when points are occluded or leave the field of view. Inspired by this progress, we present a novel framework that combines an online point tracker with a lightweight SAM model that is fine-tuned for surgical instrument segmentation. Sparse points within the region of interest are tracked and used to prompt SAM throughout the video sequence, providing temporal consistency. The quantitative results surpass the state-of-the-art semi-supervised video object segmentation method on the EndoVis 2015 dataset, with an over 25 FPS inference speed running on a single GeForce RTX 4060 GPU.

Title: Supervised Time Series Classification for Anomaly Detection in Subsea Engineering

Authors: Ergys Çokaj, Halvor Snersrud Gustad, Andrea Leone, Per Thomas Moe, Lasse Moldestad
Subjects: cs.LG, math.DS
Abstract URL: https://arxiv.org/abs/2403.08013
Pdf URL: https://arxiv.org/pdf/2403.08013
Copy Paste: [[2403.08013]] Supervised Time Series Classification for Anomaly Detection in Subsea Engineering(https://arxiv.org/abs/2403.08013)
Keywords: anomaly
Abstract: Time series classification is of significant importance in monitoring structural systems. In this work, we investigate the use of supervised machine learning classification algorithms on simulated data based on a physical system with two states: Intact and Broken. We provide a comprehensive discussion of the preprocessing of temporal data, using measures of statistical dispersion and dimension reduction techniques. We present an intuitive baseline method and discuss its efficiency. We conclude with a comparison of the various methods based on different performance metrics, showing the advantage of using machine learning techniques as a tool in decision making.

Title: McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets

Authors: Braulio V. Sánchez Vinces, Robson L. F. Cordeiro, Christos Faloutsos
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.08027
Pdf URL: https://arxiv.org/pdf/2403.08027
Copy Paste: [[2403.08027]] McCatch: Scalable Microcluster Detection in Dimensional and Nondimensional Datasets(https://arxiv.org/abs/2403.08027)
Keywords: anomaly
Abstract: How could we have an outlier detector that works even with nondimensional data, and ranks together both singleton microclusters ('one-off' outliers) and nonsingleton microclusters by their anomaly scores? How to obtain scores that are principled in one scalable and 'hands-off' manner? Microclusters of outliers indicate coalition or repetition in fraud activities, etc.; their identification is thus highly desirable. This paper presents McCatch: a new algorithm that detects microclusters by leveraging our proposed 'Oracle' plot (1NN Distance versus Group 1NN Distance). We study 31 real and synthetic datasets with up to 1M data elements to show that McCatch is the only method that answers both of the questions above; and, it outperforms 11 other methods, especially when the data has nonsingleton microclusters or is nondimensional. We also showcase McCatch's ability to detect meaningful microclusters in graphs, fingerprints, logs of network connections, text data, and satellite imagery. For example, it found a 30-elements microcluster of confirmed 'Denial of Service' attacks in the network logs, taking only ~3 minutes for 222K data elements on a stock desktop.

Title: MicroT: Low-Energy and Adaptive Models for MCUs

Authors: Yushan Huang, Ranya Aloufi, Xavier Cadet, Yuchen Zhao, Payam Barnaghi, Hamed Haddadi
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2403.08040
Pdf URL: https://arxiv.org/pdf/2403.08040
Copy Paste: [[2403.08040]] MicroT: Low-Energy and Adaptive Models for MCUs(https://arxiv.org/abs/2403.08040)
Keywords: self-supervised
Abstract: We propose MicroT, a low-energy, multi-task adaptive model framework for resource-constrained MCUs. We divide the original model into a feature extractor and a classifier. The feature extractor is obtained through self-supervised knowledge distillation and further optimized into part and full models through model splitting and joint training. These models are then deployed on MCUs, with classifiers added and trained on local tasks, ultimately performing stage-decision for joint inference. In this process, the part model initially processes the sample, and if the confidence score falls below the set threshold, the full model will resume and continue the inference. We evaluate MicroT on two models, three datasets, and two MCU boards. Our experimental evaluation shows that MicroT effectively improves model performance and reduces energy consumption when dealing with multiple local tasks. Compared to the unoptimized feature extractor, MicroT can improve accuracy by up to 9.87%. On MCUs, compared to the standard full model inference, MicroT can save up to about 29.13% in energy consumption. MicroT also allows users to adaptively adjust the stage-decision ratio as needed, better balancing model performance and energy consumption. Under the standard stage-decision ratio configuration, MicroT can increase accuracy by 5.91% and save about 14.47% of energy consumption.

Title: FluoroSAM: A Language-aligned Foundation Model for X-ray Image Segmentation

Authors: Benjamin D. Killeen, Liam J. Wang, Han Zhang, Mehran Armand, Russell H. Taylor, Greg Osgood, Mathias Unberath
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.08059
Pdf URL: https://arxiv.org/pdf/2403.08059
Copy Paste: [[2403.08059]] FluoroSAM: A Language-aligned Foundation Model for X-ray Image Segmentation(https://arxiv.org/abs/2403.08059)
Keywords: foundation model
Abstract: Automated X-ray image segmentation would accelerate research and development in diagnostic and interventional precision medicine. Prior efforts have contributed task-specific models capable of solving specific image analysis problems, but the utility of these models is restricted to their particular task domain, and expanding to broader use requires additional data, labels, and retraining efforts. Recently, foundation models (FMs) -- machine learning models trained on large amounts of highly variable data thus enabling broad applicability -- have emerged as promising tools for automated image analysis. Existing FMs for medical image analysis focus on scenarios and modalities where objects are clearly defined by visually apparent boundaries, such as surgical tool segmentation in endoscopy. X-ray imaging, by contrast, does not generally offer such clearly delineated boundaries or structure priors. During X-ray image formation, complex 3D structures are projected in transmission onto the imaging plane, resulting in overlapping features of varying opacity and shape. To pave the way toward an FM for comprehensive and automated analysis of arbitrary medical X-ray images, we develop FluoroSAM, a language-aligned variant of the Segment-Anything Model, trained from scratch on 1.6M synthetic X-ray images. FluoroSAM is trained on data including masks for 128 organ types and 464 non-anatomical objects, such as tools and implants. In real X-ray images of cadaveric specimens, FluoroSAM is able to segment bony anatomical structures based on text-only prompting with 0.51 and 0.79 DICE with point-based refinement, outperforming competing SAM variants for all structures. FluoroSAM is also capable of zero-shot generalization to segmenting classes beyond the training set thanks to its language alignment, which we demonstrate for full lung segmentation on real chest X-rays.

Title: Mitigating the Impact of Attribute Editing on Face Recognition

Authors: Sudipta Banerjee, Sai Pranaswi Mullangi, Shruti Wagle, Chinmay Hegde, Nasir Memon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08092
Pdf URL: https://arxiv.org/pdf/2403.08092
Copy Paste: [[2403.08092]] Mitigating the Impact of Attribute Editing on Face Recognition(https://arxiv.org/abs/2403.08092)
Keywords: generative
Abstract: Facial attribute editing using generative models can impair automated face recognition. This degradation persists even with recent identity-preserving models such as InstantID. To mitigate this issue, we propose two techniques that perform local and global attribute editing. Local editing operates on the finer details via a regularization-free method based on ControlNet conditioned on depth maps and auxiliary semantic segmentation masks. Global editing operates on coarser details via a regularization-based method guided by custom loss and regularization set. In this work, we empirically ablate twenty-six facial semantic, demographic and expression-based attributes altered using state-of-the-art generative models and evaluate them using ArcFace and AdaFace matchers on CelebA, CelebAMaskHQ and LFW datasets. Finally, we use LLaVA, a vision-language framework for attribute prediction to validate our editing techniques. Our methods outperform SoTA (BLIP, InstantID) at facial editing while retaining identity.

Title: BAGEL: Bootstrapping Agents by Guiding Exploration with Language

Authors: Shikhar Murty, Christopher Manning, Peter Shaw, Mandar Joshi, Kenton Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.08140
Pdf URL: https://arxiv.org/pdf/2403.08140
Copy Paste: [[2403.08140]] BAGEL: Bootstrapping Agents by Guiding Exploration with Language(https://arxiv.org/abs/2403.08140)
Keywords: in-context
Abstract: Following natural language instructions by executing actions in digital environments (e.g. web-browsers and REST APIs) is a challenging task for language model (LM) agents. Unfortunately, LM agents often fail to generalize to new environments without human demonstrations. This work presents BAGEL, a method for bootstrapping LM agents without human supervision. BAGEL converts a seed set of randomly explored trajectories or synthetic instructions, into demonstrations, via round-trips between two noisy LM components: an LM labeler which converts a trajectory into a synthetic instruction, and a zero-shot LM agent which maps the synthetic instruction into a refined trajectory. By performing these round-trips iteratively, BAGEL quickly converts the initial distribution of trajectories towards those that are well-described by natural language. We use BAGEL demonstrations to adapt a zero shot LM agent at test time via in-context learning over retrieved demonstrations, and find improvements of over 2-13% absolute on ToolQA and MiniWob++, with up to 13x reduction in execution failures.

Title: ShadowRemovalNet: Efficient Real-Time Shadow Removal

Authors: Alzayat Saleh, Alex Olsen, Jake Wood, Bronson Philippa, Mostafa Rahimi Azghadi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08142
Pdf URL: https://arxiv.org/pdf/2403.08142
Copy Paste: [[2403.08142]] ShadowRemovalNet: Efficient Real-Time Shadow Removal(https://arxiv.org/abs/2403.08142)
Keywords: generative
Abstract: Shadows significantly impact computer vision tasks, particularly in outdoor environments. State-of-the-art shadow removal methods are typically too computationally intensive for real-time image processing on edge hardware. We propose ShadowRemovalNet, a novel method designed for real-time image processing on resource-constrained hardware. ShadowRemovalNet achieves significantly higher frame rates compared to existing methods, making it suitable for real-time computer vision pipelines like those used in field robotics. Beyond speed, ShadowRemovalNet offers advantages in efficiency and simplicity, as it does not require a separate shadow mask during inference. ShadowRemovalNet also addresses challenges associated with Generative Adversarial Networks (GANs) for shadow removal, including artefacts, inaccurate mask estimations, and inconsistent supervision between shadow and boundary pixels. To address these limitations, we introduce a novel loss function that substantially reduces shadow removal errors. ShadowRemovalNet's efficiency and straightforwardness make it a robust and effective solution for real-time shadow removal in outdoor robotics and edge computing applications.

Title: LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition

Authors: Zhonglin Sun, Chen Feng, Ioannis Patras, Georgios Tzimiropoulos
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.08161
Pdf URL: https://arxiv.org/pdf/2403.08161
Copy Paste: [[2403.08161]] LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition(https://arxiv.org/abs/2403.08161)
Keywords: self-supervised
Abstract: In this work we focus on learning facial representations that can be adapted to train effective face recognition models, particularly in the absence of labels. Firstly, compared with existing labelled face datasets, a vastly larger magnitude of unlabeled faces exists in the real world. We explore the learning strategy of these unlabeled facial images through self-supervised pretraining to transfer generalized face recognition performance. Moreover, motivated by one recent finding, that is, the face saliency area is critical for face recognition, in contrast to utilizing random cropped blocks of images for constructing augmentations in pretraining, we utilize patches localized by extracted facial landmarks. This enables our method - namely LAndmark-based Facial Self-supervised learning LAFS), to learn key representation that is more critical for face recognition. We also incorporate two landmark-specific augmentations which introduce more diversity of landmark information to further regularize the learning. With learned landmark-based facial representations, we further adapt the representation for face recognition with regularization mitigating variations in landmark positions. Our method achieves significant improvement over the state-of-the-art on multiple face recognition benchmarks, especially on more challenging few-shot scenarios.

Title: PAGE: Domain-Incremental Adaptation with Past-Agnostic Generative Replay for Smart Healthcare

Authors: Chia-Hao Li, Niraj K. Jha
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.08197
Pdf URL: https://arxiv.org/pdf/2403.08197
Copy Paste: [[2403.08197]] PAGE: Domain-Incremental Adaptation with Past-Agnostic Generative Replay for Smart Healthcare(https://arxiv.org/abs/2403.08197)
Keywords: generative
Abstract: We propose PAGE, a domain-incremental adaptation strategy with past-agnostic generative replay for smart healthcare. PAGE enables generative replay without the aid of any preserved data or information from prior domains. When adapting to a new domain, it exploits real data from the new distribution and the current model to generate synthetic data that retain the learned knowledge of previous domains. By replaying the synthetic data with the new real data during training, PAGE achieves a good balance between domain adaptation and knowledge retention. In addition, we incorporate an extended inductive conformal prediction (EICP) method into PAGE to produce a confidence score and a credibility value for each detection result. This makes the predictions interpretable and provides statistical guarantees for disease detection in smart healthcare applications. We demonstrate PAGE's effectiveness in domain-incremental disease detection with three distinct disease datasets collected from commercially available WMSs. PAGE achieves highly competitive performance against state-of-the-art with superior scalability, data privacy, and feasibility. Furthermore, PAGE can enable up to 75% reduction in clinical workload with the help of EICP.

Title: PaddingFlow: Improving Normalizing Flows with Padding-Dimensional Noise

Authors: Qinglong Meng, Chongkun Xia, Xueqian Wang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2403.08216
Pdf URL: https://arxiv.org/pdf/2403.08216
Copy Paste: [[2403.08216]] PaddingFlow: Improving Normalizing Flows with Padding-Dimensional Noise(https://arxiv.org/abs/2403.08216)
Keywords: generative
Abstract: Normalizing flow is a generative modeling approach with efficient sampling. However, Flow-based models suffer two issues, which are manifold and discrete data. If the target distribution is a manifold, which means the dimension of the latent target distribution and the dimension of the data distribution are unmatched, flow-based models might perform badly. Discrete data makes flow-based models collapse into a degenerate mixture of point masses. In this paper, to sidestep such two issues we propose PaddingFlow, a novel dequantization method, which improves normalizing flows with padding-dimensional noise. PaddingFlow is easy to implement, computationally cheap, widely suitable for various tasks, and generates samples that are unbiased estimations of the data. Especially, our method can overcome the limitation of existing dequantization methods that have to change the data distribution, which might degrade performance. We validate our method on the main benchmarks of unconditional density estimation, including five tabular datasets and four image datasets for VAE models, and the IK experiments which are conditional density estimation. The results show that PaddingFlow can provide improvement on all tasks in this paper.

Title: Boosting Disfluency Detection with Large Language Model as Disfluency Generator

Authors: Zhenrong Cheng, Jiayan Guo, Hao Sun, Yan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.08229
Pdf URL: https://arxiv.org/pdf/2403.08229
Copy Paste: [[2403.08229]] Boosting Disfluency Detection with Large Language Model as Disfluency Generator(https://arxiv.org/abs/2403.08229)
Keywords: generative
Abstract: Current disfluency detection methods heavily rely on costly and scarce human-annotated data. To tackle this issue, some approaches employ heuristic or statistical features to generate disfluent sentences, partially improving detection performance. However, these sentences often deviate from real-life scenarios, constraining overall model enhancement. In this study, we propose a lightweight data augmentation approach for disfluency detection, utilizing the superior generative and semantic understanding capabilities of large language model (LLM) to generate disfluent sentences as augmentation data. We leverage LLM to generate diverse and more realistic sentences guided by specific prompts, without the need for fine-tuning the LLM. Subsequently, we apply an uncertainty-aware data filtering approach to improve the quality of the generated sentences, utilized in training a small detection model for improved performance. Experiments using enhanced data yielded state-of-the-art results. The results showed that using a small amount of LLM-generated enhanced data can significantly improve performance, thereby further enhancing cost-effectiveness.

Title: Point Cloud Compression via Constrained Optimal Transport

Authors: Zezeng Li, Weimin Wang, Ziliang Wang, Na Lei
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2403.08236
Pdf URL: https://arxiv.org/pdf/2403.08236
Copy Paste: [[2403.08236]] Point Cloud Compression via Constrained Optimal Transport(https://arxiv.org/abs/2403.08236)
Keywords: generative
Abstract: This paper presents a novel point cloud compression method COT-PCC by formulating the task as a constrained optimal transport (COT) problem. COT-PCC takes the bitrate of compressed features as an extra constraint of optimal transport (OT) which learns the distribution transformation between original and reconstructed points. Specifically, the formulated COT is implemented with a generative adversarial network (GAN) and a bitrate loss for training. The discriminator measures the Wasserstein distance between input and reconstructed points, and a generator calculates the optimal mapping between distributions of input and reconstructed point cloud. Moreover, we introduce a learnable sampling module for downsampling in the compression procedure. Extensive results on both sparse and dense point cloud datasets demonstrate that COT-PCC outperforms state-of-the-art methods in terms of both CD and PSNR metrics. Source codes are available at \url{https://github.com/cognaclee/PCC-COT}.

Title: Make Me Happier: Evoking Emotions Through Image Diffusion Models

Authors: Qing Lin, Jingfeng Zhang, Yew Soon Ong, Mengmi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08255
Pdf URL: https://arxiv.org/pdf/2403.08255
Copy Paste: [[2403.08255]] Make Me Happier: Evoking Emotions Through Image Diffusion Models(https://arxiv.org/abs/2403.08255)
Keywords: diffusion
Abstract: Despite the rapid progress in image generation, emotional image editing remains under-explored. The semantics, context, and structure of an image can evoke emotional responses, making emotional image editing techniques valuable for various real-world applications, including treatment of psychological disorders, commercialization of products, and artistic design. For the first time, we present a novel challenge of emotion-evoked image generation, aiming to synthesize images that evoke target emotions while retaining the semantics and structures of the original scenes. To address this challenge, we propose a diffusion model capable of effectively understanding and editing source images to convey desired emotions and sentiments. Moreover, due to the lack of emotion editing datasets, we provide a unique dataset consisting of 340,000 pairs of images and their emotion annotations. Furthermore, we conduct human psychophysics experiments and introduce four new evaluation metrics to systematically benchmark all the methods. Experimental results demonstrate that our method surpasses all competitive baselines. Our diffusion model is capable of identifying emotional cues from original images, editing images that elicit desired emotions, and meanwhile, preserving the semantic structure of the original images. All code, model, and data will be made public.

Title: CoroNetGAN: Controlled Pruning of GANs via Hypernetworks

Authors: Aman Kumar, Khushboo Anand, Shubham Mandloi, Ashutosh Mishra, Avinash Thakur, Neeraj Kasera, Prathosh A P
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2403.08261
Pdf URL: https://arxiv.org/pdf/2403.08261
Copy Paste: [[2403.08261]] CoroNetGAN: Controlled Pruning of GANs via Hypernetworks(https://arxiv.org/abs/2403.08261)
Keywords: generative
Abstract: Generative Adversarial Networks (GANs) have proven to exhibit remarkable performance and are widely used across many generative computer vision applications. However, the unprecedented demand for the deployment of GANs on resource-constrained edge devices still poses a challenge due to huge number of parameters involved in the generation process. This has led to focused attention on the area of compressing GANs. Most of the existing works use knowledge distillation with the overhead of teacher dependency. Moreover, there is no ability to control the degree of compression in these methods. Hence, we propose CoroNet-GAN for compressing GAN using the combined strength of differentiable pruning method via hypernetworks. The proposed method provides the advantage of performing controllable compression while training along with reducing training time by a substantial factor. Experiments have been done on various conditional GAN architectures (Pix2Pix and CycleGAN) to signify the effectiveness of our approach on multiple benchmark datasets such as Edges-to-Shoes, Horse-to-Zebra and Summer-to-Winter. The results obtained illustrate that our approach succeeds to outperform the baselines on Zebra-to-Horse and Summer-to-Winter achieving the best FID score of 32.3 and 72.3 respectively, yielding high-fidelity images across all the datasets. Additionally, our approach also outperforms the state-of-the-art methods in achieving better inference time on various smart-phone chipsets and data-types making it a feasible solution for deployment on edge devices.

Title: Sketch2Manga: Shaded Manga Screening from Sketch with Diffusion Models

Authors: Jian Lin, Xueting Liu, Chengze Li, Minshan Xie, Tien-Tsin Wong
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2403.08266
Pdf URL: https://arxiv.org/pdf/2403.08266
Copy Paste: [[2403.08266]] Sketch2Manga: Shaded Manga Screening from Sketch with Diffusion Models(https://arxiv.org/abs/2403.08266)
Keywords: diffusion
Abstract: While manga is a popular entertainment form, creating manga is tedious, especially adding screentones to the created sketch, namely manga screening. Unfortunately, there is no existing method that tailors for automatic manga screening, probably due to the difficulty of generating high-quality shaded high-frequency screentones. The classic manga screening approaches generally require user input to provide screentone exemplars or a reference manga image. The recent deep learning models enables the automatic generation by learning from a large-scale dataset. However, the state-of-the-art models still fail to generate high-quality shaded screentones due to the lack of a tailored model and high-quality manga training data. In this paper, we propose a novel sketch-to-manga framework that first generates a color illustration from the sketch and then generates a screentoned manga based on the intensity guidance. Our method significantly outperforms existing methods in generating high-quality manga with shaded high-frequency screentones.

Title: RECIPE4U: Student-ChatGPT Interaction Dataset in EFL Writing Education

Authors: Jieun Han, Haneul Yoo, Junho Myung, Minsun Kim, Tak Yeon Lee, So-Yeon Ahn, Alice Oh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.08272
Pdf URL: https://arxiv.org/pdf/2403.08272
Copy Paste: [[2403.08272]] RECIPE4U: Student-ChatGPT Interaction Dataset in EFL Writing Education(https://arxiv.org/abs/2403.08272)
Keywords: generative
Abstract: The integration of generative AI in education is expanding, yet empirical analyses of large-scale and real-world interactions between students and AI systems still remain limited. Addressing this gap, we present RECIPE4U (RECIPE for University), a dataset sourced from a semester-long experiment with 212 college students in English as Foreign Language (EFL) writing courses. During the study, students engaged in dialogues with ChatGPT to revise their essays. RECIPE4U includes comprehensive records of these interactions, including conversation logs, students' intent, students' self-rated satisfaction, and students' essay edit histories. In particular, we annotate the students' utterances in RECIPE4U with 13 intention labels based on our coding schemes. We establish baseline results for two subtasks in task-oriented dialogue systems within educational contexts: intent detection and satisfaction estimation. As a foundational step, we explore student-ChatGPT interaction patterns through RECIPE4U and analyze them by focusing on students' dialogue, essay data statistics, and students' essay edits. We further illustrate potential applications of RECIPE4U dataset for enhancing the incorporation of LLMs in educational frameworks. RECIPE4U is publicly available at https://zeunie.github.io/RECIPE4U/.

Title: VIGFace: Virtual Identity Generation Model for Face Image Synthesis

Authors: Minsoo Kim, Min-Cheol Sagong, Gi Pyo Nam, Junghyun Cho, Ig-Jae Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08277
Pdf URL: https://arxiv.org/pdf/2403.08277
Copy Paste: [[2403.08277]] VIGFace: Virtual Identity Generation Model for Face Image Synthesis(https://arxiv.org/abs/2403.08277)
Keywords: diffusion
Abstract: Deep learning-based face recognition continues to face challenges due to its reliance on huge datasets obtained from web crawling, which can be costly to gather and raise significant real-world privacy concerns. To address this issue, we propose VIGFace, a novel framework capable of generating synthetic facial images. Initially, we train the face recognition model using a real face dataset and create a feature space for both real and virtual IDs where virtual prototypes are orthogonal to other prototypes. Subsequently, we generate synthetic images by using the diffusion model based on the feature space. Our proposed framework provides two significant benefits. Firstly, it allows for creating virtual facial images without concerns about portrait rights, guaranteeing that the generated virtual face images are clearly differentiated from existing individuals. Secondly, it serves as an effective augmentation method by incorporating real existing images. Further experiments demonstrate the efficacy of our framework, achieving state-of-the-art results from both perspectives without any external data.

Title: Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at Scale

Authors: Xiang Hu, Pengyu Ji, Qingyang Zhu, Wei Wu, Kewei Tu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.08293
Pdf URL: https://arxiv.org/pdf/2403.08293
Copy Paste: [[2403.08293]] Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at Scale(https://arxiv.org/abs/2403.08293)
Keywords: generative
Abstract: A syntactic language model (SLM) incrementally generates a sentence with its syntactic tree in a left-to-right manner. We present Generative Pretrained Structured Transformers (GPST), an unsupervised SLM at scale capable of being pre-trained from scratch on raw texts with high parallelism. GPST circumvents the limitations of previous SLMs such as relying on gold trees and sequential training. It consists of two components, a usual SLM supervised by a uni-directional language modeling loss, and an additional composition model, which induces syntactic parse trees and computes constituent representations, supervised by a bi-directional language modeling loss. We propose a representation surrogate to enable joint parallel training of the two models in a hard-EM fashion. We pre-train GPST on OpenWebText, a corpus with $9$ billion tokens, and demonstrate the superiority of GPST over GPT-2 with a comparable size in numerous tasks covering both language understanding and language generation. Meanwhile, GPST also significantly outperforms existing unsupervised SLMs on left-to-right grammar induction, while holding a substantial acceleration on training.

Title: Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation

Authors: Tianyi Chu, Wei Xing, Jiafu Chen, Zhizhong Wang, Jiakai Sun, Lei Zhao, Haibo Chen, Huaizhong Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08294
Pdf URL: https://arxiv.org/pdf/2403.08294
Copy Paste: [[2403.08294]] Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation(https://arxiv.org/abs/2403.08294)
Keywords: generative
Abstract: Existing generative adversarial network (GAN) based conditional image generative models typically produce fixed output for the same conditional input, which is unreasonable for highly subjective tasks, such as large-mask image inpainting or style transfer. On the other hand, GAN-based diverse image generative methods require retraining/fine-tuning the network or designing complex noise injection functions, which is computationally expensive, task-specific, or struggle to generate high-quality results. Given that many deterministic conditional image generative models have been able to produce high-quality yet fixed results, we raise an intriguing question: is it possible for pre-trained deterministic conditional image generative models to generate diverse results without changing network structures or parameters? To answer this question, we re-examine the conditional image generation tasks from the perspective of adversarial attack and propose a simple and efficient plug-in projected gradient descent (PGD) like method for diverse and controllable image generation. The key idea is attacking the pre-trained deterministic generative models by adding a micro perturbation to the input condition. In this way, diverse results can be generated without any adjustment of network structures or fine-tuning of the pre-trained models. In addition, we can also control the diverse results to be generated by specifying the attack direction according to a reference text or image. Our work opens the door to applying adversarial attack to low-level vision tasks, and experiments on various conditional image generation tasks demonstrate the effectiveness and superiority of the proposed method.

Title: Nonlinear Manifold Learning Determines Microgel Size from Raman Spectroscopy

Authors: Eleni D. Koronaki, Luise F. Kaven, Johannes M. M. Faust, Ioannis G. Kevrekidis, Alexander Mitsos
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2403.08376
Pdf URL: https://arxiv.org/pdf/2403.08376
Copy Paste: [[2403.08376]] Nonlinear Manifold Learning Determines Microgel Size from Raman Spectroscopy(https://arxiv.org/abs/2403.08376)
Keywords: diffusion
Abstract: Polymer particle size constitutes a crucial characteristic of product quality in polymerization. Raman spectroscopy is an established and reliable process analytical technology for in-line concentration monitoring. Recent approaches and some theoretical considerations show a correlation between Raman signals and particle sizes but do not determine polymer size from Raman spectroscopic measurements accurately and reliably. With this in mind, we propose three alternative machine learning workflows to perform this task, all involving diffusion maps, a nonlinear manifold learning technique for dimensionality reduction: (i) directly from diffusion maps, (ii) alternating diffusion maps, and (iii) conformal autoencoder neural networks. We apply the workflows to a data set of Raman spectra with associated size measured via dynamic light scattering of 47 microgel (cross-linked polymer) samples in a diameter range of 208nm to 483 nm. The conformal autoencoders substantially outperform state-of-the-art methods and results for the first time in a promising prediction of polymer size from Raman spectra.

Title: Mitigate Target-level Insensitivity of Infrared Small Target Detection via Posterior Distribution Modeling

Authors: Haoqing Li, Jinfu Yang, Yifei Xu, Runshi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08380
Pdf URL: https://arxiv.org/pdf/2403.08380
Copy Paste: [[2403.08380]] Mitigate Target-level Insensitivity of Infrared Small Target Detection via Posterior Distribution Modeling(https://arxiv.org/abs/2403.08380)
Keywords: diffusion, generative
Abstract: Infrared Small Target Detection (IRSTD) aims to segment small targets from infrared clutter background. Existing methods mainly focus on discriminative approaches, i.e., a pixel-level front-background binary segmentation. Since infrared small targets are small and low signal-to-clutter ratio, empirical risk has few disturbances when a certain false alarm and missed detection exist, which seriously affect the further improvement of such methods. Motivated by the dense prediction generative methods, in this paper, we propose a diffusion model framework for Infrared Small Target Detection which compensates pixel-level discriminant with mask posterior distribution modeling. Furthermore, we design a Low-frequency Isolation in the wavelet domain to suppress the interference of intrinsic infrared noise on the diffusion noise estimation. This transition from the discriminative paradigm to generative one enables us to bypass the target-level insensitivity. Experiments show that the proposed method achieves competitive performance gains over state-of-the-art methods on NUAA-SIRST, IRSTD-1k, and NUDT-SIRST datasets. Code are available at https://github.com/Li-Haoqing/IRSTD-Diff.

Title: Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models

Authors: Pengze Zhang, Hubery Yin, Chen Li, Xiaohua Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08381
Pdf URL: https://arxiv.org/pdf/2403.08381
Copy Paste: [[2403.08381]] Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models(https://arxiv.org/abs/2403.08381)
Keywords: diffusion
Abstract: Most diffusion models assume that the reverse process adheres to a Gaussian distribution. However, this approximation has not been rigorously validated, especially at singularities, where t=0 and t=1. Improperly dealing with such singularities leads to an average brightness issue in applications, and limits the generation of images with extreme brightness or darkness. We primarily focus on tackling singularities from both theoretical and practical perspectives. Initially, we establish the error bounds for the reverse process approximation, and showcase its Gaussian characteristics at singularity time steps. Based on this theoretical insight, we confirm the singularity at t=1 is conditionally removable while it at t=0 is an inherent property. Upon these significant conclusions, we propose a novel plug-and-play method SingDiffusion to address the initial singular time step sampling, which not only effectively resolves the average brightness issue for a wide range of diffusion models without extra training efforts, but also enhances their generation capability in achieving notable lower FID scores. Code and models are released at https://github.com/PangzeCheung/SingDiffusion.

Title: Iterative Online Image Synthesis via Diffusion Model for Imbalanced Classification

Authors: Shuhan Li, Yi Lin, Hao Chen, Kwang-Ting Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08407
Pdf URL: https://arxiv.org/pdf/2403.08407
Copy Paste: [[2403.08407]] Iterative Online Image Synthesis via Diffusion Model for Imbalanced Classification(https://arxiv.org/abs/2403.08407)
Keywords: diffusion
Abstract: Accurate and robust classification of diseases is important for proper diagnosis and treatment. However, medical datasets often face challenges related to limited sample sizes and inherent imbalanced distributions, due to difficulties in data collection and variations in disease prevalence across different types. In this paper, we introduce an Iterative Online Image Synthesis (IOIS) framework to address the class imbalance problem in medical image classification. Our framework incorporates two key modules, namely Online Image Synthesis (OIS) and Accuracy Adaptive Sampling (AAS), which collectively target the imbalance classification issue at both the instance level and the class level. The OIS module alleviates the data insufficiency problem by generating representative samples tailored for online training of the classifier. On the other hand, the AAS module dynamically balances the synthesized samples among various classes, targeting those with low training accuracy. To evaluate the effectiveness of our proposed method in addressing imbalanced classification, we conduct experiments on the HAM10000 and APTOS datasets. The results obtained demonstrate the superiority of our approach over state-of-the-art methods as well as the effectiveness of each component. The source code will be released upon acceptance.

Title: Low-Cost and Real-Time Industrial Human Action Recognitions Based on Large-Scale Foundation Models

Authors: Wensheng Liang, Ruiyan Zhuang, Xianwei Shi, Shuai Li, Zhicheng Wang, Xiaoguang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08420
Pdf URL: https://arxiv.org/pdf/2403.08420
Copy Paste: [[2403.08420]] Low-Cost and Real-Time Industrial Human Action Recognitions Based on Large-Scale Foundation Models(https://arxiv.org/abs/2403.08420)
Keywords: foundation model
Abstract: Industrial managements, including quality control, cost and safety optimization, etc., heavily rely on high quality industrial human action recognitions (IHARs) which were hard to be implemented in large-scale industrial scenes due to their high costs and poor real-time performance. In this paper, we proposed a large-scale foundation model(LSFM)-based IHAR method, wherein various LSFMs and lightweight methods were jointly used, for the first time, to fulfill low-cost dataset establishment and real-time IHARs. Comprehensive tests on in-situ large-scale industrial manufacturing lines elucidated that the proposed method realized great reduction on employment costs, superior real-time performance, and satisfactory accuracy and generalization capabilities, indicating its great potential as a backbone IHAR method, especially for large-scale industrial applications.

Title: PFStorer: Personalized Face Restoration and Super-Resolution

Authors: Tuomas Varanka, Tapani Toivonen, Soumya Tripathy, Guoying Zhao, Erman Acar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08436
Pdf URL: https://arxiv.org/pdf/2403.08436
Copy Paste: [[2403.08436]] PFStorer: Personalized Face Restoration and Super-Resolution(https://arxiv.org/abs/2403.08436)
Keywords: diffusion, generative
Abstract: Recent developments in face restoration have achieved remarkable results in producing high-quality and lifelike outputs. The stunning results however often fail to be faithful with respect to the identity of the person as the models lack necessary context. In this paper, we explore the potential of personalized face restoration with diffusion models. In our approach a restoration model is personalized using a few images of the identity, leading to tailored restoration with respect to the identity while retaining fine-grained details. By using independent trainable blocks for personalization, the rich prior of a base restoration model can be exploited to its fullest. To avoid the model relying on parts of identity left in the conditioning low-quality images, a generative regularizer is employed. With a learnable parameter, the model learns to balance between the details generated based on the input image and the degree of personalization. Moreover, we improve the training pipeline of face restoration models to enable an alignment-free approach. We showcase the robust capabilities of our approach in several real-world scenarios with multiple identities, demonstrating our method's ability to generate fine-grained details with faithful restoration. In the user study we evaluate the perceptual quality and faithfulness of the genereated details, with our method being voted best 61% of the time compared to the second best with 25% of the votes.

Title: Towards Dense and Accurate Radar Perception Via Efficient Cross-Modal Diffusion Model

Authors: Ruibin Zhang, Donglai Xue, Yuhan Wang, Ruixu Geng, Fei Gao
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2403.08460
Pdf URL: https://arxiv.org/pdf/2403.08460
Copy Paste: [[2403.08460]] Towards Dense and Accurate Radar Perception Via Efficient Cross-Modal Diffusion Model(https://arxiv.org/abs/2403.08460)
Keywords: diffusion, generative
Abstract: Millimeter wave (mmWave) radars have attracted significant attention from both academia and industry due to their capability to operate in extreme weather conditions. However, they face challenges in terms of sparsity and noise interference, which hinder their application in the field of micro aerial vehicle (MAV) autonomous navigation. To this end, this paper proposes a novel approach to dense and accurate mmWave radar point cloud construction via cross-modal learning. Specifically, we introduce diffusion models, which possess state-of-the-art performance in generative modeling, to predict LiDAR-like point clouds from paired raw radar data. We also incorporate the most recent diffusion model inference accelerating techniques to ensure that the proposed method can be implemented on MAVs with limited computing resources.We validate the proposed method through extensive benchmark comparisons and real-world experiments, demonstrating its superior performance and generalization ability. Code and pretrained models will be available at https://github.com/ZJU-FAST-Lab/Radar-Diffusion.

Title: An Analysis of Human Alignment of Latent Diffusion Models

Authors: Lorenz Linhardt, Marco Morik, Sidney Bender, Naima Elosegui Borras
Subjects: cs.LG, cs.HC
Abstract URL: https://arxiv.org/abs/2403.08469
Pdf URL: https://arxiv.org/pdf/2403.08469
Copy Paste: [[2403.08469]] An Analysis of Human Alignment of Latent Diffusion Models(https://arxiv.org/abs/2403.08469)
Keywords: diffusion
Abstract: Diffusion models, trained on large amounts of data, showed remarkable performance for image synthesis. They have high error consistency with humans and low texture bias when used for classification. Furthermore, prior work demonstrated the decomposability of their bottleneck layer representations into semantic directions. In this work, we analyze how well such representations are aligned to human responses on a triplet odd-one-out task. We find that despite the aforementioned observations: I) The representational alignment with humans is comparable to that of models trained only on ImageNet-1k. II) The most aligned layers of the denoiser U-Net are intermediate layers and not the bottleneck. III) Text conditioning greatly improves alignment at high noise levels, hinting at the importance of abstract textual information, especially in the early stage of generation.

Title: Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts

Authors: Shengzhuang Chen, Jihoon Tack, Yunqiao Yang, Yee Whye Teh, Jonathan Richard Schwarz, Ying Wei
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.08477
Pdf URL: https://arxiv.org/pdf/2403.08477
Copy Paste: [[2403.08477]] Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts(https://arxiv.org/abs/2403.08477)
Keywords: foundation model
Abstract: Conventional wisdom suggests parameter-efficient fine-tuning of foundation models as the state-of-the-art method for transfer learning in vision, replacing the rich literature of alternatives such as meta-learning. In trying to harness the best of both worlds, meta-tuning introduces a subsequent optimization stage of foundation models but has so far only shown limited success and crucially tends to underperform on out-of-domain (OOD) tasks. In this paper, we introduce Sparse MetA-Tuning (SMAT), a method inspired by sparse mixture-of-experts approaches and trained to isolate subsets of pre-trained parameters automatically for meta-tuning on each task. SMAT successfully overcomes OOD sensitivity and delivers on the promise of enhancing the transfer abilities of vision foundation models beyond parameter-efficient finetuning. We establish new state-of-the-art results on a challenging combination of Meta-Dataset augmented with additional OOD tasks in both zero-shot and gradient-based adaptation settings. In addition, we provide a thorough analysis of the superiority of learned over hand-designed sparsity patterns for sparse expert methods and the pivotal importance of the sparsity level in balancing between in-domain and out-of-domain generalization. Our code is publicly available.

Title: Model Will Tell: Training Membership Inference for Diffusion Models

Authors: Xiaomeng Fu, Xi Wang, Qiao Li, Jin Liu, Jiao Dai, Jizhong Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08487
Pdf URL: https://arxiv.org/pdf/2403.08487
Copy Paste: [[2403.08487]] Model Will Tell: Training Membership Inference for Diffusion Models(https://arxiv.org/abs/2403.08487)
Keywords: diffusion, generative
Abstract: Diffusion models pose risks of privacy breaches and copyright disputes, primarily stemming from the potential utilization of unauthorized data during the training phase. The Training Membership Inference (TMI) task aims to determine whether a specific sample has been used in the training process of a target model, representing a critical tool for privacy violation verification. However, the increased stochasticity inherent in diffusion renders traditional shadow-model-based or metric-based methods ineffective when applied to diffusion models. Moreover, existing methods only yield binary classification labels which lack necessary comprehensibility in practical applications. In this paper, we explore a novel perspective for the TMI task by leveraging the intrinsic generative priors within the diffusion model. Compared with unseen samples, training samples exhibit stronger generative priors within the diffusion model, enabling the successful reconstruction of substantially degraded training images. Consequently, we propose the Degrade Restore Compare (DRC) framework. In this framework, an image undergoes sequential degradation and restoration, and its membership is determined by comparing it with the restored counterpart. Experimental results verify that our approach not only significantly outperforms existing methods in terms of accuracy but also provides comprehensible decision criteria, offering evidence for potential privacy violations.

Title: Rich Semantic Knowledge Enhanced Large Language Models for Few-shot Chinese Spell Checking

Authors: Ming Dong, Yujing Chen, Miao Zhang, Hao Sun, Tingting He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.08492
Pdf URL: https://arxiv.org/pdf/2403.08492
Copy Paste: [[2403.08492]] Rich Semantic Knowledge Enhanced Large Language Models for Few-shot Chinese Spell Checking(https://arxiv.org/abs/2403.08492)
Keywords: foundation model, in-context
Abstract: Chinese Spell Checking (CSC) is a widely used technology, which plays a vital role in speech to text (STT) and optical character recognition (OCR). Most of the existing CSC approaches relying on BERT architecture achieve excellent performance. However, limited by the scale of the foundation model, BERT-based method does not work well in few-shot scenarios, showing certain limitations in practical applications. In this paper, we explore using an in-context learning method named RS-LLM (Rich Semantic based LLMs) to introduce large language models (LLMs) as the foundation model. Besides, we study the impact of introducing various Chinese rich semantic information in our framework. We found that by introducing a small number of specific Chinese rich semantic structures, LLMs achieve better performance than the BERT-based model on few-shot CSC task. Furthermore, we conduct experiments on multiple datasets, and the experimental results verified the superiority of our proposed framework.

Title: Masked Generative Story Transformer with Character Guidance and Caption Augmentation

Authors: Christos Papadimitriou, Giorgos Filandrianos, Maria Lymperaiou, Giorgos Stamou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.08502
Pdf URL: https://arxiv.org/pdf/2403.08502
Copy Paste: [[2403.08502]] Masked Generative Story Transformer with Character Guidance and Caption Augmentation(https://arxiv.org/abs/2403.08502)
Keywords: generative
Abstract: Story Visualization (SV) is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences. Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately, to improve the rendering of characters. On the contrary, we embrace a completely parallel transformer-based approach, exclusively relying on Cross-Attention with past and future captions to achieve consistency. Additionally, we propose a Character Guidance technique to focus on the generation of characters in an implicit manner, by forming a combination of text-conditional and character-conditional logits in the logit space. We also employ a caption-augmentation technique, carried out by a Large Language Model (LLM), to enhance the robustness of our approach. The combination of these methods culminates into state-of-the-art (SOTA) results over various metrics in the most prominent SV benchmark (Pororo-SV), attained with constraint resources while achieving superior computational complexity compared to previous arts. The validity of our quantitative results is supported by a human survey.

Title: Federated Knowledge Graph Unlearning via Diffusion Model

Authors: Bingchen Liu, Yuanyuan Fang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.08554
Pdf URL: https://arxiv.org/pdf/2403.08554
Copy Paste: [[2403.08554]] Federated Knowledge Graph Unlearning via Diffusion Model(https://arxiv.org/abs/2403.08554)
Keywords: diffusion
Abstract: Federated learning (FL) promotes the development and application of artificial intelligence technologies by enabling model sharing and collaboration while safeguarding data privacy. Knowledge graph (KG) embedding representation provides a foundation for knowledge reasoning and applications by mapping entities and relations into vector space. Federated KG embedding enables the utilization of knowledge from diverse client sources while safeguarding the privacy of local data. However, due to demands such as privacy protection and the need to adapt to dynamic data changes, investigations into machine unlearning (MU) have been sparked. However, it is challenging to maintain the performance of KG embedding models while forgetting the influence of specific forgotten data on the model. In this paper, we propose FedDM, a novel framework tailored for machine unlearning in federated knowledge graphs. Leveraging diffusion models, we generate noisy data to sensibly mitigate the influence of specific knowledge on FL models while preserving the overall performance concerning the remaining data. We conduct experimental evaluations on benchmark datasets to assess the efficacy of the proposed model. Extensive experiments demonstrate that FedDM yields promising results in knowledge forgetting.

Title: Non-discrimination Criteria for Generative Language Models

Authors: Sara Sterlie, Nina Weng, Aasa Feragen
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2403.08564
Pdf URL: https://arxiv.org/pdf/2403.08564
Copy Paste: [[2403.08564]] Non-discrimination Criteria for Generative Language Models(https://arxiv.org/abs/2403.08564)
Keywords: generative
Abstract: Within recent years, generative AI, such as large language models, has undergone rapid development. As these models become increasingly available to the public, concerns arise about perpetuating and amplifying harmful biases in applications. Gender stereotypes can be harmful and limiting for the individuals they target, whether they consist of misrepresentation or discrimination. Recognizing gender bias as a pervasive societal construct, this paper studies how to uncover and quantify the presence of gender biases in generative language models. In particular, we derive generative AI analogues of three well-known non-discrimination criteria from classification, namely independence, separation and sufficiency. To demonstrate these criteria in action, we design prompts for each of the criteria with a focus on occupational gender stereotype, specifically utilizing the medical test to introduce the ground truth in the generative AI context. Our results address the presence of occupational gender bias within such conversational language models.

Title: Caformer: Rethinking Time Series Analysis from Causal Perspective

Authors: Kexuan Zhang, Xiaobei Zou, Yang Tang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.08572
Pdf URL: https://arxiv.org/pdf/2403.08572
Copy Paste: [[2403.08572]] Caformer: Rethinking Time Series Analysis from Causal Perspective(https://arxiv.org/abs/2403.08572)
Keywords: anomaly
Abstract: Time series analysis is a vital task with broad applications in various domains. However, effectively capturing cross-dimension and cross-time dependencies in non-stationary time series poses significant challenges, particularly in the context of environmental factors. The spurious correlation induced by the environment confounds the causal relationships between cross-dimension and cross-time dependencies. In this paper, we introduce a novel framework called Caformer (\underline{\textbf{Ca}}usal Trans\underline{\textbf{former}}) for time series analysis from a causal perspective. Specifically, our framework comprises three components: Dynamic Learner, Environment Learner, and Dependency Learner. The Dynamic Learner unveils dynamic interactions among dimensions, the Environment Learner mitigates spurious correlations caused by environment with a back-door adjustment, and the Dependency Learner aims to infer robust interactions across both time and dimensions. Our Caformer demonstrates consistent state-of-the-art performance across five mainstream time series analysis tasks, including long- and short-term forecasting, imputation, classification, and anomaly detection, with proper interpretability.

Title: ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos

Authors: Lei Shi, Paul Bürkner, Andreas Bulling
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08591
Pdf URL: https://arxiv.org/pdf/2403.08591
Copy Paste: [[2403.08591]] ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos(https://arxiv.org/abs/2403.08591)
Keywords: diffusion
Abstract: We present ActionDiffusion -- a novel diffusion model for procedure planning in instructional videos that is the first to take temporal inter-dependencies between actions into account in a diffusion model for procedure planning. This approach is in stark contrast to existing methods that fail to exploit the rich information content available in the particular order in which actions are performed. Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process by projecting the action information into the noise space. This is achieved 1) by adding action embeddings in the noise masks in the noise-adding phase and 2) by introducing an attention mechanism in the noise prediction network to learn the correlations between different action steps. We report extensive experiments on three instructional video benchmark datasets (CrossTask, Coin, and NIV) and show that our method outperforms previous state-of-the-art methods on all metrics on CrossTask and NIV and all metrics except accuracy on Coin dataset. We show that by adding action embeddings into the noise mask the diffusion model can better learn action temporal dependencies and increase the performances on procedure planning.

Title: Data-Efficient Sleep Staging with Synthetic Time Series Pretraining

Authors: Niklas Grieger, Siamak Mehrkanoon, Stephan Bialonski
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2403.08592
Pdf URL: https://arxiv.org/pdf/2403.08592
Copy Paste: [[2403.08592]] Data-Efficient Sleep Staging with Synthetic Time Series Pretraining(https://arxiv.org/abs/2403.08592)
Keywords: self-supervised
Abstract: Analyzing electroencephalographic (EEG) time series can be challenging, especially with deep neural networks, due to the large variability among human subjects and often small datasets. To address these challenges, various strategies, such as self-supervised learning, have been suggested, but they typically rely on extensive empirical datasets. Inspired by recent advances in computer vision, we propose a pretraining task termed "frequency pretraining" to pretrain a neural network for sleep staging by predicting the frequency content of randomly generated synthetic time series. Our experiments demonstrate that our method surpasses fully supervised learning in scenarios with limited data and few subjects, and matches its performance in regimes with many subjects. Furthermore, our results underline the relevance of frequency information for sleep stage scoring, while also demonstrating that deep neural networks utilize information beyond frequencies to enhance sleep staging performance, which is consistent with previous research. We anticipate that our approach will be advantageous across a broad spectrum of applications where EEG data is limited or derived from a small number of subjects, including the domain of brain-computer interfaces.

Title: On the Convergence of Locally Adaptive and Scalable Diffusion-Based Sampling Methods for Deep Bayesian Neural Network Posteriors

Authors: Tim Rensmeyer, Oliver Niggemann
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2403.08609
Pdf URL: https://arxiv.org/pdf/2403.08609
Copy Paste: [[2403.08609]] On the Convergence of Locally Adaptive and Scalable Diffusion-Based Sampling Methods for Deep Bayesian Neural Network Posteriors(https://arxiv.org/abs/2403.08609)
Keywords: diffusion
Abstract: Achieving robust uncertainty quantification for deep neural networks represents an important requirement in many real-world applications of deep learning such as medical imaging where it is necessary to assess the reliability of a neural network's prediction. Bayesian neural networks are a promising approach for modeling uncertainties in deep neural networks. Unfortunately, generating samples from the posterior distribution of neural networks is a major challenge. One significant advance in that direction would be the incorporation of adaptive step sizes, similar to modern neural network optimizers, into Monte Carlo Markov chain sampling algorithms without significantly increasing computational demand. Over the past years, several papers have introduced sampling algorithms with claims that they achieve this property. However, do they indeed converge to the correct distribution? In this paper, we demonstrate that these methods can have a substantial bias in the distribution they sample, even in the limit of vanishing step sizes and at full batch size.

Title: Scaling Up Dynamic Human-Scene Interaction Modeling

Authors: Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Siyuan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08629
Pdf URL: https://arxiv.org/pdf/2403.08629
Copy Paste: [[2403.08629]] Scaling Up Dynamic Human-Scene Interaction Modeling(https://arxiv.org/abs/2403.08629)
Keywords: diffusion
Abstract: Confronting the challenges of data scarcity and advanced motion synthesis in human-scene interaction modeling, we introduce the TRUMANS dataset alongside a novel HSI motion synthesis method. TRUMANS stands as the most comprehensive motion-captured HSI dataset currently available, encompassing over 15 hours of human interactions across 100 indoor scenes. It intricately captures whole-body human motions and part-level object dynamics, focusing on the realism of contact. This dataset is further scaled up by transforming physical environments into exact virtual models and applying extensive augmentations to appearance and motion for both humans and objects while maintaining interaction fidelity. Utilizing TRUMANS, we devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length, taking into account both scene context and intended actions. In experiments, our approach shows remarkable zero-shot generalizability on a range of 3D scene datasets (e.g., PROX, Replica, ScanNet, ScanNet++), producing motions that closely mimic original motion-captured sequences, as confirmed by quantitative experiments and human studies.

Title: Data Augmentation in Human-Centric Vision

Authors: Wentao Jiang, Yige Zhang, Shaozhong Zheng, Si Liu, Shuicheng Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08650
Pdf URL: https://arxiv.org/pdf/2403.08650
Copy Paste: [[2403.08650]] Data Augmentation in Human-Centric Vision(https://arxiv.org/abs/2403.08650)
Keywords: diffusion, generative
Abstract: This survey presents a comprehensive analysis of data augmentation techniques in human-centric vision tasks, a first of its kind in the field. It delves into a wide range of research areas including person ReID, human parsing, human pose estimation, and pedestrian detection, addressing the significant challenges posed by overfitting and limited training data in these domains. Our work categorizes data augmentation methods into two main types: data generation and data perturbation. Data generation covers techniques like graphic engine-based generation, generative model-based generation, and data recombination, while data perturbation is divided into image-level and human-level perturbations. Each method is tailored to the unique requirements of human-centric tasks, with some applicable across multiple areas. Our contributions include an extensive literature review, providing deep insights into the influence of these augmentation techniques in human-centric vision and highlighting the nuances of each method. We also discuss open issues and future directions, such as the integration of advanced generative models like Latent Diffusion Models, for creating more realistic and diverse training data. This survey not only encapsulates the current state of data augmentation in human-centric vision but also charts a course for future research, aiming to develop more robust, accurate, and efficient human-centric vision systems.

Title: Extracting Explanations, Justification, and Uncertainty from Black-Box Deep Neural Networks

Authors: Paul Ardis, Arjuna Flenner
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2403.08652
Pdf URL: https://arxiv.org/pdf/2403.08652
Copy Paste: [[2403.08652]] Extracting Explanations, Justification, and Uncertainty from Black-Box Deep Neural Networks(https://arxiv.org/abs/2403.08652)
Keywords: anomaly
Abstract: Deep Neural Networks (DNNs) do not inherently compute or exhibit empirically-justified task confidence. In mission critical applications, it is important to both understand associated DNN reasoning and its supporting evidence. In this paper, we propose a novel Bayesian approach to extract explanations, justifications, and uncertainty estimates from DNNs. Our approach is efficient both in terms of memory and computation, and can be applied to any black box DNN without any retraining, including applications to anomaly detection and out-of-distribution detection tasks. We validate our approach on the CIFAR-10 dataset, and show that it can significantly improve the interpretability and reliability of DNNs.

Title: Token Alignment via Character Matching for Subword Completion

Authors: Ben Athiwaratkun, Shiqi Wang, Mingyue Shang, Yuchen Tian, Zijian Wang, Sujan Kumar Gonugondla, Sanjay Krishna Gouda, Rob Kwiatowski, Ramesh Nallapati, Bing Xiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.08688
Pdf URL: https://arxiv.org/pdf/2403.08688
Copy Paste: [[2403.08688]] Token Alignment via Character Matching for Subword Completion(https://arxiv.org/abs/2403.08688)
Keywords: generative
Abstract: Generative models, widely utilized in various applications, can often struggle with prompts corresponding to partial tokens. This struggle stems from tokenization, where partial tokens fall out of distribution during inference, leading to incorrect or nonsensical outputs. This paper examines a technique to alleviate the tokenization artifact on text completion in generative models, maintaining performance even in regular non-subword cases. The method, termed token alignment, involves backtracking to the last complete tokens and ensuring the model's generation aligns with the prompt. This approach showcases marked improvement across many partial token scenarios, including nuanced cases like space-prefix and partial indentation, with only a minor time increase. The technique and analysis detailed in this paper contribute to the continuous advancement of generative models in handling partial inputs, bearing relevance for applications like code completion and text autocompletion.

Title: Review of Generative AI Methods in Cybersecurity

Authors: Yagmur Yigit, William J Buchanan, Madjid G Tehrani, Leandros Maglaras
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.08701
Pdf URL: https://arxiv.org/pdf/2403.08701
Copy Paste: [[2403.08701]] Review of Generative AI Methods in Cybersecurity(https://arxiv.org/abs/2403.08701)
Keywords: generative
Abstract: Large language models (LLMs) and generative artificial intelligence (GenAI) constitute paradigm shifts in cybersecurity that present hitherto unseen challenges as well as opportunities. In examining the state-of-the-art application of GenAI in cybersecurity, this work highlights how models like Google's Gemini and ChatGPT-4 potentially enhance security protocols, vulnerability assessment, and threat identification. Our research highlights the significance of a novel approach that employs LLMs to identify and eliminate sophisticated cyber threats. This paper presents a thorough assessment of LLMs' ability to produce important security insights, hence broadening the potential applications of AI-driven cybersecurity solutions. Our findings demonstrate the significance of GenAI in improving digital security. It offers recommendations for further investigations into the intricate relationship between cybersecurity requirements and artificial intelligence's potential.

Title: Historical Astronomical Diagrams Decomposition in Geometric Primitives

Authors: Syrine Kalleli, Scott Trigg, Ségolène Albouy, Mathieu Husson, Mathieu Aubry
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08721
Pdf URL: https://arxiv.org/pdf/2403.08721
Copy Paste: [[2403.08721]] Historical Astronomical Diagrams Decomposition in Geometric Primitives(https://arxiv.org/abs/2403.08721)
Keywords: diffusion
Abstract: Automatically extracting the geometric content from the hundreds of thousands of diagrams drawn in historical manuscripts would enable historians to study the diffusion of astronomical knowledge on a global scale. However, state-of-the-art vectorization methods, often designed to tackle modern data, are not adapted to the complexity and diversity of historical astronomical diagrams. Our contribution is thus twofold. First, we introduce a unique dataset of 303 astronomical diagrams from diverse traditions, ranging from the XIIth to the XVIIIth century, annotated with more than 3000 line segments, circles and arcs. Second, we develop a model that builds on DINO-DETR to enable the prediction of multiple geometric primitives. We show that it can be trained solely on synthetic data and accurately predict primitives on our challenging dataset. Our approach widely improves over the LETR baseline, which is restricted to lines, by introducing a meaningful parametrization for multiple primitives, jointly training for detection and parameter refinement, using deformable attention and training on rich synthetic data. Our dataset and code are available on our webpage.

Title: Ambient Diffusion Posterior Sampling: Solving Inverse Problems with Diffusion Models trained on Corrupted Data

Authors: Asad Aali, Giannis Daras, Brett Levac, Sidharth Kumar, Alexandros G. Dimakis, Jonathan I. Tamir
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.08728
Pdf URL: https://arxiv.org/pdf/2403.08728
Copy Paste: [[2403.08728]] Ambient Diffusion Posterior Sampling: Solving Inverse Problems with Diffusion Models trained on Corrupted Data(https://arxiv.org/abs/2403.08728)
Keywords: diffusion, generative
Abstract: We provide a framework for solving inverse problems with diffusion models learned from linearly corrupted data. Our method, Ambient Diffusion Posterior Sampling (A-DPS), leverages a generative model pre-trained on one type of corruption (e.g. image inpainting) to perform posterior sampling conditioned on measurements from a potentially different forward process (e.g. image blurring). We test the efficacy of our approach on standard natural image datasets (CelebA, FFHQ, and AFHQ) and we show that A-DPS can sometimes outperform models trained on clean data for several image restoration tasks in both speed and performance. We further extend the Ambient Diffusion framework to train MRI models with access only to Fourier subsampled multi-coil MRI measurements at various acceleration factors (R=2, 4, 6, 8). We again observe that models trained on highly subsampled data are better priors for solving inverse problems in the high acceleration regime than models trained on fully sampled data. We open-source our code and the trained Ambient Diffusion MRI models: https://github.com/utcsilab/ambient-diffusion-mri .

Title: GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing

Authors: Jing Wu, Jia-Wang Bian, Xinghui Li, Guangrun Wang, Ian Reid, Philip Torr, Victor Adrian Prisacariu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08733
Pdf URL: https://arxiv.org/pdf/2403.08733
Copy Paste: [[2403.08733]] GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing(https://arxiv.org/abs/2403.08733)
Keywords: diffusion
Abstract: We propose GaussCtrl, a text-driven method to edit a 3D scene reconstructed by the 3D Gaussian Splatting (3DGS). Our method first renders a collection of images by using the 3DGS and edits them by using a pre-trained 2D diffusion model (ControlNet) based on the input prompt, which is then used to optimise the 3D model. Our key contribution is multi-view consistent editing, which enables editing all images together instead of iteratively editing one image while updating the 3D model as in previous works. It leads to faster editing as well as higher visual quality. This is achieved by the two terms: (a) depth-conditioned editing that enforces geometric consistency across multi-view images by leveraging naturally consistent depth maps. (b) attention-based latent code alignment that unifies the appearance of edited images by conditioning their editing to several reference views through self and cross-view attention between images' latent representations. Experiments demonstrate that our method achieves faster editing and better visual results than previous state-of-the-art methods.

Title: Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations

Authors: Amit Meghanani, Thomas Hain
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2403.08738
Pdf URL: https://arxiv.org/pdf/2403.08738
Copy Paste: [[2403.08738]] Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations(https://arxiv.org/abs/2403.08738)
Keywords: self-supervised
Abstract: Acoustic word embeddings (AWEs) are vector representations of spoken words. An effective method for obtaining AWEs is the Correspondence Auto-Encoder (CAE). In the past, the CAE method has been associated with traditional MFCC features. Representations obtained from self-supervised learning (SSL)-based speech models such as HuBERT, Wav2vec2, etc., are outperforming MFCC in many downstream tasks. However, they have not been well studied in the context of learning AWEs. This work explores the effectiveness of CAE with SSL-based speech representations to obtain improved AWEs. Additionally, the capabilities of SSL-based speech models are explored in cross-lingual scenarios for obtaining AWEs. Experiments are conducted on five languages: Polish, Portuguese, Spanish, French, and English. HuBERT-based CAE model achieves the best results for word discrimination in all languages, despite Hu-BERT being pre-trained on English only. Also, the HuBERT-based CAE model works well in cross-lingual settings. It outperforms MFCC-based CAE models trained on the target languages when trained on one source language and tested on target languages.

Title: Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework

Authors: Jingling Li, Zeyu Tang, Xiaoyu Liu, Peter Spirtes, Kun Zhang, Liu Leqi, Yang Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.08743
Pdf URL: https://arxiv.org/pdf/2403.08743
Copy Paste: [[2403.08743]] Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework(https://arxiv.org/abs/2403.08743)
Keywords: in-context
Abstract: Large language models (LLMs) can easily generate biased and discriminative responses. As LLMs tap into consequential decision-making (e.g., hiring and healthcare), it is of crucial importance to develop strategies to mitigate these biases. This paper focuses on social bias, tackling the association between demographic information and LLM outputs. We propose a causality-guided debiasing framework that utilizes causal understandings of (1) the data-generating process of the training corpus fed to LLMs, and (2) the internal reasoning process of LLM inference, to guide the design of prompts for debiasing LLM outputs through selection mechanisms. Our framework unifies existing de-biasing prompting approaches such as inhibitive instructions and in-context contrastive examples, and sheds light on new ways of debiasing by encouraging bias-free reasoning. Our strong empirical performance on real-world datasets demonstrates that our framework provides principled guidelines on debiasing LLM outputs even with only the black-box access.

Title: VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

Authors: Enric Corona, Andrei Zanfir, Eduard Gabriel Bazavan, Nikos Kolotouros, Thiemo Alldieck, Cristian Sminchisescu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08764
Pdf URL: https://arxiv.org/pdf/2403.08764
Copy Paste: [[2403.08764]] VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis(https://arxiv.org/abs/2403.08764)
Keywords: diffusion, generative
Abstract: We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upper-body gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization.