2023-12-29

diffusion

Title: Iterative Prompt Relabeling for diffusion model with RLDF. (arXiv:2312.16204v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16204
Code URL: null
Copy Paste: [[2312.16204]] Iterative Prompt Relabeling for diffusion model with RLDF(http://arxiv.org/abs/2312.16204)
Summary:
Diffusion models have shown impressive performance in many domains, including image generation, time series prediction, and reinforcement learning. The algorithm demonstrates superior performance over the traditional GAN and transformer based methods. However, the model's capability to follow natural language instructions (e.g., spatial relationships between objects, generating complex scenes) is still unsatisfactory. This has been an important research area to enhance such capability. Prior works adopt reinforcement learning to adjust the behavior of the diffusion models. However, RL methods not only require careful reward design and complex hyperparameter tuning, but also fails to incorporate rich natural language feedback. In this work, we propose iterative prompt relabeling (IP-RLDF), a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling. IP-RLDF first samples a batch of images conditioned on the text, then relabels the text prompts of unmatched text-image pairs with classifier feedback. We conduct thorough experiments on three different models, including SDv2, GLIGEN, and SDXL, testing their capability to generate images following instructions. With IP-RLDF, we improved up to 15.22% (absolute improvement) on the challenging spatial relation VISOR benchmark, demonstrating superior performance compared to previous RL methods.

Title: Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks. (arXiv:2312.16218v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16218
Code URL: null
Copy Paste: [[2312.16218]] Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks(http://arxiv.org/abs/2312.16218)
Summary:
Solving image-to-3D from a single view is an ill-posed problem, and current neural reconstruction methods addressing it through diffusion models still rely on scene-specific optimization, constraining their generalization capability. To overcome the limitations of existing approaches regarding generalization and consistency, we introduce a novel neural rendering technique. Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks. Specifically, our method builds neural encoding volumes from generated multi-view inputs. We adjust the weights of the SDF network conditioned on an input image at test-time to allow model adaptation to novel scenes in a feed-forward manner via HyperNetworks. To mitigate artifacts derived from the synthesized views, we propose the use of a volume transformer module to improve the aggregation of image features instead of processing each viewpoint separately. Through our proposed method, dubbed as Hyper-VolTran, we avoid the bottleneck of scene-specific optimization and maintain consistency across the images generated from multiple viewpoints. Our experiments show the advantages of our proposed approach with consistent results and rapid generation.

Title: Towards Flexible, Scalable, and Adaptive Multi-Modal Conditioned Face Synthesis. (arXiv:2312.16274v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16274
Code URL: null
Copy Paste: [[2312.16274]] Towards Flexible, Scalable, and Adaptive Multi-Modal Conditioned Face Synthesis(http://arxiv.org/abs/2312.16274)
Summary:
Recent progress in multi-modal conditioned face synthesis has enabled the creation of visually striking and accurately aligned facial images. Yet, current methods still face issues with scalability, limited flexibility, and a one-size-fits-all approach to control strength, not accounting for the differing levels of conditional entropy, a measure of unpredictability in data given some condition, across modalities. To address these challenges, we introduce a novel uni-modal training approach with modal surrogates, coupled with an entropy-aware modal-adaptive modulation, to support flexible, scalable, and scalable multi-modal conditioned face synthesis network. Our uni-modal training with modal surrogate that only leverage uni-modal data, use modal surrogate to decorate condition with modal-specific characteristic and serve as linker for inter-modal collaboration , fully learns each modality control in face synthesis process as well as inter-modal collaboration. The entropy-aware modal-adaptive modulation finely adjust diffusion noise according to modal-specific characteristics and given conditions, enabling well-informed step along denoising trajectory and ultimately leading to synthesis results of high fidelity and quality. Our framework improves multi-modal face synthesis under various conditions, surpassing current methods in image quality and fidelity, as demonstrated by our thorough experimental results.

Title: State-of-the-Art in Nudity Classification: A Comparative Analysis. (arXiv:2312.16338v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16338
Code URL: null
Copy Paste: [[2312.16338]] State-of-the-Art in Nudity Classification: A Comparative Analysis(http://arxiv.org/abs/2312.16338)
Summary:
This paper presents a comparative analysis of existing nudity classification techniques for classifying images based on the presence of nudity, with a focus on their application in content moderation. The evaluation focuses on CNN-based models, vision transformer, and popular open-source safety checkers from Stable Diffusion and Large-scale Artificial Intelligence Open Network (LAION). The study identifies the limitations of current evaluation datasets and highlights the need for more diverse and challenging datasets. The paper discusses the potential implications of these findings for developing more accurate and effective image classification systems on online platforms. Overall, the study emphasizes the importance of continually improving image classification models to ensure the safety and well-being of platform users. The project page, including the demonstrations and results is publicly available at https://github.com/fcakyon/content-moderation-deep-learning.

Title: Natural Adversarial Patch Generation Method Based on Latent Diffusion Model. (arXiv:2312.16401v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16401
Code URL: null
Copy Paste: [[2312.16401]] Natural Adversarial Patch Generation Method Based on Latent Diffusion Model(http://arxiv.org/abs/2312.16401)
Summary:
Recently, some research show that deep neural networks are vulnerable to the adversarial attacks, the well-trainned samples or patches could be used to trick the neural network detector or human visual perception. However, these adversarial patches, with their conspicuous and unusual patterns, lack camouflage and can easily raise suspicion in the real world. To solve this problem, this paper proposed a novel adversarial patch method called the Latent Diffusion Patch (LDP), in which, a pretrained encoder is first designed to compress the natural images into a feature space with key characteristics. Then trains the diffusion model using the above feature space. Finally, explore the latent space of the pretrained diffusion model using the image denoising technology. It polishes the patches and images through the powerful natural abilities of diffusion models, making them more acceptable to the human visual system. Experimental results, both digital and physical worlds, show that LDPs achieve a visual subjectivity score of 87.3%, while still maintaining effective attack capabilities.

Title: SVGDreamer: Text Guided SVG Generation with Diffusion Model. (arXiv:2312.16476v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16476
Code URL: null
Copy Paste: [[2312.16476]] SVGDreamer: Text Guided SVG Generation with Diffusion Model(http://arxiv.org/abs/2312.16476)
Summary:
Recently, text-guided scalable vector graphics (SVGs) synthesis has shown promise in domains such as iconography and sketch. However, existing text-to-SVG generation methods lack editability and struggle with visual quality and result diversity. To address these limitations, we propose a novel text-guided vector graphics synthesis method called SVGDreamer. SVGDreamer incorporates a semantic-driven image vectorization (SIVE) process that enables the decomposition of synthesis into foreground objects and background, thereby enhancing editability. Specifically, the SIVE process introduce attention-based primitive control and an attention-mask loss function for effective control and manipulation of individual elements. Additionally, we propose a Vectorized Particle-based Score Distillation (VPSD) approach to tackle the challenges of color over-saturation, vector primitives over-smoothing, and limited result diversity in existing text-to-SVG generation methods. Furthermore, on the basis of VPSD, we introduce Reward Feedback Learning (ReFL) to accelerate VPSD convergence and improve aesthetic appeal. Extensive experiments have been conducted to validate the effectiveness of SVGDreamer, demonstrating its superiority over baseline methods in terms of editability, visual quality, and diversity.

Title: PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion. (arXiv:2312.16486v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16486
Code URL: null
Copy Paste: [[2312.16486]] PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion(http://arxiv.org/abs/2312.16486)
Summary:
Current large-scale diffusion models represent a giant leap forward in conditional image synthesis, capable of interpreting diverse cues like text, human poses, and edges. However, their reliance on substantial computational resources and extensive data collection remains a bottleneck. On the other hand, the integration of existing diffusion models, each specialized for different controls and operating in unique latent spaces, poses a challenge due to incompatible image resolutions and latent space embedding structures, hindering their joint use. Addressing these constraints, we present "PanGu-Draw", a novel latent diffusion model designed for resource-efficient text-to-image synthesis that adeptly accommodates multiple control signals. We first propose a resource-efficient Time-Decoupling Training Strategy, which splits the monolithic text-to-image model into structure and texture generators. Each generator is trained using a regimen that maximizes data utilization and computational efficiency, cutting data preparation by 48% and reducing training resources by 51%. Secondly, we introduce "Coop-Diffusion", an algorithm that enables the cooperative use of various pre-trained diffusion models with different latent spaces and predefined resolutions within a unified denoising process. This allows for multi-control image synthesis at arbitrary resolutions without the necessity for additional data or retraining. Empirical validations of Pangu-Draw show its exceptional prowess in text-to-image and multi-control image generation, suggesting a promising direction for future model training efficiencies and generation versatility. The largest 5B T2I PanGu-Draw model is released on the Ascend platform. Project page: https://pangu-draw.github.io

Title: Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection. (arXiv:2312.16649v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16649
Code URL: null
Copy Paste: [[2312.16649]] Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection(http://arxiv.org/abs/2312.16649)
Summary:
In this paper, we study the problem of generalizable synthetic image detection, aiming to detect forgery images from diverse generative methods, e.g., GANs and diffusion models. Cutting-edge solutions start to explore the benefits of pre-trained models, and mainly follow the fixed paradigm of solely training an attached classifier, e.g., combining frozen CLIP-ViT with a learnable linear layer in UniFD. However, our analysis shows that such a fixed paradigm is prone to yield detectors with insufficient learning regarding forgery representations. We attribute the key challenge to the lack of forgery adaptation, and present a novel forgery-aware adaptive transformer approach, namely FatFormer. Based on the pre-trained vision-language spaces of CLIP, FatFormer introduces two core designs for the adaption to build generalized forgery representations. First, motivated by the fact that both image and frequency analysis are essential for synthetic image detection, we develop a forgery-aware adapter to adapt image features to discern and integrate local forgery traces within image and frequency domains. Second, we find that considering the contrastive objectives between adapted image features and text prompt embeddings, a previously overlooked aspect, results in a nontrivial generalization improvement. Accordingly, we introduce language-guided alignment to supervise the forgery adaptation with image and text prompts in FatFormer. Experiments show that, by coupling these two designs, our approach tuned on 4-class ProGAN data attains a remarkable detection performance, achieving an average of 98% accuracy to unseen GANs, and surprisingly generalizes to unseen diffusion models with 95% accuracy.

Title: I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion Models. (arXiv:2312.16693v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16693
Code URL: null
Copy Paste: [[2312.16693]] I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion Models(http://arxiv.org/abs/2312.16693)
Summary:
In the rapidly evolving domain of digital content generation, the focus has shifted from text-to-image (T2I) models to more advanced video diffusion models, notably text-to-video (T2V) and image-to-video (I2V). This paper addresses the intricate challenge posed by I2V: converting static images into dynamic, lifelike video sequences while preserving the original image fidelity. Traditional methods typically involve integrating entire images into diffusion processes or using pretrained encoders for cross attention. However, these approaches often necessitate altering the fundamental weights of T2I models, thereby restricting their reusability. We introduce a novel solution, namely I2V-Adapter, designed to overcome such limitations. Our approach preserves the structural integrity of T2I models and their inherent motion modules. The I2V-Adapter operates by processing noised video frames in parallel with the input image, utilizing a lightweight adapter module. This module acts as a bridge, efficiently linking the input to the model's self-attention mechanism, thus maintaining spatial details without requiring structural changes to the T2I model. Moreover, I2V-Adapter requires only a fraction of the parameters of conventional models and ensures compatibility with existing community-driven T2I models and controlling tools. Our experimental results demonstrate I2V-Adapter's capability to produce high-quality video outputs. This performance, coupled with its versatility and reduced need for trainable parameters, represents a substantial advancement in the field of AI-driven video generation, particularly for creative applications.

self-supervised

Title: TEMP3D: Temporally Continuous 3D Human Pose Estimation Under Occlusions. (arXiv:2312.16221v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16221
Code URL: null
Copy Paste: [[2312.16221]] TEMP3D: Temporally Continuous 3D Human Pose Estimation Under Occlusions(http://arxiv.org/abs/2312.16221)
Summary:
Existing 3D human pose estimation methods perform remarkably well in both monocular and multi-view settings. However, their efficacy diminishes significantly in the presence of heavy occlusions, which limits their practical utility. For video sequences, temporal continuity can help infer accurate poses, especially in heavily occluded frames. In this paper, we aim to leverage this potential of temporal continuity through human motion priors, coupled with large-scale pre-training on 3D poses and self-supervised learning, to enhance 3D pose estimation in a given video sequence. This leads to a temporally continuous 3D pose estimate on unlabelled in-the-wild videos, which may contain occlusions, while exclusively relying on pre-trained 3D pose models. We propose an unsupervised method named TEMP3D that aligns a motion prior model on a given in-the-wild video using existing SOTA single image-based 3D pose estimation methods to give temporally continuous output under occlusions. To evaluate our method, we test it on the Occluded Human3.6M dataset, our custom-built dataset which contains significantly large (up to 100%) human body occlusions incorporated into the Human3.6M dataset. We achieve SOTA results on Occluded Human3.6M and the OcMotion dataset while maintaining competitive performance on non-occluded data. URL: https://sites.google.com/ucr.edu/temp3d

Title: Soft Contrastive Learning for Time Series. (arXiv:2312.16424v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16424
Code URL: https://github.com/seunghan96/softclt
Copy Paste: [[2312.16424]] Soft Contrastive Learning for Time Series(http://arxiv.org/abs/2312.16424)
Summary:
Contrastive learning has shown to be effective to learn representations from time series in a self-supervised way. However, contrasting similar time series instances or values from adjacent timestamps within a time series leads to ignore their inherent correlations, which results in deteriorating the quality of learned representations. To address this issue, we propose SoftCLT, a simple yet effective soft contrastive learning strategy for time series. This is achieved by introducing instance-wise and temporal contrastive loss with soft assignments ranging from zero to one. Specifically, we define soft assignments for 1) instance-wise contrastive loss by the distance between time series on the data space, and 2) temporal contrastive loss by the difference of timestamps. SoftCLT is a plug-and-play method for time series contrastive learning that improves the quality of learned representations without bells and whistles. In experiments, we demonstrate that SoftCLT consistently improves the performance in various downstream tasks including classification, semi-supervised learning, transfer learning, and anomaly detection, showing state-of-the-art performance. Code is available at this repository: https://github.com/seunghan96/softclt.

Title: Learning to Embed Time Series Patches Independently. (arXiv:2312.16427v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16427
Code URL: https://github.com/seunghan96/pits
Copy Paste: [[2312.16427]] Learning to Embed Time Series Patches Independently(http://arxiv.org/abs/2312.16427)
Summary:
Masked time series modeling has recently gained much attention as a self-supervised representation learning strategy for time series. Inspired by masked image modeling in computer vision, recent works first patchify and partially mask out time series, and then train Transformers to capture the dependencies between patches by predicting masked patches from unmasked patches. However, we argue that capturing such patch dependencies might not be an optimal strategy for time series representation learning; rather, learning to embed patches independently results in better time series representations. Specifically, we propose to use 1) the simple patch reconstruction task, which autoencode each patch without looking at other patches, and 2) the simple patch-wise MLP that embeds each patch independently. In addition, we introduce complementary contrastive learning to hierarchically capture adjacent time series information efficiently. Our proposed method improves time series forecasting and classification performance compared to state-of-the-art Transformer-based models, while it is more efficient in terms of the number of parameters and training/inference time. Code is available at this repository: https://github.com/seunghan96/pits.

Title: Mitigating Degree Biases in Message Passing Mechanism by Utilizing Community Structures. (arXiv:2312.16788v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16788
Code URL: https://github.com/nslab-cuk/community-aware-graph-transformer
Copy Paste: [[2312.16788]] Mitigating Degree Biases in Message Passing Mechanism by Utilizing Community Structures(http://arxiv.org/abs/2312.16788)
Summary:
This study utilizes community structures to address node degree biases in message-passing (MP) via learnable graph augmentations and novel graph transformers. Recent augmentation-based methods showed that MP neural networks often perform poorly on low-degree nodes, leading to degree biases due to a lack of messages reaching low-degree nodes. Despite their success, most methods use heuristic or uniform random augmentations, which are non-differentiable and may not always generate valuable edges for learning representations. In this paper, we propose Community-aware Graph Transformers, namely CGT, to learn degree-unbiased representations based on learnable augmentations and graph transformers by extracting within community structures. We first design a learnable graph augmentation to generate more within-community edges connecting low-degree nodes through edge perturbation. Second, we propose an improved self-attention to learn underlying proximity and the roles of nodes within the community. Third, we propose a self-supervised learning task that could learn the representations to preserve the global graph structure and regularize the graph augmentations. Extensive experiments on various benchmark datasets showed CGT outperforms state-of-the-art baselines and significantly improves the node degree biases. The source code is available at https://github.com/NSLab-CUK/Community-aware-Graph-Transformer.

foundation model

Title: Time Travelling Pixels: Bitemporal Features Integration with Foundation Model for Remote Sensing Image Change Detection. (arXiv:2312.16202v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16202
Code URL: https://github.com/KyanChen/TTP
Copy Paste: [[2312.16202]] Time Travelling Pixels: Bitemporal Features Integration with Foundation Model for Remote Sensing Image Change Detection(http://arxiv.org/abs/2312.16202)
Summary:
Change detection, a prominent research area in remote sensing, is pivotal in observing and analyzing surface transformations. Despite significant advancements achieved through deep learning-based methods, executing high-precision change detection in spatio-temporally complex remote sensing scenarios still presents a substantial challenge. The recent emergence of foundation models, with their powerful universality and generalization capabilities, offers potential solutions. However, bridging the gap of data and tasks remains a significant obstacle. In this paper, we introduce Time Travelling Pixels (TTP), a novel approach that integrates the latent knowledge of the SAM foundation model into change detection. This method effectively addresses the domain shift in general knowledge transfer and the challenge of expressing homogeneous and heterogeneous characteristics of multi-temporal images. The state-of-the-art results obtained on the LEVIR-CD underscore the efficacy of the TTP. The Code is available at \url{https://kychen.me/TTP}.

Title: DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision. (arXiv:2312.16256v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16256
Code URL: null
Copy Paste: [[2312.16256]] DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision(http://arxiv.org/abs/2312.16256)
Summary:
We have witnessed significant progress in deep learning-based 3D vision, ranging from neural radiance field (NeRF) based 3D representation learning to applications in novel view synthesis (NVS). However, existing scene-level datasets for deep learning-based 3D vision, limited to either synthetic environments or a narrow selection of real-world scenes, are quite insufficient. This insufficiency not only hinders a comprehensive benchmark of existing methods but also caps what could be explored in deep learning-based 3D analysis. To address this critical gap, we present DL3DV-10K, a large-scale scene dataset, featuring 51.2 million frames from 10,510 videos captured from 65 types of point-of-interest (POI) locations, covering both bounded and unbounded scenes, with different levels of reflection, transparency, and lighting. We conducted a comprehensive benchmark of recent NVS methods on DL3DV-10K, which revealed valuable insights for future research in NVS. In addition, we have obtained encouraging results in a pilot study to learn generalizable NeRF from DL3DV-10K, which manifests the necessity of a large-scale scene-level dataset to forge a path toward a foundation model for learning 3D representation. Our DL3DV-10K dataset, benchmark results, and models will be publicly accessible at https://dl3dv-10k.github.io/DL3DV-10K/.

Title: Segment Change Model (SCM) for Unsupervised Change detection in VHR Remote Sensing Images: a Case Study of Buildings. (arXiv:2312.16410v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16410
Code URL: null
Copy Paste: [[2312.16410]] Segment Change Model (SCM) for Unsupervised Change detection in VHR Remote Sensing Images: a Case Study of Buildings(http://arxiv.org/abs/2312.16410)
Summary:
The field of Remote Sensing (RS) widely employs Change Detection (CD) on very-high-resolution (VHR) images. A majority of extant deep-learning-based methods hinge on annotated samples to complete the CD process. Recently, the emergence of Vision Foundation Model (VFM) enables zero-shot predictions in particular vision tasks. In this work, we propose an unsupervised CD method named Segment Change Model (SCM), built upon the Segment Anything Model (SAM) and Contrastive Language-Image Pre-training (CLIP). Our method recalibrates features extracted at different scales and integrates them in a top-down manner to enhance discriminative change edges. We further design an innovative Piecewise Semantic Attention (PSA) scheme, which can offer semantic representation without training, thereby minimize pseudo change phenomenon. Through conducting experiments on two public datasets, the proposed SCM increases the mIoU from 46.09% to 53.67% on the LEVIR-CD dataset, and from 47.56% to 52.14% on the WHU-CD dataset. Our codes are available at https://github.com/StephenApX/UCD-SCM.

generative

Title: AI Mirage: The Impostor Bias and the Deepfake Detection Challenge in the Era of Artificial Illusions. (arXiv:2312.16220v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16220
Code URL: null
Copy Paste: [[2312.16220]] AI Mirage: The Impostor Bias and the Deepfake Detection Challenge in the Era of Artificial Illusions(http://arxiv.org/abs/2312.16220)
Summary:
This paper provides a comprehensive analysis of cognitive biases in forensics and digital forensics, examining their implications for decision-making processes in these fields. It explores the various types of cognitive biases that may arise during forensic investigations and digital forensic analyses, such as confirmation bias, expectation bias, overconfidence in errors, contextual bias, and attributional biases. It also evaluates existing methods and techniques used to mitigate cognitive biases in these contexts, assessing the effectiveness of interventions aimed at reducing biases and improving decision-making outcomes. Additionally, this paper introduces a new cognitive bias, called "impostor bias", that may affect the use of generative Artificial Intelligence (AI) tools in forensics and digital forensics. The impostor bias is the tendency to doubt the authenticity or validity of the output generated by AI tools, such as deepfakes, in the form of audio, images, and videos. This bias may lead to erroneous judgments or false accusations, undermining the reliability and credibility of forensic evidence. The paper discusses the potential causes and consequences of the impostor bias, and suggests some strategies to prevent or counteract it. By addressing these topics, this paper seeks to offer valuable insights into understanding cognitive biases in forensic practices and provide recommendations for future research and practical applications to enhance the objectivity and validity of forensic investigations.

Title: MetaScript: Few-Shot Handwritten Chinese Content Generation via Generative Adversarial Networks. (arXiv:2312.16251v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16251
Code URL: null
Copy Paste: [[2312.16251]] MetaScript: Few-Shot Handwritten Chinese Content Generation via Generative Adversarial Networks(http://arxiv.org/abs/2312.16251)
Summary:
In this work, we propose MetaScript, a novel Chinese content generation system designed to address the diminishing presence of personal handwriting styles in the digital representation of Chinese characters. Our approach harnesses the power of few-shot learning to generate Chinese characters that not only retain the individual's unique handwriting style but also maintain the efficiency of digital typing. Trained on a diverse dataset of handwritten styles, MetaScript is adept at producing high-quality stylistic imitations from minimal style references and standard fonts. Our work demonstrates a practical solution to the challenges of digital typography in preserving the personal touch in written communication, particularly in the context of Chinese script. Notably, our system has demonstrated superior performance in various evaluations, including recognition accuracy, inception score, and Frechet inception distance. At the same time, the training conditions of our model are easy to meet and facilitate generalization to real applications.

Title: Bellman Optimal Step-size Straightening of Flow-Matching Models. (arXiv:2312.16414v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16414
Code URL: null
Copy Paste: [[2312.16414]] Bellman Optimal Step-size Straightening of Flow-Matching Models(http://arxiv.org/abs/2312.16414)
Summary:
Flow matching is a powerful framework for generating high-quality samples in various applications, especially image synthesis. However, the intensive computational demands of these models, especially during the fine-tuning process and sampling processes, pose significant challenges for low-resource scenarios. This paper introduces Bellman Optimal Step-size Straightening (BOSS) technique for distilling flow-matching generative models: it aims specifically for a few step efficient image sampling while adhering to a computational budget constraint. First, this technique involves a dynamic programming algorithm that optimizes the step sizes of the pretrained network. Then, it refines the velocity network to match the optimal step sizes, aiming to straighten the generation paths. Extensive experimental evaluations across image generation tasks demonstrate the efficacy of BOSS in terms of both resource utilization and image quality. Our results reveal that BOSS achieves substantial gains in efficiency while maintaining competitive sample quality, effectively bridging the gap between low-resource constraints and the demanding requirements of flow-matching generative models. Our paper also fortifies the responsible development of artificial intelligence, offering a more sustainable generative model that reduces computational costs and environmental footprints. Our code can be found at https://anonymous.4open.science/r/DRL-8E88.

Title: Disentangled Continual Learning: Separating Memory Edits from Model Updates. (arXiv:2312.16731v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16731
Code URL: null
Copy Paste: [[2312.16731]] Disentangled Continual Learning: Separating Memory Edits from Model Updates(http://arxiv.org/abs/2312.16731)
Summary:
The ability of machine learning systems to learn continually is hindered by catastrophic forgetting, the tendency of neural networks to overwrite existing knowledge when learning a new task. Existing continual learning methods alleviate this problem through regularisation, parameter isolation, or rehearsal, and are typically evaluated on benchmarks consisting of a handful of tasks. We propose a novel conceptual approach to continual classification that aims to disentangle class-specific information that needs to be memorised from the class-agnostic knowledge that encapsulates generalization. We store the former in a buffer that can be easily pruned or updated when new categories arrive, while the latter is represented with a neural network that generalizes across tasks. We show that the class-agnostic network does not suffer from catastrophic forgetting and by leveraging it to perform classification, we improve accuracy on past tasks over time. In addition, our approach supports open-set classification and one-shot generalization. To test our conceptual framework, we introduce Infinite dSprites, a tool for creating continual classification and disentanglement benchmarks of arbitrary length with full control over generative factors. We show that over a sufficiently long time horizon all major types of continual learning methods break down, while our approach enables continual learning over hundreds of tasks with explicit control over memorization and forgetting.

Title: HMP: Hand Motion Priors for Pose and Shape Estimation from Video. (arXiv:2312.16737v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16737
Code URL: null
Copy Paste: [[2312.16737]] HMP: Hand Motion Priors for Pose and Shape Estimation from Video(http://arxiv.org/abs/2312.16737)
Summary:
Understanding how humans interact with the world necessitates accurate 3D hand pose estimation, a task complicated by the hand's high degree of articulation, frequent occlusions, self-occlusions, and rapid motions. While most existing methods rely on single-image inputs, videos have useful cues to address aforementioned issues. However, existing video-based 3D hand datasets are insufficient for training feedforward models to generalize to in-the-wild scenarios. On the other hand, we have access to large human motion capture datasets which also include hand motions, e.g. AMASS. Therefore, we develop a generative motion prior specific for hands, trained on the AMASS dataset which features diverse and high-quality hand motions. This motion prior is then employed for video-based 3D hand motion estimation following a latent optimization approach. Our integration of a robust motion prior significantly enhances performance, especially in occluded scenarios. It produces stable, temporally consistent results that surpass conventional single-frame methods. We demonstrate our method's efficacy via qualitative and quantitative evaluations on the HO3D and DexYCB datasets, with special emphasis on an occlusion-focused subset of HO3D. Code is available at https://hmp.is.tue.mpg.de

Title: Active Third-Person Imitation Learning. (arXiv:2312.16365v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16365
Code URL: null
Copy Paste: [[2312.16365]] Active Third-Person Imitation Learning(http://arxiv.org/abs/2312.16365)
Summary:
We consider the problem of third-person imitation learning with the additional challenge that the learner must select the perspective from which they observe the expert. In our setting, each perspective provides only limited information about the expert's behavior, and the learning agent must carefully select and combine information from different perspectives to achieve competitive performance. This setting is inspired by real-world imitation learning applications, e.g., in robotics, a robot might observe a human demonstrator via camera and receive information from different perspectives depending on the camera's position. We formalize the aforementioned active third-person imitation learning problem, theoretically analyze its characteristics, and propose a generative adversarial network-based active learning approach. Empirically, we demstrate that our proposed approach can effectively learn from expert demonstrations and explore the importance of different architectural choices for the learner's performance.

anomaly

Title: ReSynthDetect: A Fundus Anomaly Detection Network with Reconstruction and Synthetic Features. (arXiv:2312.16470v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.16470
Code URL: null
Copy Paste: [[2312.16470]] ReSynthDetect: A Fundus Anomaly Detection Network with Reconstruction and Synthetic Features(http://arxiv.org/abs/2312.16470)
Summary:
Detecting anomalies in fundus images through unsupervised methods is a challenging task due to the similarity between normal and abnormal tissues, as well as their indistinct boundaries. The current methods have limitations in accurately detecting subtle anomalies while avoiding false positives. To address these challenges, we propose the ReSynthDetect network which utilizes a reconstruction network for modeling normal images, and an anomaly generator that produces synthetic anomalies consistent with the appearance of fundus images. By combining the features of consistent anomaly generation and image reconstruction, our method is suited for detecting fundus abnormalities. The proposed approach has been extensively tested on benchmark datasets such as EyeQ and IDRiD, demonstrating state-of-the-art performance in both image-level and pixel-level anomaly detection. Our experiments indicate a substantial 9% improvement in AUROC on EyeQ and a significant 17.1% improvement in AUPR on IDRiD.

in-context

Title: How Robust are LLMs to In-Context Majority Label Bias?. (arXiv:2312.16549v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.16549
Code URL: null
Copy Paste: [[2312.16549]] How Robust are LLMs to In-Context Majority Label Bias?(http://arxiv.org/abs/2312.16549)
Summary:
In the In-Context Learning (ICL) setup, various forms of label biases can manifest. One such manifestation is majority label bias, which arises when the distribution of labeled examples in the in-context samples is skewed towards one or more specific classes making Large Language Models (LLMs) more prone to predict those labels. Such discrepancies can arise from various factors, including logistical constraints, inherent biases in data collection methods, limited access to diverse data sources, etc. which are unavoidable in a real-world industry setup. In this work, we study the robustness of in-context learning in LLMs to shifts that occur due to majority label bias within the purview of text classification tasks. Prior works have shown that in-context learning with LLMs is susceptible to such biases. In our study, we go one level deeper and show that the robustness boundary varies widely for different models and tasks, with certain LLMs being highly robust (~90%) to majority label bias. Additionally, our findings also highlight the impact of model size and the richness of instructional prompts contributing towards model robustness. We restrict our study to only publicly available open-source models to ensure transparency and reproducibility.