diffusion

Title: StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D. (arXiv:2312.02189v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02189
Code URL: null
Copy Paste: [[2312.02189]] StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D(http://arxiv.org/abs/2312.02189)
Summary:
In the realm of text-to-3D generation, utilizing 2D diffusion models through score distillation sampling (SDS) frequently leads to issues such as blurred appearances and multi-faced geometry, primarily due to the intrinsically noisy nature of the SDS loss. Our analysis identifies the core of these challenges as the interaction among noise levels in the 2D diffusion process, the architecture of the diffusion network, and the 3D model representation. To overcome these limitations, we present StableDreamer, a methodology incorporating three advances. First, inspired by InstructNeRF2NeRF, we formalize the equivalence of the SDS generative prior and a simple supervised L2 reconstruction loss. This finding provides a novel tool to debug SDS, which we use to show the impact of time-annealing noise levels on reducing multi-faced geometries. Second, our analysis shows that while image-space diffusion contributes to geometric precision, latent-space diffusion is crucial for vivid color rendition. Based on this observation, StableDreamer introduces a two-stage training strategy that effectively combines these aspects, resulting in high-fidelity 3D models. Third, we adopt an anisotropic 3D Gaussians representation, replacing Neural Radiance Fields (NeRFs), to enhance the overall quality, reduce memory usage during training, and accelerate rendering speeds, and better capture semi-transparent objects. StableDreamer reduces multi-face geometries, generates fine details, and converges stably.

Title: Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D. (arXiv:2312.02190v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02190
Code URL: null
Copy Paste: [[2312.02190]] Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D(http://arxiv.org/abs/2312.02190)
Summary:
Diffusion Handles is a novel approach to enabling 3D object edits on diffusion images. We accomplish these edits using existing pre-trained diffusion models, and 2D image depth estimation, without any fine-tuning or 3D object retrieval. The edited results remain plausible, photo-real, and preserve object identity. Diffusion Handles address a critically missing facet of generative image based creative design, and significantly advance the state-of-the-art in generative image editing. Our key insight is to lift diffusion activations for an object to 3D using a proxy depth, 3D-transform the depth and associated activations, and project them back to image space. The diffusion process applied to the manipulated activations with identity control, produces plausible edited images showing complex 3D occlusion and lighting effects. We evaluate Diffusion Handles: quantitatively, on a large synthetic data benchmark; and qualitatively by a user study, showing our output to be more plausible, and better than prior art at both, 3D editing and identity control.

Title: Exploiting Diffusion Priors for All-in-One Image Restoration. (arXiv:2312.02197v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02197
Code URL: null
Copy Paste: [[2312.02197]] Exploiting Diffusion Priors for All-in-One Image Restoration(http://arxiv.org/abs/2312.02197)
Summary:
All-in-one aims to solve various tasks of image restoration in a single model. To this end, we present a feasible way of exploiting the image priors captured by the pretrained diffusion model, through addressing the two challenges, i.e., degradation modeling and diffusion guidance. The former aims to simulate the process of the clean image degenerated by certain degradations, and the latter aims at guiding the diffusion model to generate the corresponding clean image. With the motivations, we propose a zero-shot framework for all-in-one image restoration, termed ZeroAIR, which alternatively performs the test-time degradation modeling (TDM) and the three-stage diffusion guidance (TDG) at each timestep of the reverse sampling. To be specific, TDM exploits the diffusion priors to learn a degradation model from a given degraded image, and TDG divides the timesteps into three stages for taking full advantage of the varying diffusion priors. Thanks to their degradation-agnostic property, the all-in-one image restoration could be achieved in a zero-shot way by ZeroAIR. Through extensive experiments, we show that our ZeroAIR achieves comparable even better performance than those task-specific methods. The code will be available on Github.

Title: ImageDream: Image-Prompt Multi-view Diffusion for 3D Generation. (arXiv:2312.02201v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02201
Code URL: null
Copy Paste: [[2312.02201]] ImageDream: Image-Prompt Multi-view Diffusion for 3D Generation(http://arxiv.org/abs/2312.02201)
Summary:
We introduce "ImageDream," an innovative image-prompt, multi-view diffusion model for 3D object generation. ImageDream stands out for its ability to produce 3D models of higher quality compared to existing state-of-the-art, image-conditioned methods. Our approach utilizes a canonical camera coordination for the objects in images, improving visual geometry accuracy. The model is designed with various levels of control at each block inside the diffusion model based on the input image, where global control shapes the overall object layout and local control fine-tunes the image details. The effectiveness of ImageDream is demonstrated through extensive evaluations using a standard prompt list. For more information, visit our project page at https://Image-Dream.github.io.

Title: Portrait Diffusion: Training-free Face Stylization with Chain-of-Painting. (arXiv:2312.02212v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02212
Code URL: https://github.com/liujin112/portraitdiffusion
Copy Paste: [[2312.02212]] Portrait Diffusion: Training-free Face Stylization with Chain-of-Painting(http://arxiv.org/abs/2312.02212)
Summary:
Face stylization refers to the transformation of a face into a specific portrait style. However, current methods require the use of example-based adaptation approaches to fine-tune pre-trained generative models so that they demand lots of time and storage space and fail to achieve detailed style transformation. This paper proposes a training-free face stylization framework, named Portrait Diffusion. This framework leverages off-the-shelf text-to-image diffusion models, eliminating the need for fine-tuning specific examples. Specifically, the content and style images are first inverted into latent codes. Then, during image reconstruction using the corresponding latent code, the content and style features in the attention space are delicately blended through a modified self-attention operation called Style Attention Control. Additionally, a Chain-of-Painting method is proposed for the gradual redrawing of unsatisfactory areas from rough adjustments to fine-tuning. Extensive experiments validate the effectiveness of our Portrait Diffusion method and demonstrate the superiority of Chain-of-Painting in achieving precise face stylization. Code will be released at \url{https://github.com/liujin112/PortraitDiffusion}.

Title: Slice3D: Multi-Slice, Occlusion-Revealing, Single View 3D Reconstruction. (arXiv:2312.02221v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02221
Code URL: null
Copy Paste: [[2312.02221]] Slice3D: Multi-Slice, Occlusion-Revealing, Single View 3D Reconstruction(http://arxiv.org/abs/2312.02221)
Summary:
We introduce multi-slice reasoning, a new notion for single-view 3D reconstruction which challenges the current and prevailing belief that multi-view synthesis is the most natural conduit between single-view and 3D. Our key observation is that object slicing is more advantageous than altering views to reveal occluded structures. Specifically, slicing is more occlusion-revealing since it can peel through any occluders without obstruction. In the limit, i.e., with infinitely many slices, it is guaranteed to unveil all hidden object parts. We realize our idea by developing Slice3D, a novel method for single-view 3D reconstruction which first predicts multi-slice images from a single RGB image and then integrates the slices into a 3D model using a coordinate-based transformer network for signed distance prediction. The slice images can be regressed or generated, both through a U-Net based network. For the former, we inject a learnable slice indicator code to designate each decoded image into a spatial slice location, while the slice generator is a denoising diffusion model operating on the entirety of slice images stacked on the input channels. We conduct extensive evaluation against state-of-the-art alternatives to demonstrate superiority of our method, especially in recovering complex and severely occluded shape structures, amid ambiguities. All Slice3D results were produced by networks trained on a single Nvidia A40 GPU, with an inference time less than 20 seconds.

Title: MedXChat: Bridging CXR Modalities with a Unified Multimodal Large Model. (arXiv:2312.02233v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02233
Code URL: null
Copy Paste: [[2312.02233]] MedXChat: Bridging CXR Modalities with a Unified Multimodal Large Model(http://arxiv.org/abs/2312.02233)
Summary:
Despite the success of Large Language Models (LLMs) in general image tasks, a gap persists in the medical field for a multimodal large model adept at handling the nuanced diversity of medical images. Addressing this, we propose MedXChat, a unified multimodal large model designed for seamless interactions between medical assistants and users. MedXChat encompasses three key functionalities: CXR(Chest X-ray)-to-Report generation, CXR-based visual question-answering (VQA), and Text-to-CXR synthesis. Our contributions are as follows. Firstly, our model showcases exceptional cross-task adaptability, displaying adeptness across all three defined tasks and outperforming the benchmark models on the MIMIC dataset in medical multimodal applications. Secondly, we introduce an innovative Text-to-CXR synthesis approach that utilizes instruction-following capabilities within the Stable Diffusion (SD) architecture. This technique integrates smoothly with the existing model framework, requiring no extra parameters, thereby maintaining the SD's generative strength while also bestowing upon it the capacity to render fine-grained medical images with high fidelity. Comprehensive experiments validate MedXChat's synergistic enhancement across all tasks. Our instruction data and model will be open-sourced.

Title: X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model. (arXiv:2312.02238v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02238
Code URL: null
Copy Paste: [[2312.02238]] X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model(http://arxiv.org/abs/2312.02238)
Summary:
We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally, X-Adapter adds trainable mapping layers that bridge the decoders from models of different versions for feature remapping. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter, we employ a null-text training strategy for the upgraded model. After training, we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies, X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together, thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method, we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model.

Title: Conditional Variational Diffusion Models. (arXiv:2312.02246v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02246
Code URL: null
Copy Paste: [[2312.02246]] Conditional Variational Diffusion Models(http://arxiv.org/abs/2312.02246)
Summary:
Inverse problems aim to determine parameters from observations, a crucial task in engineering and science. Lately, generative models, especially diffusion models, have gained popularity in this area for their ability to produce realistic solutions and their good mathematical properties. Despite their success, an important drawback of diffusion models is their sensitivity to the choice of variance schedule, which controls the dynamics of the diffusion process. Fine-tuning this schedule for specific applications is crucial but time-costly and does not guarantee an optimal result. We propose a novel approach for learning the schedule as part of the training process. Our method supports probabilistic conditioning on data, provides high-quality solutions, and is flexible, proving able to adapt to different applications with minimum overhead. This approach is tested in two unrelated inverse problems: super-resolution microscopy and quantitative phase imaging, yielding comparable or superior results to previous methods and fine-tuned diffusion models. We conclude that fine-tuning the schedule by experimentation should be avoided because it can be learned during training in a stable way that yields better results.

Title: Large Language Models as Consistent Story Visualizers. (arXiv:2312.02252v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02252
Code URL: null
Copy Paste: [[2312.02252]] Large Language Models as Consistent Story Visualizers(http://arxiv.org/abs/2312.02252)
Summary:
Recent generative models have demonstrated impressive capabilities in generating realistic and visually pleasing images grounded on textual prompts. Nevertheless, a significant challenge remains in applying these models for the more intricate task of story visualization. Since it requires resolving pronouns (he, she, they) in the frame descriptions, i.e., anaphora resolution, and ensuring consistent characters and background synthesis across frames. Yet, the emerging Large Language Model (LLM) showcases robust reasoning abilities to navigate through ambiguous references and process extensive sequences. Therefore, we introduce \textbf{StoryGPT-V}, which leverages the merits of the latent diffusion (LDM) and LLM to produce images with consistent and high-quality characters grounded on given story descriptions. First, we train a character-aware LDM, which takes character-augmented semantic embedding as input and includes the supervision of the cross-attention map using character segmentation masks, aiming to enhance character generation accuracy and faithfulness. In the second stage, we enable an alignment between the output of LLM and the character-augmented embedding residing in the input space of the first-stage model. This harnesses the reasoning ability of LLM to address ambiguous references and the comprehension capability to memorize the context. We conduct comprehensive experiments on two visual story visualization benchmarks. Our model reports superior quantitative results and consistently generates accurate characters of remarkable quality with low memory consumption. Our code will be made publicly available.

Title: Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images. (arXiv:2312.02253v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02253
Code URL: null
Copy Paste: [[2312.02253]] Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images(http://arxiv.org/abs/2312.02253)
Summary:
Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. Prior work shows that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images from the finetuned model can enhance an ImageNet classifier's performance. However, performance degrades as synthetic images outnumber real ones. In this paper, we explore whether generative fine-tuning is essential for this improvement and whether it is possible to further scale up training using more synthetic data. We present a new framework leveraging off-the-shelf generative models to generate synthetic training images, addressing multiple challenges: class name ambiguity, lack of diversity in naive prompts, and domain shifts. Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images. Our framework consistently enhances recognition model performance with more synthetic data, up to 6x of original ImageNet size showcasing the potential of synthetic data for improved recognition models and strong out-of-domain generalization.

Title: EMDM: Efficient Motion Diffusion Model for Fast, High-Quality Motion Generation. (arXiv:2312.02256v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02256
Code URL: null
Copy Paste: [[2312.02256]] EMDM: Efficient Motion Diffusion Model for Fast, High-Quality Motion Generation(http://arxiv.org/abs/2312.02256)
Summary:
We introduce Efficient Motion Diffusion Model (EMDM) for fast and high-quality human motion generation. Although previous motion diffusion models have shown impressive results, they struggle to achieve fast generation while maintaining high-quality human motions. Motion latent diffusion has been proposed for efficient motion generation. However, effectively learning a latent space can be non-trivial in such a two-stage manner. Meanwhile, accelerating motion sampling by increasing the step size, e.g., DDIM, typically leads to a decline in motion quality due to the inapproximation of complex data distributions when naively increasing the step size. In this paper, we propose EMDM that allows for much fewer sample steps for fast motion generation by modeling the complex denoising distribution during multiple sampling steps. Specifically, we develop a Conditional Denoising Diffusion GAN to capture multimodal data distributions conditioned on both control signals, i.e., textual description and denoising time step. By modeling the complex data distribution, a larger sampling step size and fewer steps are achieved during motion synthesis, significantly accelerating the generation process. To effectively capture the human dynamics and reduce undesired artifacts, we employ motion geometric loss during network training, which improves the motion quality and training efficiency. As a result, EMDM achieves a remarkable speed-up at the generation stage while maintaining high-quality motion generation in terms of fidelity and diversity.

Title: Towards Granularity-adjusted Pixel-level Semantic Annotation. (arXiv:2312.02420v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02420
Code URL: null
Copy Paste: [[2312.02420]] Towards Granularity-adjusted Pixel-level Semantic Annotation(http://arxiv.org/abs/2312.02420)
Summary:
Recent advancements in computer vision predominantly rely on learning-based systems, leveraging annotations as the driving force to develop specialized models. However, annotating pixel-level information, particularly in semantic segmentation, presents a challenging and labor-intensive task, prompting the need for autonomous processes. In this work, we propose GranSAM which distinguishes itself by providing semantic segmentation at the user-defined granularity level on unlabeled data without the need for any manual supervision, offering a unique contribution in the realm of semantic mask annotation method. Specifically, we propose an approach to enable the Segment Anything Model (SAM) with semantic recognition capability to generate pixel-level annotations for images without any manual supervision. For this, we accumulate semantic information from synthetic images generated by the Stable Diffusion model or web crawled images and employ this data to learn a mapping function between SAM mask embeddings and object class labels. As a result, SAM, enabled with granularity-adjusted mask recognition, can be used for pixel-level semantic annotation purposes. We conducted experiments on the PASCAL VOC 2012 and COCO-80 datasets and observed a +17.95% and +5.17% increase in mIoU, respectively, compared to existing state-of-the-art methods when evaluated under our problem setting.

Title: Orthogonal Adaptation for Modular Customization of Diffusion Models. (arXiv:2312.02432v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02432
Code URL: null
Copy Paste: [[2312.02432]] Orthogonal Adaptation for Modular Customization of Diffusion Models(http://arxiv.org/abs/2312.02432)
Summary:
Customization techniques for text-to-image models have paved the way for a wide range of previously unattainable applications, enabling the generation of specific concepts across diverse contexts and styles. While existing methods facilitate high-fidelity customization for individual concepts or a limited, pre-defined set of them, they fall short of achieving scalability, where a single model can seamlessly render countless concepts. In this paper, we address a new problem called Modular Customization, with the goal of efficiently merging customized models that were fine-tuned independently for individual concepts. This allows the merged model to jointly synthesize concepts in one image without compromising fidelity or incurring any additional computational costs.

To address this problem, we introduce Orthogonal Adaptation, a method designed to encourage the customized models, which do not have access to each other during fine-tuning, to have orthogonal residual weights. This ensures that during inference time, the customized models can be summed with minimal interference.

Our proposed method is both simple and versatile, applicable to nearly all optimizable weights in the model architecture. Through an extensive set of quantitative and qualitative evaluations, our method consistently outperforms relevant baselines in terms of efficiency and identity preservation, demonstrating a significant leap toward scalable customization of diffusion models.

Title: Retrieving Conditions from Reference Images for Diffusion Models. (arXiv:2312.02521v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02521
Code URL: null
Copy Paste: [[2312.02521]] Retrieving Conditions from Reference Images for Diffusion Models(http://arxiv.org/abs/2312.02521)
Summary:
Recent diffusion-based subject driven generative methods have enabled image generations with good fidelity for specific objects or human portraits. However, to achieve better versatility for applications, we argue that not only improved datasets and evaluations are desired, but also more careful methods to retrieve only relevant information from conditional images are anticipated. To this end, we propose an anime figures dataset RetriBooru-V1, with enhanced identity and clothing labels. We state new tasks enabled by this dataset, and introduce a new diversity metric to measure success in completing these tasks, quantifying the flexibility of image generations. We establish an RAG-inspired baseline method, designed to retrieve precise conditional information from reference images. Then, we compare with current methods on existing task to demonstrate the capability of the proposed method. Finally, we provide baseline experiment results on new tasks, and conduct ablation studies on the possible structural choices.

Title: GeNIe: Generative Hard Negative Images Through Diffusion. (arXiv:2312.02548v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02548
Code URL: https://github.com/ucdvision/genie
Copy Paste: [[2312.02548]] GeNIe: Generative Hard Negative Images Through Diffusion(http://arxiv.org/abs/2312.02548)
Summary:
Data augmentation is crucial in training deep models, preventing them from overfitting to limited data. Common data augmentation methods are effective, but recent advancements in generative AI, such as diffusion models for image generation, enable more sophisticated augmentation techniques that produce data resembling natural images. We recognize that augmented samples closer to the ideal decision boundary of a classifier are particularly effective and efficient in guiding the learning process. We introduce GeNIe which leverages a diffusion model conditioned on a text prompt to merge contrasting data points (an image from the source category and a text prompt from the target category) to generate challenging samples for the target category. Inspired by recent image editing methods, we limit the number of diffusion iterations and the amount of noise. This ensures that the generated image retains low-level and contextual features from the source image, potentially conflicting with the target category. Our extensive experiments, in few-shot and also long-tail distribution settings, demonstrate the effectiveness of our novel augmentation method, especially benefiting categories with a limited number of examples.

Title: Prompt2NeRF-PIL: Fast NeRF Generation via Pretrained Implicit Latent. (arXiv:2312.02568v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02568
Code URL: null
Copy Paste: [[2312.02568]] Prompt2NeRF-PIL: Fast NeRF Generation via Pretrained Implicit Latent(http://arxiv.org/abs/2312.02568)
Summary:
This paper explores promptable NeRF generation (e.g., text prompt or single image prompt) for direct conditioning and fast generation of NeRF parameters for the underlying 3D scenes, thus undoing complex intermediate steps while providing full 3D generation with conditional control. Unlike previous diffusion-CLIP-based pipelines that involve tedious per-prompt optimizations, Prompt2NeRF-PIL is capable of generating a variety of 3D objects with a single forward pass, leveraging a pre-trained implicit latent space of NeRF parameters. Furthermore, in zero-shot tasks, our experiments demonstrate that the NeRFs produced by our method serve as semantically informative initializations, significantly accelerating the inference process of existing prompt-to-NeRF methods. Specifically, we will show that our approach speeds up the text-to-NeRF model DreamFusion and the 3D reconstruction speed of the image-to-NeRF method Zero-1-to-3 by 3 to 5 times.

Title: Projection Regret: Reducing Background Bias for Novelty Detection via Diffusion Models. (arXiv:2312.02615v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02615
Code URL: null
Copy Paste: [[2312.02615]] Projection Regret: Reducing Background Bias for Novelty Detection via Diffusion Models(http://arxiv.org/abs/2312.02615)
Summary:
Novelty detection is a fundamental task of machine learning which aims to detect abnormal ($\textit{i.e.}$ out-of-distribution (OOD)) samples. Since diffusion models have recently emerged as the de facto standard generative framework with surprising generation results, novelty detection via diffusion models has also gained much attention. Recent methods have mainly utilized the reconstruction property of in-distribution samples. However, they often suffer from detecting OOD samples that share similar background information to the in-distribution data. Based on our observation that diffusion models can \emph{project} any sample to an in-distribution sample with similar background information, we propose \emph{Projection Regret (PR)}, an efficient novelty detection method that mitigates the bias of non-semantic information. To be specific, PR computes the perceptual distance between the test image and its diffusion-based projection to detect abnormality. Since the perceptual distance often fails to capture semantic changes when the background information is dominant, we cancel out the background bias by comparing it against recursive projections. Extensive experiments demonstrate that PR outperforms the prior art of generative-model-based novelty detection methods by a significant margin.

Title: DreaMo: Articulated 3D Reconstruction From A Single Casual Video. (arXiv:2312.02617v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02617
Code URL: null
Copy Paste: [[2312.02617]] DreaMo: Articulated 3D Reconstruction From A Single Casual Video(http://arxiv.org/abs/2312.02617)
Summary:
Articulated 3D reconstruction has valuable applications in various domains, yet it remains costly and demands intensive work from domain experts. Recent advancements in template-free learning methods show promising results with monocular videos. Nevertheless, these approaches necessitate a comprehensive coverage of all viewpoints of the subject in the input video, thus limiting their applicability to casually captured videos from online sources. In this work, we study articulated 3D shape reconstruction from a single and casually captured internet video, where the subject's view coverage is incomplete. We propose DreaMo that jointly performs shape reconstruction while solving the challenging low-coverage regions with view-conditioned diffusion prior and several tailored regularizations. In addition, we introduce a skeleton generation strategy to create human-interpretable skeletons from the learned neural bones and skinning weights. We conduct our study on a self-collected internet video collection characterized by incomplete view coverage. DreaMo shows promising quality in novel-view rendering, detailed articulated shape reconstruction, and skeleton generation. Extensive qualitative and quantitative studies validate the efficacy of each proposed component, and show existing methods are unable to solve correct geometry due to the incomplete view coverage.

Title: Diffusion Noise Feature: Accurate and Fast Generated Image Detection. (arXiv:2312.02625v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02625
Code URL: null
Copy Paste: [[2312.02625]] Diffusion Noise Feature: Accurate and Fast Generated Image Detection(http://arxiv.org/abs/2312.02625)
Summary:
Generative models have reached an advanced stage where they can produce remarkably realistic images. However, this remarkable generative capability also introduces the risk of disseminating false or misleading information. Notably, existing image detectors for generated images encounter challenges such as low accuracy and limited generalization. This paper seeks to address this issue by seeking a representation with strong generalization capabilities to enhance the detection of generated images. Our investigation has revealed that real and generated images display distinct latent Gaussian representations when subjected to an inverse diffusion process within a pre-trained diffusion model. Exploiting this disparity, we can amplify subtle artifacts in generated images. Building upon this insight, we introduce a novel image representation known as Diffusion Noise Feature (DNF). DNF is an ensemble representation that estimates the noise generated during the inverse diffusion process. A simple classifier, e.g., ResNet, trained on DNF achieves high accuracy, robustness, and generalization capabilities for detecting generated images, even from previously unseen classes or models. We conducted experiments using a widely recognized and standard dataset, achieving state-of-the-art effects of Detection.

Title: TPA3D: Triplane Attention for Fast Text-to-3D Generation. (arXiv:2312.02647v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02647
Code URL: null
Copy Paste: [[2312.02647]] TPA3D: Triplane Attention for Fast Text-to-3D Generation(http://arxiv.org/abs/2312.02647)
Summary:
Due to the lack of large-scale text-3D correspondence data, recent text-to-3D generation works mainly rely on utilizing 2D diffusion models for synthesizing 3D data. Since diffusion-based methods typically require significant optimization time for both training and inference, the use of GAN-based models would still be desirable for fast 3D generation. In this work, we propose Triplane Attention for text-guided 3D generation (TPA3D), an end-to-end trainable GAN-based deep learning model for fast text-to-3D generation. With only 3D shape data and their rendered 2D images observed during training, our TPA3D is designed to retrieve detailed visual descriptions for synthesizing the corresponding 3D mesh data. This is achieved by the proposed attention mechanisms on the extracted sentence and word-level text features. In our experiments, we show that TPA3D generates high-quality 3D textured shapes aligned with fine-grained descriptions, while impressive computation efficiency can be observed.

Title: Analyzing and Improving the Training Dynamics of Diffusion Models. (arXiv:2312.02696v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02696
Code URL: null
Copy Paste: [[2312.02696]] Analyzing and Improving the Training Dynamics of Diffusion Models(http://arxiv.org/abs/2312.02696)
Summary:
Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper, we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture, without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training, we redesign the network layers to preserve activation, weight, and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances, resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic sampling.

As an independent contribution, we present a method for setting the exponential moving average (EMA) parameters post-hoc, i.e., after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs, and reveals its surprising interactions with network architecture, training time, and guidance.

Title: Neural Sign Actors: A diffusion model for 3D sign language production from text. (arXiv:2312.02702v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02702
Code URL: null
Copy Paste: [[2312.02702]] Neural Sign Actors: A diffusion model for 3D sign language production from text(http://arxiv.org/abs/2312.02702)
Summary:
Sign Languages (SL) serve as the predominant mode of communication for the Deaf and Hard of Hearing communities. The advent of deep learning has aided numerous methods in SL recognition and translation, achieving remarkable results. However, Sign Language Production (SLP) poses a challenge for the computer vision community as the motions generated must be realistic and have precise semantic meanings. Most SLP methods rely on 2D data, thus impeding their ability to attain a necessary level of realism. In this work, we propose a diffusion-based SLP model trained on a curated large-scale dataset of 4D signing avatars and their corresponding text transcripts. The proposed method can generate dynamic sequences of 3D avatars from an unconstrained domain of discourse using a diffusion process formed on a novel and anatomically informed graph neural network defined on the SMPL-X body skeleton. Through a series of quantitative and qualitative experiments, we show that the proposed method considerably outperforms previous methods of SLP. We believe that this work presents an important and necessary step towards realistic neural sign avatars, bridging the communication gap between Deaf and hearing communities. The code, method and generated data will be made publicly available.

Title: A Conditional Denoising Diffusion Probabilistic Model for Point Cloud Upsampling. (arXiv:2312.02719v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02719
Code URL: null
Copy Paste: [[2312.02719]] A Conditional Denoising Diffusion Probabilistic Model for Point Cloud Upsampling(http://arxiv.org/abs/2312.02719)
Summary:
Point cloud upsampling (PCU) enriches the representation of raw point clouds, significantly improving the performance in downstream tasks such as classification and reconstruction. Most of the existing point cloud upsampling methods focus on sparse point cloud feature extraction and upsampling module design. In a different way, we dive deeper into directly modelling the gradient of data distribution from dense point clouds. In this paper, we proposed a conditional denoising diffusion probability model (DDPM) for point cloud upsampling, called PUDM. Specifically, PUDM treats the sparse point cloud as a condition, and iteratively learns the transformation relationship between the dense point cloud and the noise. Simultaneously, PUDM aligns with a dual mapping paradigm to further improve the discernment of point features. In this context, PUDM enables learning complex geometry details in the ground truth through the dominant features, while avoiding an additional upsampling module design. Furthermore, to generate high-quality arbitrary-scale point clouds during inference, PUDM exploits the prior knowledge of the scale between sparse point clouds and dense point clouds during training by parameterizing a rate factor. Moreover, PUDM exhibits strong noise robustness in experimental results. In the quantitative and qualitative evaluations on PU1K and PUGAN, PUDM significantly outperformed existing methods in terms of Chamfer Distance (CD) and Hausdorff Distance (HD), achieving state of the art (SOTA) performance.

Title: Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions. (arXiv:2312.02772v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02772
Code URL: null
Copy Paste: [[2312.02772]] Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions(http://arxiv.org/abs/2312.02772)
Summary:
Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, it remains challenging to generate fine-grained or stylized motions due to the lack of datasets annotated with detailed textual descriptions. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for human motion generation. Specifically, we first parse previous vague textual annotation into fine-grained description of different body parts by leveraging a large language model (GPT-3.5). We then use these fine-grained descriptions to guide a transformer-based diffusion model. FG-MDM can generate fine-grained and stylized motions even outside of the distribution of the training data. Our experimental results demonstrate the superiority of FG-MDM over previous methods, especially the strong generalization capability. We will release our fine-grained textual annotations for HumanML3D and KIT.

Title: BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models. (arXiv:2312.02813v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02813
Code URL: null
Copy Paste: [[2312.02813]] BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models(http://arxiv.org/abs/2312.02813)
Summary:
Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons. First, it requires huge memory and compute overhead to train a video generation foundation model. Even with video foundation models, additional costly training is still required for downstream video synthesis tasks. Second, although some works extend image diffusion models into videos in a training-free manner, temporal consistency cannot be well kept. Finally, these adaption methods are specifically designed for one task and fail to generalize to different downstream video synthesis tasks. To mitigate these issues, we propose a training-free general-purpose video synthesis framework, coined as BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically, we first use an image diffusion model (like ControlNet, Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion model for temporal smoothing. Decoupling image and video models enables flexible image model selection for different purposes, which endows the framework with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff, we perform a wide range of video generation tasks, including controllable video generation video editing, video inpainting and outpainting. Our project page is available at https://bivdiff.github.io.

Title: Deterministic Guidance Diffusion Model for Probabilistic Weather Forecasting. (arXiv:2312.02819v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02819
Code URL: null
Copy Paste: [[2312.02819]] Deterministic Guidance Diffusion Model for Probabilistic Weather Forecasting(http://arxiv.org/abs/2312.02819)
Summary:
Weather forecasting requires not only accuracy but also the ability to perform probabilistic prediction. However, deterministic weather forecasting methods do not support probabilistic predictions, and conversely, probabilistic models tend to be less accurate. To address these challenges, in this paper, we introduce the \textbf{\textit{D}}eterministic \textbf{\textit{G}}uidance \textbf{\textit{D}}iffusion \textbf{\textit{M}}odel (DGDM) for probabilistic weather forecasting, integrating benefits of both deterministic and probabilistic approaches. During the forward process, both the deterministic and probabilistic models are trained end-to-end. In the reverse process, weather forecasting leverages the predicted result from the deterministic model, using as an intermediate starting point for the probabilistic model. By fusing deterministic models with probabilistic models in this manner, DGDM is capable of providing accurate forecasts while also offering probabilistic predictions. To evaluate DGDM, we assess it on the global weather forecasting dataset (WeatherBench) and the common video frame prediction benchmark (Moving MNIST). We also introduce and evaluate the Pacific Northwest Windstorm (PNW)-Typhoon weather satellite dataset to verify the effectiveness of DGDM in high-resolution regional forecasting. As a result of our experiments, DGDM achieves state-of-the-art results not only in global forecasting but also in regional forecasting. The code is available at: \url{https://github.com/DongGeun-Yoon/DGDM}.

self-supervised

Title: TailorMe: Self-Supervised Learning of an Anatomically Constrained Volumetric Human Shape Model. (arXiv:2312.02173v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02173
Code URL: null
Copy Paste: [[2312.02173]] TailorMe: Self-Supervised Learning of an Anatomically Constrained Volumetric Human Shape Model(http://arxiv.org/abs/2312.02173)
Summary:
Human shape spaces have been extensively studied, as they are a core element of human shape and pose inference tasks. Classic methods for creating a human shape model register a surface template mesh to a database of 3D scans and use dimensionality reduction techniques, such as Principal Component Analysis, to learn a compact representation. While these shape models enable global shape modifications by correlating anthropometric measurements with the learned subspace, they only provide limited localized shape control. We instead register a volumetric anatomical template, consisting of skeleton bones and soft tissue, to the surface scans of the CAESAR database. We further enlarge our training data to the full Cartesian product of all skeletons and all soft tissues using physically plausible volumetric deformation transfer. This data is then used to learn an anatomically constrained volumetric human shape model in a self-supervised fashion. The resulting TailorMe model enables shape sampling, localized shape manipulation, and fast inference from given surface scans.

Title: Local Masking Meets Progressive Freezing: Crafting Efficient Vision Transformers for Self-Supervised Learning. (arXiv:2312.02194v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02194
Code URL: https://github.com/utkutpcgl/vitfreeze
Copy Paste: [[2312.02194]] Local Masking Meets Progressive Freezing: Crafting Efficient Vision Transformers for Self-Supervised Learning(http://arxiv.org/abs/2312.02194)
Summary:
In this paper, we present an innovative approach to self-supervised learning for Vision Transformers (ViTs), integrating local masked image modeling with progressive layer freezing. This method focuses on enhancing the efficiency and speed of initial layer training in ViTs. By systematically freezing specific layers at strategic points during training, we reduce computational demands while maintaining or improving learning capabilities. Our approach employs a novel multi-scale reconstruction process that fosters efficient learning in initial layers and enhances semantic comprehension across scales. The results demonstrate a substantial reduction in training time (~12.5\%) with a minimal impact on model accuracy (decrease in top-1 accuracy by 0.6\%). Our method achieves top-1 and top-5 accuracies of 82.6\% and 96.2\%, respectively, underscoring its potential in scenarios where computational resources and time are critical. This work marks an advancement in the field of self-supervised learning for computer vision. The implementation of our approach is available at our project's GitHub repository: github.com/utkutpcgl/ViTFreeze.

Title: USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery. (arXiv:2312.02199v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02199
Code URL: https://github.com/stanfordmlgroup/usat
Copy Paste: [[2312.02199]] USat: A Unified Self-Supervised Encoder for Multi-Sensor Satellite Imagery(http://arxiv.org/abs/2312.02199)
Summary:
Large, self-supervised vision models have led to substantial advancements for automatically interpreting natural images. Recent works have begun tailoring these methods to remote sensing data which has rich structure with multi-sensor, multi-spectral, and temporal information providing massive amounts of self-labeled data that can be used for self-supervised pre-training. In this work, we develop a new encoder architecture called USat that can input multi-spectral data from multiple sensors for self-supervised pre-training. USat is a vision transformer with modified patch projection layers and positional encodings to model spectral bands with varying spatial scales from multiple sensors. We integrate USat into a Masked Autoencoder (MAE) self-supervised pre-training procedure and find that a pre-trained USat outperforms state-of-the-art self-supervised MAE models trained on remote sensing data on multiple remote sensing benchmark datasets (up to 8%) and leads to improvements in low data regimes (up to 7%). Code and pre-trained weights are available at https://github.com/stanfordmlgroup/USat .

Title: Disentangling the Effects of Data Augmentation and Format Transform in Self-Supervised Learning of Image Representations. (arXiv:2312.02205v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02205
Code URL: null
Copy Paste: [[2312.02205]] Disentangling the Effects of Data Augmentation and Format Transform in Self-Supervised Learning of Image Representations(http://arxiv.org/abs/2312.02205)
Summary:
Self-Supervised Learning (SSL) enables training performant models using limited labeled data. One of the pillars underlying vision SSL is the use of data augmentations/perturbations of the input which do not significantly alter its semantic content. For audio and other temporal signals, augmentations are commonly used alongside format transforms such as Fourier transforms or wavelet transforms. Unlike augmentations, format transforms do not change the information contained in the data; rather, they express the same information in different coordinates. In this paper, we study the effects of format transforms and augmentations both separately and together on vision SSL. We define augmentations in frequency space called Fourier Domain Augmentations (FDA) and show that training SSL models on a combination of these and image augmentations can improve the downstream classification accuracy by up to 1.3% on ImageNet-1K. We also show improvements against SSL baselines in few-shot and transfer learning setups using FDA. Surprisingly, we also observe that format transforms can improve the quality of learned representations even without augmentations; however, the combination of the two techniques yields better quality.

Title: A Data-efficient Framework for Robotics Large-scale LiDAR Scene Parsing. (arXiv:2312.02208v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02208
Code URL: null
Copy Paste: [[2312.02208]] A Data-efficient Framework for Robotics Large-scale LiDAR Scene Parsing(http://arxiv.org/abs/2312.02208)
Summary:
Existing state-of-the-art 3D point clouds understanding methods only perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework which simultaneously solves the downstream high-level understanding tasks, especially when labels are extremely limited. This work presents a general and simple framework to tackle point clouds understanding when labels are limited. We propose a novel unsupervised region expansion based clustering method for generating clusters. More importantly, we innovatively propose to learn to merge the over-divided clusters based on the local low-level geometric property similarities and the learned high-level feature similarities supervised by weak labels. Hence, the true weak labels guide pseudo labels merging taking both geometric and semantic feature correlations into consideration. Finally, the self-supervised reconstruction and data augmentation optimization modules are proposed to guide the propagation of labels among semantically similar points within a scene. Experimental Results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when limited points are labeled, under the data-efficient settings for the large-scale 3D semantic scene parsing. The developed techniques have postentials to be applied to downstream tasks for better representations in robotic manipulation and robotic autonomous navigation. Codes and models are publicly available at: https://github.com/KangchengLiu.

Title: Class-Discriminative Attention Maps for Vision Transformers. (arXiv:2312.02364v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02364
Code URL: https://github.com/lenbrocki/cdam
Copy Paste: [[2312.02364]] Class-Discriminative Attention Maps for Vision Transformers(http://arxiv.org/abs/2312.02364)
Summary:
Interpretability methods are critical components for examining and exploring deep neural networks (DNN), as well as increasing our understanding of and trust in them. Vision transformers (ViT), which can be trained to state-of-the-art performance with a self-supervised learning (SSL) training method, provide built-in attention maps (AM). While AMs can provide high-quality semantic segmentation of input images, they do not account for any signal coming from a downstream classifier. We introduce class-discriminative attention maps (CDAM), a novel post-hoc explanation method that is highly sensitive to the target class. Our method essentially scales attention scores by how relevant the corresponding tokens are for the predictions of a classifier head. Alternative to classifier outputs, CDAM can also explain a user-defined concept by targeting similarity measures in the latent space of the ViT. This allows for explanations of arbitrary concepts, defined by the user through a few sample images. We investigate the operating characteristics of CDAM in comparison with relevance propagation (RP) and token ablation maps (TAM), an alternative to pixel occlusion methods. CDAM is highly class-discriminative and semantically relevant, while providing implicit regularization of relevance scores.

PyTorch implementation: \url{https://github.com/lenbrocki/CDAM}

Web live demo: \url{https://cdam.informatism.com/}

Title: AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation. (arXiv:2312.02512v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02512
Code URL: null
Copy Paste: [[2312.02512]] AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation(http://arxiv.org/abs/2312.02512)
Summary:
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence of acoustic noise, showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset, we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover, we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling, thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting. The demo page is available on https://choijeongsoo.github.io/av2av.

Title: Rethinking and Simplifying Bootstrapped Graph Latents. (arXiv:2312.02619v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02619
Code URL: null
Copy Paste: [[2312.02619]] Rethinking and Simplifying Bootstrapped Graph Latents(http://arxiv.org/abs/2312.02619)
Summary:
Graph contrastive learning (GCL) has emerged as a representative paradigm in graph self-supervised learning, where negative samples are commonly regarded as the key to preventing model collapse and producing distinguishable representations. Recent studies have shown that GCL without negative samples can achieve state-of-the-art performance as well as scalability improvement, with bootstrapped graph latent (BGRL) as a prominent step forward. However, BGRL relies on a complex architecture to maintain the ability to scatter representations, and the underlying mechanisms enabling the success remain largely unexplored. In this paper, we introduce an instance-level decorrelation perspective to tackle the aforementioned issue and leverage it as a springboard to reveal the potential unnecessary model complexity within BGRL. Based on our findings, we present SGCL, a simple yet effective GCL framework that utilizes the outputs from two consecutive iterations as positive pairs, eliminating the negative samples. SGCL only requires a single graph augmentation and a single graph encoder without additional parameters. Extensive experiments conducted on various graph benchmarks demonstrate that SGCL can achieve competitive performance with fewer parameters, lower time and space costs, and significant convergence speedup.

foundation model

Title: Towards General Purpose Vision Foundation Models for Medical Image Analysis: An Experimental Study of DINOv2 on Radiology Benchmarks. (arXiv:2312.02366v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02366
Code URL: https://github.com/mohammedsb/dinov2formedical
Copy Paste: [[2312.02366]] Towards General Purpose Vision Foundation Models for Medical Image Analysis: An Experimental Study of DINOv2 on Radiology Benchmarks(http://arxiv.org/abs/2312.02366)
Summary:
The integration of deep learning systems into the medical domain has been hindered by the resource-intensive process of data annotation and the inability of these systems to generalize to different data distributions. Foundation models, which are models pre-trained on large datasets, have emerged as a solution to reduce reliance on annotated data and enhance model generalizability and robustness. DINOv2, an open-source foundation model pre-trained with self-supervised learning on 142 million curated natural images, excels in extracting general-purpose visual representations, exhibiting promising capabilities across various vision tasks. Nevertheless, a critical question remains unanswered regarding DINOv2's adaptability to radiological imaging, and the clarity on whether its features are sufficiently general to benefit radiology image analysis is yet to be established. Therefore, this study comprehensively evaluates DINOv2 for radiology, conducting over 100 experiments across diverse modalities (X-ray, CT, and MRI). Tasks include disease classification and organ segmentation on both 2D and 3D images, evaluated under different settings like kNN, few-shot learning, linear-probing, end-to-end fine-tuning, and parameter-efficient fine-tuning, to measure the effectiveness and generalizability of the DINOv2 feature embeddings. Comparative analyses with established medical image analysis models, U-Net and TransUnet for segmentation, and CNN and ViT models pre-trained via supervised, weakly supervised, and self-supervised learning for classification, reveal DINOv2's superior performance in segmentation tasks and competitive results in disease classification. The findings contribute insights to potential avenues for optimizing pre-training strategies for medical imaging and enhancing the broader understanding of DINOv2's role in bridging the gap between natural and radiological image analysis.

generative

Title: The SVHN Dataset Is Deceptive for Probabilistic Generative Models Due to a Distribution Mismatch. (arXiv:2312.02168v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02168
Code URL: null
Copy Paste: [[2312.02168]] The SVHN Dataset Is Deceptive for Probabilistic Generative Models Due to a Distribution Mismatch(http://arxiv.org/abs/2312.02168)
Summary:
The Street View House Numbers (SVHN) dataset is a popular benchmark dataset in deep learning. Originally designed for digit classification tasks, the SVHN dataset has been widely used as a benchmark for various other tasks including generative modeling. However, with this work, we aim to warn the community about an issue of the SVHN dataset as a benchmark for generative modeling tasks: we discover that the official split into training set and test set of the SVHN dataset are not drawn from the same distribution. We empirically show that this distribution mismatch has little impact on the classification task (which may explain why this issue has not been detected before), but it severely affects the evaluation of probabilistic generative models, such as Variational Autoencoders and diffusion models. As a workaround, we propose to mix and re-split the official training and test set when SVHN is used for tasks other than classification. We publish a new split and the indices we used to create it at https://jzenn.github.io/svhn-remix/ .

Title: InvertAvatar: Incremental GAN Inversion for Generalized Head Avatars. (arXiv:2312.02222v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02222
Code URL: null
Copy Paste: [[2312.02222]] InvertAvatar: Incremental GAN Inversion for Generalized Head Avatars(http://arxiv.org/abs/2312.02222)
Summary:
While high fidelity and efficiency are central to the creation of digital head avatars, recent methods relying on 2D or 3D generative models often experience limitations such as shape distortion, expression inaccuracy, and identity flickering. Additionally, existing one-shot inversion techniques fail to fully leverage multiple input images for detailed feature extraction. We propose a novel framework, \textbf{Incremental 3D GAN Inversion}, that enhances avatar reconstruction performance using an algorithm designed to increase the fidelity from multiple frames, resulting in improved reconstruction quality proportional to frame count. Our method introduces a unique animatable 3D GAN prior with two crucial modifications for enhanced expression controllability alongside an innovative neural texture encoder that categorizes texture feature spaces based on UV parameterization. Differentiating from traditional techniques, our architecture emphasizes pixel-aligned image-to-image translation, mitigating the need to learn correspondences between observation and canonical spaces. Furthermore, we incorporate ConvGRU-based recurrent networks for temporal data aggregation from multiple frames, boosting geometry and texture detail reconstruction. The proposed paradigm demonstrates state-of-the-art performance on one-shot and few-shot avatar animation tasks.

Title: Tracing Hyperparameter Dependencies for Model Parsing via Learnable Graph Pooling Network. (arXiv:2312.02224v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02224
Code URL: null
Copy Paste: [[2312.02224]] Tracing Hyperparameter Dependencies for Model Parsing via Learnable Graph Pooling Network(http://arxiv.org/abs/2312.02224)
Summary:
Model Parsing defines the research task of predicting hyperparameters of the generative model (GM), given a generated image as input. Since a diverse set of hyperparameters is jointly employed by the generative model, and dependencies often exist among them, it is crucial to learn these hyperparameter dependencies for the improved model parsing performance. To explore such important dependencies, we propose a novel model parsing method called Learnable Graph Pooling Network (LGPN). Specifically, we transform model parsing into a graph node classification task, using graph nodes and edges to represent hyperparameters and their dependencies, respectively. Furthermore, LGPN incorporates a learnable pooling-unpooling mechanism tailored to model parsing, which adaptively learns hyperparameter dependencies of GMs used to generate the input image. We also extend our proposed method to CNN-generated image detection and coordinate attacks detection. Empirically, we achieve state-of-the-art results in model parsing and its extended applications, showing the effectiveness of our method. Our source code are available.

Title: GenEM: Physics-Informed Generative Cryo-Electron Microscopy. (arXiv:2312.02235v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02235
Code URL: null
Copy Paste: [[2312.02235]] GenEM: Physics-Informed Generative Cryo-Electron Microscopy(http://arxiv.org/abs/2312.02235)
Summary:
In the past decade, deep conditional generative models have revolutionized the generation of realistic images, extending their application from entertainment to scientific domains. Single-particle cryo-electron microscopy (cryo-EM) is crucial in resolving near-atomic resolution 3D structures of proteins, such as the SARS-COV-2 spike protein. To achieve high-resolution reconstruction, AI models for particle picking and pose estimation have been adopted. However, their performance is still limited as they lack high-quality annotated datasets. To address this, we introduce physics-informed generative cryo-electron microscopy (GenEM), which for the first time integrates physical-based cryo-EM simulation with a generative unpaired noise translation to generate physically correct synthetic cryo-EM datasets with realistic noises. Initially, GenEM simulates the cryo-EM imaging process based on a virtual specimen. To generate realistic noises, we leverage an unpaired noise translation via contrastive learning with a novel mask-guided sampling scheme. Extensive experiments show that GenEM is capable of generating realistic cryo-EM images. The generated dataset can further enhance particle picking and pose estimation models, eventually improving the reconstruction resolution. We will release our code and annotated synthetic datasets.

Title: PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation. (arXiv:2312.02284v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02284
Code URL: https://github.com/zhyever/PatchFusion
Copy Paste: [[2312.02284]] PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation(http://arxiv.org/abs/2312.02284)
Summary:
Single image depth estimation is a foundational task in computer vision and generative modeling. However, prevailing depth estimation models grapple with accommodating the increasing resolutions commonplace in today's consumer cameras and devices. Existing high-resolution strategies show promise, but they often face limitations, ranging from error propagation to the loss of high-frequency details. We present PatchFusion, a novel tile-based framework with three key components to improve the current state of the art: (1) A patch-wise fusion network that fuses a globally-consistent coarse prediction with finer, inconsistent tiled predictions via high-level feature guidance, (2) A Global-to-Local (G2L) module that adds vital context to the fusion network, discarding the need for patch selection heuristics, and (3) A Consistency-Aware Training (CAT) and Inference (CAI) approach, emphasizing patch overlap consistency and thereby eradicating the necessity for post-processing. Experiments on UnrealStereo4K, MVS-Synth, and Middleburry 2014 demonstrate that our framework can generate high-resolution depth maps with intricate details. PatchFusion is independent of the base model for depth estimation. Notably, our framework built on top of SOTA ZoeDepth brings improvements for a total of 17.3% and 29.4% in terms of the root mean squared error (RMSE) on UnrealStereo4K and MVS-Synth, respectively.

Title: How Generative-AI can be Effectively used in Government Chatbots. (arXiv:2312.02181v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02181
Code URL: null
Copy Paste: [[2312.02181]] How Generative-AI can be Effectively used in Government Chatbots(http://arxiv.org/abs/2312.02181)
Summary:
With the rapid development of artificial intelligence and breakthroughs in machine learning and natural language processing, intelligent question-answering robots have become widely used in government affairs. This paper conducts a horizontal comparison between Guangdong Province's government chatbots, ChatGPT, and Wenxin Ernie, two large language models, to analyze the strengths and weaknesses of existing government chatbots and AIGC technology. The study finds significant differences between government chatbots and large language models. China's government chatbots are still in an exploratory stage and have a gap to close to achieve "intelligence." To explore the future direction of government chatbots more deeply, this research proposes targeted optimization paths to help generative AI be effectively applied in government chatbot conversations.

Title: An Evaluation Framework for Mapping News Headlines to Event Classes in a Knowledge Graph. (arXiv:2312.02334v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02334
Code URL: https://github.com/mbouadeus/news-headline-event-linking
Copy Paste: [[2312.02334]] An Evaluation Framework for Mapping News Headlines to Event Classes in a Knowledge Graph(http://arxiv.org/abs/2312.02334)
Summary:
Mapping ongoing news headlines to event-related classes in a rich knowledge base can be an important component in a knowledge-based event analysis and forecasting solution. In this paper, we present a methodology for creating a benchmark dataset of news headlines mapped to event classes in Wikidata, and resources for the evaluation of methods that perform the mapping. We use the dataset to study two classes of unsupervised methods for this task: 1) adaptations of classic entity linking methods, and 2) methods that treat the problem as a zero-shot text classification problem. For the first approach, we evaluate off-the-shelf entity linking systems. For the second approach, we explore a) pre-trained natural language inference (NLI) models, and b) pre-trained large generative language models. We present the results of our evaluation, lessons learned, and directions for future work. The dataset and scripts for evaluation are made publicly available.

Title: Visually Grounded Language Learning: a review of language games, datasets, tasks, and models. (arXiv:2312.02431v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02431
Code URL: null
Copy Paste: [[2312.02431]] Visually Grounded Language Learning: a review of language games, datasets, tasks, and models(http://arxiv.org/abs/2312.02431)
Summary:
In recent years, several machine learning models have been proposed. They are trained with a language modelling objective on large-scale text-only data. With such pretraining, they can achieve impressive results on many Natural Language Understanding and Generation tasks. However, many facets of meaning cannot be learned by ``listening to the radio" only. In the literature, many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field. We rely on Wittgenstein's idea of `language games' to categorise such tasks into 3 different families: 1) discriminative games, 2) generative games, and 3) interactive games. Our analysis of the literature provides evidence that future work should be focusing on interactive games where communication in Natural Language is important to resolve ambiguities about object referents and action plans and that physical embodiment is essential to understand the semantics of situations and events. Overall, these represent key requirements for developing grounded meanings in neural models.

Title: MKA: A Scalable Medical Knowledge Assisted Mechanism for Generative Models on Medical Conversation Tasks. (arXiv:2312.02496v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.02496
Code URL: https://github.com/liangke23/knowledge_assisted_medical_dialogue_generation_mechanism
Copy Paste: [[2312.02496]] MKA: A Scalable Medical Knowledge Assisted Mechanism for Generative Models on Medical Conversation Tasks(http://arxiv.org/abs/2312.02496)
Summary:
Using natural language processing (NLP) technologies to develop medical chatbots makes the diagnosis of the patient more convenient and efficient, which is a typical application in healthcare AI. Because of its importance, lots of research have been come out. Recently, the neural generative models have shown their impressive ability as the core of chatbot, while it cannot scale well when directly applied to medical conversation due to the lack of medical-specific knowledge. To address the limitation, a scalable Medical Knowledge Assisted mechanism, MKA, is proposed in this paper. The mechanism aims to assist general neural generative models to achieve better performance on the medical conversation task. The medical-specific knowledge graph is designed within the mechanism, which contains 6 types of medical-related information, including department, drug, check, symptom, disease, food. Besides, the specific token concatenation policy is defined to effectively inject medical information into the input data. Evaluation of our method is carried out on two typical medical datasets, MedDG and MedDialog-CN. The evaluation results demonstrate that models combined with our mechanism outperform original methods in multiple automatic evaluation metrics. Besides, MKA-Bert-GPT achieves state-of-the-art performance. The open-sourced codes are public: https://github.com/LIANGKE23/Knowledge_Assisted_Medical_Dialogue_Generation_Mechanism

Title: H-GAP: Humanoid Control with a Generalist Planner. (arXiv:2312.02682v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02682
Code URL: null
Copy Paste: [[2312.02682]] H-GAP: Humanoid Control with a Generalist Planner(http://arxiv.org/abs/2312.02682)
Summary:
Humanoid control is an important research challenge offering avenues for integration into human-centric infrastructures and enabling physics-driven humanoid animations. The daunting challenges in this field stem from the difficulty of optimizing in high-dimensional action spaces and the instability introduced by the bipedal morphology of humanoids. However, the extensive collection of human motion-captured data and the derived datasets of humanoid trajectories, such as MoCapAct, paves the way to tackle these challenges. In this context, we present Humanoid Generalist Autoencoding Planner (H-GAP), a state-action trajectory generative model trained on humanoid trajectories derived from human motion-captured data, capable of adeptly handling downstream control tasks with Model Predictive Control (MPC). For 56 degrees of freedom humanoid, we empirically demonstrate that H-GAP learns to represent and generate a wide range of motor behaviours. Further, without any learning from online interactions, it can also flexibly transfer these behaviors to solve novel downstream control tasks via planning. Notably, H-GAP excels established MPC baselines that have access to the ground truth dynamics model, and is superior or comparable to offline RL methods trained for individual tasks. Finally, we do a series of empirical studies on the scaling properties of H-GAP, showing the potential for performance gains via additional data but not computing. Code and videos are available at https://ycxuyingchen.github.io/hgap/.

Title: Toward autocorrection of chemical process flowsheets using large language models. (arXiv:2312.02873v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02873
Code URL: null
Copy Paste: [[2312.02873]] Toward autocorrection of chemical process flowsheets using large language models(http://arxiv.org/abs/2312.02873)
Summary:
The process engineering domain widely uses Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (P&IDs) to represent process flows and equipment configurations. However, the P&IDs and PFDs, hereafter called flowsheets, can contain errors causing safety hazards, inefficient operation, and unnecessary expenses. Correcting and verifying flowsheets is a tedious, manual process. We propose a novel generative AI methodology for automatically identifying errors in flowsheets and suggesting corrections to the user, i.e., autocorrecting flowsheets. Inspired by the breakthrough of Large Language Models (LLMs) for grammatical autocorrection of human language, we investigate LLMs for the autocorrection of flowsheets. The input to the model is a potentially erroneous flowsheet and the output of the model are suggestions for a corrected flowsheet. We train our autocorrection model on a synthetic dataset in a supervised manner. The model achieves a top-1 accuracy of 80% and a top-5 accuracy of 84% on an independent test dataset of synthetically generated flowsheets. The results suggest that the model can learn to autocorrect the synthetic flowsheets. We envision that flowsheet autocorrection will become a useful tool for chemical engineers.

anomaly

Title: A Unified Simulation Framework for Visual and Behavioral Fidelity in Crowd Analysis. (arXiv:2312.02613v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02613
Code URL: null
Copy Paste: [[2312.02613]] A Unified Simulation Framework for Visual and Behavioral Fidelity in Crowd Analysis(http://arxiv.org/abs/2312.02613)
Summary:
Simulation is a powerful tool to easily generate annotated data, and a highly desirable feature, especially in those domains where learning models need large training datasets. Machine learning and deep learning solutions, have proven to be extremely data-hungry and sometimes, the available real-world data are not sufficient to effectively model the given task. Despite the initial skepticism of a portion of the scientific community, the potential of simulation has been largely confirmed in many application areas, and the recent developments in terms of rendering and virtualization engines, have shown a good ability also in representing complex scenes. This includes environmental factors, such as weather conditions and surface reflectance, as well as human-related events, like human actions and behaviors. We present a human crowd simulator, called UniCrowd, and its associated validation pipeline. We show how the simulator can generate annotated data, suitable for computer vision tasks, in particular for detection and segmentation, as well as the related applications, as crowd counting, human pose estimation, trajectory analysis and prediction, and anomaly detection.

Title: Pseudo Replay-based Class Continual Learning for Online New Category Anomaly Detection in Additive Manufacturing. (arXiv:2312.02491v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02491
Code URL: null
Copy Paste: [[2312.02491]] Pseudo Replay-based Class Continual Learning for Online New Category Anomaly Detection in Additive Manufacturing(http://arxiv.org/abs/2312.02491)
Summary:
The incorporation of advanced sensors and machine learning techniques has enabled modern manufacturing enterprises to perform data-driven in-situ quality monitoring based on the sensor data collected in manufacturing processes. However, one critical challenge is that newly presented defect category may manifest as the manufacturing process continues, resulting in monitoring performance deterioration of previously trained machine learning models. Hence, there is an increasing need for empowering machine learning model to learn continually. Among all continual learning methods, memory-based continual learning has the best performance but faces the constraints of data storage capacity. To address this issue, this paper develops a novel pseudo replay-based continual learning by integrating class incremental learning and oversampling-based data generation. Without storing all the data, the developed framework could generate high-quality data representing previous classes to train machine learning model incrementally when new category anomaly occurs. In addition, it could even enhance the monitoring performance since it also effectively improves the data quality. The effectiveness of the proposed framework is validated in an additive manufacturing process, which leverages supervised classification problem for anomaly detection. The experimental results show that the developed method is very promising in detecting novel anomaly while maintaining a good performance on the previous task and brings up more flexibility in model architecture.

Title: MEMTO: Memory-guided Transformer for Multivariate Time Series Anomaly Detection. (arXiv:2312.02530v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02530
Code URL: null
Copy Paste: [[2312.02530]] MEMTO: Memory-guided Transformer for Multivariate Time Series Anomaly Detection(http://arxiv.org/abs/2312.02530)
Summary:
Detecting anomalies in real-world multivariate time series data is challenging due to complex temporal dependencies and inter-variable correlations. Recently, reconstruction-based deep models have been widely used to solve the problem. However, these methods still suffer from an over-generalization issue and fail to deliver consistently high performance. To address this issue, we propose the MEMTO, a memory-guided Transformer using a reconstruction-based approach. It is designed to incorporate a novel memory module that can learn the degree to which each memory item should be updated in response to the input data. To stabilize the training procedure, we use a two-phase training paradigm which involves using K-means clustering for initializing memory items. Additionally, we introduce a bi-dimensional deviation-based detection criterion that calculates anomaly scores considering both input space and latent space. We evaluate our proposed method on five real-world datasets from diverse domains, and it achieves an average anomaly detection F1-score of 95.74%, significantly outperforming the previous state-of-the-art methods. We also conduct extensive experiments to empirically validate the effectiveness of our proposed model's key components.

Title: A Self-Commissioning Edge Computing Method for Data-Driven Anomaly Detection in Power Electronic Systems. (arXiv:2312.02661v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02661
Code URL: null
Copy Paste: [[2312.02661]] A Self-Commissioning Edge Computing Method for Data-Driven Anomaly Detection in Power Electronic Systems(http://arxiv.org/abs/2312.02661)
Summary:
Ensuring the reliability of power electronic converters is a matter of great importance, and data-driven condition monitoring techniques are cementing themselves as an important tool for this purpose. However, translating methods that work well in controlled lab environments to field applications presents significant challenges, notably because of the limited diversity and accuracy of the lab training data. By enabling the use of field data, online machine learning can be a powerful tool to overcome this problem, but it introduces additional challenges in ensuring the stability and predictability of the training processes. This work presents an edge computing method that mitigates these shortcomings with minimal additional memory usage, by employing an autonomous algorithm that prioritizes the storage of training samples with larger prediction errors. The method is demonstrated on the use case of a self-commissioning condition monitoring system, in the form of a thermal anomaly detection scheme for a variable frequency motor drive, where the algorithm self-learned to distinguish normal and anomalous operation with minimal prior knowledge. The obtained results, based on experimental data, show a significant improvement in prediction accuracy and training speed, when compared to equivalent models trained online without the proposed data selection process.

Title: Semi-Supervised Health Index Monitoring with Feature Generation and Fusion. (arXiv:2312.02867v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02867
Code URL: null
Copy Paste: [[2312.02867]] Semi-Supervised Health Index Monitoring with Feature Generation and Fusion(http://arxiv.org/abs/2312.02867)
Summary:
The Health Index (HI) is crucial for evaluating system health, aiding tasks like anomaly detection and predicting remaining useful life for systems demanding high safety and reliability. Tight monitoring is crucial for achieving high precision at a lower cost, with applications such as spray coating. Obtaining HI labels in real-world applications is often cost-prohibitive, requiring continuous, precise health measurements. Therefore, it is more convenient to leverage run-to failure datasets that may provide potential indications of machine wear condition, making it necessary to apply semi-supervised tools for HI construction. In this study, we adapt the Deep Semi-supervised Anomaly Detection (DeepSAD) method for HI construction. We use the DeepSAD embedding as a condition indicators to address interpretability challenges and sensitivity to system-specific factors. Then, we introduce a diversity loss to enrich condition indicators. We employ an alternating projection algorithm with isotonic constraints to transform the DeepSAD embedding into a normalized HI with an increasing trend. Validation on the PHME 2010 milling dataset, a recognized benchmark with ground truth HIs demonstrates meaningful HIs estimations. Our methodology is then applied to monitor wear states of thermal spray coatings using high-frequency voltage. Our contributions create opportunities for more accessible and reliable HI estimation, particularly in cases where obtaining ground truth HI labels is unfeasible.

in-context

Title: Towards More Unified In-context Visual Understanding. (arXiv:2312.02520v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02520
Code URL: null
Copy Paste: [[2312.02520]] Towards More Unified In-context Visual Understanding(http://arxiv.org/abs/2312.02520)
Summary:
The rapid advancement of large language models (LLMs) has accelerated the emergence of in-context learning (ICL) as a cutting-edge approach in the natural language processing domain. Recently, ICL has been employed in visual understanding tasks, such as semantic segmentation and image captioning, yielding promising results. However, existing visual ICL framework can not enable producing content across multiple modalities, which limits their potential usage scenarios. To address this issue, we present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space, structured as interleaved in-context sequences. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them, facilitating in-context learning. Thanks to this design, the model is capable of handling in-context vision understanding tasks with multimodal output in a unified pipeline. Experimental results demonstrate that our model achieves competitive performance compared with specialized models and previous ICL baselines. Overall, our research takes a further step toward unified multimodal in-context learning.

Title: Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning. (arXiv:2312.02546v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.02546
Code URL: https://github.com/tmllab/machine_vision_therapy
Copy Paste: [[2312.02546]] Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning(http://arxiv.org/abs/2312.02546)
Summary:
Although vision models such as Contrastive Language-Image Pre-Training (CLIP) show impressive generalization performance, their zero-shot robustness is still limited under Out-of-Distribution (OOD) scenarios without fine-tuning. Instead of undesirably providing human supervision as commonly done, it is possible to take advantage of Multi-modal Large Language Models (MLLMs) that hold powerful visual understanding abilities. However, MLLMs are shown to struggle with vision problems due to the incompatibility of tasks, thus hindering their utilization. In this paper, we propose to effectively leverage MLLMs to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models. By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner. To solve the incompatibility issue, we propose a novel Denoising In-Context Learning (DICL) strategy to align vision tasks with MLLMs. Concretely, by estimating a transition matrix that captures the probability of one class being confused with another, an instruction containing a correct exemplar and an erroneous one from the most probable noisy class can be constructed. Such an instruction can help any MLLMs with ICL ability to detect and rectify incorrect predictions of vision models. Through extensive experiments on ImageNet, WILDS, DomainBed, and other OOD datasets, we carefully validate the quantitative and qualitative effectiveness of our method. Our code is available at https://github.com/tmllab/Machine_Vision_Therapy.

Title: Prompt Optimization via Adversarial In-Context Learning. (arXiv:2312.02614v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.02614
Code URL: null
Copy Paste: [[2312.02614]] Prompt Optimization via Adversarial In-Context Learning(http://arxiv.org/abs/2312.02614)
Summary:
We propose a new method, Adversarial In-Context Learning (adv-ICL), to optimize prompt for in-context learning (ICL) by employing one LLM as a generator, another as a discriminator, and a third as a prompt modifier. As in traditional adversarial learning, adv-ICL is implemented as a two-player game between the generator and discriminator, where the generator tries to generate realistic enough output to fool the discriminator. In each round, given an input prefixed by task instructions and several exemplars, the generator produces an output. The discriminator is then tasked with classifying the generator input-output pair as model-generated or real data. Based on the discriminator loss, the prompt modifier proposes possible edits to the generator and discriminator prompts, and the edits that most improve the adversarial loss are selected. We show that adv-ICL results in significant improvements over state-of-the-art prompt optimization techniques for both open and closed-source models on 11 generation and classification tasks including summarization, arithmetic reasoning, machine translation, data-to-text generation, and the MMLU and big-bench hard benchmarks. In addition, because our method uses pre-trained models and updates only prompts rather than model parameters, it is computationally efficient, easy to extend to any LLM and task, and effective in low-resource settings.