diffusion

Title: Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing. (arXiv:2312.03763v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03763
Code URL: null
Copy Paste: [[2312.03763]] Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing(http://arxiv.org/abs/2312.03763)
Summary:
We present a novel framework for generating photorealistic 3D human head and subsequently manipulating and reposing them with remarkable flexibility. The proposed approach leverages an implicit function representation of 3D human heads, employing 3D Gaussians anchored on a parametric face model. To enhance representational capabilities and encode spatial information, we embed a lightweight tri-plane payload within each Gaussian rather than directly storing color and opacity. Additionally, we parameterize the Gaussians in a 2D UV space via a 3DMM, enabling effective utilization of the diffusion model for 3D head avatar generation. Our method facilitates the creation of diverse and realistic 3D human heads with fine-grained editing over facial features and expressions. Extensive experiments demonstrate the effectiveness of our method.

Title: DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models. (arXiv:2312.03771v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03771
Code URL: null
Copy Paste: [[2312.03771]] DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models(http://arxiv.org/abs/2312.03771)
Summary:
This study introduces Text-Guided Subject-Driven Image Inpainting, a novel task that combines text and exemplar images for image inpainting. While both text and exemplar images have been used independently in previous efforts, their combined utilization remains unexplored. Simultaneously accommodating both conditions poses a significant challenge due to the inherent balance required between editability and subject fidelity. To tackle this challenge, we propose a two-step approach DreamInpainter. First, we compute dense subject features to ensure accurate subject replication. Then, we employ a discriminative token selection module to eliminate redundant subject details, preserving the subject's identity while allowing changes according to other conditions such as mask shape and text prompts. Additionally, we introduce a decoupling regularization technique to enhance text control in the presence of exemplar images. Our extensive experiments demonstrate the superior performance of our method in terms of visual quality, identity preservation, and text control, showcasing its effectiveness in the context of text-guided subject-driven image inpainting.

Title: DiffusionAtlas: High-Fidelity Consistent Diffusion Video Editing. (arXiv:2312.03772v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03772
Code URL: null
Copy Paste: [[2312.03772]] DiffusionAtlas: High-Fidelity Consistent Diffusion Video Editing(http://arxiv.org/abs/2312.03772)
Summary:
We present a diffusion-based video editing framework, namely DiffusionAtlas, which can achieve both frame consistency and high fidelity in editing video object appearance. Despite the success in image editing, diffusion models still encounter significant hindrances when it comes to video editing due to the challenge of maintaining spatiotemporal consistency in the object's appearance across frames. On the other hand, atlas-based techniques allow propagating edits on the layered representations consistently back to frames. However, they often struggle to create editing effects that adhere correctly to the user-provided textual or visual conditions due to the limitation of editing the texture atlas on a fixed UV mapping field. Our method leverages a visual-textual diffusion model to edit objects directly on the diffusion atlases, ensuring coherent object identity across frames. We design a loss term with atlas-based constraints and build a pretrained text-driven diffusion model as pixel-wise guidance for refining shape distortions and correcting texture deviations. Qualitative and quantitative experiments show that our method outperforms state-of-the-art methods in achieving consistent high-fidelity video-object editing.

Title: FAAC: Facial Animation Generation with Anchor Frame and Conditional Control for Superior Fidelity and Editability. (arXiv:2312.03775v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03775
Code URL: null
Copy Paste: [[2312.03775]] FAAC: Facial Animation Generation with Anchor Frame and Conditional Control for Superior Fidelity and Editability(http://arxiv.org/abs/2312.03775)
Summary:
Over recent years, diffusion models have facilitated significant advancements in video generation. Yet, the creation of face-related videos still confronts issues such as low facial fidelity, lack of frame consistency, limited editability and uncontrollable human poses. To address these challenges, we introduce a facial animation generation method that enhances both face identity fidelity and editing capabilities while ensuring frame consistency. This approach incorporates the concept of an anchor frame to counteract the degradation of generative ability in original text-to-image models when incorporating a motion module. We propose two strategies towards this objective: training-free and training-based anchor frame methods. Our method's efficacy has been validated on multiple representative DreamBooth and LoRA models, delivering substantial improvements over the original outcomes in terms of facial fidelity, text-to-image editability, and video motion. Moreover, we introduce conditional control using a 3D parametric face model to capture accurate facial movements and expressions. This solution augments the creative possibilities for facial animation generation through the integration of multiple control signals. For additional samples, please visit https://anonymous.4open.science/r/FAAC.

Title: AnimateZero: Video Diffusion Models are Zero-Shot Image Animators. (arXiv:2312.03793v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03793
Code URL: https://github.com/vvictoryuki/animatezero
Copy Paste: [[2312.03793]] AnimateZero: Video Diffusion Models are Zero-Shot Image Animators(http://arxiv.org/abs/2312.03793)
Summary:
Large-scale text-to-video (T2V) diffusion models have great progress in recent years in terms of visual quality, motion and temporal consistency. However, the generation process is still a black box, where all attributes (e.g., appearance, motion) are learned and generated jointly without precise control ability other than rough text descriptions. Inspired by image animation which decouples the video as one specific appearance with the corresponding motion, we propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff, and provide more precise appearance and motion control abilities for it. For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation for ensuring the generated first frame is equal to the given generated image. For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention to ensure other frames align with the first frame well. Empowered by the proposed methods, AnimateZero can successfully control the generating progress without further training. As a zero-shot image animator for given images, AnimateZero also enables multiple new applications, including interactive video generation and real image animation. The detailed experiments demonstrate the effectiveness of the proposed method in both T2V and related applications.

Title: AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation. (arXiv:2312.03795v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03795
Code URL: null
Copy Paste: [[2312.03795]] AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation(http://arxiv.org/abs/2312.03795)
Summary:
Text-to-3D model adaptations have advanced static 3D model quality, but sequential 3D model generation, particularly for animatable objects with large motions, is still scarce. Our work proposes AnimatableDreamer, a text-to-4D generation framework capable of generating diverse categories of non-rigid objects while adhering to the object motions extracted from a monocular video. At its core, AnimatableDreamer is equipped with our novel optimization design dubbed Canonical Score Distillation (CSD), which simplifies the generation dimension from 4D to 3D by denoising over different frames in the time-varying camera spaces while conducting the distillation process in a unique canonical space shared per video. Concretely, CSD ensures that score gradients back-propagate to the canonical space through differentiable warping, hence guaranteeing the time-consistent generation and maintaining morphological plausibility across different poses. By lifting the 3D generator to 4D with warping functions, AnimatableDreamer offers a novel perspective on non-rigid 3D model generation and reconstruction. Besides, with inductive knowledge from a multi-view consistent diffusion model, CSD regularizes reconstruction from novel views, thus cyclically enhancing the generation process. Extensive experiments demonstrate the capability of our method in generating high-flexibility text-guided 3D models from the monocular video, while also showing improved reconstruction performance over typical non-rigid reconstruction methods. Project page https://AnimatableDreamer.github.io.

Title: AVID: Any-Length Video Inpainting with Diffusion Model. (arXiv:2312.03816v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03816
Code URL: null
Copy Paste: [[2312.03816]] AVID: Any-Length Video Inpainting with Diffusion Model(http://arxiv.org/abs/2312.03816)
Summary:
Recent advances in diffusion models have successfully enabled text-guided image inpainting. While it seems straightforward to extend such editing capability into video domain, there has been fewer works regarding text-guided video inpainting. Given a video, a masked region at its initial frame, and an editing prompt, it requires a model to do infilling at each frame following the editing guidance while keeping the out-of-mask region intact. There are three main challenges in text-guided video inpainting: ($i$) temporal consistency of the edited video, ($ii$) supporting different inpainting types at different structural fidelity level, and ($iii$) dealing with variable video length. To address these challenges, we introduce Any-Length Video Inpainting with Diffusion Model, dubbed as AVID. At its core, our model is equipped with effective motion modules and adjustable structure guidance, for fixed-length video inpainting. Building on top of that, we propose a novel Temporal MultiDiffusion sampling pipeline with an middle-frame attention guidance mechanism, facilitating the generation of videos with any desired duration. Our comprehensive experiments show our model can robustly deal with various inpainting types at different video duration range, with high quality. More visualization results is made publicly available at https://zhang-zx.github.io/AVID/ .

Title: Diffusion Illusions: Hiding Images in Plain Sight. (arXiv:2312.03817v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03817
Code URL: null
Copy Paste: [[2312.03817]] Diffusion Illusions: Hiding Images in Plain Sight(http://arxiv.org/abs/2312.03817)
Summary:
We explore the problem of computationally generating special `prime' images that produce optical illusions when physically arranged and viewed in a certain way. First, we propose a formal definition for this problem. Next, we introduce Diffusion Illusions, the first comprehensive pipeline designed to automatically generate a wide range of these illusions. Specifically, we both adapt the existing `score distillation loss' and propose a new `dream target loss' to optimize a group of differentially parametrized prime images, using a frozen text-to-image diffusion model. We study three types of illusions, each where the prime images are arranged in different ways and optimized using the aforementioned losses such that images derived from them align with user-chosen text prompts or images. We conduct comprehensive experiments on these illusions and verify the effectiveness of our proposed method qualitatively and quantitatively. Additionally, we showcase the successful physical fabrication of our illusions -- as they are all designed to work in the real world. Our code and examples are publicly available at our interactive project website: https://diffusionillusions.com

Title: LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning. (arXiv:2312.03849v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03849
Code URL: null
Copy Paste: [[2312.03849]] LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning(http://arxiv.org/abs/2312.03849)
Summary:
Generating instructional images of human daily actions from an egocentric viewpoint serves a key step towards efficient skill transfer. In this paper, we introduce a novel problem -- egocentric action frame generation. The goal is to synthesize the action frame conditioning on the user prompt question and an input egocentric image that captures user's environment. Notably, existing egocentric datasets lack the detailed annotations that describe the execution of actions. Additionally, the diffusion-based image manipulation models fail to control the state change of an action within the corresponding egocentric image pixel space. To this end, we finetune a visual large language model (VLLM) via visual instruction tuning for curating the enriched action descriptions to address our proposed problem. Moreover, we propose to Learn EGOcentric (LEGO) action frame generation using image and text embeddings from VLLM as additional conditioning. We validate our proposed model on two egocentric datasets -- Ego4D and Epic-Kitchens. Our experiments show prominent improvement over prior image manipulation models in both quantitative and qualitative evaluation. We also conduct detailed ablation studies and analysis to provide insights on our method.

Title: Inpaint3D: 3D Scene Content Generation using 2D Inpainting Diffusion. (arXiv:2312.03869v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03869
Code URL: null
Copy Paste: [[2312.03869]] Inpaint3D: 3D Scene Content Generation using 2D Inpainting Diffusion(http://arxiv.org/abs/2312.03869)
Summary:
This paper presents a novel approach to inpainting 3D regions of a scene, given masked multi-view images, by distilling a 2D diffusion model into a learned 3D scene representation (e.g. a NeRF). Unlike 3D generative methods that explicitly condition the diffusion model on camera pose or multi-view information, our diffusion model is conditioned only on a single masked 2D image. Nevertheless, we show that this 2D diffusion model can still serve as a generative prior in a 3D multi-view reconstruction problem where we optimize a NeRF using a combination of score distillation sampling and NeRF reconstruction losses. Predicted depth is used as additional supervision to encourage accurate geometry. We compare our approach to 3D inpainting methods that focus on object removal. Because our method can generate content to fill any 3D masked region, we additionally demonstrate 3D object completion, 3D object replacement, and 3D scene completion.

Title: Controllable Human-Object Interaction Synthesis. (arXiv:2312.03913v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03913
Code URL: null
Copy Paste: [[2312.03913]] Controllable Human-Object Interaction Synthesis(http://arxiv.org/abs/2312.03913)
Summary:
Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. We propose Controllable Human-Object Interaction Synthesis (CHOIS), an approach that generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints. While language descriptions inform style and intent, waypoints ground the motion in the scene and can be effectively extracted using high-level planning methods. Naively applying a diffusion model fails to predict object motion aligned with the input waypoints and cannot ensure the realism of interactions that require precise hand-object contact and appropriate contact grounded by the floor. To overcome these problems, we introduce an object geometry loss as additional supervision to improve the matching between generated object motion and input object waypoints. In addition, we design guidance terms to enforce contact constraints during the sampling process of the trained diffusion model.

Title: Adapting HouseDiffusion for conditional Floor Plan generation on Modified Swiss Dwellings dataset. (arXiv:2312.03938v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03938
Code URL: null
Copy Paste: [[2312.03938]] Adapting HouseDiffusion for conditional Floor Plan generation on Modified Swiss Dwellings dataset(http://arxiv.org/abs/2312.03938)
Summary:
Automated floor plan generation has recently gained momentum with several methods that have been proposed. The CVAAD Floor Plan Auto-Completion workshop challenge introduced MSD, a new dataset that includes existing structural walls of the building as an additional input constraint. This technical report presents an approach for extending a recent work, HouseDiffusion (arXiv:2211.13287 [cs.CV]), to the MSD dataset. The adaption involves modifying the model's transformer layers to condition on a set of wall lines. The report introduces a pre-processing pipeline to extract wall lines from the binary mask of the building structure provided as input. Additionally, it was found that a data processing procedure that simplifies all room polygons to rectangles leads to better performance. This indicates that future work should explore better representations of variable-length polygons in diffusion models. The code will be made available at a later date.

Title: Style Transfer to Calvin and Hobbes comics using Stable Diffusion. (arXiv:2312.03993v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03993
Code URL: null
Copy Paste: [[2312.03993]] Style Transfer to Calvin and Hobbes comics using Stable Diffusion(http://arxiv.org/abs/2312.03993)
Summary:
This project report summarizes our journey to perform stable diffusion fine-tuning on a dataset containing Calvin and Hobbes comics. The purpose is to convert any given input image into the comic style of Calvin and Hobbes, essentially performing style transfer. We train stable-diffusion-v1.5 using Low Rank Adaptation (LoRA) to efficiently speed up the fine-tuning process. The diffusion itself is handled by a Variational Autoencoder (VAE), which is a U-net. Our results were visually appealing for the amount of training time and the quality of input data that went into training.

Title: Stable diffusion for Data Augmentation in COCO and Weed Datasets. (arXiv:2312.03996v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03996
Code URL: null
Copy Paste: [[2312.03996]] Stable diffusion for Data Augmentation in COCO and Weed Datasets(http://arxiv.org/abs/2312.03996)
Summary:
Generative models have increasingly impacted relative tasks ranging from image revision and object detection in computer vision to interior design and idea illustration in more general fields. Stable diffusion is an outstanding model series that paves the way for producing high-resolution images with thorough details from text prompts or reference images. It will be an interesting topic about how to leverage the capability of stable diffusion to elevate the image variations of certain categories (e.g., vehicles, humans, and daily objects); particularly, it has the potential to gain improvements for small datasets with image-sparse categories. This study utilized seven categories in the popular COCO dataset and three widespread weed species in Michigan to evaluate the efficiency of a recent version of stable diffusion. In detail, Stable diffusion was used to generate synthetic images belonging to these classes; then, YOLOv8 models were trained based on these synthetic images, whose performance was compared to the models trained on original images. In addition, several techniques (e.g., Image-to-image translation, Dreambooth, ControlNet) of Stable diffusion were leveraged for image generation with different focuses. In spite of the overall results being disappointing, promising results have been achieved in some classes, illustrating the potential of stable diffusion models to improve the performance of detection models, which represent more helpful information being conveyed into the models by the generated images. This seminal study may expedite the adaption of stable diffusion models to classification and detection tasks in different fields.

Title: KOALA: Self-Attention Matters in Knowledge Distillation of Latent Diffusion Models for Memory-Efficient and Fast Image Synthesis. (arXiv:2312.04005v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04005
Code URL: null
Copy Paste: [[2312.04005]] KOALA: Self-Attention Matters in Knowledge Distillation of Latent Diffusion Models for Memory-Efficient and Fast Image Synthesis(http://arxiv.org/abs/2312.04005)
Summary:
Stable diffusion is the mainstay of the text-to-image (T2I) synthesis in the community due to its generation performance and open-source nature. Recently, Stable Diffusion XL (SDXL), the successor of stable diffusion, has received a lot of attention due to its significant performance improvements with a higher resolution of 1024x1024 and a larger model. However, its increased computation cost and model size require higher-end hardware(e.g., bigger VRAM GPU) for end-users, incurring higher costs of operation. To address this problem, in this work, we propose an efficient latent diffusion model for text-to-image synthesis obtained by distilling the knowledge of SDXL. To this end, we first perform an in-depth analysis of the denoising U-Net in SDXL, which is the main bottleneck of the model, and then design a more efficient U-Net based on the analysis. Secondly, we explore how to effectively distill the generation capability of SDXL into an efficient U-Net and eventually identify four essential factors, the core of which is that self-attention is the most important part. With our efficient U-Net and self-attention-based knowledge distillation strategy, we build our efficient T2I models, called KOALA-1B & -700M, while reducing the model size up to 54% and 69% of the original SDXL model. In particular, the KOALA-700M is more than twice as fast as SDXL while still retaining a decent generation quality. We hope that due to its balanced speed-performance tradeoff, our KOALA models can serve as a cost-effective alternative to SDXL in resource-constrained environments.

Title: DiffusionPhase: Motion Diffusion in Frequency Domain. (arXiv:2312.04036v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04036
Code URL: null
Copy Paste: [[2312.04036]] DiffusionPhase: Motion Diffusion in Frequency Domain(http://arxiv.org/abs/2312.04036)
Summary:
In this study, we introduce a learning-based method for generating high-quality human motion sequences from text descriptions (e.g., ``A person walks forward"). Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences, due to limited text-to-motion datasets and the pose representations used that often lack expressiveness or compactness. To address these issues, we propose the first method for text-conditioned human motion generation in the frequency domain of motions. We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space with high-frequency details encoded, capturing the local periodicity of motions in time and space with high accuracy. We also introduce a conditional diffusion model for predicting periodic motion parameters based on text descriptions and a start pose, efficiently achieving smooth transitions between motion sequences associated with different text descriptions. Experiments demonstrate that our approach outperforms current methods in generating a broader variety of high-quality motions, and synthesizing long sequences with natural transitions.

Title: MTVG : Multi-text Video Generation with Text-to-Video Models. (arXiv:2312.04086v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04086
Code URL: null
Copy Paste: [[2312.04086]] MTVG : Multi-text Video Generation with Text-to-Video Models(http://arxiv.org/abs/2312.04086)
Summary:
Recently, video generation has attracted massive attention and yielded noticeable outcomes. Concerning the characteristics of video, multi-text conditioning incorporating sequential events is necessary for next-step video generation. In this work, we propose a novel multi-text video generation~(MTVG) by directly utilizing a pre-trained diffusion-based text-to-video~(T2V) generation model without additional fine-tuning. To generate consecutive video segments, visual consistency generated by distinct prompts is necessary with diverse variations, such as motion and content-related transitions. Our proposed MTVG includes Dynamic Noise and Last Frame Aware Inversion which reinitialize the noise latent to preserve visual coherence between videos of different prompts and prevent repetitive motion or contents. Furthermore, we present Structure Guiding Sampling to maintain the global appearance across the frames in a single video clip, where we leverage iterative latent updates across the preceding frame. Additionally, our Prompt Generator allows for arbitrary format of text conditions consisting of diverse events. As a result, our extensive experiments, including diverse transitions of descriptions, demonstrate that our proposed methods show superior generated outputs in terms of semantically coherent and temporally seamless video.Video examples are available in our project page: https://kuai-lab.github.io/mtvg-page.

Title: Diffusing Colors: Image Colorization with Text Guided Diffusion. (arXiv:2312.04145v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04145
Code URL: null
Copy Paste: [[2312.04145]] Diffusing Colors: Image Colorization with Text Guided Diffusion(http://arxiv.org/abs/2312.04145)
Summary:
The colorization of grayscale images is a complex and subjective task with significant challenges. Despite recent progress in employing large-scale datasets with deep neural networks, difficulties with controllability and visual quality persist. To tackle these issues, we present a novel image colorization framework that utilizes image diffusion techniques with granular text prompts. This integration not only produces colorization outputs that are semantically appropriate but also greatly improves the level of control users have over the colorization process. Our method provides a balance between automation and control, outperforming existing techniques in terms of visual quality and semantic coherence. We leverage a pretrained generative Diffusion Model, and show that we can finetune it for the colorization task without losing its generative power or attention to text prompts. Moreover, we present a novel CLIP-based ranking model that evaluates color vividness, enabling automatic selection of the most suitable level of vividness based on the specific scene semantics. Our approach holds potential particularly for color enhancement and historical image colorization.

Title: Detecting and Restoring Non-Standard Hands in Stable Diffusion Generated Images. (arXiv:2312.04236v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04236
Code URL: null
Copy Paste: [[2312.04236]] Detecting and Restoring Non-Standard Hands in Stable Diffusion Generated Images(http://arxiv.org/abs/2312.04236)
Summary:
We introduce a pipeline to address anatomical inaccuracies in Stable Diffusion generated hand images. The initial step involves constructing a specialized dataset, focusing on hand anomalies, to train our models effectively. A finetuned detection model is pivotal for precise identification of these anomalies, ensuring targeted correction. Body pose estimation aids in understanding hand orientation and positioning, crucial for accurate anomaly correction. The integration of ControlNet and InstructPix2Pix facilitates sophisticated inpainting and pixel-level transformation, respectively. This dual approach allows for high-fidelity image adjustments. This comprehensive approach ensures the generation of images with anatomically accurate hands, closely resembling real-world appearances. Our experimental results demonstrate the pipeline's efficacy in enhancing hand image realism in Stable Diffusion outputs. We provide an online demo at https://fixhand.yiqun.io

Title: Prompt Highlighter: Interactive Control for Multi-Modal LLMs. (arXiv:2312.04302v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04302
Code URL: https://github.com/dvlab-research/prompt-highlighter
Copy Paste: [[2312.04302]] Prompt Highlighter: Interactive Control for Multi-Modal LLMs(http://arxiv.org/abs/2312.04302)
Summary:
This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and heavier reliance on prompt contents due to their autoregressive generative nature. While manipulating prompt formats could improve outputs, designing specific and precise prompts per task can be challenging and ineffective. To tackle this issue, we introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation. Motivated by the classifier-free diffusion guidance, we form regular and unconditional context pairs based on highlighted tokens, demonstrating that the autoregressive generation in models can be guided in a classifier-free way. Notably, we find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs. Our approach is compatible with current LLMs and VLMs, achieving impressive customized generation results without training. Experiments confirm its effectiveness in focusing on input contexts and generating reliable content. Without tuning on LLaVA-v1.5, our method secured 69.5 in the MMBench test and 1552.5 in MME-perception. The code is available at: https://github.com/dvlab-research/Prompt-Highlighter/

Title: iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image Diffusion Model for Interior Design. (arXiv:2312.04326v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04326
Code URL: null
Copy Paste: [[2312.04326]] iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image Diffusion Model for Interior Design(http://arxiv.org/abs/2312.04326)
Summary:
With the open-sourcing of text-to-image models (T2I) such as stable diffusion (SD) and stable diffusion XL (SD-XL), there is an influx of models fine-tuned in specific domains based on the open-source SD model, such as in anime, character portraits, etc. However, there are few specialized models in certain domains, such as interior design, which is attributed to the complex textual descriptions and detailed visual elements inherent in design, alongside the necessity for adaptable resolution. Therefore, text-to-image models for interior design are required to have outstanding prompt-following capabilities, as well as iterative collaboration with design professionals to achieve the desired outcome. In this paper, we collect and optimize text-image data in the design field and continue training in both English and Chinese on the basis of the open-source CLIP model. We also proposed a fine-tuning strategy with curriculum learning and reinforcement learning from CLIP feedback to enhance the prompt-following capabilities of our approach so as to improve the quality of image generation. The experimental results on the collected dataset demonstrate the effectiveness of the proposed approach, which achieves impressive results and outperforms strong baselines.

Title: Multi-View Unsupervised Image Generation with Cross Attention Guidance. (arXiv:2312.04337v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04337
Code URL: null
Copy Paste: [[2312.04337]] Multi-View Unsupervised Image Generation with Cross Attention Guidance(http://arxiv.org/abs/2312.04337)
Summary:
The growing interest in novel view synthesis, driven by Neural Radiance Field (NeRF) models, is hindered by scalability issues due to their reliance on precisely annotated multi-view images. Recent models address this by fine-tuning large text2image diffusion models on synthetic multi-view data. Despite robust zero-shot generalization, they may need post-processing and can face quality issues due to the synthetic-real domain gap. This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets. With the help of pretrained self-supervised Vision Transformers (DINOv2), we identify object poses by clustering the dataset through comparing visibility and locations of specific object parts. The pose-conditioned diffusion model, trained on pose labels, and equipped with cross-frame attention at inference time ensures cross-view consistency, that is further aided by our novel hard-attention guidance. Our model, MIRAGE, surpasses prior work in novel view synthesis on real images. Furthermore, MIRAGE is robust to diverse textures and geometries, as demonstrated with our experiments on synthetic images generated with pretrained Stable Diffusion.

Title: Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models. (arXiv:2312.04410v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04410
Code URL: https://github.com/shi-labs/smooth-diffusion
Copy Paste: [[2312.04410]] Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models(http://arxiv.org/abs/2312.04410)
Summary:
Recently, diffusion models have made remarkable progress in text-to-image (T2I) generation, synthesizing images with high fidelity and diverse contents. Despite this advancement, latent space smoothness within diffusion models remains largely unexplored. Smooth latent spaces ensure that a perturbation on an input latent corresponds to a steady change in the output image. This property proves beneficial in downstream tasks, including image interpolation, inversion, and editing. In this work, we expose the non-smoothness of diffusion latent spaces by observing noticeable visual fluctuations resulting from minor latent variations. To tackle this issue, we propose Smooth Diffusion, a new category of diffusion models that can be simultaneously high-performing and smooth. Specifically, we introduce Step-wise Variation Regularization to enforce the proportion between the variations of an arbitrary input latent and that of the output image is a constant at any diffusion training step. In addition, we devise an interpolation standard deviation (ISTD) metric to effectively assess the latent space smoothness of a diffusion model. Extensive quantitative and qualitative experiments demonstrate that Smooth Diffusion stands out as a more desirable solution not only in T2I generation but also across various downstream tasks. Smooth Diffusion is implemented as a plug-and-play Smooth-LoRA to work with various community models. Code is available at https://github.com/SHI-Labs/Smooth-Diffusion.

Title: Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views. (arXiv:2312.04424v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04424
Code URL: null
Copy Paste: [[2312.04424]] Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views(http://arxiv.org/abs/2312.04424)
Summary:
Synthesizing multi-view 3D from one single image is a significant and challenging task. For this goal, Zero-1-to-3 methods aim to extend a 2D latent diffusion model to the 3D scope. These approaches generate the target-view image with a single-view source image and the camera pose as condition information. However, the one-to-one manner adopted in Zero-1-to-3 incurs challenges for building geometric and visual consistency across views, especially for complex objects. We propose a cascade generation framework constructed with two Zero-1-to-3 models, named Cascade-Zero123, to tackle this issue, which progressively extracts 3D information from the source image. Specifically, a self-prompting mechanism is designed to generate several nearby views at first. These views are then fed into the second-stage model along with the source image as generation conditions. With self-prompted multiple views as the supplementary information, our Cascade-Zero123 generates more highly consistent novel-view images than Zero-1-to-3. The promotion is significant for various complex and challenging scenes, involving insects, humans, transparent objects, and stacked multiple objects etc. The project page is at https://cascadezero123.github.io/.

Title: Approximate Caching for Efficiently Serving Diffusion Models. (arXiv:2312.04429v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04429
Code URL: null
Copy Paste: [[2312.04429]] Approximate Caching for Efficiently Serving Diffusion Models(http://arxiv.org/abs/2312.04429)
Summary:
Text-to-image generation using diffusion models has seen explosive popularity owing to their ability in producing high quality images adhering to text prompts. However, production-grade diffusion model serving is a resource intensive task that not only require high-end GPUs which are expensive but also incurs considerable latency. In this paper, we introduce a technique called approximate-caching that can reduce such iterative denoising steps for an image generation based on a prompt by reusing intermediate noise states created during a prior image generation for similar prompts. Based on this idea, we present an end to end text-to-image system, Nirvana, that uses the approximate-caching with a novel cache management-policy Least Computationally Beneficial and Frequently Used (LCBFU) to provide % GPU compute savings, 19.8% end-to-end latency reduction and 19% dollar savings, on average, on two real production workloads. We further present an extensive characterization of real production text-to-image prompts from the perspective of caching, popularity and reuse of intermediate states in a large production environment.

Title: DreamVideo: Composing Your Dream Videos with Customized Subject and Motion. (arXiv:2312.04433v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04433
Code URL: null
Copy Paste: [[2312.04433]] DreamVideo: Composing Your Dream Videos with Customized Subject and Motion(http://arxiv.org/abs/2312.04433)
Summary:
Customized generation using diffusion models has made impressive progress in image generation, but remains unsatisfactory in the challenging video generation task, as it requires the controllability of both subjects and motions. To that end, we present DreamVideo, a novel approach to generating personalized videos from a few static images of the desired subject and a few videos of target motion. DreamVideo decouples this task into two stages, subject learning and motion learning, by leveraging a pre-trained video diffusion model. The subject learning aims to accurately capture the fine appearance of the subject from provided images, which is achieved by combining textual inversion and fine-tuning of our carefully designed identity adapter. In motion learning, we architect a motion adapter and fine-tune it on the given videos to effectively model the target motion pattern. Combining these two lightweight and efficient adapters allows for flexible customization of any subject with any motion. Extensive experimental results demonstrate the superior performance of our DreamVideo over the state-of-the-art methods for customized video generation. Our project page is at https://dreamvideo-t2v.github.io.

Title: Improved Efficient Two-Stage Denoising Diffusion Power System Measurement Recovery Against False Data Injection Attacks and Data Losses. (arXiv:2312.04346v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.04346
Code URL: null
Copy Paste: [[2312.04346]] Improved Efficient Two-Stage Denoising Diffusion Power System Measurement Recovery Against False Data Injection Attacks and Data Losses(http://arxiv.org/abs/2312.04346)
Summary:
Measurement uncertainties, represented by cyber-attacks and data losses, seriously degrade the quality of power system measurements. Fortunately, the powerful generation ability of the denoising diffusion models can enable more precise measurement generation for power system data recovery. However, the controllable data generation and efficient computing methods of denoising diffusion models for deterministic trajectory still need further investigation. To this end, this paper proposes an improved two-stage denoising diffusion model (TSDM) to identify and reconstruct the measurements with various measurement uncertainties. The first stage of the model comprises a classifier-guided conditional anomaly detection component, while the second stage involves diffusion-based measurement imputation component. Moreover, the proposed TSDM adopts precise means and optimal variances to accelerate the diffusion generation process with subsequence sampling. Extensive numerical case studies demonstrate that the proposed TSDM can accurately recover power system measurements despite strong randomness under renewable energy integration and highly nonlinear dynamics under complex cyber-physical contingencies. Additionally, the proposed TSDM has stronger robustness compared to existing reconstruction networks and exhibits lower computational complexity than general denoising diffusion models.

self-supervised

Title: Intelligent Anomaly Detection for Lane Rendering Using Transformer with Self-Supervised Pre-Training and Customized Fine-Tuning. (arXiv:2312.04398v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04398
Code URL: null
Copy Paste: [[2312.04398]] Intelligent Anomaly Detection for Lane Rendering Using Transformer with Self-Supervised Pre-Training and Customized Fine-Tuning(http://arxiv.org/abs/2312.04398)
Summary:
The burgeoning navigation services using digital maps provide great convenience to drivers. Nevertheless, the presence of anomalies in lane rendering map images occasionally introduces potential hazards, as such anomalies can be misleading to human drivers and consequently contribute to unsafe driving conditions. In response to this concern and to accurately and effectively detect the anomalies, this paper transforms lane rendering image anomaly detection into a classification problem and proposes a four-phase pipeline consisting of data pre-processing, self-supervised pre-training with the masked image modeling (MiM) method, customized fine-tuning using cross-entropy based loss with label smoothing, and post-processing to tackle it leveraging state-of-the-art deep learning techniques, especially those involving Transformer models. Various experiments verify the effectiveness of the proposed pipeline. Results indicate that the proposed pipeline exhibits superior performance in lane rendering image anomaly detection, and notably, the self-supervised pre-training with MiM can greatly enhance the detection accuracy while significantly reducing the total training time. For instance, employing the Swin Transformer with Uniform Masking as self-supervised pretraining (Swin-Trans-UM) yielded a heightened accuracy at 94.77% and an improved Area Under The Curve (AUC) score of 0.9743 compared with the pure Swin Transformer without pre-training (Swin-Trans) with an accuracy of 94.01% and an AUC of 0.9498. The fine-tuning epochs were dramatically reduced to 41 from the original 280. In conclusion, the proposed pipeline, with its incorporation of self-supervised pre-training using MiM and other advanced deep learning techniques, emerges as a robust solution for enhancing the accuracy and efficiency of lane rendering image anomaly detection in digital navigation systems.

Title: SCStory: Self-supervised and Continual Online Story Discovery. (arXiv:2312.03725v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03725
Code URL: null
Copy Paste: [[2312.03725]] SCStory: Self-supervised and Continual Online Story Discovery(http://arxiv.org/abs/2312.03725)
Summary:
We present a framework SCStory for online story discovery, that helps people digest rapidly published news article streams in real-time without human annotations. To organize news article streams into stories, existing approaches directly encode the articles and cluster them based on representation similarity. However, these methods yield noisy and inaccurate story discovery results because the generic article embeddings do not effectively reflect the story-indicative semantics in an article and cannot adapt to the rapidly evolving news article streams. SCStory employs self-supervised and continual learning with a novel idea of story-indicative adaptive modeling of news article streams. With a lightweight hierarchical embedding module that first learns sentence representations and then article representations, SCStory identifies story-relevant information of news articles and uses them to discover stories. The embedding module is continuously updated to adapt to evolving news streams with a contrastive learning objective, backed up by two unique techniques, confidence-aware memory replay and prioritized-augmentation, employed for label absence and data scarcity problems. Thorough experiments on real and the latest news data sets demonstrate that SCStory outperforms existing state-of-the-art algorithms for unsupervised online story discovery.

Title: MultiGPrompt for Multi-Task Pre-Training and Prompting on Graphs. (arXiv:2312.03731v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03731
Code URL: null
Copy Paste: [[2312.03731]] MultiGPrompt for Multi-Task Pre-Training and Prompting on Graphs(http://arxiv.org/abs/2312.03731)
Summary:
Graphs can inherently model interconnected objects on the Web, thereby facilitating a series of Web applications, such as web analyzing and content recommendation. Recently, Graph Neural Networks (GNNs) have emerged as a mainstream technique for graph representation learning. However, their efficacy within an end-to-end supervised framework is significantly tied to the availabilityof task-specific labels. To mitigate labeling costs and enhance robustness in few-shot settings, pre-training on self-supervised tasks has emerged as a promising method, while prompting has been proposed to further narrow the objective gap between pretext and downstream tasks. Although there has been some initial exploration of prompt-based learning on graphs, they primarily leverage a single pretext task, resulting in a limited subset of general knowledge that could be learned from the pre-training data. Hence, in this paper, we propose MultiGPrompt, a novel multi-task pre-training and prompting framework to exploit multiple pretext tasks for more comprehensive pre-trained knowledge. First, in pre-training, we design a set of pretext tokens to synergize multiple pretext tasks. Second, we propose a dual-prompt mechanism consisting of composed and open prompts to leverage task-specific and global pre-training knowledge, to guide downstream tasks in few-shot settings. Finally, we conduct extensive experiments on six public datasets to evaluate and analyze MultiGPrompt.

Title: Learning Genomic Sequence Representations using Graph Neural Networks over De Bruijn Graphs. (arXiv:2312.03865v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03865
Code URL: https://github.com/ratschlab/genomic-gnn
Copy Paste: [[2312.03865]] Learning Genomic Sequence Representations using Graph Neural Networks over De Bruijn Graphs(http://arxiv.org/abs/2312.03865)
Summary:
The rapid expansion of genomic sequence data calls for new methods to achieve robust sequence representations. Existing techniques often neglect intricate structural details, emphasizing mainly contextual information. To address this, we developed k-mer embeddings that merge contextual and structural string information by enhancing De Bruijn graphs with structural similarity connections. Subsequently, we crafted a self-supervised method based on Contrastive Learning that employs a heterogeneous Graph Convolutional Network encoder and constructs positive pairs based on node similarities. Our embeddings consistently outperform prior techniques for Edit Distance Approximation and Closest String Retrieval tasks.

Title: Rapid detection of rare events from in situ X-ray diffraction data using machine learning. (arXiv:2312.03989v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03989
Code URL: null
Copy Paste: [[2312.03989]] Rapid detection of rare events from in situ X-ray diffraction data using machine learning(http://arxiv.org/abs/2312.03989)
Summary:
High-energy X-ray diffraction methods can non-destructively map the 3D microstructure and associated attributes of metallic polycrystalline engineering materials in their bulk form. These methods are often combined with external stimuli such as thermo-mechanical loading to take snapshots over time of the evolving microstructure and attributes. However, the extreme data volumes and the high costs of traditional data acquisition and reduction approaches pose a barrier to quickly extracting actionable insights and improving the temporal resolution of these snapshots. Here we present a fully automated technique capable of rapidly detecting the onset of plasticity in high-energy X-ray microscopy data. Our technique is computationally faster by at least 50 times than the traditional approaches and works for data sets that are up to 9 times sparser than a full data set. This new technique leverages self-supervised image representation learning and clustering to transform massive data into compact, semantic-rich representations of visually salient characteristics (e.g., peak shapes). These characteristics can be a rapid indicator of anomalous events such as changes in diffraction peak shapes. We anticipate that this technique will provide just-in-time actionable information to drive smarter experiments that effectively deploy multi-modal X-ray diffraction methods that span many decades of length scales.

Title: Series2Vec: Similarity-based Self-supervised Representation Learning for Time Series Classification. (arXiv:2312.03998v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03998
Code URL: https://github.com/navidfoumani/series2vec
Copy Paste: [[2312.03998]] Series2Vec: Similarity-based Self-supervised Representation Learning for Time Series Classification(http://arxiv.org/abs/2312.03998)
Summary:
We argue that time series analysis is fundamentally different in nature to either vision or natural language processing with respect to the forms of meaningful self-supervised learning tasks that can be defined. Motivated by this insight, we introduce a novel approach called \textit{Series2Vec} for self-supervised representation learning. Unlike other self-supervised methods in time series, which carry the risk of positive sample variants being less similar to the anchor sample than series in the negative set, Series2Vec is trained to predict the similarity between two series in both temporal and spectral domains through a self-supervised task. Series2Vec relies primarily on the consistency of the unsupervised similarity step, rather than the intrinsic quality of the similarity measurement, without the need for hand-crafted data augmentation. To further enforce the network to learn similar representations for similar time series, we propose a novel approach that applies order-invariant attention to each representation within the batch during training. Our evaluation of Series2Vec on nine large real-world datasets, along with the UCR/UEA archive, shows enhanced performance compared to current state-of-the-art self-supervised techniques for time series. Additionally, our extensive experiments show that Series2Vec performs comparably with fully supervised training and offers high efficiency in datasets with limited-labeled data. Finally, we show that the fusion of Series2Vec with other representation learning models leads to enhanced performance for time series classification. Code and models are open-source at \url{https://github.com/Navidfoumani/Series2Vec.}

Title: TimeDRL: Disentangled Representation Learning for Multivariate Time-Series. (arXiv:2312.04142v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.04142
Code URL: null
Copy Paste: [[2312.04142]] TimeDRL: Disentangled Representation Learning for Multivariate Time-Series(http://arxiv.org/abs/2312.04142)
Summary:
Multivariate time-series data in numerous real-world applications (e.g., healthcare and industry) are informative but challenging due to the lack of labels and high dimensionality. Recent studies in self-supervised learning have shown their potential in learning rich representations without relying on labels, yet they fall short in learning disentangled embeddings and addressing issues of inductive bias (e.g., transformation-invariance). To tackle these challenges, we propose TimeDRL, a generic multivariate time-series representation learning framework with disentangled dual-level embeddings. TimeDRL is characterized by three novel features: (i) disentangled derivation of timestamp-level and instance-level embeddings from patched time-series data using a [CLS] token strategy; (ii) utilization of timestamp-predictive and instance-contrastive tasks for disentangled representation learning, with the former optimizing timestamp-level embeddings with predictive loss, and the latter optimizing instance-level embeddings with contrastive loss; and (iii) avoidance of augmentation methods to eliminate inductive biases, such as transformation-invariance from cropping and masking. Comprehensive experiments on 6 time-series forecasting datasets and 5 time-series classification datasets have shown that TimeDRL consistently surpasses existing representation learning approaches, achieving an average improvement of forecasting by 57.98% in MSE and classification by 1.25% in accuracy. Furthermore, extensive ablation studies confirmed the relative contribution of each component in TimeDRL's architecture, and semi-supervised learning evaluations demonstrated its effectiveness in real-world scenarios, even with limited labeled data.

foundation model

Title: Novel class discovery meets foundation models for 3D semantic segmentation. (arXiv:2312.03782v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03782
Code URL: null
Copy Paste: [[2312.03782]] Novel class discovery meets foundation models for 3D semantic segmentation(http://arxiv.org/abs/2312.03782)
Summary:
The task of Novel Class Discovery (NCD) in semantic segmentation entails training a model able to accurately segment unlabelled (novel) classes, relying on the available supervision from annotated (base) classes. Although extensively investigated in 2D image data, the extension of the NCD task to the domain of 3D point clouds represents a pioneering effort, characterized by assumptions and challenges that are not present in the 2D case. This paper represents an advancement in the analysis of point cloud data in four directions. Firstly, it introduces the novel task of NCD for point cloud semantic segmentation. Secondly, it demonstrates that directly transposing the only existing NCD method for 2D image semantic segmentation to 3D data yields suboptimal results. Thirdly, a new NCD approach based on online clustering, uncertainty estimation, and semantic distillation is presented. Lastly, a novel evaluation protocol is proposed to rigorously assess the performance of NCD in point cloud semantic segmentation. Through comprehensive evaluations on the SemanticKITTI, SemanticPOSS, and S3DIS datasets, the paper demonstrates substantial superiority of the proposed method over the considered baselines.

Title: Improving Medical Report Generation with Adapter Tuning and Knowledge Enhancement in Vision-Language Foundation Models. (arXiv:2312.03970v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03970
Code URL: null
Copy Paste: [[2312.03970]] Improving Medical Report Generation with Adapter Tuning and Knowledge Enhancement in Vision-Language Foundation Models(http://arxiv.org/abs/2312.03970)
Summary:
Medical report generation demands automatic creation of coherent and precise descriptions for medical images. However, the scarcity of labelled medical image-report pairs poses formidable challenges in developing large-scale neural networks capable of harnessing the potential of artificial intelligence, exemplified by large language models. This study builds upon the state-of-the-art vision-language pre-training and fine-tuning approach, BLIP-2, to customize general large-scale foundation models. Integrating adapter tuning and a medical knowledge enhancement loss, our model significantly improves accuracy and coherence. Validation on the dataset of ImageCLEFmedical 2023 demonstrates our model's prowess, achieving the best-averaged results against several state-of-the-art methods. Significant improvements in ROUGE and CIDEr underscore our method's efficacy, highlighting promising outcomes for the rapid medical-domain adaptation of the vision-language foundation models in addressing challenges posed by data scarcity.

Title: An unsupervised approach towards promptable defect segmentation in laser-based additive manufacturing by Segment Anything. (arXiv:2312.04063v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04063
Code URL: null
Copy Paste: [[2312.04063]] An unsupervised approach towards promptable defect segmentation in laser-based additive manufacturing by Segment Anything(http://arxiv.org/abs/2312.04063)
Summary:
Foundation models are currently driving a paradigm shift in computer vision tasks for various fields including biology, astronomy, and robotics among others, leveraging user-generated prompts to enhance their performance. In the manufacturing domain, accurate image-based defect segmentation is imperative to ensure product quality and facilitate real-time process control. However, such tasks are often characterized by multiple challenges including the absence of labels and the requirement for low latency inference among others. To address these issues, we construct a framework for image segmentation using a state-of-the-art Vision Transformer (ViT) based Foundation model (Segment Anything Model) with a novel multi-point prompt generation scheme using unsupervised clustering. We apply our framework to perform real-time porosity segmentation in a case study of laser base powder bed fusion (L-PBF) and obtain high Dice Similarity Coefficients (DSC) without the necessity for any supervised fine-tuning in the model. Using such lightweight foundation model inference in conjunction with unsupervised prompt generation, we envision the construction of a real-time anomaly detection pipeline that has the potential to revolutionize the current laser-based additive manufacturing processes, thereby facilitating the shift towards Industry 4.0 and promoting defect-free production along with operational efficiency.

Title: VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models. (arXiv:2312.04087v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04087
Code URL: null
Copy Paste: [[2312.04087]] VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models(http://arxiv.org/abs/2312.04087)
Summary:
With recent advancements in Large Multimodal Models (LMMs) across various domains, a novel prompting method called visual referring prompting has emerged, showing significant potential in enhancing human-computer interaction within multimodal systems. This method offers a more natural and flexible approach to human interaction with these systems compared to traditional text descriptions or coordinates. However, the categorization of visual referring prompting remains undefined, and its impact on the performance of LMMs has yet to be formally examined. In this study, we conduct the first comprehensive analysis of LMMs using a variety of visual referring prompting strategies. We introduce a benchmark dataset called VRPTEST, comprising 3 different visual tasks and 2,275 images, spanning diverse combinations of prompt strategies. Using VRPTEST, we conduct a comprehensive evaluation of eight versions of prominent open-source and proprietary foundation models, including two early versions of GPT-4V. We develop an automated assessment framework based on software metamorphic testing techniques to evaluate the accuracy of LMMs without the need for human intervention or manual labeling. We find that the current proprietary models generally outperform the open-source ones, showing an average accuracy improvement of 22.70%; however, there is still potential for improvement. Moreover, our quantitative analysis shows that the choice of prompt strategy significantly affects the accuracy of LMMs, with variations ranging from -17.5% to +7.3%. Further case studies indicate that an appropriate visual referring prompting strategy can improve LMMs' understanding of context and location information, while an unsuitable one might lead to answer rejection. We also provide insights on minimizing the negative impact of visual referring prompting on LMMs.

Title: Fine-tune vision foundation model for crack segmentation in civil infrastructures. (arXiv:2312.04233v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04233
Code URL: null
Copy Paste: [[2312.04233]] Fine-tune vision foundation model for crack segmentation in civil infrastructures(http://arxiv.org/abs/2312.04233)
Summary:
Large-scale foundation models have become the mainstream method in the field of deep learning, while in civil engineering, the scale of AI models is strictly limited. In this work, vision foundation model is introduced for crack segmentation. Two Parameter-efficient fine-tuning methods, adapter and low-rank adaptation, are adopted to fine-tune the foundation model in the field of semantic segmentation: Segment Anything Model (SAM). The fine-tuned model CrackSAM is much larger than all the existing crack segmentation models, but shows excellent performance. To test the zero-shot performance of the proposed method, two unique datasets related to road and exterior wall cracks are collected, annotated and open-sourced, in total 810 images. Comparative experiments are conducted with twelve mature semantic segmentation models. On datasets with artificial noise and previously unseen datasets, the performance of CrackSAM far exceeds that of all state-of-the-art models. CrackSAM exhibits remarkable superiority, particularly in challenging conditions such as dim lighting, shadows, road markings, construction joints, and other interference factors. Such cross-scenario results demonstrate the outstanding zero-shot capability of foundation models, and provide new ideas for the development of vision models in civil engineering.

Title: Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation. (arXiv:2312.04265v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04265
Code URL: null
Copy Paste: [[2312.04265]] Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation(http://arxiv.org/abs/2312.04265)
Summary:
In this paper, we first assess and harness various Vision Foundation Models (VFMs) in the context of Domain Generalized Semantic Segmentation (DGSS). Driven by the motivation that Leveraging Stronger pre-trained models and Fewer trainable parameters for Superior generalizability, we introduce a robust fine-tuning approach, namely Rein, to parameter-efficiently harness VFMs for DGSS. Built upon a set of trainable tokens, each linked to distinct instances, Rein precisely refines and forwards the feature maps from each layer to the next layer within the backbone. This process produces diverse refinements for different categories within a single image. With fewer trainable parameters, Rein efficiently fine-tunes VFMs for DGSS tasks, surprisingly surpassing full parameter fine-tuning. Extensive experiments across various settings demonstrate that Rein significantly outperforms state-of-the-art methods. Remarkably, with just an extra 1% of trainable parameters within the frozen backbone, Rein achieves a mIoU of 68.1% on the Cityscapes, without accessing any real urban-scene datasets.

Title: FoMo Rewards: Can we cast foundation models as reward functions?. (arXiv:2312.03881v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03881
Code URL: null
Copy Paste: [[2312.03881]] FoMo Rewards: Can we cast foundation models as reward functions?(http://arxiv.org/abs/2312.03881)
Summary:
We explore the viability of casting foundation models as generic reward functions for reinforcement learning. To this end, we propose a simple pipeline that interfaces an off-the-shelf vision model with a large language model. Specifically, given a trajectory of observations, we infer the likelihood of an instruction describing the task that the user wants an agent to perform. We show that this generic likelihood function exhibits the characteristics ideally expected from a reward function: it associates high values with the desired behaviour and lower values for several similar, but incorrect policies. Overall, our work opens the possibility of designing open-ended agents for interactive tasks via foundation models.

Title: Jointly spatial-temporal representation learning for individual trajectories. (arXiv:2312.04055v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.04055
Code URL: null
Copy Paste: [[2312.04055]] Jointly spatial-temporal representation learning for individual trajectories(http://arxiv.org/abs/2312.04055)
Summary:
Individual trajectories, containing substantial information on human-environment interactions across space and time, is a crucial input for geospatial foundation models (GeoFMs). However, existing attempts, leveraging trajectory data for various applications have overlooked the implicit spatial-temporal dependency within trajectories and failed to encode and represent it in a format friendly to deep learning, posing a challenge in obtaining general-purpose trajectory representations. Therefore, this paper proposes a spatial-temporal joint representation learning method (ST-GraphRL) to formalize learnable spatial-temporal dependencies into trajectory representations. The proposed ST-GraphRL consists of three compositions: (i) a weighted directed spatial-temporal graph to explicitly construct mobility interactions over both space and time dimensions; (ii) a two-stage jointly encoder (i.e., decoupling and fusion) to learn entangled spatial-temporal dependencies by independently decomposing and jointly aggregating space and time information; (iii) a decoder guides ST-GraphRL to learn explicit mobility regularities by simulating the spatial-temporal distributions of trajectories. Tested on three real-world human mobility datasets, the proposed ST-GraphRL outperformed all the baseline models in predicting movement spatial-temporal distributions and preserving trajectory similarity with high spatial-temporal correlations. We also explore how spatial-temporal features presented in latent space, validating that ST-GraphRL understands spatial-temporal patterns. This method is also transferable for general-purpose geospatial data representations for broad downstream tasks, as well advancing GeoFMs developing.

generative

Title: XCube ($\mathcal{X}^3$): Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies. (arXiv:2312.03806v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03806
Code URL: null
Copy Paste: [[2312.03806]] XCube ($\mathcal{X}^3$): Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies(http://arxiv.org/abs/2312.03806)
Summary:
We present $\mathcal{X}^3$ (pronounced XCube), a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to $1024^3$ in a feed-forward fashion without time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation, we show that our model can be used to solve a variety of tasks such as user-guided editing, scene completion from a single scan, and text-to-3D. More results and details can be found at https://research.nvidia.com/labs/toronto-ai/xcube/.

Title: PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation. (arXiv:2312.04016v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04016
Code URL: null
Copy Paste: [[2312.04016]] PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation(http://arxiv.org/abs/2312.04016)
Summary:
This paper proposes a cross-modal distillation framework, PartDistill, which transfers 2D knowledge from vision-language models (VLMs) to facilitate 3D shape part segmentation. PartDistill addresses three major challenges in this task: the lack of 3D segmentation in invisible or undetected regions in the 2D projections, inaccurate and inconsistent 2D predictions by VLMs, and the lack of knowledge accumulation across different 3D shapes. PartDistill consists of a teacher network that uses a VLM to make 2D predictions and a student network that learns from the 2D predictions while extracting geometrical features from multiple 3D shapes to carry out 3D part segmentation. A bi-directional distillation, including forward and backward distillations, is carried out within the framework, where the former forward distills the 2D predictions to the student network, and the latter improves the quality of the 2D predictions, which subsequently enhances the final 3D part segmentation. Moreover, PartDistill can exploit generative models that facilitate effortless 3D shape creation for generating knowledge sources to be distilled. Through extensive experiments, PartDistill boosts the existing methods with substantial margins on widely used ShapeNetPart and PartE datasets, by more than 15% and 12% higher mIoU scores, respectively.

Title: Comparing Generative Chatbots Based on Process Requirements. (arXiv:2312.03741v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03741
Code URL: null
Copy Paste: [[2312.03741]] Comparing Generative Chatbots Based on Process Requirements(http://arxiv.org/abs/2312.03741)
Summary:
Business processes are commonly represented by modelling languages, such as Event-driven Process Chain (EPC), Yet Another Workflow Language (YAWL), and the most popular standard notation for modelling business processes, the Business Process Model and Notation (BPMN). Most recently, chatbots, programs that allow users to interact with a machine using natural language, have been increasingly used for business process execution support. A recent category of chatbots worth mentioning is generative-based chatbots, powered by Large Language Models (LLMs) such as OpenAI's Generative Pre-Trained Transformer (GPT) model and Google's Pathways Language Model (PaLM), which are trained on billions of parameters and support conversational intelligence. However, it is not clear whether generative-based chatbots are able to understand and meet the requirements of constructs such as those provided by BPMN for process execution support. This paper presents a case study to compare the performance of prominent generative models, GPT and PaLM, in the context of process execution support. The research sheds light into the challenging problem of using conversational approaches supported by generative chatbots as a means to understand process-aware modelling notations and support users to execute their tasks.

Title: Beyond Surface: Probing LLaMA Across Scales and Layers. (arXiv:2312.04333v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.04333
Code URL: https://github.com/nuochenpku/llama_analysis
Copy Paste: [[2312.04333]] Beyond Surface: Probing LLaMA Across Scales and Layers(http://arxiv.org/abs/2312.04333)
Summary:
This paper presents an in-depth analysis of Large Language Models (LLMs), focusing on LLaMA, a prominent open-source foundational model in natural language processing. Instead of assessing LLaMA through its generative output, we design multiple-choice tasks to probe its intrinsic understanding in high-order tasks such as reasoning and computation. We examine the model horizontally, comparing different sizes, and vertically, assessing different layers. We unveil several key and uncommon findings based on the designed probing tasks: (1) Horizontally, enlarging model sizes almost could not automatically impart additional knowledge or computational prowess. Instead, it can enhance reasoning abilities, especially in math problem solving, and helps reduce hallucinations, but only beyond certain size thresholds; (2) In vertical analysis, the lower layers of LLaMA lack substantial arithmetic and factual knowledge, showcasing logical thinking, multilingual and recognitive abilities, with top layers housing most computational power and real-world knowledge.

Title: Improving Gradient-guided Nested Sampling for Posterior Inference. (arXiv:2312.03911v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03911
Code URL: https://github.com/pablo-lemos/ggns
Copy Paste: [[2312.03911]] Improving Gradient-guided Nested Sampling for Posterior Inference(http://arxiv.org/abs/2312.03911)
Summary:
We present a performant, general-purpose gradient-guided nested sampling algorithm, ${\tt GGNS}$, combining the state of the art in differentiable programming, Hamiltonian slice sampling, clustering, mode separation, dynamic nested sampling, and parallelization. This unique combination allows ${\tt GGNS}$ to scale well with dimensionality and perform competitively on a variety of synthetic and real-world problems. We also show the potential of combining nested sampling with generative flow networks to obtain large amounts of high-quality samples from the posterior distribution. This combination leads to faster mode discovery and more accurate estimates of the partition function.

Title: Mixture of Dynamical Variational Autoencoders for Multi-Source Trajectory Modeling and Separation. (arXiv:2312.04167v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.04167
Code URL: null
Copy Paste: [[2312.04167]] Mixture of Dynamical Variational Autoencoders for Multi-Source Trajectory Modeling and Separation(http://arxiv.org/abs/2312.04167)
Summary:
In this paper, we propose a latent-variable generative model called mixture of dynamical variational autoencoders (MixDVAE) to model the dynamics of a system composed of multiple moving sources. A DVAE model is pre-trained on a single-source dataset to capture the source dynamics. Then, multiple instances of the pre-trained DVAE model are integrated into a multi-source mixture model with a discrete observation-to-source assignment latent variable. The posterior distributions of both the discrete observation-to-source assignment variable and the continuous DVAE variables representing the sources content/position are estimated using a variational expectation-maximization algorithm, leading to multi-source trajectories estimation. We illustrate the versatility of the proposed MixDVAE model on two tasks: a computer vision task, namely multi-object tracking, and an audio processing task, namely single-channel audio source separation. Experimental results show that the proposed method works well on these two tasks, and outperforms several baseline methods.

Title: Learning to sample in Cartesian MRI. (arXiv:2312.04327v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.04327
Code URL: null
Copy Paste: [[2312.04327]] Learning to sample in Cartesian MRI(http://arxiv.org/abs/2312.04327)
Summary:
Despite its exceptional soft tissue contrast, Magnetic Resonance Imaging (MRI) faces the challenge of long scanning times compared to other modalities like X-ray radiography. Shortening scanning times is crucial in clinical settings, as it increases patient comfort, decreases examination costs and improves throughput. Recent advances in compressed sensing (CS) and deep learning allow accelerated MRI acquisition by reconstructing high-quality images from undersampled data. While reconstruction algorithms have received most of the focus, designing acquisition trajectories to optimize reconstruction quality remains an open question. This thesis explores two approaches to address this gap in the context of Cartesian MRI. First, we propose two algorithms, lazy LBCS and stochastic LBCS, that significantly improve upon G\"ozc\"u et al.'s greedy learning-based CS (LBCS) approach. These algorithms scale to large, clinically relevant scenarios like multi-coil 3D MR and dynamic MRI, previously inaccessible to LBCS. Additionally, we demonstrate that generative adversarial networks (GANs) can serve as a natural criterion for adaptive sampling by leveraging variance in the measurement domain to guide acquisition. Second, we delve into the underlying structures or assumptions that enable mask design algorithms to perform well in practice. Our experiments reveal that state-of-the-art deep reinforcement learning (RL) approaches, while capable of adaptation and long-horizon planning, offer only marginal improvements over stochastic LBCS, which is neither adaptive nor does long-term planning. Altogether, our findings suggest that stochastic LBCS and similar methods represent promising alternatives to deep RL. They shine in particular by their scalability and computational efficiency and could be key in the deployment of optimized acquisition trajectories in Cartesian MRI.

anomaly

Title: How Low Can You Go? Surfacing Prototypical In-Distribution Samples for Unsupervised Anomaly Detection. (arXiv:2312.03804v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.03804
Code URL: null
Copy Paste: [[2312.03804]] How Low Can You Go? Surfacing Prototypical In-Distribution Samples for Unsupervised Anomaly Detection(http://arxiv.org/abs/2312.03804)
Summary:
Unsupervised anomaly detection (UAD) alleviates large labeling efforts by training exclusively on unlabeled in-distribution data and detecting outliers as anomalies. Generally, the assumption prevails that large training datasets allow the training of higher-performing UAD models. However, in this work, we show that using only very few training samples can already match - and in some cases even improve - anomaly detection compared to training with the whole training dataset. We propose three methods to identify prototypical samples from a large dataset of in-distribution samples. We demonstrate that by training with a subset of just ten such samples, we achieve an area under the receiver operating characteristics curve (AUROC) of $96.37 \%$ on CIFAR10, $92.59 \%$ on CIFAR100, $95.37 \%$ on MNIST, $95.38 \%$ on Fashion-MNIST, $96.37 \%$ on MVTec-AD, $98.81 \%$ on BraTS, and $81.95 \%$ on RSNA pneumonia detection, even exceeding the performance of full training in $25/67$ classes we tested. Additionally, we show that the prototypical in-distribution samples identified by our proposed methods translate well to different models and other datasets and that using their characteristics as guidance allows for successful manual selection of small subsets of high-performing samples. Our code is available at https://anonymous.4open.science/r/uad_prototypical_samples/

Title: A Multilevel Guidance-Exploration Network and Behavior-Scene Matching Method for Human Behavior Anomaly Detection. (arXiv:2312.04119v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2312.04119
Code URL: null
Copy Paste: [[2312.04119]] A Multilevel Guidance-Exploration Network and Behavior-Scene Matching Method for Human Behavior Anomaly Detection(http://arxiv.org/abs/2312.04119)
Summary:
Human behavior anomaly detection aims to identify unusual human actions, playing a crucial role in intelligent surveillance and other areas. The current mainstream methods still adopt reconstruction or future frame prediction techniques. However, reconstructing or predicting low-level pixel features easily enables the network to achieve overly strong generalization ability, allowing anomalies to be reconstructed or predicted as effectively as normal data. Different from their methods, inspired by the Student-Teacher Network, we propose a novel framework called the Multilevel Guidance-Exploration Network(MGENet), which detects anomalies through the difference in high-level representation between the Guidance and Exploration network. Specifically, we first utilize the pre-trained Normalizing Flow that takes skeletal keypoints as input to guide an RGB encoder, which takes unmasked RGB frames as input, to explore motion latent features. Then, the RGB encoder guides the mask encoder, which takes masked RGB frames as input, to explore the latent appearance feature. Additionally, we design a Behavior-Scene Matching Module(BSMM) to detect scene-related behavioral anomalies. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performance on ShanghaiTech and UBnormal datasets, with AUC of 86.9 % and 73.5 %, respectively. The code will be available on https://github.com/molu-ggg/GENet.

in-context

Title: DP-OPT: Make Large Language Model Your Privacy-Preserving Prompt Engineer. (arXiv:2312.03724v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03724
Code URL: null
Copy Paste: [[2312.03724]] DP-OPT: Make Large Language Model Your Privacy-Preserving Prompt Engineer(http://arxiv.org/abs/2312.03724)
Summary:
Large Language Models (LLMs) have emerged as dominant tools for various tasks, particularly when tailored for a specific target by prompt tuning. Nevertheless, concerns surrounding data privacy present obstacles due to the tuned prompts' dependency on sensitive private information. A practical solution is to host a local LLM and optimize a soft prompt privately using data. Yet, hosting a local model becomes problematic when model ownership is protected. Alternative methods, like sending data to the model's provider for training, intensify these privacy issues facing an untrusted provider. In this paper, we present a novel solution called Differentially-Private Offsite Prompt Tuning (DP-OPT) to address this challenge. Our approach involves tuning a discrete prompt on the client side and then applying it to the desired cloud models. We demonstrate that prompts suggested by LLMs themselves can be transferred without compromising performance significantly. To ensure that the prompts do not leak private information, we introduce the first private prompt generation mechanism, by a differentially-private (DP) ensemble of in-context learning with private demonstrations. With DP-OPT, generating privacy-preserving prompts by Vicuna-7b can yield competitive performance compared to non-private in-context learning on GPT3.5 or local private prompt tuning. Codes are available at https://github.com/VITA-Group/DP-OPT .

Title: Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration. (arXiv:2312.03987v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.03987
Code URL: https://github.com/fmh1art/batcher
Copy Paste: [[2312.03987]] Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration(http://arxiv.org/abs/2312.03987)
Summary:
Entity resolution (ER) is an important data integration task with a wide spectrum of applications. The state-of-the-art solutions on ER rely on pre-trained language models (PLMs), which require fine-tuning on a lot of labeled matching/non-matching entity pairs. Recently, large languages models (LLMs), such as GPT-4, have shown the ability to perform many tasks without tuning model parameters, which is known as in-context learning (ICL) that facilitates effective learning from a few labeled input context demonstrations. However, existing ICL approaches to ER typically necessitate providing a task description and a set of demonstrations for each entity pair and thus have limitations on the monetary cost of interfacing LLMs. To address the problem, in this paper, we provide a comprehensive study to investigate how to develop a cost-effective batch prompting approach to ER. We introduce a framework BATCHER consisting of demonstration selection and question batching and explore different design choices that support batch prompting for ER. We also devise a covering-based demonstration selection strategy that achieves an effective balance between matching accuracy and monetary cost. We conduct a thorough evaluation to explore the design space and evaluate our proposed strategies. Through extensive experiments, we find that batch prompting is very cost-effective for ER, compared with not only PLM-based methods fine-tuned with extensive labeled data but also LLM-based methods with manually designed prompting. We also provide guidance for selecting appropriate design choices for batch prompting.

Title: A Study on the Calibration of In-context Learning. (arXiv:2312.04021v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2312.04021
Code URL: null
Copy Paste: [[2312.04021]] A Study on the Calibration of In-context Learning(http://arxiv.org/abs/2312.04021)
Summary:
Modern auto-regressive language models are trained to minimize log loss on broad data by predicting the next token so they are expected to get calibrated answers when framing a problem as a next-token prediction task. We study this for in-context learning (ICL), a widely used way to adapt frozen large language models (LLMs) via crafting prompts, and investigate the trade-offs between performance and calibration on a wide range of natural language understanding and reasoning tasks. We conduct extensive experiments to show that such trade-offs may get worse as we increase model size, incorporate more ICL examples, and fine-tune models using instruction, dialog, or reinforcement learning from human feedback (RLHF) on carefully curated datasets. Furthermore, we find that common recalibration techniques that are widely effective such as temperature scaling provide limited gains in calibration errors, suggesting that new methods may be required for settings where models are expected to be reliable.

Title: Generalization to New Sequential Decision Making Tasks with In-Context Learning. (arXiv:2312.03801v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.03801
Code URL: null
Copy Paste: [[2312.03801]] Generalization to New Sequential Decision Making Tasks with In-Context Learning(http://arxiv.org/abs/2312.03801)
Summary:
Training autonomous agents that can learn new tasks from only a handful of demonstrations is a long-standing problem in machine learning. Recently, transformers have been shown to learn new language or vision tasks without any weight updates from only a few examples, also referred to as in-context learning. However, the sequential decision making setting poses additional challenges having a lower tolerance for errors since the environment's stochasticity or the agent's actions can lead to unseen, and sometimes unrecoverable, states. In this paper, we use an illustrative example to show that naively applying transformers to sequential decision making problems does not enable in-context learning of new tasks. We then demonstrate how training on sequences of trajectories with certain distributional properties leads to in-context learning of new sequential decision making tasks. We investigate different design choices and find that larger model and dataset sizes, as well as more task diversity, environment stochasticity, and trajectory burstiness, all result in better in-context learning of new out-of-distribution tasks. By training on large diverse offline datasets, our model is able to learn new MiniHack and Procgen tasks without any weight updates from just a handful of demonstrations.

Title: On the adaptation of in-context learners for system identification. (arXiv:2312.04083v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2312.04083
Code URL: null
Copy Paste: [[2312.04083]] On the adaptation of in-context learners for system identification(http://arxiv.org/abs/2312.04083)
Summary:
In-context system identification aims at constructing meta-models to describe classes of systems, differently from traditional approaches that model single systems. This paradigm facilitates the leveraging of knowledge acquired from observing the behaviour of different, yet related dynamics. This paper discusses the role of meta-model adaptation. Through numerical examples, we demonstrate how meta-model adaptation can enhance predictive performance in three realistic scenarios: tailoring the meta-model to describe a specific system rather than a class; extending the meta-model to capture the behaviour of systems beyond the initial training class; and recalibrating the model for new prediction tasks. Results highlight the effectiveness of meta-model adaptation to achieve a more robust and versatile meta-learning framework for system identification.