diffusion

Title: MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer. (arXiv:2311.12052v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12052
Code URL: https://github.com/boese0601/magicdance
Copy Paste: [[2311.12052]] MagicDance: Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer(http://arxiv.org/abs/2311.12052)
Summary:
In this work, we propose MagicDance, a diffusion-based model for 2D human motion and facial expression transfer on challenging human dance videos. Specifically, we aim to generate human dance videos of any target identity driven by novel pose sequences while keeping the identity unchanged. To this end, we propose a two-stage training strategy to disentangle human motions and appearance (e.g., facial expressions, skin tone and dressing), consisting of the pretraining of an appearance-control block and fine-tuning of an appearance-pose-joint-control block over human dance poses of the same dataset. Our novel design enables robust appearance control with temporally consistent upper body, facial attributes, and even background. The model also generalizes well on unseen human identities and complex motion sequences without the need for any fine-tuning with additional data with diverse human attributes by leveraging the prior knowledge of image diffusion models. Moreover, the proposed model is easy to use and can be considered as a plug-in module/extension to Stable Diffusion. We also demonstrate the model's ability for zero-shot 2D animation generation, enabling not only the appearance transfer from one identity to another but also allowing for cartoon-like stylization given only pose inputs. Extensive experiments demonstrate our superior performance on the TikTok dataset.

Title: Quality and Quantity: Unveiling a Million High-Quality Images for Text-to-Image Synthesis in Fashion Design. (arXiv:2311.12067v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12067
Code URL: null
Copy Paste: [[2311.12067]] Quality and Quantity: Unveiling a Million High-Quality Images for Text-to-Image Synthesis in Fashion Design(http://arxiv.org/abs/2311.12067)
Summary:
The fusion of AI and fashion design has emerged as a promising research area. However, the lack of extensive, interrelated data on clothing and try-on stages has hindered the full potential of AI in this domain. Addressing this, we present the Fashion-Diffusion dataset, a product of multiple years' rigorous effort. This dataset, the first of its kind, comprises over a million high-quality fashion images, paired with detailed text descriptions. Sourced from a diverse range of geographical locations and cultural backgrounds, the dataset encapsulates global fashion trends. The images have been meticulously annotated with fine-grained attributes related to clothing and humans, simplifying the fashion design process into a Text-to-Image (T2I) task. The Fashion-Diffusion dataset not only provides high-quality text-image pairs and diverse human-garment pairs but also serves as a large-scale resource about humans, thereby facilitating research in T2I generation. Moreover, to foster standardization in the T2I-based fashion design field, we propose a new benchmark comprising multiple datasets for evaluating the performance of fashion design models. This work represents a significant leap forward in the realm of AI-driven fashion design, setting a new standard for future research in this field.

Title: Pyramid Diffusion for Fine 3D Large Scene Generation. (arXiv:2311.12085v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12085
Code URL: null
Copy Paste: [[2311.12085]] Pyramid Diffusion for Fine 3D Large Scene Generation(http://arxiv.org/abs/2311.12085)
Summary:
Directly transferring the 2D techniques to 3D scene generation is challenging due to significant resolution reduction and the scarcity of comprehensive real-world 3D scene datasets. To address these issues, our work introduces the Pyramid Discrete Diffusion model (PDD) for 3D scene generation. This novel approach employs a multi-scale model capable of progressively generating high-quality 3D scenes from coarse to fine. In this way, the PDD can generate high-quality scenes within limited resource constraints and does not require additional data sources. To the best of our knowledge, we are the first to adopt the simple but effective coarse-to-fine strategy for 3D large scene generation. Our experiments, covering both unconditional and conditional generation, have yielded impressive results, showcasing the model's effectiveness and robustness in generating realistic and detailed 3D scenes. Our code will be available to the public.

Title: FrePolad: Frequency-Rectified Point Latent Diffusion for Point Cloud Generation. (arXiv:2311.12090v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12090
Code URL: null
Copy Paste: [[2311.12090]] FrePolad: Frequency-Rectified Point Latent Diffusion for Point Cloud Generation(http://arxiv.org/abs/2311.12090)
Summary:
We propose FrePolad: frequency-rectified point latent diffusion, a point cloud generation pipeline integrating a variational autoencoder (VAE) with a denoising diffusion probabilistic model (DDPM) for the latent distribution. FrePolad simultaneously achieves high quality, diversity, and flexibility in point cloud cardinality for generation tasks while maintaining high computational efficiency. The improvement in generation quality and diversity is achieved through (1) a novel frequency rectification module via spherical harmonics designed to retain high-frequency content while learning the point cloud distribution; and (2) a latent DDPM to learn the regularized yet complex latent distribution. In addition, FrePolad supports variable point cloud cardinality by formulating the sampling of points as conditional distributions over a latent shape distribution. Finally, the low-dimensional latent space encoded by the VAE contributes to FrePolad's fast and scalable sampling. Our quantitative and qualitative results demonstrate the state-of-the-art performance of FrePolad in terms of quality, diversity, and computational efficiency.

Title: Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models. (arXiv:2311.12092v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12092
Code URL: https://github.com/rohitgandikota/sliders
Copy Paste: [[2311.12092]] Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models(http://arxiv.org/abs/2311.12092)
Summary:
We present a method to create interpretable concept sliders that enable precise control over attributes in image generations from diffusion models. Our approach identifies a low-rank parameter direction corresponding to one concept while minimizing interference with other attributes. A slider is created using a small set of prompts or sample images; thus slider directions can be created for either textual or visual concepts. Concept Sliders are plug-and-play: they can be composed efficiently and continuously modulated, enabling precise control over image generation. In quantitative experiments comparing to previous editing techniques, our sliders exhibit stronger targeted edits with lower interference. We showcase sliders for weather, age, styles, and expressions, as well as slider compositions. We show how sliders can transfer latents from StyleGAN for intuitive editing of visual concepts for which textual description is difficult. We also find that our method can help address persistent quality issues in Stable Diffusion XL including repair of object deformations and fixing distorted hands. Our code, data, and trained sliders are available at https://sliders.baulab.info/

Title: Overcoming Pathology Image Data Deficiency: Generating Images from Pathological Transformation Process. (arXiv:2311.12316v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12316
Code URL: https://github.com/rowerliu/adbd
Copy Paste: [[2311.12316]] Overcoming Pathology Image Data Deficiency: Generating Images from Pathological Transformation Process(http://arxiv.org/abs/2311.12316)
Summary:
Histopathology serves as the gold standard for medical diagnosis but faces application limitations due to the shortage of medical resources. Leveraging deep learning, computer-aided diagnosis has the potential to alleviate the pathologist scarcity and provide timely clinical analysis. However, developing a reliable model generally necessitates substantial data for training, which is challenging in pathological field. In response, we propose an adaptive depth-controlled bidirectional diffusion (ADBD) network for image data generation. The domain migration approach can work with small trainset and overcome the diffusion overfitting by source information guidance. Specifically, we developed a hybrid attention strategy to blend global and local attention priorities, which guides the bidirectional diffusion and ensures the migration success. In addition, we developed the adaptive depth-controlled strategy to simulate physiological transformations, capable of yielding unlimited cross-domain intermediate images with corresponding soft labels. ADBD is effective for overcoming pathological image data deficiency and supportable for further pathology-related research.

Title: LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis. (arXiv:2311.12342v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12342
Code URL: null
Copy Paste: [[2311.12342]] LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis(http://arxiv.org/abs/2311.12342)
Summary:
Recent text-to-image diffusion models have reached an unprecedented level in generating high-quality images. However, their exclusive reliance on textual prompts often falls short in accurately conveying fine-grained spatial compositions. In this paper, we propose LoCo, a training-free approach for layout-to-image synthesis that excels in producing high-quality images aligned with both textual prompts and spatial layouts. Our method introduces a Localized Attention Constraint to refine cross-attention for individual objects, ensuring their precise placement in designated regions. We further propose a Padding Token Constraint to leverage the semantic information embedded in previously neglected padding tokens, thereby preventing the undesired fusion of synthesized objects. LoCo seamlessly integrates into existing text-to-image and layout-to-image models, significantly amplifying their performance and effectively addressing semantic failures observed in prior methods. Through extensive experiments, we showcase the superiority of our approach, surpassing existing state-of-the-art training-free layout-to-image methods both qualitatively and quantitatively across multiple benchmarks.

Title: Stable Diffusion For Aerial Object Detection. (arXiv:2311.12345v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12345
Code URL: null
Copy Paste: [[2311.12345]] Stable Diffusion For Aerial Object Detection(http://arxiv.org/abs/2311.12345)
Summary:
Aerial object detection is a challenging task, in which one major obstacle lies in the limitations of large-scale data collection and the long-tail distribution of certain classes. Synthetic data offers a promising solution, especially with recent advances in diffusion-based methods like stable diffusion (SD). However, the direct application of diffusion methods to aerial domains poses unique challenges: stable diffusion's optimization for rich ground-level semantics doesn't align with the sparse nature of aerial objects, and the extraction of post-synthesis object coordinates remains problematic. To address these challenges, we introduce a synthetic data augmentation framework tailored for aerial images. It encompasses sparse-to-dense region of interest (ROI) extraction to bridge the semantic gap, fine-tuning the diffusion model with low-rank adaptation (LORA) to circumvent exhaustive retraining, and finally, a Copy-Paste method to compose synthesized objects with backgrounds, providing a nuanced approach to aerial object detection through synthetic data.

Title: GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning. (arXiv:2311.12631v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12631
Code URL: null
Copy Paste: [[2311.12631]] GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning(http://arxiv.org/abs/2311.12631)
Summary:
Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for future explorations.

Title: EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models. (arXiv:2311.12066v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2311.12066
Code URL: null
Copy Paste: [[2311.12066]] EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models(http://arxiv.org/abs/2311.12066)
Summary:
Text-to-image diffusion models have emerged as an evolutionary for producing creative content in image synthesis. Based on the impressive generation abilities of these models, instruction-guided diffusion models can edit images with simple instructions and input images. While they empower users to obtain their desired edited images with ease, they have raised concerns about unauthorized image manipulation. Prior research has delved into the unauthorized use of personalized diffusion models; however, this problem of instruction-guided diffusion models remains largely unexplored. In this paper, we first propose a protection method EditShield against unauthorized modifications from such models. Specifically, EditShield works by adding imperceptible perturbations that can shift the latent representation used in the diffusion process, forcing models to generate unrealistic images with mismatched subjects. Our extensive experiments demonstrate EditShield's effectiveness among synthetic and real-world datasets. Besides, EditShield also maintains robustness against various editing types and synonymous instruction phrases.

self-supervised

Title: Kuro Siwo: 12.1 billion $m^2$ under the water. A global multi-temporal satellite dataset for rapid flood mapping. (arXiv:2311.12056v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12056
Code URL: null
Copy Paste: [[2311.12056]] Kuro Siwo: 12(http://arxiv.org/abs/2311.12056)
Summary:
Global floods, exacerbated by climate change, pose severe threats to human life, infrastructure, and the environment. This urgency is highlighted by recent catastrophic events in Pakistan and New Zealand, underlining the critical need for precise flood mapping for guiding restoration efforts, understanding vulnerabilities, and preparing for future events. While Synthetic Aperture Radar (SAR) offers day-and-night, all-weather imaging capabilities, harnessing it for deep learning is hindered by the absence of a large annotated dataset. To bridge this gap, we introduce Kuro Siwo, a meticulously curated multi-temporal dataset, spanning 32 flood events globally. Our dataset maps more than 63 billion m2 of land, with 12.1 billion of them being either a flooded area or a permanent water body. Kuro Siwo stands out for its unparalleled annotation quality to facilitate rapid flood mapping in a supervised setting. We also augment learning by including a large unlabeled set of SAR samples, aimed at self-supervised pretraining. We provide an extensive benchmark and strong baselines for a diverse set of flood events from Europe, America, Africa and Australia. Our benchmark demonstrates the quality of Kuro Siwo annotations, training models that can achieve $\approx$ 85% and $\approx$ 87% in F1-score for flooded areas and general water detection respectively. This work calls on the deep learning community to develop solution-driven algorithms for rapid flood mapping, with the potential to aid civil protection and humanitarian agencies amid climate change challenges. Our code and data will be made available at https://github.com/Orion-AI-Lab/KuroSiwo

Title: Leveraging Unlabeled Data for 3D Medical Image Segmentation through Self-Supervised Contrastive Learning. (arXiv:2311.12617v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12617
Code URL: null
Copy Paste: [[2311.12617]] Leveraging Unlabeled Data for 3D Medical Image Segmentation through Self-Supervised Contrastive Learning(http://arxiv.org/abs/2311.12617)
Summary:
Current 3D semi-supervised segmentation methods face significant challenges such as limited consideration of contextual information and the inability to generate reliable pseudo-labels for effective unsupervised data use. To address these challenges, we introduce two distinct subnetworks designed to explore and exploit the discrepancies between them, ultimately correcting the erroneous prediction results. More specifically, we identify regions of inconsistent predictions and initiate a targeted verification training process. This procedure strategically fine-tunes and harmonizes the predictions of the subnetworks, leading to enhanced utilization of contextual information. Furthermore, to adaptively fine-tune the network's representational capacity and reduce prediction uncertainty, we employ a self-supervised contrastive learning paradigm. For this, we use the network's confidence to distinguish between reliable and unreliable predictions. The model is then trained to effectively minimize unreliable predictions. Our experimental results for organ segmentation, obtained from clinical MRI and CT scans, demonstrate the effectiveness of our approach when compared to state-of-the-art methods. The codebase is accessible on \href{https://github.com/xmindflow/SSL-contrastive}{GitHub}.

Title: Bridging Generalization Gaps in High Content Imaging Through Online Self-Supervised Domain Adaptation. (arXiv:2311.12623v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12623
Code URL: null
Copy Paste: [[2311.12623]] Bridging Generalization Gaps in High Content Imaging Through Online Self-Supervised Domain Adaptation(http://arxiv.org/abs/2311.12623)
Summary:
High Content Imaging (HCI) plays a vital role in modern drug discovery and development pipelines, facilitating various stages from hit identification to candidate drug characterization. Applying machine learning models to these datasets can prove challenging as they typically consist of multiple batches, affected by experimental variation, especially if different imaging equipment have been used. Moreover, as new data arrive, it is preferable that they are analyzed in an online fashion. To overcome this, we propose CODA, an online self-supervised domain adaptation approach. CODA divides the classifier's role into a generic feature extractor and a task-specific model. We adapt the feature extractor's weights to the new domain using cross-batch self-supervision while keeping the task-specific model unchanged. Our results demonstrate that this strategy significantly reduces the generalization gap, achieving up to a 300% improvement when applied to data from different labs utilizing different microscopes. CODA can be applied to new, unlabeled out-of-domain data sources of different sizes, from a single plate to multiple experimental batches.

Title: SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction. (arXiv:2311.12754v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12754
Code URL: https://github.com/huang-yh/selfocc
Copy Paste: [[2311.12754]] SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction(http://arxiv.org/abs/2311.12754)
Summary:
3D occupancy prediction is an important task for the robustness of vision-centric autonomous driving, which aims to predict whether each point is occupied in the surrounding 3D space. Existing methods usually require 3D occupancy labels to produce meaningful results. However, it is very laborious to annotate the occupancy status of each voxel. In this paper, we propose SelfOcc to explore a self-supervised way to learn 3D occupancy using only video sequences. We first transform the images into the 3D space (e.g., bird's eye view) to obtain 3D representation of the scene. We directly impose constraints on the 3D representations by treating them as signed distance fields. We can then render 2D images of previous and future frames as self-supervision signals to learn the 3D representations. We propose an MVS-embedded strategy to directly optimize the SDF-induced weights with multiple depth proposals. Our SelfOcc outperforms the previous best method SceneRF by 58.7% using a single frame as input on SemanticKITTI and is the first self-supervised work that produces reasonable 3D occupancy for surround cameras on Occ3D. SelfOcc produces high-quality depth and achieves state-of-the-art results on novel depth synthesis, monocular depth estimation, and surround-view depth estimation on the SemanticKITTI, KITTI-2015, and nuScenes, respectively. Code: https://github.com/huang-yh/SelfOcc.

Title: Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis. (arXiv:2311.12275v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.12275
Code URL: null
Copy Paste: [[2311.12275]] Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis(http://arxiv.org/abs/2311.12275)
Summary:
After a large language model (LLM) is deployed on edge devices, it is desirable for these devices to learn from user-generated conversation data to generate user-specific and personalized responses in real-time. However, user-generated data usually contains sensitive and private information, and uploading such data to the cloud for annotation is not preferred if not prohibited. While it is possible to obtain annotation locally by directly asking users to provide preferred responses, such annotations have to be sparse to not affect user experience. In addition, the storage of edge devices is usually too limited to enable large-scale fine-tuning with full user-generated data. It remains an open question how to enable on-device LLM personalization, considering sparse annotation and limited on-device storage. In this paper, we propose a novel framework to select and store the most representative data online in a self-supervised way. Such data has a small memory footprint and allows infrequent requests of user annotations for further fine-tuning. To enhance fine-tuning quality, multiple semantically similar pairs of question texts and expected responses are generated using the LLM. Our experiments show that the proposed framework achieves the best user-specific content-generating capability (accuracy) and fine-tuning speed (performance) compared with vanilla baselines. To the best of our knowledge, this is the very first on-device LLM personalization framework.

Title: Contrastive Left-Right Wearable Sensors (IMUs) Consistency Matching for HAR. (arXiv:2311.12674v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.12674
Code URL: null
Copy Paste: [[2311.12674]] Contrastive Left-Right Wearable Sensors (IMUs) Consistency Matching for HAR(http://arxiv.org/abs/2311.12674)
Summary:
Machine learning algorithms are improving rapidly, but annotating training data remains a bottleneck for many applications. In this paper, we show how real data can be used for self-supervised learning without any transformations by taking advantage of the symmetry present in the activities. Our approach involves contrastive matching of two different sensors (left and right wrist or leg-worn IMUs) to make representations of co-occurring sensor data more similar and those of non-co-occurring sensor data more different. We test our approach on the Opportunity and MM-Fit datasets. In MM-Fit we show significant improvement over the baseline supervised and self-supervised method SimCLR, while for Opportunity there is significant improvement over the supervised baseline and slight improvement when compared to SimCLR. Moreover, our method improves supervised baselines even when using only a small amount of the data for training. Future work should explore under which conditions our method is beneficial for human activity recognition systems and other related applications.

foundation model

Title: Applications of Large Scale Foundation Models for Autonomous Driving. (arXiv:2311.12144v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12144
Code URL: null
Copy Paste: [[2311.12144]] Applications of Large Scale Foundation Models for Autonomous Driving(http://arxiv.org/abs/2311.12144)
Summary:
Since DARPA Grand Challenges (rural) in 2004/05 and Urban Challenges in 2007, autonomous driving has been the most active field of AI applications. Recently powered by large language models (LLMs), chat systems, such as chatGPT and PaLM, emerge and rapidly become a promising direction to achieve artificial general intelligence (AGI) in natural language processing (NLP). There comes a natural thinking that we could employ these abilities to reformulate autonomous driving. By combining LLM with foundation models, it is possible to utilize the human knowledge, commonsense and reasoning to rebuild autonomous driving systems from the current long-tailed AI dilemma. In this paper, we investigate the techniques of foundation models and LLMs applied for autonomous driving, categorized as simulation, world model, data annotation and planning or E2E solutions etc.

Title: Point, Segment and Count: A Generalized Framework for Object Counting. (arXiv:2311.12386v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12386
Code URL: null
Copy Paste: [[2311.12386]] Point, Segment and Count: A Generalized Framework for Object Counting(http://arxiv.org/abs/2311.12386)
Summary:
Class-agnostic object counting aims to count all objects in an image with respect to example boxes or class names, \emph{a.k.a} few-shot and zero-shot counting. Current state-of-the-art methods highly rely on density maps to predict object counts, which lacks model interpretability. In this paper, we propose a generalized framework for both few-shot and zero-shot object counting based on detection. Our framework combines the superior advantages of two foundation models without compromising their zero-shot capability: (\textbf{i}) SAM to segment all possible objects as mask proposals, and (\textbf{ii}) CLIP to classify proposals to obtain accurate object counts. However, this strategy meets the obstacles of efficiency overhead and the small crowded objects that cannot be localized and distinguished. To address these issues, our framework, termed PseCo, follows three steps: point, segment, and count. Specifically, we first propose a class-agnostic object localization to provide accurate but least point prompts for SAM, which consequently not only reduces computation costs but also avoids missing small objects. Furthermore, we propose a generalized object classification that leverages CLIP image/text embeddings as the classifier, following a hierarchical knowledge distillation to obtain discriminative classifications among hierarchical mask proposals. Extensive experimental results on FSC-147 dataset demonstrate that PseCo achieves state-of-the-art performance in both few-shot/zero-shot object counting/detection, with additional results on large-scale COCO and LVIS datasets. The source code is available at \url{https://github.com/Hzzone/PseCo}.

Title: AcademicGPT: Empowering Academic Research. (arXiv:2311.12315v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.12315
Code URL: null
Copy Paste: [[2311.12315]] AcademicGPT: Empowering Academic Research(http://arxiv.org/abs/2311.12315)
Summary:
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Yet, many of these advanced LLMs are tailored for broad, general-purpose applications. In this technical report, we introduce AcademicGPT, designed specifically to empower academic research. AcademicGPT is a continual training model derived from LLaMA2-70B. Our training corpus mainly consists of academic papers, thesis, content from some academic domain, high-quality Chinese data and others. While it may not be extensive in data scale, AcademicGPT marks our initial venture into a domain-specific GPT tailored for research area. We evaluate AcademicGPT on several established public benchmarks such as MMLU and CEval, as well as on some specialized academic benchmarks like PubMedQA, SCIEval, and our newly-created ComputerScienceQA, to demonstrate its ability from general knowledge ability, to Chinese ability, and to academic ability. Building upon AcademicGPT's foundation model, we also developed several applications catered to the academic area, including General Academic Question Answering, AI-assisted Paper Reading, Paper Review, and AI-assisted Title and Abstract Generation.

Title: nach0: Multimodal Natural and Chemical Languages Foundation Model. (arXiv:2311.12410v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.12410
Code URL: null
Copy Paste: [[2311.12410]] nach0: Multimodal Natural and Chemical Languages Foundation Model(http://arxiv.org/abs/2311.12410)
Summary:
Large Language Models (LLMs) have substantially driven scientific progress in various domains, and many papers have demonstrated their ability to tackle complex problems with creative solutions. Our paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks: biomedical question answering, named entity recognition, molecular generation, molecular synthesis, attributes prediction, and others. nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings to incorporate a range of chemical and linguistic knowledge. We employed instruction tuning, where specific task-related instructions are utilized to fine-tune nach0 for the final set of tasks. To train nach0 effectively, we leverage the NeMo framework, enabling efficient parallel optimization of both base and large model versions. Extensive experiments demonstrate that our model outperforms state-of-the-art baselines on single-domain and cross-domain tasks. Furthermore, it can generate high-quality outputs in molecular and textual formats, showcasing its effectiveness in multi-domain setups.

Title: A Supervised Contrastive Learning Pretrain-Finetune Approach for Time Series. (arXiv:2311.12290v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.12290
Code URL: null
Copy Paste: [[2311.12290]] A Supervised Contrastive Learning Pretrain-Finetune Approach for Time Series(http://arxiv.org/abs/2311.12290)
Summary:
Foundation models have recently gained attention within the field of machine learning thanks to its efficiency in broad data processing. While researchers had attempted to extend this success to time series models, the main challenge is effectively extracting representations and transferring knowledge from pretraining datasets to the target finetuning dataset. To tackle this issue, we introduce a novel pretraining procedure that leverages supervised contrastive learning to distinguish features within each pretraining dataset. This pretraining phase enables a probabilistic similarity metric, which assesses the likelihood of a univariate sample being closely related to one of the pretraining datasets. Subsequently, using this similarity metric as a guide, we propose a fine-tuning procedure designed to enhance the accurate prediction of the target data by aligning it more closely with the learned dynamics of the pretraining datasets. Our experiments have shown promising results which demonstrate the efficacy of our approach.

generative

Title: Efficient Domain Adaptation via Generative Prior for 3D Infant Pose Estimation. (arXiv:2311.12043v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12043
Code URL: null
Copy Paste: [[2311.12043]] Efficient Domain Adaptation via Generative Prior for 3D Infant Pose Estimation(http://arxiv.org/abs/2311.12043)
Summary:
Although 3D human pose estimation has gained impressive development in recent years, only a few works focus on infants, that have different bone lengths and also have limited data. Directly applying adult pose estimation models typically achieves low performance in the infant domain and suffers from out-of-distribution issues. Moreover, the limitation of infant pose data collection also heavily constrains the efficiency of learning-based models to lift 2D poses to 3D. To deal with the issues of small datasets, domain adaptation and data augmentation are commonly used techniques. Following this paradigm, we take advantage of an optimization-based method that utilizes generative priors to predict 3D infant keypoints from 2D keypoints without the need of large training data. We further apply a guided diffusion model to domain adapt 3D adult pose to infant pose to supplement small datasets. Besides, we also prove that our method, ZeDO-i, could attain efficient domain adaptation, even if only a small number of data is given. Quantitatively, we claim that our model attains state-of-the-art MPJPE performance of 43.6 mm on the SyRIP dataset and 21.2 mm on the MINI-RGBD dataset.

Title: DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields. (arXiv:2311.12063v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12063
Code URL: null
Copy Paste: [[2311.12063]] DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields(http://arxiv.org/abs/2311.12063)
Summary:
Progress in 3D computer vision tasks demands a huge amount of data, yet annotating multi-view images with 3D-consistent annotations, or point clouds with part segmentation is both time-consuming and challenging. This paper introduces DatasetNeRF, a novel approach capable of generating infinite, high-quality 3D-consistent 2D annotations alongside 3D point cloud segmentations, while utilizing minimal 2D human-labeled annotations. Specifically, we leverage the strong semantic prior within a 3D generative model to train a semantic decoder, requiring only a handful of fine-grained labeled samples. Once trained, the decoder efficiently generalizes across the latent space, enabling the generation of infinite data. The generated data is applicable across various computer vision tasks, including video segmentation and 3D point cloud segmentation. Our approach not only surpasses baseline models in segmentation quality, achieving superior 3D consistency and segmentation precision on individual images, but also demonstrates versatility by being applicable to both articulated and non-articulated generative models. Furthermore, we explore applications stemming from our approach, such as 3D-aware semantic editing and 3D inversion.

Title: Rich and Poor Texture Contrast: A Simple yet Effective Approach for AI-generated Image Detection. (arXiv:2311.12397v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12397
Code URL: null
Copy Paste: [[2311.12397]] Rich and Poor Texture Contrast: A Simple yet Effective Approach for AI-generated Image Detection(http://arxiv.org/abs/2311.12397)
Summary:
Recent generative models show impressive performance in generating photographic images. Humans can hardly distinguish such incredibly realistic-looking AI-generated images from real ones. AI-generated images may lead to ubiquitous disinformation dissemination. Therefore, it is of utmost urgency to develop a detector to identify AI-generated images. Most existing detectors suffer from sharp performance drops over unseen generative models. In this paper, we propose a novel AI-generated image detector capable of identifying fake images created by a wide range of generative models. Our approach leverages the inter-pixel correlation contrast between rich and poor texture regions within an image. Pixels in rich texture regions exhibit more significant fluctuations than those in poor texture regions. This discrepancy reflects that the entropy of rich texture regions is larger than that of poor ones. Consequently, synthesizing realistic rich texture regions proves to be more challenging for existing generative models. Based on this principle, we divide an image into multiple patches and reconstruct them into two images, comprising rich-texture and poor-texture patches respectively. Subsequently, we extract the inter-pixel correlation discrepancy feature between rich and poor texture regions. This feature serves as a universal fingerprint used for AI-generated image forensics across different generative models. In addition, we build a comprehensive AI-generated image detection benchmark, which includes 16 kinds of prevalent generative models, to evaluate the effectiveness of existing baselines and our approach. Our benchmark provides a leaderboard for follow-up studies. Extensive experimental results show that our approach outperforms state-of-the-art baselines by a significant margin. Our project: https://fdmas.github.io/AIGCDetect/

Title: Visual Analytics for Generative Transformer Models. (arXiv:2311.12418v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.12418
Code URL: null
Copy Paste: [[2311.12418]] Visual Analytics for Generative Transformer Models(http://arxiv.org/abs/2311.12418)
Summary:
While transformer-based models have achieved state-of-the-art results in a variety of classification and generation tasks, their black-box nature makes them challenging for interpretability. In this work, we present a novel visual analytical framework to support the analysis of transformer-based generative networks. In contrast to previous work, which has mainly focused on encoder-based models, our framework is one of the first dedicated to supporting the analysis of transformer-based encoder-decoder models and decoder-only models for generative and classification tasks. Hence, we offer an intuitive overview that allows the user to explore different facets of the model through interactive visualization. To demonstrate the feasibility and usefulness of our framework, we present three detailed case studies based on real-world NLP research problems.

Title: Explainable Anomaly Detection using Masked Latent Generative Modeling. (arXiv:2311.12550v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.12550
Code URL: null
Copy Paste: [[2311.12550]] Explainable Anomaly Detection using Masked Latent Generative Modeling(http://arxiv.org/abs/2311.12550)
Summary:
We present a novel time series anomaly detection method that achieves excellent detection accuracy while offering a superior level of explainability. Our proposed method, TimeVQVAE-AD, leverages masked generative modeling adapted from the cutting-edge time series generation method known as TimeVQVAE. The prior model is trained on the discrete latent space of a time-frequency domain. Notably, the dimensional semantics of the time-frequency domain are preserved in the latent space, enabling us to compute anomaly scores across different frequency bands, which provides a better insight into the detected anomalies. Additionally, the generative nature of the prior model allows for sampling likely normal states for detected anomalies, enhancing the explainability of the detected anomalies through counterfactuals. Our experimental evaluation on the UCR Time Series Anomaly archive demonstrates that TimeVQVAE-AD significantly surpasses the existing methods in terms of detection accuracy and explainability.

anomaly

in-context

Title: Few-Shot Classification & Segmentation Using Large Language Models Agent. (arXiv:2311.12065v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.12065
Code URL: null
Copy Paste: [[2311.12065]] Few-Shot Classification & Segmentation Using Large Language Models Agent(http://arxiv.org/abs/2311.12065)
Summary:
The task of few-shot image classification and segmentation (FS-CS) requires the classification and segmentation of target objects in a query image, given only a few examples of the target classes. We introduce a method that utilises large language models (LLM) as an agent to address the FS-CS problem in a training-free manner. By making the LLM the task planner and off-the-shelf vision models the tools, the proposed method is capable of classifying and segmenting target objects using only image-level labels. Specifically, chain-of-thought prompting and in-context learning guide the LLM to observe support images like human; vision models such as Segment Anything Model (SAM) and GPT-4Vision assist LLM understand spatial and semantic information at the same time. Ultimately, the LLM uses its summarizing and reasoning capabilities to classify and segment the query image. The proposed method's modular framework makes it easily extendable. Our approach achieves state-of-the-art performance on the Pascal-5i dataset.

Title: In-Context Learning Functions with Varying Number of Minima. (arXiv:2311.12538v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.12538
Code URL: https://github.com/pittnail/icl-minima
Copy Paste: [[2311.12538]] In-Context Learning Functions with Varying Number of Minima(http://arxiv.org/abs/2311.12538)
Summary:
Large Language Models (LLMs) have proven effective at In-Context Learning (ICL), an ability that allows them to create predictors from labeled examples. Few studies have explored the interplay between ICL and specific properties of functions it attempts to approximate. In our study, we use a formal framework to explore ICL and propose a new task of approximating functions with varying number of minima. We implement a method that allows for producing functions with given inputs as minima. We find that increasing the number of minima degrades ICL performance. At the same time, our evaluation shows that ICL outperforms 2-layer Neural Network (2NN) model. Furthermore, ICL learns faster than 2NN in all settings. We validate the findings through a set of few-shot experiments across various hyperparameter configurations.

Title: The DURel Annotation Tool: Human and Computational Measurement of Semantic Proximity, Sense Clusters and Semantic Change. (arXiv:2311.12664v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.12664
Code URL: null
Copy Paste: [[2311.12664]] The DURel Annotation Tool: Human and Computational Measurement of Semantic Proximity, Sense Clusters and Semantic Change(http://arxiv.org/abs/2311.12664)
Summary:
We present the DURel tool that implements the annotation of semantic proximity between uses of words into an online, open source interface. The tool supports standardized human annotation as well as computational annotation, building on recent advances with Word-in-Context models. Annotator judgments are clustered with automatic graph clustering techniques and visualized for analysis. This allows to measure word senses with simple and intuitive micro-task judgments between use pairs, requiring minimal preparation efforts. The tool offers additional functionalities to compare the agreement between annotators to guarantee the inter-subjectivity of the obtained judgments and to calculate summary statistics giving insights into sense frequency distributions, semantic variation or changes of senses over time.

Title: Looped Transformers are Better at Learning Learning Algorithms. (arXiv:2311.12424v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.12424
Code URL: null
Copy Paste: [[2311.12424]] Looped Transformers are Better at Learning Learning Algorithms(http://arxiv.org/abs/2311.12424)
Summary:
Transformers have demonstrated effectiveness in \emph{in-context solving} data-fitting problems from various (latent) models, as reported by Garg et al. However, the absence of an inherent iterative structure in the transformer architecture presents a challenge in emulating the iterative algorithms, which are commonly employed in traditional machine learning methods. To address this, we propose the utilization of \emph{looped} transformer architecture and its associated training methodology, with the aim of incorporating iterative characteristics into the transformer architectures. Experimental results suggest that the looped transformer achieves performance comparable to the standard transformer in solving various data-fitting problems, while utilizing less than 10\% of the parameter count.