diffusion

Title: Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models. (arXiv:2311.16117v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16117
Code URL: null
Copy Paste: [[2311.16117]] Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models(http://arxiv.org/abs/2311.16117)
Summary:
Diffusion models have achieved remarkable results in generating high-quality, diverse, and creative images. However, when it comes to text-based image generation, they often fail to capture the intended meaning presented in the text. For instance, a specified object may not be generated, an unnecessary object may be generated, and an adjective may alter objects it was not intended to modify. Moreover, we found that relationships indicating possession between objects are often overlooked. While users' intentions in text are diverse, existing methods tend to specialize in only some aspects of these. In this paper, we propose Predicated Diffusion, a unified framework to express users' intentions. We consider that the root of the above issues lies in the text encoder, which often focuses only on individual words and neglects the logical relationships between them. The proposed method does not solely rely on the text encoder, but instead, represents the intended meaning in the text as propositions using predicate logic and treats the pixels in the attention maps as the fuzzy predicates. This enables us to obtain a differentiable loss function that makes the image fulfill the proposition by minimizing it. When compared to several existing methods, we demonstrated that Predicated Diffusion can generate images that are more faithful to various text prompts, as verified by human evaluators and pretrained image-text models.

Title: Effective Quantization for Diffusion Models on CPUs. (arXiv:2311.16133v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16133
Code URL: null
Copy Paste: [[2311.16133]] Effective Quantization for Diffusion Models on CPUs(http://arxiv.org/abs/2311.16133)
Summary:
Diffusion models have gained popularity for generating images from textual descriptions. Nonetheless, the substantial need for computational resources continues to present a noteworthy challenge, contributing to time-consuming processes. Quantization, a technique employed to compress deep learning models for enhanced efficiency, presents challenges when applied to diffusion models. These models are notably more sensitive to quantization compared to other model types, potentially resulting in a degradation of image quality. In this paper, we introduce a novel approach to quantize the diffusion models by leveraging both quantization-aware training and distillation. Our results show the quantized models can maintain the high image quality while demonstrating the inference efficiency on CPUs.

Title: Shortcut Bias Mitigation via Ensemble Diversity Using Diffusion Probabilistic Models. (arXiv:2311.16176v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.16176
Code URL: null
Copy Paste: [[2311.16176]] Shortcut Bias Mitigation via Ensemble Diversity Using Diffusion Probabilistic Models(http://arxiv.org/abs/2311.16176)
Summary:
Spurious correlations in the data, where multiple cues are predictive of the target labels, often lead to a phenomenon known as simplicity bias, where a model relies on erroneous, easy-to-learn cues while ignoring reliable ones. In this work, we propose an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs) for shortcut bias mitigation. We show that at particular training intervals, DPMs can generate images with novel feature combinations, even when trained on images displaying correlated input features. We leverage this crucial property to generate synthetic counterfactuals to increase model diversity via ensemble disagreement. We show that DPM-guided diversification is sufficient to remove dependence on primary shortcut cues, without a need for additional supervised signals. We further empirically quantify its efficacy on several diversification objectives, and finally show improved generalization and diversification performance on par with prior work that relies on auxiliary data collection.

Title: Improving Denoising Diffusion Probabilistic Models via Exploiting Shared Representations. (arXiv:2311.16353v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.16353
Code URL: null
Copy Paste: [[2311.16353]] Improving Denoising Diffusion Probabilistic Models via Exploiting Shared Representations(http://arxiv.org/abs/2311.16353)
Summary:
In this work, we address the challenge of multi-task image generation with limited data for denoising diffusion probabilistic models (DDPM), a class of generative models that produce high-quality images by reversing a noisy diffusion process. We propose a novel method, SR-DDPM, that leverages representation-based techniques from few-shot learning to effectively learn from fewer samples across different tasks. Our method consists of a core meta architecture with shared parameters, i.e., task-specific layers with exclusive parameters. By exploiting the similarity between diverse data distributions, our method can scale to multiple tasks without compromising the image quality. We evaluate our method on standard image datasets and show that it outperforms both unconditional and conditional DDPM in terms of FID and SSIM metrics.

Title: Manifold Preserving Guided Diffusion. (arXiv:2311.16424v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.16424
Code URL: null
Copy Paste: [[2311.16424]] Manifold Preserving Guided Diffusion(http://arxiv.org/abs/2311.16424)
Summary:
Despite the recent advancements, conditional image generation still faces challenges of cost, generalizability, and the need for task-specific training. In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a training-free conditional generation framework that leverages pretrained diffusion models and off-the-shelf neural networks with minimal additional inference cost for a broad range of tasks. Specifically, we leverage the manifold hypothesis to refine the guided diffusion steps and introduce a shortcut algorithm in the process. We then propose two methods for on-manifold training-free guidance using pre-trained autoencoders and demonstrate that our shortcut inherently preserves the manifolds when applied to latent diffusion models. Our experiments show that MPGD is efficient and effective for solving a variety of conditional generation applications in low-compute settings, and can consistently offer up to 3.8x speed-ups with the same number of diffusion steps while maintaining high sample quality compared to the baselines.

Title: TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering. (arXiv:2311.16465v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16465
Code URL: null
Copy Paste: [[2311.16465]] TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering(http://arxiv.org/abs/2311.16465)
Summary:
The diffusion model has been proven a powerful generative model in recent years, yet remains a challenge in generating visual text. Several methods alleviated this issue by incorporating explicit text position and content as guidance on where and what text to render. However, these methods still suffer from several drawbacks, such as limited flexibility and automation, constrained capability of layout prediction, and restricted style diversity. In this paper, we present TextDiffuser-2, aiming to unleash the power of language models for text rendering. Firstly, we fine-tune a large language model for layout planning. The large language model is capable of automatically generating keywords for text rendering and also supports layout modification through chatting. Secondly, we utilize the language model within the diffusion model to encode the position and texts at the line level. Unlike previous methods that employed tight character-level guidance, this approach generates more diverse text images. We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V, validating TextDiffuser-2's capacity to achieve a more rational text layout and generation with enhanced diversity. The code and model will be available at \url{https://aka.ms/textdiffuser-2}.

Title: Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net. (arXiv:2311.16488v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16488
Code URL: null
Copy Paste: [[2311.16488]] Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net(http://arxiv.org/abs/2311.16488)
Summary:
Recently, diffusion models have been used successfully to fit distributions for cross-modal data translation and multimodal data generation. However, these methods rely on extensive scaling, overlooking the inefficiency and interference between modalities. We develop Partially Shared U-Net (PS-U-Net) architecture which is an efficient multimodal diffusion model that allows text and image inputs to pass through dedicated layers and skip-connections for preserving modality-specific fine-grained details. Inspired by image inpainting, we also propose a new efficient multimodal sampling method that introduces new scenarios for conditional generation while only requiring a simple joint distribution to be learned. Our empirical exploration of the MS-COCO dataset demonstrates that our method generates multimodal text and image data with higher quality compared to existing multimodal diffusion models while having a comparable size, faster training, faster multimodal sampling, and more flexible generation.

Title: $Z^*$: Zero-shot Style Transfer via Attention Rearrangement. (arXiv:2311.16491v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16491
Code URL: null
Copy Paste: [[2311.16491]] $Z^*$: Zero-shot Style Transfer via Attention Rearrangement(http://arxiv.org/abs/2311.16491)
Summary:
Despite the remarkable progress in image style transfer, formulating style in the context of art is inherently subjective and challenging. In contrast to existing learning/tuning methods, this study shows that vanilla diffusion models can directly extract style information and seamlessly integrate the generative prior into the content image without retraining. Specifically, we adopt dual denoising paths to represent content/style references in latent space and then guide the content image denoising process with style latent codes. We further reveal that the cross-attention mechanism in latent diffusion models tends to blend the content and style images, resulting in stylized outputs that deviate from the original content image. To overcome this limitation, we introduce a cross-attention rearrangement strategy. Through theoretical analysis and experiments, we demonstrate the effectiveness and superiority of the diffusion-based $\underline{Z}$ero-shot $\underline{S}$tyle $\underline{T}$ransfer via $\underline{A}$ttention $\underline{R}$earrangement, Z-STAR.

Title: Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement. (arXiv:2311.16495v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16495
Code URL: null
Copy Paste: [[2311.16495]] Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement(http://arxiv.org/abs/2311.16495)
Summary:
In this work, we explore egocentric whole-body motion capture using a single fisheye camera, which simultaneously estimates human body and hand motion. This task presents significant challenges due to three factors: the lack of high-quality datasets, fisheye camera distortion, and human body self-occlusion. To address these challenges, we propose a novel approach that leverages FisheyeViT to extract fisheye image features, which are subsequently converted into pixel-aligned 3D heatmap representations for 3D human body pose prediction. For hand tracking, we incorporate dedicated hand detection and hand pose estimation networks for regressing 3D hand poses. Finally, we develop a diffusion-based whole-body motion prior model to refine the estimated whole-body motion while accounting for joint uncertainties. To train these networks, we collect a large synthetic dataset, EgoWholeBody, comprising 840,000 high-quality egocentric images captured across a diverse range of whole-body motion sequences. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in producing high-quality whole-body motion estimates from a single egocentric camera.

Title: MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model. (arXiv:2311.16498v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16498
Code URL: null
Copy Paste: [[2311.16498]] MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model(http://arxiv.org/abs/2311.16498)
Summary:
This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results, these approaches face challenges in maintaining temporal consistency throughout the animation due to the lack of temporal modeling and poor preservation of reference identity. In this work, we introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity. To achieve this, we first develop a video diffusion model to encode temporal information. Second, to maintain the appearance coherence across frames, we introduce a novel appearance encoder to retain the intricate details of the reference image. Leveraging these two innovations, we further employ a simple video fusion technique to encourage smooth transitions for long video animation. Empirical results demonstrate the superiority of our method over baseline approaches on two benchmarks. Notably, our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. Code and model will be made available.

Title: Deceptive-Human: Prompt-to-NeRF 3D Human Generation with 3D-Consistent Synthetic Images. (arXiv:2311.16499v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16499
Code URL: null
Copy Paste: [[2311.16499]] Deceptive-Human: Prompt-to-NeRF 3D Human Generation with 3D-Consistent Synthetic Images(http://arxiv.org/abs/2311.16499)
Summary:
This paper presents Deceptive-Human, a novel Prompt-to-NeRF framework capitalizing state-of-the-art control diffusion models (e.g., ControlNet) to generate a high-quality controllable 3D human NeRF. Different from direct 3D generative approaches, e.g., DreamFusion and DreamHuman, Deceptive-Human employs a progressive refinement technique to elevate the reconstruction quality. This is achieved by utilizing high-quality synthetic human images generated through the ControlNet with view-consistent loss. Our method is versatile and readily extensible, accommodating multimodal inputs, including a text prompt and additional data such as 3D mesh, poses, and seed images. The resulting 3D human NeRF model empowers the synthesis of highly photorealistic novel views from 360-degree perspectives. The key to our Deceptive-Human for hallucinating multi-view consistent synthetic human images lies in our progressive finetuning strategy. This strategy involves iteratively enhancing views using the provided multimodal inputs at each intermediate step to improve the human NeRF model. Within this iterative refinement process, view-dependent appearances are systematically eliminated to prevent interference with the underlying density estimation. Extensive qualitative and quantitative experimental comparison shows that our deceptive human models achieve state-of-the-art application quality.

Title: LLMGA: Multimodal Large Language Model based Generation Assistant. (arXiv:2311.16500v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16500
Code URL: null
Copy Paste: [[2311.16500]] LLMGA: Multimodal Large Language Model based Generation Assistant(http://arxiv.org/abs/2311.16500)
Summary:
In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in Large Language Models (LLMs) to assist users in image generation and editing. Diverging from existing approaches where Multimodal Large Language Models (MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our LLMGA provides a detailed language generation prompt for precise control over SD. This not only augments LLM context understanding but also reduces noise in generation prompts, yields images with more intricate and precise content, and elevates the interpretability of the network. To this end, we curate a comprehensive dataset comprising prompt refinement, similar image generation, inpainting $\&$ outpainting, and visual question answering. Moreover, we propose a two-stage training scheme. In the first stage, we train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. In the second stage, we optimize SD to align with the MLLM's generation prompts. Additionally, we propose a reference-based restoration network to alleviate texture, brightness, and contrast disparities between generated and preserved regions during image editing. Extensive results show that LLMGA has promising generative capabilities and can enable wider applications in an interactive manner.

Title: TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models. (arXiv:2311.16503v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16503
Code URL: null
Copy Paste: [[2311.16503]] TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models(http://arxiv.org/abs/2311.16503)
Summary:
The Diffusion model, a prevalent framework for image generation, encounters significant challenges in terms of broad applicability due to its extended inference times and substantial memory requirements. Efficient Post-training Quantization (PTQ) is pivotal for addressing these issues in traditional models. Different from traditional models, diffusion models heavily depend on the time-step $t$ to achieve satisfactory multi-round denoising. Usually, $t$ from the finite set $\{1, \ldots, T\}$ is encoded to a temporal feature by a few modules totally irrespective of the sampling data. However, existing PTQ methods do not optimize these modules separately. They adopt inappropriate reconstruction targets and complex calibration methods, resulting in a severe disturbance of the temporal feature and denoising trajectory, as well as a low compression efficiency. To solve these, we propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block which is just related to the time-step $t$ and unrelated to the sampling data. Powered by the pioneering block design, we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features in a limited time. Equipped with the framework, we can maintain the most temporal information and ensure the end-to-end generation quality. Extensive experiments on various datasets and diffusion models prove our state-of-the-art results. Remarkably, our quantization approach, for the first time, achieves model performance nearly on par with the full-precision model under 4-bit weight quantization. Additionally, our method incurs almost no extra computational cost and accelerates quantization time by $2.0 \times$ on LSUN-Bedrooms $256 \times 256$ compared to previous works.

Title: Exploring Straighter Trajectories of Flow Matching with Diffusion Guidance. (arXiv:2311.16507v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16507
Code URL: null
Copy Paste: [[2311.16507]] Exploring Straighter Trajectories of Flow Matching with Diffusion Guidance(http://arxiv.org/abs/2311.16507)
Summary:
Flow matching as a paradigm of generative model achieves notable success across various domains. However, existing methods use either multi-round training or knowledge within minibatches, posing challenges in finding a favorable coupling strategy for straight trajectories. To address this issue, we propose a novel approach, Straighter trajectories of Flow Matching (StraightFM). It straightens trajectories with the coupling strategy guided by diffusion model from entire distribution level. First, we propose a coupling strategy to straighten trajectories, creating couplings between image and noise samples under diffusion model guidance. Second, StraightFM also integrates real data to enhance training, employing a neural network to parameterize another coupling process from images to noise samples. StraightFM is jointly optimized with couplings from above two mutually complementary directions, resulting in straighter trajectories and enabling both one-step and few-step generation. Extensive experiments demonstrate that StraightFM yields high quality samples with fewer step. StraightFM generates visually appealing images with a lower FID among diffusion and traditional flow matching methods within 5 sampling steps when trained on pixel space. In the latent space (i.e., Latent Diffusion), StraightFM achieves a lower KID value compared to existing methods on the CelebA-HQ 256 dataset in fewer than 10 sampling steps.

Title: GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation. (arXiv:2311.16511v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16511
Code URL: null
Copy Paste: [[2311.16511]] GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation(http://arxiv.org/abs/2311.16511)
Summary:
While the recent advances in Multimodal Large Language Models (MLLMs) constitute a significant leap forward in the field, these models are predominantly confined to the realm of input-side multimodal comprehension, lacking the capacity for multimodal content generation. To fill this gap, we present GPT4Video, a unified multi-model framework that empowers Large Language Models (LLMs) with the capability of both video understanding and generation. Specifically, we develop an instruction-following-based approach integrated with the stable diffusion generative model, which has demonstrated to effectively and securely handle video generation scenarios. GPT4Video offers the following benefits: 1) It exhibits impressive capabilities in both video understanding and generation scenarios. For example, GPT4Video outperforms Valley by 11.8\% on the Video Question Answering task, and surpasses NExt-GPT by 2.3\% on the Text to Video generation task. 2) it endows the LLM/MLLM with video generation capabilities without requiring additional training parameters and can flexibly interface with a wide range of models to perform video generation. 3) it maintains a safe and healthy conversation not only in output-side but also the input side in an end-to-end manner. Qualitative and qualitative experiments demonstrate that GPT4Video holds the potential to function as a effective, safe and Humanoid-like video assistant that can handle both video understanding and generation scenarios.

Title: CoSeR: Bridging Image and Language for Cognitive Super-Resolution. (arXiv:2311.16512v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16512
Code URL: null
Copy Paste: [[2311.16512]] CoSeR: Bridging Image and Language for Cognitive Super-Resolution(http://arxiv.org/abs/2311.16512)
Summary:
Existing super-resolution (SR) models primarily focus on restoring local texture details, often neglecting the global semantic information within the scene. This oversight can lead to the omission of crucial semantic details or the introduction of inaccurate textures during the recovery process. In our work, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images. We achieve this by marrying image appearance and language understanding to generate a cognitive embedding, which not only activates prior information from large text-to-image diffusion models but also facilitates the generation of high-quality reference images to optimize the SR process. To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention", consolidating all conditional information into a single module. Consequently, our method successfully restores semantically correct and photorealistic details, demonstrating state-of-the-art performance across multiple benchmarks.

Title: Fine-grained Appearance Transfer with Diffusion Models. (arXiv:2311.16513v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16513
Code URL: https://github.com/babahui/fine-grained-appearance-transfer
Copy Paste: [[2311.16513]] Fine-grained Appearance Transfer with Diffusion Models(http://arxiv.org/abs/2311.16513)
Summary:
Image-to-image translation (I2I), and particularly its subfield of appearance transfer, which seeks to alter the visual appearance between images while maintaining structural coherence, presents formidable challenges. Despite significant advancements brought by diffusion models, achieving fine-grained transfer remains complex, particularly in terms of retaining detailed structural elements and ensuring information fidelity. This paper proposes an innovative framework designed to surmount these challenges by integrating various aspects of semantic matching, appearance transfer, and latent deviation. A pivotal aspect of our approach is the strategic use of the predicted $x_0$ space by diffusion models within the latent space of diffusion processes. This is identified as a crucial element for the precise and natural transfer of fine-grained details. Our framework exploits this space to accomplish semantic alignment between source and target images, facilitating mask-wise appearance transfer for improved feature acquisition. A significant advancement of our method is the seamless integration of these features into the latent space, enabling more nuanced latent deviations without necessitating extensive model retraining or fine-tuning. The effectiveness of our approach is demonstrated through extensive experiments, which showcase its ability to adeptly handle fine-grained appearance transfers across a wide range of categories and domains. We provide our code at https://github.com/babahui/Fine-grained-Appearance-Transfer

Title: SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution. (arXiv:2311.16518v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16518
Code URL: https://github.com/cswry/seesr
Copy Paste: [[2311.16518]] SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution(http://arxiv.org/abs/2311.16518)
Summary:
Owe to the powerful generative priors, the pre-trained text-to-image (T2I) diffusion models have become increasingly popular in solving the real-world image super-resolution problem. However, as a consequence of the heavy quality degradation of input low-resolution (LR) images, the destruction of local structures can lead to ambiguous image semantics. As a result, the content of reproduced high-resolution image may have semantic errors, deteriorating the super-resolution performance. To address this issue, we present a semantics-aware approach to better preserve the semantic fidelity of generative real-world image super-resolution. First, we train a degradation-aware prompt extractor, which can generate accurate soft and hard semantic prompts even under strong degradation. The hard semantic prompts refer to the image tags, aiming to enhance the local perception ability of the T2I model, while the soft semantic prompts compensate for the hard ones to provide additional representation information. These semantic prompts can encourage the T2I model to generate detailed and semantically accurate results. Furthermore, during the inference process, we integrate the LR images into the initial sampling noise to mitigate the diffusion model's tendency to generate excessive random details. The experiments show that our method can reproduce more realistic image details and hold better the semantics.

Title: Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models. (arXiv:2311.16555v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16555
Code URL: null
Copy Paste: [[2311.16555]] Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models(http://arxiv.org/abs/2311.16555)
Summary:
Scene text detection techniques have garnered significant attention due to their wide-ranging applications. However, existing methods have a high demand for training data, and obtaining accurate human annotations is labor-intensive and time-consuming. As a solution, researchers have widely adopted synthetic text images as a complementary resource to real text images during pre-training. Yet there is still room for synthetic datasets to enhance the performance of scene text detectors. We contend that one main limitation of existing generation methods is the insufficient integration of foreground text with the background. To alleviate this problem, we present the Diffusion Model based Text Generator (DiffText), a pipeline that utilizes the diffusion model to seamlessly blend foreground text regions with the background's intrinsic features. Additionally, we propose two strategies to generate visually coherent text with fewer spelling errors. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors. Extensive experiments on detecting horizontal, rotated, curved, and line-level texts demonstrate the effectiveness of DiffText in producing realistic text images.

Title: DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser. (arXiv:2311.16565v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16565
Code URL: null
Copy Paste: [[2311.16565]] DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser(http://arxiv.org/abs/2311.16565)
Summary:
Speech-driven 3D facial animation has been an attractive task in both academia and industry. Traditional methods mostly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. However, personalizing facial animation and accelerating animation generation are still two major limitations of existing diffusion-based methods. To address the above limitations, we propose DiffusionTalker, a diffusion-based method that utilizes contrastive learning to personalize 3D facial animation and knowledge distillation to accelerate 3D animation generation. Specifically, to enable personalization, we introduce a learnable talking identity to aggregate knowledge in audio sequences. The proposed identity embeddings extract customized facial cues across different people in a contrastive learning manner. During inference, users can obtain personalized facial animation based on input audio, reflecting a specific talking style. With a trained diffusion model with hundreds of steps, we distill it into a lightweight model with 8 steps for acceleration. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released.

Title: MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices. (arXiv:2311.16567v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16567
Code URL: null
Copy Paste: [[2311.16567]] MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices(http://arxiv.org/abs/2311.16567)
Summary:
The deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and slow inference speed. In this paper, we propose \textbf{MobileDiffusion}, a highly efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to reduce redundancy, enhance computational efficiency, and minimize model's parameter count, while preserving image generation quality. Additionally, we employ distillation and diffusion-GAN finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference respectively. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed techniques. MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for generating a $512\times512$ image on mobile devices, establishing a new state of the art.

Title: LEDITS++: Limitless Image Editing using Text-to-Image Models. (arXiv:2311.16711v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16711
Code URL: null
Copy Paste: [[2311.16711]] LEDITS++: Limitless Image Editing using Text-to-Image Models(http://arxiv.org/abs/2311.16711)
Summary:
Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming fine-tuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods. The project page is available at https://leditsplusplus-project.static.hf.space .

Title: As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors. (arXiv:2311.16739v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16739
Code URL: null
Copy Paste: [[2311.16739]] As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors(http://arxiv.org/abs/2311.16739)
Summary:
We present As-Plausible-as-Possible (APAP) mesh deformation technique that leverages 2D diffusion priors to preserve the plausibility of a mesh under user-controlled deformation. Our framework uses per-face Jacobians to represent mesh deformations, where mesh vertex coordinates are computed via a differentiable Poisson Solve. The deformed mesh is rendered, and the resulting 2D image is used in the Score Distillation Sampling (SDS) process, which enables extracting meaningful plausibility priors from a pretrained 2D diffusion model. To better preserve the identity of the edited mesh, we fine-tune our 2D diffusion model with LoRA. Gradients extracted by SDS and a user-prescribed handle displacement are then backpropagated to the per-face Jacobians, and we use iterative gradient descent to compute the final deformation that balances between the user edit and the output plausibility. We evaluate our method with 2D and 3D meshes and demonstrate qualitative and quantitative improvements when using plausibility priors over geometry-preservation or distortion-minimization priors used by previous techniques.

Title: ChatTraffc: Text-to-Traffic Generation via Diffusion Model. (arXiv:2311.16203v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.16203
Code URL: https://github.com/ChyaZhang/ChatTraffic
Copy Paste: [[2311.16203]] ChatTraffc: Text-to-Traffic Generation via Diffusion Model(http://arxiv.org/abs/2311.16203)
Summary:
Traffic prediction is one of the most significant foundations in Intelligent Transportation Systems (ITS). Traditional traffic prediction methods rely only on historical traffic data to predict traffic trends and face two main challenges. 1) insensitivity to unusual events. 2) poor performance in long-term prediction. In this work, we explore how generative models combined with text describing the traffic system can be applied for traffic generation and name the task Text-to-Traffic Generation (TTG). The key challenge of the TTG task is how to associate text with the spatial structure of the road network and traffic data for generating traffic situations. To this end, we propose ChatTraffic, the first diffusion model for text-to-traffic generation. To guarantee the consistency between synthetic and real data, we augment a diffusion model with the Graph Convolutional Network (GCN) to extract spatial correlations of traffic data. In addition, we construct a large dataset containing text-traffic pairs for the TTG task. We benchmarked our model qualitatively and quantitatively on the released dataset. The experimental results indicate that ChatTraffic can generate realistic traffic situations from the text. Our code and dataset are available at https://github.com/ChyaZhang/ChatTraffic.

Title: DiffAttack: Evasion Attacks Against Diffusion-Based Adversarial Purification. (arXiv:2311.16124v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2311.16124
Code URL: null
Copy Paste: [[2311.16124]] DiffAttack: Evasion Attacks Against Diffusion-Based Adversarial Purification(http://arxiv.org/abs/2311.16124)
Summary:
Diffusion-based purification defenses leverage diffusion models to remove crafted perturbations of adversarial examples and achieve state-of-the-art robustness. Recent studies show that even advanced attacks cannot break such defenses effectively, since the purification process induces an extremely deep computational graph which poses the potential problem of gradient obfuscation, high memory cost, and unbounded randomness. In this paper, we propose a unified framework DiffAttack to perform effective and efficient attacks against diffusion-based purification defenses, including both DDPM and score-based approaches. In particular, we propose a deviated-reconstruction loss at intermediate diffusion steps to induce inaccurate density gradient estimation to tackle the problem of vanishing/exploding gradients. We also provide a segment-wise forwarding-backwarding algorithm, which leads to memory-efficient gradient backpropagation. We validate the attack effectiveness of DiffAttack compared with existing adaptive attacks on CIFAR-10 and ImageNet. We show that DiffAttack decreases the robust accuracy of models compared with SOTA attacks by over 20% on CIFAR-10 under $\ell_\infty$ attack $(\epsilon=8/255)$, and over 10% on ImageNet under $\ell_\infty$ attack $(\epsilon=4/255)$. We conduct a series of ablations studies, and we find 1) DiffAttack with the deviated-reconstruction loss added over uniformly sampled time steps is more effective than that added over only initial/final steps, and 2) diffusion-based purification with a moderate diffusion length is more robust under DiffAttack.

Title: Federated Learning with Diffusion Models for Privacy-Sensitive Vision Tasks. (arXiv:2311.16538v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.16538
Code URL: null
Copy Paste: [[2311.16538]] Federated Learning with Diffusion Models for Privacy-Sensitive Vision Tasks(http://arxiv.org/abs/2311.16538)
Summary:
Diffusion models have shown great potential for vision-related tasks, particularly for image generation. However, their training is typically conducted in a centralized manner, relying on data collected from publicly available sources. This approach may not be feasible or practical in many domains, such as the medical field, which involves privacy concerns over data collection. Despite the challenges associated with privacy-sensitive data, such domains could still benefit from valuable vision services provided by diffusion models. Federated learning (FL) plays a crucial role in enabling decentralized model training without compromising data privacy. Instead of collecting data, an FL system gathers model parameters, effectively safeguarding the private data of different parties involved. This makes FL systems vital for managing decentralized learning tasks, especially in scenarios where privacy-sensitive data is distributed across a network of clients. Nonetheless, FL presents its own set of challenges due to its distributed nature and privacy-preserving properties. Therefore, in this study, we explore the FL strategy to train diffusion models, paving the way for the development of federated diffusion models. We conduct experiments on various FL scenarios, and our findings demonstrate that federated diffusion models have great potential to deliver vision services to privacy-sensitive domains.

Title: Inexpensive High Fidelity Melt Pool Models in Additive Manufacturing Using Generative Deep Diffusion. (arXiv:2311.16168v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.16168
Code URL: null
Copy Paste: [[2311.16168]] Inexpensive High Fidelity Melt Pool Models in Additive Manufacturing Using Generative Deep Diffusion(http://arxiv.org/abs/2311.16168)
Summary:
Defects in laser powder bed fusion (L-PBF) parts often result from the meso-scale dynamics of the molten alloy near the laser, known as the melt pool. For instance, the melt pool can directly contribute to the formation of undesirable porosity, residual stress, and surface roughness in the final part. Experimental in-situ monitoring of the three-dimensional melt pool physical fields is challenging, due to the short length and time scales involved in the process. Multi-physics simulation methods can describe the three-dimensional dynamics of the melt pool, but are computationally expensive at the mesh refinement required for accurate predictions of complex effects, such as the formation of keyhole porosity. Therefore, in this work, we develop a generative deep learning model based on the probabilistic diffusion framework to map low-fidelity, coarse-grained simulation information to the high-fidelity counterpart. By doing so, we bypass the computational expense of conducting multiple high-fidelity simulations for analysis by instead upscaling lightweight coarse mesh simulations. Specifically, we implement a 2-D diffusion model to spatially upscale cross-sections of the coarsely simulated melt pool to their high-fidelity equivalent. We demonstrate the preservation of key metrics of the melting process between the ground truth simulation data and the diffusion model output, such as the temperature field, the melt pool dimensions and the variability of the keyhole vapor cavity. Specifically, we predict the melt pool depth within 3 $\mu m$ based on low-fidelity input data 4$\times$ coarser than the high-fidelity simulations, reducing analysis time by two orders of magnitude.

Title: Symphony: Symmetry-Equivariant Point-Centered Spherical Harmonics for Molecule Generation. (arXiv:2311.16199v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.16199
Code URL: null
Copy Paste: [[2311.16199]] Symphony: Symmetry-Equivariant Point-Centered Spherical Harmonics for Molecule Generation(http://arxiv.org/abs/2311.16199)
Summary:
We present Symphony, an $E(3)$-equivariant autoregressive generative model for 3D molecular geometries that iteratively builds a molecule from molecular fragments. Existing autoregressive models such as G-SchNet and G-SphereNet for molecules utilize rotationally invariant features to respect the 3D symmetries of molecules. In contrast, Symphony uses message-passing with higher-degree $E(3)$-equivariant features. This allows a novel representation of probability distributions via spherical harmonic signals to efficiently model the 3D geometry of molecules. We show that Symphony is able to accurately generate small molecules from the QM9 dataset, outperforming existing autoregressive models and approaching the performance of diffusion models.

Title: Personalized Predictions of Glioblastoma Infiltration: Mathematical Models, Physics-Informed Neural Networks and Multimodal Scans. (arXiv:2311.16536v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.16536
Code URL: null
Copy Paste: [[2311.16536]] Personalized Predictions of Glioblastoma Infiltration: Mathematical Models, Physics-Informed Neural Networks and Multimodal Scans(http://arxiv.org/abs/2311.16536)
Summary:
Predicting the infiltration of Glioblastoma (GBM) from medical MRI scans is crucial for understanding tumor growth dynamics and designing personalized radiotherapy treatment plans.Mathematical models of GBM growth can complement the data in the prediction of spatial distributions of tumor cells. However, this requires estimating patient-specific parameters of the model from clinical data, which is a challenging inverse problem due to limited temporal data and the limited time between imaging and diagnosis. This work proposes a method that uses Physics-Informed Neural Networks (PINNs) to estimate patient-specific parameters of a reaction-diffusion PDE model of GBM growth from a single 3D structural MRI snapshot. PINNs embed both the data and the PDE into a loss function, thus integrating theory and data. Key innovations include the identification and estimation of characteristic non-dimensional parameters, a pre-training step that utilizes the non-dimensional parameters and a fine-tuning step to determine the patient specific parameters. Additionally, the diffuse domain method is employed to handle the complex brain geometry within the PINN framework. Our method is validated both on synthetic and patient datasets, and shows promise for real-time parametric inference in the clinical setting for personalized GBM treatment.

self-supervised

Title: Progressive Target-Styled Feature Augmentation for Unsupervised Domain Adaptation on Point Clouds. (arXiv:2311.16474v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16474
Code URL: https://github.com/xiaoyao3302/ptsfa
Copy Paste: [[2311.16474]] Progressive Target-Styled Feature Augmentation for Unsupervised Domain Adaptation on Point Clouds(http://arxiv.org/abs/2311.16474)
Summary:
Unsupervised domain adaptation is a critical challenge in the field of point cloud analysis, as models trained on one set of data often struggle to perform well in new scenarios due to domain shifts. Previous works tackle the problem by using adversarial training or self-supervised learning for feature extractor adaptation, but ensuring that features extracted from the target domain can be distinguished by the source-supervised classifier remains challenging. In this work, we propose a novel approach called progressive target-styled feature augmentation (PTSFA). Unlike previous works that focus on feature extractor adaptation, our PTSFA approach focuses on classifier adaptation. It aims to empower the classifier to recognize target-styled source features and progressively adapt to the target domain. To enhance the reliability of predictions within the PTSFA framework and encourage discriminative feature extraction, we further introduce a new intermediate domain approaching (IDA) strategy. We validate our method on the benchmark datasets, where our method achieves new state-of-the-art performance. Our code is available at https://github.com/xiaoyao3302/PTSFA.

Title: Augmenting x-ray single particle imaging reconstruction with self-supervised machine learning. (arXiv:2311.16652v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16652
Code URL: null
Copy Paste: [[2311.16652]] Augmenting x-ray single particle imaging reconstruction with self-supervised machine learning(http://arxiv.org/abs/2311.16652)
Summary:
The development of X-ray Free Electron Lasers (XFELs) has opened numerous opportunities to probe atomic structure and ultrafast dynamics of various materials. Single Particle Imaging (SPI) with XFELs enables the investigation of biological particles in their natural physiological states with unparalleled temporal resolution, while circumventing the need for cryogenic conditions or crystallization. However, reconstructing real-space structures from reciprocal-space x-ray diffraction data is highly challenging due to the absence of phase and orientation information, which is further complicated by weak scattering signals and considerable fluctuations in the number of photons per pulse. In this work, we present an end-to-end, self-supervised machine learning approach to recover particle orientations and estimate reciprocal space intensities from diffraction images only. Our method demonstrates great robustness under demanding experimental conditions with significantly enhanced reconstruction capabilities compared with conventional algorithms, and signifies a paradigm shift in SPI as currently practiced at XFELs.

Title: StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models. (arXiv:2311.16509v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.16509
Code URL: null
Copy Paste: [[2311.16509]] StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models(http://arxiv.org/abs/2311.16509)
Summary:
We propose StyleCap, a method to generate natural language descriptions of speaking styles appearing in speech. Although most of conventional techniques for para-/non-linguistic information recognition focus on the category classification or the intensity estimation of pre-defined labels, they cannot provide the reasoning of the recognition result in an interpretable manner. As a first step towards an end-to-end method for generating speaking-style prompts from speech, i.e., automatic speaking-style captioning, StyleCap uses paired data of speech and natural language descriptions to train neural networks that predict prefix vectors fed into a large language model (LLM)-based text decoder from a speech representation vector. We explore an appropriate text decoder and speech feature representation suitable for this new task. The experimental results demonstrate that our StyleCap leveraging richer LLMs for the text decoder, speech self-supervised learning (SSL) features, and sentence rephrasing augmentation improves the accuracy and diversity of generated speaking-style captions. Samples of speaking-style captions generated by our StyleCap are publicly available.

Title: Making Self-supervised Learning Robust to Spurious Correlation via Learning-speed Aware Sampling. (arXiv:2311.16361v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.16361
Code URL: null
Copy Paste: [[2311.16361]] Making Self-supervised Learning Robust to Spurious Correlation via Learning-speed Aware Sampling(http://arxiv.org/abs/2311.16361)
Summary:
Self-supervised learning (SSL) has emerged as a powerful technique for learning rich representations from unlabeled data. The data representations are able to capture many underlying attributes of data, and be useful in downstream prediction tasks. In real-world settings, spurious correlations between some attributes (e.g. race, gender and age) and labels for downstream tasks often exist, e.g. cancer is usually more prevalent among elderly patients. In this paper, we investigate SSL in the presence of spurious correlations and show that the SSL training loss can be minimized by capturing only a subset of the conspicuous features relevant to those sensitive attributes, despite the presence of other important predictive features for the downstream tasks. To address this issue, we investigate the learning dynamics of SSL and observe that the learning is slower for samples that conflict with such correlations (e.g. elder patients without cancer). Motivated by these findings, we propose a learning-speed aware SSL (LA-SSL) approach, in which we sample each training data with a probability that is inversely related to its learning speed. We evaluate LA-SSL on three datasets that exhibit spurious correlations between different attributes, demonstrating that it improves the robustness of pretrained representations on downstream classification tasks.

Title: Contrastive encoder pre-training-based clustered federated learning for heterogeneous data. (arXiv:2311.16535v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.16535
Code URL: null
Copy Paste: [[2311.16535]] Contrastive encoder pre-training-based clustered federated learning for heterogeneous data(http://arxiv.org/abs/2311.16535)
Summary:
Federated learning (FL) is a promising approach that enables distributed clients to collaboratively train a global model while preserving their data privacy. However, FL often suffers from data heterogeneity problems, which can significantly affect its performance. To address this, clustered federated learning (CFL) has been proposed to construct personalized models for different client clusters. One effective client clustering strategy is to allow clients to choose their own local models from a model pool based on their performance. However, without pre-trained model parameters, such a strategy is prone to clustering failure, in which all clients choose the same model. Unfortunately, collecting a large amount of labeled data for pre-training can be costly and impractical in distributed environments. To overcome this challenge, we leverage self-supervised contrastive learning to exploit unlabeled data for the pre-training of FL systems. Together, self-supervised pre-training and client clustering can be crucial components for tackling the data heterogeneity issues of FL. Leveraging these two crucial strategies, we propose contrastive pre-training-based clustered federated learning (CP-CFL) to improve the model convergence and overall performance of FL systems. In this work, we demonstrate the effectiveness of CP-CFL through extensive experiments in heterogeneous FL settings, and present various interesting observations.

Title: MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph Structures. (arXiv:2311.16666v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.16666
Code URL: null
Copy Paste: [[2311.16666]] MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph Structures(http://arxiv.org/abs/2311.16666)
Summary:
The quest for accurate prediction of drug molecule properties poses a fundamental challenge in the realm of Artificial Intelligence Drug Discovery (AIDD). An effective representation of drug molecules emerges as a pivotal component in this pursuit. Contemporary leading-edge research predominantly resorts to self-supervised learning (SSL) techniques to extract meaningful structural representations from large-scale, unlabeled molecular data, subsequently fine-tuning these representations for an array of downstream tasks. However, an inherent shortcoming of these studies lies in their singular reliance on one modality of molecular information, such as molecule image or SMILES representations, thus neglecting the potential complementarity of various molecular modalities. In response to this limitation, we propose MolIG, a novel MultiModaL molecular pre-training framework for predicting molecular properties based on Image and Graph structures. MolIG model innovatively leverages the coherence and correlation between molecule graph and molecule image to execute self-supervised tasks, effectively amalgamating the strengths of both molecular representation forms. This holistic approach allows for the capture of pivotal molecular structural characteristics and high-level semantic information. Upon completion of pre-training, Graph Neural Network (GNN) Encoder is used for the prediction of downstream tasks. In comparison to advanced baseline models, MolIG exhibits enhanced performance in downstream tasks pertaining to molecular property prediction within benchmark groups such as MoleculeNet Benchmark Group and ADMET Benchmark Group.

foundation model

Title: Adapting Segment Anything Model (SAM) through Prompt-based Learning for Enhanced Protein Identification in Cryo-EM Micrographs. (arXiv:2311.16140v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16140
Code URL: https://github.com/yangyang-69/Prompt_sam_cryoPPP
Copy Paste: [[2311.16140]] Adapting Segment Anything Model (SAM) through Prompt-based Learning for Enhanced Protein Identification in Cryo-EM Micrographs(http://arxiv.org/abs/2311.16140)
Summary:
Cryo-electron microscopy (cryo-EM) remains pivotal in structural biology, yet the task of protein particle picking, integral for 3D protein structure construction, is laden with manual inefficiencies. While recent AI tools such as Topaz and crYOLO are advancing the field, they do not fully address the challenges of cryo-EM images, including low contrast, complex shapes, and heterogeneous conformations. This study explored prompt-based learning to adapt the state-of-the-art image segmentation foundation model Segment Anything Model (SAM) for cryo-EM. This focus was driven by the desire to optimize model performance with a small number of labeled data without altering pre-trained parameters, aiming for a balance between adaptability and foundational knowledge retention. Through trials with three prompt-based learning strategies, namely head prompt, prefix prompt, and encoder prompt, we observed enhanced performance and reduced computational requirements compared to the fine-tuning approach. This work not only highlights the potential of prompting SAM in protein identification from cryo-EM micrographs but also suggests its broader promise in biomedical image segmentation and object detection.

Title: MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. (arXiv:2311.16502v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.16502
Code URL: null
Copy Paste: [[2311.16502]] MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI(http://arxiv.org/abs/2311.16502)
Summary:
We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. Our evaluation of 14 open-source LMMs and the proprietary GPT-4V(ision) highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V only achieves a 56% accuracy, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

Title: Source-Free Domain Adaptation with Frozen Multimodal Foundation Model. (arXiv:2311.16510v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16510
Code URL: null
Copy Paste: [[2311.16510]] Source-Free Domain Adaptation with Frozen Multimodal Foundation Model(http://arxiv.org/abs/2311.16510)
Summary:
Source-Free Domain Adaptation (SFDA) aims to adapt a source model for a target domain, with only access to unlabeled target training data and the source model pre-trained on a supervised source domain. Relying on pseudo labeling and/or auxiliary supervision, conventional methods are inevitably error-prone. To mitigate this limitation, in this work we for the first time explore the potentials of off-the-shelf vision-language (ViL) multimodal models (e.g.,CLIP) with rich whilst heterogeneous knowledge. We find that directly applying the ViL model to the target domain in a zero-shot fashion is unsatisfactory, as it is not specialized for this particular task but largely generic. To make it task specific, we propose a novel Distilling multimodal Foundation model(DIFO)approach. Specifically, DIFO alternates between two steps during adaptation: (i) Customizing the ViL model by maximizing the mutual information with the target model in a prompt learning manner, (ii) Distilling the knowledge of this customized ViL model to the target model. For more fine-grained and reliable distillation, we further introduce two effective regularization terms, namely most-likely category encouragement and predictive consistency. Extensive experiments show that DIFO significantly outperforms the state-of-the-art alternatives. Our source code will be released.

Title: Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. (arXiv:2311.16452v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.16452
Code URL: null
Copy Paste: [[2311.16452]] Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine(http://arxiv.org/abs/2311.16452)
Summary:
Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expert-curated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as Med-PaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.

generative

Title: Semantic Generative Augmentations for Few-Shot Counting. (arXiv:2311.16122v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16122
Code URL: null
Copy Paste: [[2311.16122]] Semantic Generative Augmentations for Few-Shot Counting(http://arxiv.org/abs/2311.16122)
Summary:
With the availability of powerful text-to-image diffusion models, recent works have explored the use of synthetic data to improve image classification performances. These works show that it can effectively augment or even replace real data. In this work, we investigate how synthetic data can benefit few-shot class-agnostic counting. This requires to generate images that correspond to a given input number of objects. However, text-to-image models struggle to grasp the notion of count. We propose to rely on a double conditioning of Stable Diffusion with both a prompt and a density map in order to augment a training dataset for few-shot counting. Due to the small dataset size, the fine-tuned model tends to generate images close to the training images. We propose to enhance the diversity of synthesized images by exchanging captions between images thus creating unseen configurations of object types and spatial layout. Our experiments show that our diversified generation strategy significantly improves the counting accuracy of two recent and performing few-shot counting models on FSC147 and CARPK.

Title: RelVAE: Generative Pretraining for few-shot Visual Relationship Detection. (arXiv:2311.16261v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16261
Code URL: null
Copy Paste: [[2311.16261]] RelVAE: Generative Pretraining for few-shot Visual Relationship Detection(http://arxiv.org/abs/2311.16261)
Summary:
Visual relations are complex, multimodal concepts that play an important role in the way humans perceive the world. As a result of their complexity, high-quality, diverse and large scale datasets for visual relations are still absent. In an attempt to overcome this data barrier, we choose to focus on the problem of few-shot Visual Relationship Detection (VRD), a setting that has been so far neglected by the community. In this work we present the first pretraining method for few-shot predicate classification that does not require any annotated relations. We achieve this by introducing a generative model that is able to capture the variation of semantic, visual and spatial information of relations inside a latent space and later exploiting its representations in order to achieve efficient few-shot classification. We construct few-shot training splits and show quantitative experiments on VG200 and VRD datasets where our model outperforms the baselines. Lastly we attempt to interpret the decisions of the model by conducting various qualitative experiments.

Title: MI-Gen: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images. (arXiv:2311.16480v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16480
Code URL: null
Copy Paste: [[2311.16480]] MI-Gen: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images(http://arxiv.org/abs/2311.16480)
Summary:
Whole slide images are the foundation of digital pathology for the diagnosis and treatment of carcinomas. Writing pathology reports is laborious and error-prone for inexperienced pathologists. To reduce the workload and improve clinical automation, we investigate how to generate pathology reports given whole slide images. On the data end, we curated the largest WSI-text dataset (TCGA-PathoText). In specific, we collected nearly 10000 high-quality WSI-text pairs for visual-language models by recognizing and cleaning pathology reports which narrate diagnostic slides in TCGA. On the model end, we propose the multiple instance generative model (MI-Gen) which can produce pathology reports for gigapixel WSIs. We benchmark our model on the largest subset of TCGA-PathoText. Experimental results show our model can generate pathology reports which contain multiple clinical clues. Furthermore, WSI-text prediction can be seen as an approach of visual-language pre-training, which enables our model to be transferred to downstream diagnostic tasks like carcinoma grading and phenotyping. We observe that simple semantic extraction from the pathology reports can achieve the best performance (0.838 of F1 score) on BRCA subtyping without adding extra parameters or tricky fine-tuning. Our collected dataset and related code will all be publicly available.

Title: PISA: Point-cloud-based Instructed Scene Augmentation. (arXiv:2311.16501v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16501
Code URL: null
Copy Paste: [[2311.16501]] PISA: Point-cloud-based Instructed Scene Augmentation(http://arxiv.org/abs/2311.16501)
Summary:
Indoor scene augmentation has become an emerging topic in the field of computer vision with applications in augmented and virtual reality. However, existing scene augmentation methods mostly require a pre-built object database with a given position as the desired location. In this paper, we propose the first end-to-end multi-modal deep neural network that can generate point cloud objects consistent with their surroundings, conditioned on text instructions. Our model generates a seemly object in the appropriate position based on the inputs of a query and point clouds, thereby enabling the creation of new scenarios involving previously unseen layouts of objects. Database of pre-stored CAD models is no longer needed. We use Point-E as our generative model and introduce methods including quantified position prediction and Top-K estimation to mitigate the false negative problems caused by ambiguous language description. Moreover, we evaluate the ability of our model by demonstrating the diversity of generated objects, the effectiveness of instruction, and quantitative metric results, which collectively indicate that our model is capable of generating realistic in-door objects. For a more thorough evaluation, we also incorporate visual grounding as a metric to assess the quality of the scenes generated by our model.

Title: Improving Lane Detection Generalization: A Novel Framework using HD Maps for Boosting Diversity. (arXiv:2311.16589v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16589
Code URL: null
Copy Paste: [[2311.16589]] Improving Lane Detection Generalization: A Novel Framework using HD Maps for Boosting Diversity(http://arxiv.org/abs/2311.16589)
Summary:
Lane detection is a vital task for vehicles to navigate and localize their position on the road. To ensure reliable results, lane detection algorithms must have robust generalization performance in various road environments. However, despite the significant performance improvement of deep learning-based lane detection algorithms, their generalization performance in response to changes in road environments still falls short of expectations. In this paper, we present a novel framework for single-source domain generalization (SSDG) in lane detection. By decomposing data into lane structures and surroundings, we enhance diversity using High-Definition (HD) maps and generative models. Rather than expanding data volume, we strategically select a core subset of data, maximizing diversity and optimizing performance. Our extensive experiments demonstrate that our framework enhances the generalization performance of lane detection, comparable to the domain adaptation-based method.

Title: MedGen: A Python Natural Language Processing Toolkit for Medical Text Processing. (arXiv:2311.16588v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2311.16588
Code URL: https://github.com/yale-lily/medgen
Copy Paste: [[2311.16588]] MedGen: A Python Natural Language Processing Toolkit for Medical Text Processing(http://arxiv.org/abs/2311.16588)
Summary:
This study introduces MedGen, a comprehensive natural language processing (NLP) toolkit designed for medical text processing. MedGen is tailored for biomedical researchers and healthcare professionals with an easy-to-use, all-in-one solution that requires minimal programming expertise. It includes (1) Generative Functions: For the first time, MedGen includes four advanced generative functions: question answering, text summarization, text simplification, and machine translation; (2) Basic NLP Functions: MedGen integrates 12 essential NLP functions such as word tokenization and sentence segmentation; and (3) Query and Search Capabilities: MedGen provides user-friendly query and search functions on text corpora. We fine-tuned 32 domain-specific language models, evaluated them thoroughly on 24 established benchmarks and conducted manual reviews with clinicians. Additionally, we expanded our toolkit by introducing query and search functions, while also standardizing and integrating functions from third-party libraries. The toolkit, its models, and associated data are publicly available via https://github.com/Yale-LILY/MedGen.

Title: Deep Learning for Time Series Classification of Parkinson's Disease Eye Tracking Data. (arXiv:2311.16381v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.16381
Code URL: null
Copy Paste: [[2311.16381]] Deep Learning for Time Series Classification of Parkinson's Disease Eye Tracking Data(http://arxiv.org/abs/2311.16381)
Summary:
Eye-tracking is an accessible and non-invasive technology that provides information about a subject's motor and cognitive abilities. As such, it has proven to be a valuable resource in the study of neurodegenerative diseases such as Parkinson's disease. Saccade experiments, in particular, have proven useful in the diagnosis and staging of Parkinson's disease. However, to date, no single eye-movement biomarker has been found to conclusively differentiate patients from healthy controls. In the present work, we investigate the use of state-of-the-art deep learning algorithms to perform Parkinson's disease classification using eye-tracking data from saccade experiments. In contrast to previous work, instead of using hand-crafted features from the saccades, we use raw $\sim1.5\,s$ long fixation intervals recorded during the preparatory phase before each trial. Using these short time series as input we implement two different classification models, InceptionTime and ROCKET. We find that the models are able to learn the classification task and generalize to unseen subjects. InceptionTime achieves $78\%$ accuracy, while ROCKET achieves $88\%$ accuracy. We also employ a novel method for pruning the ROCKET model to improve interpretability and generalizability, achieving an accuracy of $96\%$. Our results suggest that fixation data has low inter-subject variability and potentially carries useful information about brain cognitive and motor conditions, making it suitable for use with machine learning in the discovery of disease-relevant biomarkers.

anomaly

Title: Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation : A Unified Approach. (arXiv:2311.16514v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16514
Code URL: null
Copy Paste: [[2311.16514]] Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation : A Unified Approach(http://arxiv.org/abs/2311.16514)
Summary:
Video Anomaly Detection (VAD) is an open-set recognition task, which is usually formulated as a one-class classification (OCC) problem, where training data is comprised of videos with normal instances while test data contains both normal and anomalous instances. Recent works have investigated the creation of pseudo-anomalies (PAs) using only the normal data and making strong assumptions about real-world anomalies with regards to abnormality of objects and speed of motion to inject prior information about anomalies in an autoencoder (AE) based reconstruction model during training. This work proposes a novel method for generating generic spatio-temporal PAs by inpainting a masked out region of an image using a pre-trained Latent Diffusion Model and further perturbing the optical flow using mixup to emulate spatio-temporal distortions in the data. In addition, we present a simple unified framework to detect real-world anomalies under the OCC setting by learning three types of anomaly indicators, namely reconstruction quality, temporal irregularity and semantic inconsistency. Extensive experiments on four VAD benchmark datasets namely Ped2, Avenue, ShanghaiTech and UBnormal demonstrate that our method performs on par with other existing state-of-the-art PAs generation and reconstruction based methods under the OCC setting. Our analysis also examines the transferability and generalisation of PAs across these datasets, offering valuable insights by identifying real-world anomalies through PAs.

Title: Segment Every Out-of-Distribution Object. (arXiv:2311.16516v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2311.16516
Code URL: null
Copy Paste: [[2311.16516]] Segment Every Out-of-Distribution Object(http://arxiv.org/abs/2311.16516)
Summary:
Semantic segmentation models, while effective for in-distribution categories, face challenges in real-world deployment due to encountering out-of-distribution (OoD) objects. Detecting these OoD objects is crucial for safety-critical applications. Existing methods rely on anomaly scores, but choosing a suitable threshold for generating masks presents difficulties and can lead to fragmentation and inaccuracy. This paper introduces a method to convert anomaly Score To segmentation Mask, called S2M, a simple and effective framework for OoD detection in semantic segmentation. Unlike assigning anomaly scores to pixels, S2M directly segments the entire OoD object. By transforming anomaly scores into prompts for a promptable segmentation model, S2M eliminates the need for threshold selection. Extensive experiments demonstrate that S2M outperforms the state-of-the-art by approximately 10\% in IoU and 30\% in mean F1 score, on average, across various benchmarks including Fishyscapes, Segment-Me-If-You-Can, and RoadAnomaly datasets.

Title: A Unified Hardware-based Threat Detector for AI Accelerators. (arXiv:2311.16684v1 [cs.CR])

Paper URL: http://arxiv.org/abs/2311.16684
Code URL: null
Copy Paste: [[2311.16684]] A Unified Hardware-based Threat Detector for AI Accelerators(http://arxiv.org/abs/2311.16684)
Summary:
The proliferation of AI technology gives rise to a variety of security threats, which significantly compromise the confidentiality and integrity of AI models and applications. Existing software-based solutions mainly target one specific attack, and require the implementation into the models, rendering them less practical. We design UniGuard, a novel unified and non-intrusive detection methodology to safeguard FPGA-based AI accelerators. The core idea of UniGuard is to harness power side-channel information generated during model inference to spot any anomaly. We employ a Time-to-Digital Converter to capture power fluctuations and train a supervised machine learning model to identify various types of threats. Evaluations demonstrate that UniGuard can achieve 94.0% attack detection accuracy, with high generalization over unknown or adaptive attacks and robustness against varied configurations (e.g., sensor frequency and location).

Title: MACE: A Multi-pattern Accommodated and Efficient Anomaly Detection Method in the Frequency Domain. (arXiv:2311.16191v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2311.16191
Code URL: null
Copy Paste: [[2311.16191]] MACE: A Multi-pattern Accommodated and Efficient Anomaly Detection Method in the Frequency Domain(http://arxiv.org/abs/2311.16191)
Summary:
Anomaly detection significantly enhances the robustness of cloud systems. While neural network-based methods have recently demonstrated strong advantages, they encounter practical challenges in cloud environments: the contradiction between the impracticality of maintaining a unique model for each service and the limited ability of dealing with diverse normal patterns by a unified model, as well as issues with handling heavy traffic in real time and short-term anomaly detection sensitivity. Thus, we propose MACE, a Multi-pattern Accommodated and efficient Anomaly detection method in the frequency domain for time series anomaly detection. There are three novel characteristics of it: (i) a pattern extraction mechanism excelling at handling diverse normal patterns, which enables the model to identify anomalies by examining the correlation between the data sample and its service normal pattern, instead of solely focusing on the data sample itself; (ii) a dualistic convolution mechanism that amplifies short-term anomalies in the time domain and hinders the reconstruction of anomalies in the frequency domain, which enlarges the reconstruction error disparity between anomaly and normality and facilitates anomaly detection; (iii) leveraging the sparsity and parallelism of frequency domain to enhance model efficiency. We theoretically and experimentally prove that using a strategically selected subset of Fourier bases can not only reduce computational overhead but is also profit to distinguish anomalies, compared to using the complete spectrum. Moreover, extensive experiments demonstrate MACE's effectiveness in handling diverse normal patterns with a unified model and it achieves state-of-the-art performance with high efficiency. \end{abstract}

diffusion

Title: Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models. (arXiv:2311.16117v1 [cs.CV])

Title: Effective Quantization for Diffusion Models on CPUs. (arXiv:2311.16133v1 [cs.CV])

Title: Shortcut Bias Mitigation via Ensemble Diversity Using Diffusion Probabilistic Models. (arXiv:2311.16176v1 [cs.LG])

Title: Improving Denoising Diffusion Probabilistic Models via Exploiting Shared Representations. (arXiv:2311.16353v1 [cs.LG])

Title: Manifold Preserving Guided Diffusion. (arXiv:2311.16424v1 [cs.LG])

Title: TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering. (arXiv:2311.16465v1 [cs.CV])

Title: Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net. (arXiv:2311.16488v1 [cs.CV])

Title: $Z^*$: Zero-shot Style Transfer via Attention Rearrangement. (arXiv:2311.16491v1 [cs.CV])

Title: Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement. (arXiv:2311.16495v1 [cs.CV])

Title: MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model. (arXiv:2311.16498v1 [cs.CV])

Title: Deceptive-Human: Prompt-to-NeRF 3D Human Generation with 3D-Consistent Synthetic Images. (arXiv:2311.16499v1 [cs.CV])

Title: LLMGA: Multimodal Large Language Model based Generation Assistant. (arXiv:2311.16500v1 [cs.CV])

Title: TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models. (arXiv:2311.16503v1 [cs.CV])

Title: Exploring Straighter Trajectories of Flow Matching with Diffusion Guidance. (arXiv:2311.16507v1 [cs.CV])

Title: GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation. (arXiv:2311.16511v1 [cs.CV])

Title: CoSeR: Bridging Image and Language for Cognitive Super-Resolution. (arXiv:2311.16512v1 [cs.CV])

Title: Fine-grained Appearance Transfer with Diffusion Models. (arXiv:2311.16513v1 [cs.CV])

Title: SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution. (arXiv:2311.16518v1 [cs.CV])

Title: Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models. (arXiv:2311.16555v1 [cs.CV])

Title: DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser. (arXiv:2311.16565v1 [cs.CV])

Title: MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices. (arXiv:2311.16567v1 [cs.CV])

Title: LEDITS++: Limitless Image Editing using Text-to-Image Models. (arXiv:2311.16711v1 [cs.CV])

Title: As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors. (arXiv:2311.16739v1 [cs.CV])

Title: ChatTraffc: Text-to-Traffic Generation via Diffusion Model. (arXiv:2311.16203v1 [cs.LG])

Title: DiffAttack: Evasion Attacks Against Diffusion-Based Adversarial Purification. (arXiv:2311.16124v1 [cs.CR])

Title: Federated Learning with Diffusion Models for Privacy-Sensitive Vision Tasks. (arXiv:2311.16538v1 [cs.LG])

Title: Inexpensive High Fidelity Melt Pool Models in Additive Manufacturing Using Generative Deep Diffusion. (arXiv:2311.16168v1 [cs.LG])

Title: Symphony: Symmetry-Equivariant Point-Centered Spherical Harmonics for Molecule Generation. (arXiv:2311.16199v1 [cs.LG])

Title: Personalized Predictions of Glioblastoma Infiltration: Mathematical Models, Physics-Informed Neural Networks and Multimodal Scans. (arXiv:2311.16536v1 [cs.LG])

self-supervised

Title: Progressive Target-Styled Feature Augmentation for Unsupervised Domain Adaptation on Point Clouds. (arXiv:2311.16474v1 [cs.CV])

Title: Augmenting x-ray single particle imaging reconstruction with self-supervised machine learning. (arXiv:2311.16652v1 [cs.CV])

Title: StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models. (arXiv:2311.16509v1 [cs.CL])

Title: Making Self-supervised Learning Robust to Spurious Correlation via Learning-speed Aware Sampling. (arXiv:2311.16361v1 [cs.LG])

Title: Contrastive encoder pre-training-based clustered federated learning for heterogeneous data. (arXiv:2311.16535v1 [cs.LG])

Title: MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph Structures. (arXiv:2311.16666v1 [cs.LG])

foundation model

Title: Adapting Segment Anything Model (SAM) through Prompt-based Learning for Enhanced Protein Identification in Cryo-EM Micrographs. (arXiv:2311.16140v1 [cs.CV])

Title: MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. (arXiv:2311.16502v1 [cs.CL])

Title: Source-Free Domain Adaptation with Frozen Multimodal Foundation Model. (arXiv:2311.16510v1 [cs.CV])

Title: Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. (arXiv:2311.16452v1 [cs.CL])

generative

Title: Semantic Generative Augmentations for Few-Shot Counting. (arXiv:2311.16122v1 [cs.CV])

Title: RelVAE: Generative Pretraining for few-shot Visual Relationship Detection. (arXiv:2311.16261v1 [cs.CV])

Title: MI-Gen: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images. (arXiv:2311.16480v1 [cs.CV])

Title: PISA: Point-cloud-based Instructed Scene Augmentation. (arXiv:2311.16501v1 [cs.CV])

Title: Improving Lane Detection Generalization: A Novel Framework using HD Maps for Boosting Diversity. (arXiv:2311.16589v1 [cs.CV])

Title: MedGen: A Python Natural Language Processing Toolkit for Medical Text Processing. (arXiv:2311.16588v1 [cs.CL])

Title: Deep Learning for Time Series Classification of Parkinson's Disease Eye Tracking Data. (arXiv:2311.16381v1 [cs.LG])

anomaly

Title: Video Anomaly Detection via Spatio-Temporal Pseudo-Anomaly Generation : A Unified Approach. (arXiv:2311.16514v1 [cs.CV])

Title: Segment Every Out-of-Distribution Object. (arXiv:2311.16516v1 [cs.CV])

Title: A Unified Hardware-based Threat Detector for AI Accelerators. (arXiv:2311.16684v1 [cs.CR])

Title: MACE: A Multi-pattern Accommodated and Efficient Anomaly Detection Method in the Frequency Domain. (arXiv:2311.16191v1 [cs.LG])

in-context