diffusion

Title: Effective Real Image Editing with Accelerated Iterative Diffusion Inversion. (arXiv:2309.04907v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.04907
Code URL: null
Copy Paste: [[2309.04907]] Effective Real Image Editing with Accelerated Iterative Diffusion Inversion(http://arxiv.org/abs/2309.04907)
Summary:
Despite all recent progress, it is still challenging to edit and manipulate natural images with modern generative models. When using Generative Adversarial Network (GAN), one major hurdle is in the inversion process mapping a real image to its corresponding noise vector in the latent space, since its necessary to be able to reconstruct an image to edit its contents. Likewise for Denoising Diffusion Implicit Models (DDIM), the linearization assumption in each inversion step makes the whole deterministic inversion process unreliable. Existing approaches that have tackled the problem of inversion stability often incur in significant trade-offs in computational efficiency. In this work we propose an Accelerated Iterative Diffusion Inversion method, dubbed AIDI, that significantly improves reconstruction accuracy with minimal additional overhead in space and time complexity. By using a novel blended guidance technique, we show that effective results can be obtained on a large range of image editing tasks without large classifier-free guidance in inversion. Furthermore, when compared with other diffusion inversion based works, our proposed process is shown to be more robust for fast image editing in the 10 and 20 diffusion steps' regimes.

Title: Text-driven Editing of 3D Scenes without Retraining. (arXiv:2309.04917v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.04917
Code URL: null
Copy Paste: [[2309.04917]] Text-driven Editing of 3D Scenes without Retraining(http://arxiv.org/abs/2309.04917)
Summary:
Numerous diffusion models have recently been applied to image synthesis and editing. However, editing 3D scenes is still in its early stages. It poses various challenges, such as the requirement to design specific methods for different editing types, retraining new models for various 3D scenes, and the absence of convenient human interaction during editing. To tackle these issues, we introduce a text-driven editing method, termed DN2N, which allows for the direct acquisition of a NeRF model with universal editing capabilities, eliminating the requirement for retraining. Our method employs off-the-shelf text-based editing models of 2D images to modify the 3D scene images, followed by a filtering process to discard poorly edited images that disrupt 3D consistency. We then consider the remaining inconsistency as a problem of removing noise perturbation, which can be solved by generating training data with similar perturbation characteristics for training. We further propose cross-view regularization terms to help the generalized NeRF model mitigate these perturbations. Our text-driven method allows users to edit a 3D scene with their desired description, which is more friendly, intuitive, and practical than prior works. Empirical results show that our method achieves multiple editing types, including but not limited to appearance editing, weather transition, material changing, and style transfer. Most importantly, our method generalizes well with editing abilities shared among a set of model parameters without requiring a customized editing model for some specific scenes, thus inferring novel views with editing effects directly from user input. The project website is available at this http URL

Title: Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning. (arXiv:2309.04965v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.04965
Code URL: null
Copy Paste: [[2309.04965]] Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning(http://arxiv.org/abs/2309.04965)
Summary:
While impressive performance has been achieved in image captioning, the limited diversity of the generated captions and the large parameter scale remain major barriers to the real-word application of these systems. In this work, we propose a lightweight image captioning network in combination with continuous diffusion, called Prefix-diffusion. To achieve diversity, we design an efficient method that injects prefix image embeddings into the denoising process of the diffusion model. In order to reduce trainable parameters, we employ a pre-trained model to extract image features and further design an extra mapping network. Prefix-diffusion is able to generate diverse captions with relatively less parameters, while maintaining the fluency and relevance of the captions benefiting from the generative capabilities of the diffusion model. Our work paves the way for scaling up diffusion models for image captioning, and achieves promising performance compared with recent approaches.

Title: SA-Solver: Stochastic Adams Solver for Fast Sampling of Diffusion Models. (arXiv:2309.05019v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.05019
Code URL: null
Copy Paste: [[2309.05019]] SA-Solver: Stochastic Adams Solver for Fast Sampling of Diffusion Models(http://arxiv.org/abs/2309.05019)
Summary:
Diffusion Probabilistic Models (DPMs) have achieved considerable success in generation tasks. As sampling from DPMs is equivalent to solving diffusion SDE or ODE which is time-consuming, numerous fast sampling methods built upon improved differential equation solvers are proposed. The majority of such techniques consider solving the diffusion ODE due to its superior efficiency. However, stochastic sampling could offer additional advantages in generating diverse and high-quality data. In this work, we engage in a comprehensive analysis of stochastic sampling from two aspects: variance-controlled diffusion SDE and linear multi-step SDE solver. Based on our analysis, we propose SA-Solver, which is an improved efficient stochastic Adams method for solving diffusion SDE to generate data with high quality. Our experiments show that SA-Solver achieves: 1) improved or comparable performance compared with the existing state-of-the-art sampling methods for few-step sampling; 2) SOTA FID scores on substantial benchmark datasets under a suitable number of function evaluations (NFEs).

self-supervised

Title: Frequency-Aware Self-Supervised Long-Tailed Learning. (arXiv:2309.04723v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.04723
Code URL: null
Copy Paste: [[2309.04723]] Frequency-Aware Self-Supervised Long-Tailed Learning(http://arxiv.org/abs/2309.04723)
Summary:
Data collected from the real world typically exhibit long-tailed distributions, where frequent classes contain abundant data while rare ones have only a limited number of samples. While existing supervised learning approaches have been proposed to tackle such data imbalance, the requirement of label supervision would limit their applicability to real-world scenarios in which label annotation might not be available. Without the access to class labels nor the associated class frequencies, we propose Frequency-Aware Self-Supervised Learning (FASSL) in this paper. Targeting at learning from unlabeled data with inherent long-tailed distributions, the goal of FASSL is to produce discriminative feature representations for downstream classification tasks. In FASSL, we first learn frequency-aware prototypes, reflecting the associated long-tailed distribution. Particularly focusing on rare-class samples, the relationships between image data and the derived prototypes are further exploited with the introduced self-supervised learning scheme. Experiments on long-tailed image datasets quantitatively and qualitatively verify the effectiveness of our learning scheme.

Title: Self-Supervised Transformer with Domain Adaptive Reconstruction for General Face Forgery Video Detection. (arXiv:2309.04795v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.04795
Code URL: null
Copy Paste: [[2309.04795]] Self-Supervised Transformer with Domain Adaptive Reconstruction for General Face Forgery Video Detection(http://arxiv.org/abs/2309.04795)
Summary:
Face forgery videos have caused severe social public concern, and various detectors have been proposed recently. However, most of them are trained in a supervised manner with limited generalization when detecting videos from different forgery methods or real source videos. To tackle this issue, we explore to take full advantage of the difference between real and forgery videos by only exploring the common representation of real face videos. In this paper, a Self-supervised Transformer cooperating with Contrastive and Reconstruction learning (CoReST) is proposed, which is first pre-trained only on real face videos in a self-supervised manner, and then fine-tuned a linear head on specific face forgery video datasets. Two specific auxiliary tasks incorporated contrastive and reconstruction learning are designed to enhance the representation learning. Furthermore, a Domain Adaptive Reconstruction (DAR) module is introduced to bridge the gap between different forgery domains by reconstructing on unlabeled target videos when fine-tuning. Extensive experiments on public datasets demonstrate that our proposed method performs even better than the state-of-the-art supervised competitors with impressive generalization.

Title: Redundancy-Free Self-Supervised Relational Learning for Graph Clustering. (arXiv:2309.04694v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.04694
Code URL: https://github.com/yisiyu95/r2fgc
Copy Paste: [[2309.04694]] Redundancy-Free Self-Supervised Relational Learning for Graph Clustering(http://arxiv.org/abs/2309.04694)
Summary:
Graph clustering, which learns the node representations for effective cluster assignments, is a fundamental yet challenging task in data analysis and has received considerable attention accompanied by graph neural networks in recent years. However, most existing methods overlook the inherent relational information among the non-independent and non-identically distributed nodes in a graph. Due to the lack of exploration of relational attributes, the semantic information of the graph-structured data fails to be fully exploited which leads to poor clustering performance. In this paper, we propose a novel self-supervised deep graph clustering method named Relational Redundancy-Free Graph Clustering (R$^2$FGC) to tackle the problem. It extracts the attribute- and structure-level relational information from both global and local views based on an autoencoder and a graph autoencoder. To obtain effective representations of the semantic information, we preserve the consistent relation among augmented nodes, whereas the redundant relation is further reduced for learning discriminative embeddings. In addition, a simple yet valid strategy is utilized to alleviate the over-smoothing issue. Extensive experiments are performed on widely used benchmark datasets to validate the superiority of our R$^2$FGC over state-of-the-art baselines. Our codes are available at https://github.com/yisiyu95/R2FGC.

foundation model

Title: Unified Language-Vision Pretraining with Dynamic Discrete Visual Tokenization. (arXiv:2309.04669v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.04669
Code URL: null
Copy Paste: [[2309.04669]] Unified Language-Vision Pretraining with Dynamic Discrete Visual Tokenization(http://arxiv.org/abs/2309.04669)
Summary:
Recently, the remarkable advance of the Large Language Model (LLM) has inspired researchers to transfer its extraordinary reasoning capability to data across several modalities. The prevailing approaches primarily regard visual input as the prompt and focus exclusively on optimizing the text generation process conditioned upon vision content by a frozen LLM. Such an inequitable treatment of vision and language heavily constrains the model's potential. In this paper, we break through this limitation by representing both vision and language in a unified representation. To this end, we craft a visual tokenizer that translates the non-linguistic image into a sequence of discrete tokens like a foreign language that LLM can read. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image content. Coped with this visual tokenizer, the presented foundation model called LaVIT (Language-VIsion Transformer) can handle both image and text indiscriminately under a unified generative learning paradigm. Pre-trained on the web-scale image-text corpus, LaVIT is empowered with impressive multi-modal comprehension capability. The extensive experiments showcase that it outperforms existing models by a large margin on downstream tasks. Our code and models will be available at https://github.com/jy0205/LaVIT.

Title: SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning. (arXiv:2309.04766v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.04766
Code URL: null
Copy Paste: [[2309.04766]] SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning(http://arxiv.org/abs/2309.04766)
Summary:
We present SeaEval, a benchmark for multilingual foundation models. In addition to characterizing how these models understand and reason with natural language, we also investigate how well they comprehend cultural practices, nuances, and values. Alongside standard accuracy metrics, we investigate the brittleness of foundation models in the dimensions of semantics and multilinguality. Our analyses span both open-sourced and closed models, leading to empirical results across classic NLP tasks, reasoning, and cultural comprehension. Key findings indicate (1) Most models exhibit varied behavior when given paraphrased instructions. (2) Many models still suffer from exposure bias (e.g., positional bias, majority label bias). (3) For questions rooted in factual, scientific, and commonsense knowledge, consistent responses are expected across multilingual queries that are semantically equivalent. Yet, most models surprisingly demonstrate inconsistent performance on these queries. (4) Multilingually-trained models have not attained "balanced multilingual" capabilities. Our endeavors underscore the need for more generalizable semantic representations and enhanced multilingual contextualization. SeaEval can serve as a launchpad for more thorough investigations and evaluations for multilingual and multicultural scenarios.

generative

Title: Style Generation: Image Synthesis based on Coarsely Matched Texts. (arXiv:2309.04608v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.04608
Code URL: null
Copy Paste: [[2309.04608]] Style Generation: Image Synthesis based on Coarsely Matched Texts(http://arxiv.org/abs/2309.04608)
Summary:
Previous text-to-image synthesis algorithms typically use explicit textual instructions to generate/manipulate images accurately, but they have difficulty adapting to guidance in the form of coarsely matched texts. In this work, we attempt to stylize an input image using such coarsely matched text as guidance. To tackle this new problem, we introduce a novel task called text-based style generation and propose a two-stage generative adversarial network: the first stage generates the overall image style with a sentence feature, and the second stage refines the generated style with a synthetic feature, which is produced by a multi-modality style synthesis module. We re-filter one existing dataset and collect a new dataset for the task. Extensive experiments and ablation studies are conducted to validate our framework. The practical potential of our work is demonstrated by various applications such as text-image alignment and story visualization. Our datasets are published at https://www.kaggle.com/datasets/mengyaocui/style-generation.

Title: VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis. (arXiv:2309.04800v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.04800
Code URL: null
Copy Paste: [[2309.04800]] VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis(http://arxiv.org/abs/2309.04800)
Summary:
Unsupervised learning of 3D-aware generative adversarial networks has lately made much progress. Some recent work demonstrates promising results of learning human generative models using neural articulated radiance fields, yet their generalization ability and controllability lag behind parametric human models, i.e., they do not perform well when generalizing to novel pose/shape and are not part controllable. To solve these problems, we propose VeRi3D, a generative human vertex-based radiance field parameterized by vertices of the parametric human template, SMPL. We map each 3D point to the local coordinate system defined on its neighboring vertices, and use the corresponding vertex feature and local coordinates for mapping it to color and density values. We demonstrate that our simple approach allows for generating photorealistic human images with free control over camera pose, human pose, shape, as well as enabling part-level editing.

Title: TCGAN: Convolutional Generative Adversarial Network for Time Series Classification and Clustering. (arXiv:2309.04732v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.04732
Code URL: https://bitbucket.org/lynn1/tcgan
Copy Paste: [[2309.04732]] TCGAN: Convolutional Generative Adversarial Network for Time Series Classification and Clustering(http://arxiv.org/abs/2309.04732)
Summary:
Recent works have demonstrated the superiority of supervised Convolutional Neural Networks (CNNs) in learning hierarchical representations from time series data for successful classification. These methods require sufficiently large labeled data for stable learning, however acquiring high-quality labeled time series data can be costly and potentially infeasible. Generative Adversarial Networks (GANs) have achieved great success in enhancing unsupervised and semi-supervised learning. Nonetheless, to our best knowledge, it remains unclear how effectively GANs can serve as a general-purpose solution to learn representations for time series recognition, i.e., classification and clustering. The above considerations inspire us to introduce a Time-series Convolutional GAN (TCGAN). TCGAN learns by playing an adversarial game between two one-dimensional CNNs (i.e., a generator and a discriminator) in the absence of label information. Parts of the trained TCGAN are then reused to construct a representation encoder to empower linear recognition methods. We conducted comprehensive experiments on synthetic and real-world datasets. The results demonstrate that TCGAN is faster and more accurate than existing time-series GANs. The learned representations enable simple classification and clustering methods to achieve superior and stable performance. Furthermore, TCGAN retains high efficacy in scenarios with few-labeled and imbalanced-labeled data. Our work provides a promising path to effectively utilize abundant unlabeled time series data.

Title: AmbientFlow: Invertible generative models from incomplete, noisy measurements. (arXiv:2309.04856v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.04856
Code URL: null
Copy Paste: [[2309.04856]] AmbientFlow: Invertible generative models from incomplete, noisy measurements(http://arxiv.org/abs/2309.04856)
Summary:
Generative models have gained popularity for their potential applications in imaging science, such as image reconstruction, posterior sampling and data sharing. Flow-based generative models are particularly attractive due to their ability to tractably provide exact density estimates along with fast, inexpensive and diverse samples. Training such models, however, requires a large, high quality dataset of objects. In applications such as computed imaging, it is often difficult to acquire such data due to requirements such as long acquisition time or high radiation dose, while acquiring noisy or partially observed measurements of these objects is more feasible. In this work, we propose AmbientFlow, a framework for learning flow-based generative models directly from noisy and incomplete data. Using variational Bayesian methods, a novel framework for establishing flow-based generative models from noisy, incomplete data is proposed. Extensive numerical studies demonstrate the effectiveness of AmbientFlow in correctly learning the object distribution. The utility of AmbientFlow in a downstream inference task of image reconstruction is demonstrated.

anomaly

Title: Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation. (arXiv:2309.04573v1 [cs.CV])

Paper URL: http://arxiv.org/abs/2309.04573
Code URL: null
Copy Paste: [[2309.04573]] Mask2Anomaly: Mask Transformer for Universal Open-set Segmentation(http://arxiv.org/abs/2309.04573)
Summary:
Segmenting unknown or anomalous object instances is a critical task in autonomous driving applications, and it is approached traditionally as a per-pixel classification problem. However, reasoning individually about each pixel without considering their contextual semantics results in high uncertainty around the objects' boundaries and numerous false positives. We propose a paradigm change by shifting from a per-pixel classification to a mask classification. Our mask-based method, Mask2Anomaly, demonstrates the feasibility of integrating a mask-classification architecture to jointly address anomaly segmentation, open-set semantic segmentation, and open-set panoptic segmentation. Mask2Anomaly includes several technical novelties that are designed to improve the detection of anomalies/unknown objects: i) a global masked attention module to focus individually on the foreground and background regions; ii) a mask contrastive learning that maximizes the margin between an anomaly and known classes; iii) a mask refinement solution to reduce false positives; and iv) a novel approach to mine unknown instances based on the mask-architecture properties. By comprehensive qualitative and qualitative evaluation, we show Mask2Anomaly achieves new state-of-the-art results across the benchmarks of anomaly segmentation, open-set semantic segmentation, and open-set panoptic segmentation.

Title: Knowledge Distillation-Empowered Digital Twin for Anomaly Detection. (arXiv:2309.04616v1 [cs.LG])

Paper URL: http://arxiv.org/abs/2309.04616
Code URL: null
Copy Paste: [[2309.04616]] Knowledge Distillation-Empowered Digital Twin for Anomaly Detection(http://arxiv.org/abs/2309.04616)
Summary:
Cyber-physical systems (CPSs), like train control and management systems (TCMS), are becoming ubiquitous in critical infrastructures. As safety-critical systems, ensuring their dependability during operation is crucial. Digital twins (DTs) have been increasingly studied for this purpose owing to their capability of runtime monitoring and warning, prediction and detection of anomalies, etc. However, constructing a DT for anomaly detection in TCMS necessitates sufficient training data and extracting both chronological and context features with high quality. Hence, in this paper, we propose a novel method named KDDT for TCMS anomaly detection. KDDT harnesses a language model (LM) and a long short-term memory (LSTM) network to extract contexts and chronological features, respectively. To enrich data volume, KDDT benefits from out-of-domain data with knowledge distillation (KD). We evaluated KDDT with two datasets from our industry partner Alstom and obtained the F1 scores of 0.931 and 0.915, respectively, demonstrating the effectiveness of KDDT. We also explored individual contributions of the DT model, LM, and KD to the overall performance of KDDT, via a comprehensive empirical study, and observed average F1 score improvements of 12.4%, 3%, and 6.05%, respectively.

in-context

Title: FIAT: Fusing learning paradigms with Instruction-Accelerated Tuning. (arXiv:2309.04663v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.04663
Code URL: null
Copy Paste: [[2309.04663]] FIAT: Fusing learning paradigms with Instruction-Accelerated Tuning(http://arxiv.org/abs/2309.04663)
Summary:
Learning paradigms for large language models (LLMs) currently tend to fall within either in-context learning (ICL) or full fine-tuning. Each of these comes with their own trade-offs based on available data, model size, compute cost, ease-of-use, and final quality with neither solution performing well across-the-board. In this article, we first describe ICL and fine-tuning paradigms in a way that highlights their natural connections. Based on these connections, we propose a new learning paradigm called FIAT that fuses the best of these paradigms together, enabling prompt-engineered instructions and chain-of-thought reasoning with the very largest models while also using similar methods to perform parameter updates on a modestly-sized LLM with parameter-efficient tuning. We evaluate FIAT's effectiveness on a variety of multilingual tasks and observe that FIAT performs better than both ICL and fine-tuning at scales ranging from 100-10,000 training examples. We hope that FIAT provides a practical way of harnessing the full potential of LLMs without needing to make a hard choice between learning paradigms.

Title: Code-Style In-Context Learning for Knowledge-Based Question Answering. (arXiv:2309.04695v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.04695
Code URL: null
Copy Paste: [[2309.04695]] Code-Style In-Context Learning for Knowledge-Based Question Answering(http://arxiv.org/abs/2309.04695)
Summary:
Current methods for Knowledge-Based Question Answering (KBQA) usually rely on complex training techniques and model frameworks, leading to many limitations in practical applications. Recently, the emergence of In-Context Learning (ICL) capabilities in Large Language Models (LLMs) provides a simple and training-free semantic parsing paradigm for KBQA: Given a small number of questions and their labeled logical forms as demo examples, LLMs can understand the task intent and generate the logic form for a new question. However, current powerful LLMs have little exposure to logic forms during pre-training, resulting in a high format error rate. To solve this problem, we propose a code-style in-context learning method for KBQA, which converts the generation process of unfamiliar logical form into the more familiar code generation process for LLMs. Experimental results on three mainstream datasets show that our method dramatically mitigated the formatting error problem in generating logic forms while realizing a new SOTA on WebQSP, GrailQA, and GraphQ under the few-shot setting.

Title: EPA: Easy Prompt Augmentation on Large Language Models via Multiple Sources and Multiple Targets. (arXiv:2309.04725v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.04725
Code URL: null
Copy Paste: [[2309.04725]] EPA: Easy Prompt Augmentation on Large Language Models via Multiple Sources and Multiple Targets(http://arxiv.org/abs/2309.04725)
Summary:
Large language models (LLMs) have shown promising performance on various NLP tasks via task prompting. And their performance can be further improved by appending task demonstrations to the head of the prompt. And usually, a better performance can be achieved with more demonstrations. However, asking the users to write the demonstrations can be cumbersome. As a simple yet cost-effective workaround, this paper proposes a novel method called EPA (\textbf{E}asy \textbf{P}rompt \textbf{A}ugmentation)\footnote{While this paper considers augmenting prompts via demonstrations, we name it EPA as the name EDA is already taken by a well-known NLP method \citep{wei-zou-2019-eda}.} that effectively minimizes user efforts in writing demonstrations while improving the model performance at the same time. EPA achieves these goals by automatically augmenting the demonstrations with multiple sources/targets, where each of them paraphrases each other. This is well motivated as augmenting data via paraphrasing effectively improves neural language models. EPA thus employs paraphrasing as an augmentation method for in-context learning. Extensive experiments indicate that EPA effectively improves both NLU and NLG tasks, covering from natural language inference to machine translation in translating tens of languages.\footnote{Code and data will be released upon publication.}

Title: MMHQA-ICL: Multimodal In-context Learning for Hybrid Question Answering over Text, Tables and Images. (arXiv:2309.04790v1 [cs.CL])

Paper URL: http://arxiv.org/abs/2309.04790
Code URL: null
Copy Paste: [[2309.04790]] MMHQA-ICL: Multimodal In-context Learning for Hybrid Question Answering over Text, Tables and Images(http://arxiv.org/abs/2309.04790)
Summary:
In the real world, knowledge often exists in a multimodal and heterogeneous form. Addressing the task of question answering with hybrid data types, including text, tables, and images, is a challenging task (MMHQA). Recently, with the rise of large language models (LLM), in-context learning (ICL) has become the most popular way to solve QA problems. We propose MMHQA-ICL framework for addressing this problems, which includes stronger heterogeneous data retriever and an image caption module. Most importantly, we propose a Type-specific In-context Learning Strategy for MMHQA, enabling LLMs to leverage their powerful performance in this task. We are the first to use end-to-end LLM prompting method for this task. Experimental results demonstrate that our framework outperforms all baselines and methods trained on the full dataset, achieving state-of-the-art results under the few-shot setting on the MultimodalQA dataset.