2024-06-18

Title: Towards Signal Processing In Large Language Models

Authors: Prateek Verma, Mert Pilanci
Subjects: cs.CL, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Towards Signal Processing In Large Language Models(https://arxiv.org/abs/)
Keywords: generative
Abstract: This paper introduces the idea of applying signal processing inside a Large Language Model (LLM). With the recent explosion of generative AI, our work can help bridge two fields together, namely the field of signal processing and large language models. We draw parallels between classical Fourier-Transforms and Fourier Transform-like learnable time-frequency representations for every intermediate activation signal of an LLM. Once we decompose every activation signal across tokens into a time-frequency representation, we learn how to filter and reconstruct them, with all components learned from scratch, to predict the next token given the previous context. We show that for GPT-like architectures, our work achieves faster convergence and significantly increases performance by adding a minuscule number of extra parameters when trained for the same epochs. We hope this work paves the way for algorithms exploring signal processing inside the signals found in neural architectures like LLMs and beyond.

Title: Unused information in token probability distribution of generative LLM: improving LLM reading comprehension through calculation of expected values

Authors: Krystian Zawistowski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Unused information in token probability distribution of generative LLM: improving LLM reading comprehension through calculation of expected values(https://arxiv.org/abs/)
Keywords: generative
Abstract: LLM text decoding is key component for perceived LLM quality. We demonstrate two experiments showing that decoding methods could be improved by manipulation of token probabilities. First, we test few LLM on SummEval summary scoring dataset, to measure reading comprehension. We compare scores from greedy decoding to expected values over the next token distribution. We scale logits by large temperature to increase the entropy of scores. This allows strong improvement of performance on SummEval (in terms of correlations to human judgement). We see improvement from 6-8% to 13-28% for 7B Mistral and from 20%-46% to 37%-56% for Mixtral, beating GPT 4 0314 result on two metrics. Part of the gain seems related to positional bias. Secondly, we use probability-based tree sampling algorithm, to examine all most probable generations for given prompt.

Title: ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets

Authors: Shahin Amiriparian, Filip Packań, Maurice Gerczuk, Björn W. Schuller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: Foundation models have shown great promise in speech emotion recognition (SER) by leveraging their pre-trained representations to capture emotion patterns in speech signals. To further enhance SER performance across various languages and domains, we propose a novel twofold approach. First, we gather EmoSet++, a comprehensive multi-lingual, multi-cultural speech emotion corpus with 37 datasets, 150,907 samples, and a total duration of 119.5 hours. Second, we introduce ExHuBERT, an enhanced version of HuBERT achieved by backbone extension and fine-tuning on EmoSet++. We duplicate each encoder layer and its weights, then freeze the first duplicate, integrating an extra zero-initialized linear layer and skip connections to preserve functionality and ensure its adaptability for subsequent fine-tuning. Our evaluation on unseen datasets shows the efficacy of ExHuBERT, setting a new benchmark for various SER tasks. Model and details on EmoSet++: this https URL.

Title: Hardware-based stack buffer overflow attack detection on RISC-V architectures

Authors: Cristiano Pegoraro Chenet, Ziteng Zhang, Alessandro Savino, Stefano Di Carlo
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Hardware-based stack buffer overflow attack detection on RISC-V architectures(https://arxiv.org/abs/)
Keywords: anomaly
Abstract: This work evaluates how well hardware-based approaches detect stack buffer overflow (SBO) attacks in RISC-V systems. We conducted simulations on the PULP platform and examined micro-architecture events using semi-supervised anomaly detection techniques. The findings showed the challenge of detection performance. Thus, a potential solution combines software and hardware-based detectors concurrently, with hardware as the primary defense. The hardware-based approaches present compelling benefits that could enhance RISC-V-based architectures.

Title: Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection

Authors: Zihan Pan, Tianchi Liu, Hardik B. Sailor, Qiongqiong Wang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection(https://arxiv.org/abs/)
Keywords: self-supervised
Abstract: Self-supervised learning (SSL) speech representation models, trained on large speech corpora, have demonstrated effectiveness in extracting hierarchical speech embeddings through multiple transformer layers. However, the behavior of these embeddings in specific tasks remains uncertain. This paper investigates the multi-layer behavior of the WavLM model in anti-spoofing and proposes an attentive merging method to leverage the hierarchical hidden embeddings. Results demonstrate the feasibility of fine-tuning WavLM to achieve the best equal error rate (EER) of 0.65%, 3.50%, and 3.19% on the ASVspoof 2019LA, 2021LA, and 2021DF evaluation sets, respectively. Notably, We find that the early hidden transformer layers of the WavLM large model contribute significantly to anti-spoofing task, enabling computational efficiency by utilizing a partial pre-trained model.

Title: VeraCT Scan: Retrieval-Augmented Fake News Detection with Justifiable Reasoning

Authors: Cheng Niu, Yang Guan, Yuanhao Wu, Juno Zhu, Juntong Song, Randy Zhong, Kaihua Zhu, Siliang Xu, Shizhe Diao, Tong Zhang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] VeraCT Scan: Retrieval-Augmented Fake News Detection with Justifiable Reasoning(https://arxiv.org/abs/)
Keywords: generative
Abstract: The proliferation of fake news poses a significant threat not only by disseminating misleading information but also by undermining the very foundations of democracy. The recent advance of generative artificial intelligence has further exacerbated the challenge of distinguishing genuine news from fabricated stories. In response to this challenge, we introduce VeraCT Scan, a novel retrieval-augmented system for fake news detection. This system operates by extracting the core facts from a given piece of news and subsequently conducting an internet-wide search to identify corroborating or conflicting reports. Then sources' credibility is leveraged for information verification. Besides determining the veracity of news, we also provide transparent evidence and reasoning to support its conclusions, resulting in the interpretability and trust in the results. In addition to GPT-4 Turbo, Llama-2 13B is also fine-tuned for news content understanding, information verification, and reasoning. Both implementations have demonstrated state-of-the-art accuracy in the realm of fake news detection.

Title: CLST: Cold-Start Mitigation in Knowledge Tracing by Aligning a Generative Language Model as a Students' Knowledge Tracer

Authors: Heeseok Jung, Jaesang Yoo, Yohaan Yoon, Yeonju Jang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] CLST: Cold-Start Mitigation in Knowledge Tracing by Aligning a Generative Language Model as a Students' Knowledge Tracer(https://arxiv.org/abs/)
Keywords: generative
Abstract: Knowledge tracing (KT), wherein students' problem-solving histories are used to estimate their current levels of knowledge, has attracted significant interest from researchers. However, most existing KT models were developed with an ID-based paradigm, which exhibits limitations in cold-start performance. These limitations can be mitigated by leveraging the vast quantities of external knowledge possessed by generative large language models (LLMs). In this study, we propose cold-start mitigation in knowledge tracing by aligning a generative language model as a students' knowledge tracer (CLST) as a framework that utilizes a generative LLM as a knowledge tracer. Upon collecting data from math, social studies, and science subjects, we framed the KT task as a natural language processing task, wherein problem-solving data are expressed in natural language, and fine-tuned the generative LLM using the formatted KT dataset. Subsequently, we evaluated the performance of the CLST in situations of data scarcity using various baseline models for comparison. The results indicate that the CLST significantly enhanced performance with a dataset of fewer than 100 students in terms of prediction, reliability, and cross-domain generalization.

Title: What is the best model? Application-driven Evaluation for Large Language Models

Authors: Shiguo Lian, Kaikai Zhao, Xinhui Liu, Xuejiao Lei, Bikun Yang, Wenjing Zhang, Kai Wang, Zhaoxiang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] What is the best model? Application-driven Evaluation for Large Language Models(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: General large language models enhanced with supervised fine-tuning and reinforcement learning from human feedback are increasingly popular in academia and industry as they generalize foundation models to various practical tasks in a prompt manner. To assist users in selecting the best model in practical application scenarios, i.e., choosing the model that meets the application requirements while minimizing cost, we introduce A-Eval, an application-driven LLMs evaluation benchmark for general large language models. First, we categorize evaluation tasks into five main categories and 27 sub-categories from a practical application perspective. Next, we construct a dataset comprising 678 question-and-answer pairs through a process of collecting, annotating, and reviewing. Then, we design an objective and effective evaluation method and evaluate a series of LLMs of different scales on A-Eval. Finally, we reveal interesting laws regarding model scale and task difficulty level and propose a feasible method for selecting the best model. Through A-Eval, we provide clear empirical and engineer guidance for selecting the best model, reducing barriers to selecting and using LLMs and promoting their application and development. Our benchmark is publicly available at this https URL.

Title: Creating a Lens of Chinese Culture: A Multimodal Dataset for Chinese Pun Rebus Art Understanding

Authors: Tuo Zhang, Tiantian Feng, Yibin Ni, Mengqin Cao, Ruying Liu, Katharine Butler, Yanjun Weng, Mi Zhang, Shrikanth S. Narayanan, Salman Avestimehr
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Creating a Lens of Chinese Culture: A Multimodal Dataset for Chinese Pun Rebus Art Understanding(https://arxiv.org/abs/)
Keywords: in-context
Abstract: Large vision-language models (VLMs) have demonstrated remarkable abilities in understanding everyday content. However, their performance in the domain of art, particularly culturally rich art forms, remains less explored. As a pearl of human wisdom and creativity, art encapsulates complex cultural narratives and symbolism. In this paper, we offer the Pun Rebus Art Dataset, a multimodal dataset for art understanding deeply rooted in traditional Chinese culture. We focus on three primary tasks: identifying salient visual elements, matching elements with their symbolic meanings, and explanations for the conveyed messages. Our evaluation reveals that state-of-the-art VLMs struggle with these tasks, often providing biased and hallucinated explanations and showing limited improvement through in-context learning. By releasing the Pun Rebus Art Dataset, we aim to facilitate the development of VLMs that can better understand and interpret culturally specific content, promoting greater inclusiveness beyond English-based corpora.

Title: VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs

Authors: Rohit Bharadwaj, Hanan Gani, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs(https://arxiv.org/abs/)
Keywords: anomaly
Abstract: The recent developments in Large Multi-modal Video Models (Video-LMMs) have significantly enhanced our ability to interpret and analyze video data. Despite their impressive capabilities, current Video-LMMs have not been evaluated for anomaly detection tasks, which is critical to their deployment in practical scenarios e.g., towards identifying deepfakes, manipulated video content, traffic accidents and crimes. In this paper, we introduce VANE-Bench, a benchmark designed to assess the proficiency of Video-LMMs in detecting and localizing anomalies and inconsistencies in videos. Our dataset comprises an array of videos synthetically generated using existing state-of-the-art text-to-video generation models, encompassing a variety of subtle anomalies and inconsistencies grouped into five categories: unnatural transformations, unnatural appearance, pass-through, disappearance and sudden appearance. Additionally, our benchmark features real-world samples from existing anomaly detection datasets, focusing on crime-related irregularities, atypical pedestrian behavior, and unusual events. The task is structured as a visual question-answering challenge to gauge the models' ability to accurately detect and localize the anomalies within the videos. We evaluate nine existing Video-LMMs, both open and closed sources, on this benchmarking task and find that most of the models encounter difficulties in effectively identifying the subtle anomalies. In conclusion, our research offers significant insights into the current capabilities of Video-LMMs in the realm of anomaly detection, highlighting the importance of our work in evaluating and improving these models for real-world applications. Our code and data is available at this https URL

Title: SigDiffusions: Score-Based Diffusion Models for Long Time Series via Log-Signature Embeddings

Authors: Barbora Barancikova, Zhuoyue Huang, Cristopher Salvi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] SigDiffusions: Score-Based Diffusion Models for Long Time Series via Log-Signature Embeddings(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: Score-based diffusion models have recently emerged as state-of-the-art generative models for a variety of data modalities. Nonetheless, it remains unclear how to adapt these models to generate long multivariate time series. Viewing a time series as the discretization of an underlying continuous process, we introduce SigDiffusion, a novel diffusion model operating on log-signature embeddings of the data. The forward and backward processes gradually perturb and denoise log-signatures preserving their algebraic structure. To recover a signal from its log-signature, we provide new closed-form inversion formulae expressing the coefficients obtained by expanding the signal in a given basis (e.g. Fourier or orthogonal polynomials) as explicit polynomial functions of the log-signature. Finally, we show that combining SigDiffusion with these inversion formulae results in highly realistic time series generation, competitive with the current state-of-the-art on various datasets of synthetic and real-world examples.

Title: EWEK-QA: Enhanced Web and Efficient Knowledge Graph Retrieval for Citation-based Question Answering Systems

Authors: Mohammad Dehghan, Mohammad Ali Alomrani, Sunyam Bagga, David Alfonso-Hermelo, Khalil Bibi, Abbas Ghaddar, Yingxue Zhang, Xiaoguang Li, Jianye Hao, Qun Liu, Jimmy Lin, Boxing Chen, Prasanna Parthasarathi, Mahdi Biparva, Mehdi Rezagholizadeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] EWEK-QA: Enhanced Web and Efficient Knowledge Graph Retrieval for Citation-based Question Answering Systems(https://arxiv.org/abs/)
Keywords: generative
Abstract: The emerging citation-based QA systems are gaining more attention especially in generative AI search applications. The importance of extracted knowledge provided to these systems is vital from both accuracy (completeness of information) and efficiency (extracting the information in a timely manner). In this regard, citation-based QA systems are suffering from two shortcomings. First, they usually rely only on web as a source of extracted knowledge and adding other external knowledge sources can hamper the efficiency of the system. Second, web-retrieved contents are usually obtained by some simple heuristics such as fixed length or breakpoints which might lead to splitting information into pieces. To mitigate these issues, we propose our enhanced web and efficient knowledge graph (KG) retrieval solution (EWEK-QA) to enrich the content of the extracted knowledge fed to the system. This has been done through designing an adaptive web retriever and incorporating KGs triples in an efficient manner. We demonstrate the effectiveness of EWEK-QA over the open-source state-of-the-art (SoTA) web-based and KG baseline models using a comprehensive set of quantitative and human evaluation experiments. Our model is able to: first, improve the web-retriever baseline in terms of extracting more relevant passages (>20\%), the coverage of answer span (>25\%) and self containment (>35\%); second, obtain and integrate KG triples into its pipeline very efficiently (by avoiding any LLM calls) to outperform the web-only and KG-only SoTA baselines significantly in 7 quantitative QA tasks and our human evaluation.

Title: Towards Neural Scaling Laws for Foundation Models on Temporal Graphs

Authors: Razieh Shirzadkhani, Tran Gia Bao Ngo, Kiarash Shamsi, Shenyang Huang, Farimah Poursafaei, Poupak Azad, Reihaneh Rabbany, Baris Coskunuzer, Guillaume Rabusseau, Cuneyt Gurcan Akcora
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Towards Neural Scaling Laws for Foundation Models on Temporal Graphs(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: The field of temporal graph learning aims to learn from evolving network data to forecast future interactions. Given a collection of observed temporal graphs, is it possible to predict the evolution of an unseen network from the same domain? To answer this question, we first present the Temporal Graph Scaling (TGS) dataset, a large collection of temporal graphs consisting of eighty-four ERC20 token transaction networks collected from 2017 to 2023. Next, we evaluate the transferability of Temporal Graph Neural Networks (TGNNs) for the temporal graph property prediction task by pre-training on a collection of up to sixty-four token transaction networks and then evaluating the downstream performance on twenty unseen token networks. We find that the neural scaling law observed in NLP and Computer Vision also applies in temporal graph learning, where pre-training on greater number of networks leads to improved downstream performance. To the best of our knowledge, this is the first empirical demonstration of the transferability of temporal graphs learning. On downstream token networks, the largest pre-trained model outperforms single model TGNNs on thirteen unseen test networks. Therefore, we believe that this is a promising first step towards building foundation models for temporal graphs.

Title: Consistency-diversity-realism Pareto fronts of conditional image generative models

Authors: Pietro Astolfi, Marlene Careil, Melissa Hall, Oscar Mañas, Matthew Muckley, Jakob Verbeek, Adriana Romero Soriano, Michal Drozdzal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Consistency-diversity-realism Pareto fronts of conditional image generative models(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: Building world models that accurately and comprehensively represent the real world is the utmost aspiration for conditional image generative models as it would enable their use as world simulators. For these models to be successful world models, they should not only excel at image quality and prompt-image consistency but also ensure high representation diversity. However, current research in generative models mostly focuses on creative applications that are predominantly concerned with human preferences of image quality and aesthetics. We note that generative models have inference time mechanisms - or knobs - that allow the control of generation consistency, quality, and diversity. In this paper, we use state-of-the-art text-to-image and image-and-text-to-image models and their knobs to draw consistency-diversity-realism Pareto fronts that provide a holistic view on consistency-diversity-realism multi-objective. Our experiments suggest that realism and consistency can both be improved simultaneously; however there exists a clear tradeoff between realism/consistency and diversity. By looking at Pareto optimal points, we note that earlier models are better at representation diversity and worse in consistency/realism, and more recent models excel in consistency/realism while decreasing significantly the representation diversity. By computing Pareto fronts on a geodiverse dataset, we find that the first version of latent diffusion models tends to perform better than more recent models in all axes of evaluation, and there exist pronounced consistency-diversity-realism disparities between geographical regions. Overall, our analysis clearly shows that there is no best model and the choice of model should be determined by the downstream application. With this analysis, we invite the research community to consider Pareto fronts as an analytical tool to measure progress towards world models.

Title: Enhancing In-Context Learning with Semantic Representations for Relation Extraction

Authors: Peitao Han, Lis Kanashiro Pereira, Fei Cheng, Wan Jou She, Eiji Aramaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Enhancing In-Context Learning with Semantic Representations for Relation Extraction(https://arxiv.org/abs/)
Keywords: in-context
Abstract: In this work, we employ two AMR-enhanced semantic representations for ICL on RE: one that explores the AMR structure generated for a sentence at the subgraph level (shortest AMR path), and another that explores the full AMR structure generated for a sentence. In both cases, we demonstrate that all settings benefit from the fine-grained AMR's semantic structure. We evaluate our model on four RE datasets. Our results show that our model can outperform the GPT-based baselines, and achieve SOTA performance on two of the datasets, and competitive performance on the other two.

Title: Domain-Specific Shorthand for Generation Based on Context-Free Grammar

Authors: Andriy Kanyuka, Elias Mahfoud
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Domain-Specific Shorthand for Generation Based on Context-Free Grammar(https://arxiv.org/abs/)
Keywords: generative
Abstract: The generation of structured data in formats such as JSON, YAML and XML is a critical task in Generative AI (GenAI) applications. These formats, while widely used, contain many redundant constructs that lead to inflated token usage. This inefficiency is particularly evident when employing large language models (LLMs) like GPT-4, where generating extensive structured data incurs increased latency and operational costs. We introduce a domain-specific shorthand (DSS) format, underpinned by a context-free grammar (CFG), and demonstrate its usage to reduce the number of tokens required for structured data generation. The method involves creating a shorthand notation that captures essential elements of the output schema with fewer tokens, ensuring it can be unambiguously converted to and from its verbose form. It employs a CFG to facilitate efficient shorthand generation by the LLM, and to create parsers to translate the shorthand back into standard structured formats. The application of our approach to data visualization with LLMs demonstrates a significant (3x to 5x) reduction in generated tokens, leading to significantly lower latency and cost. This paper outlines the development of the DSS and the accompanying CFG, and the implications of this approach for GenAI applications, presenting a scalable solution to the token inefficiency problem in structured data generation.

Title: The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences

Authors: Bria Long, Violet Xiang, Stefan Stojanov, Robert Z. Sparks, Zi Yin, Grace E. Keene, Alvin W. M. Tan, Steven Y. Feng, Chengxu Zhuang, Virginia A. Marchman, Daniel L. K. Yamins, Michael C. Frank
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences(https://arxiv.org/abs/)
Keywords: self-supervised
Abstract: Human children far exceed modern machine learning algorithms in their sample efficiency, achieving high performance in key domains with much less data than current models. This ''data gap'' is a key challenge both for building intelligent artificial systems and for understanding human development. Egocentric video capturing children's experience -- their ''training data'' -- is a key ingredient for comparison of humans and models and for the development of algorithmic innovations to bridge this gap. Yet there are few such datasets available, and extant data are low-resolution, have limited metadata, and importantly, represent only a small set of children's experiences. Here, we provide the first release of the largest developmental egocentric video dataset to date -- the BabyView dataset -- recorded using a high-resolution camera with a large vertical field-of-view and gyroscope/accelerometer data. This 493 hour dataset includes egocentric videos from children spanning 6 months - 5 years of age in both longitudinal, at-home contexts and in a preschool environment. We provide gold-standard annotations for the evaluation of speech transcription, speaker diarization, and human pose estimation, and evaluate models in each of these domains. We train self-supervised language and vision models and evaluate their transfer to out-of-distribution tasks including syntactic structure learning, object recognition, depth estimation, and image segmentation. Although performance in each scales with dataset size, overall performance is relatively lower than when models are trained on curated datasets, especially in the visual domain. Our dataset stands as an open challenge for robust, humanlike AI systems: how can such systems achieve human-levels of success on the same scale and distribution of training data as humans?

Title: CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

Authors: Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu, Long Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation(https://arxiv.org/abs/)
Keywords: in-context
Abstract: Interleaved image-text generation has emerged as a crucial multimodal task, aiming at creating sequences of interleaved visual and textual content given a query. Despite notable advancements in recent multimodal large language models (MLLMs), generating integrated image-text sequences that exhibit narrative coherence and entity and style consistency remains challenging due to poor training data quality. To address this gap, we introduce CoMM, a high-quality Coherent interleaved image-text MultiModal dataset designed to enhance the coherence, consistency, and alignment of generated multimodal content. Initially, CoMM harnesses raw data from diverse sources, focusing on instructional content and visual storytelling, establishing a foundation for coherent and consistent content. To further refine the data quality, we devise a multi-perspective filter strategy that leverages advanced pre-trained models to ensure the development of sentences, consistency of inserted images, and semantic alignment between them. Various quality evaluation metrics are designed to prove the high quality of the filtered dataset. Meanwhile, extensive few-shot experiments on various downstream tasks demonstrate CoMM's effectiveness in significantly enhancing the in-context learning capabilities of MLLMs. Moreover, we propose four new tasks to evaluate MLLMs' interleaved generation abilities, supported by a comprehensive evaluation framework. We believe CoMM opens a new avenue for advanced MLLMs with superior multimodal in-context learning and understanding ability.

Title: From Words to Worlds: Transforming One-line Prompt into Immersive Multi-modal Digital Stories with Communicative LLM Agent

Authors: Samuel S. Sohn, Danrui Li, Sen Zhang, Che-Jui Chang, Mubbasir Kapadia
Subjects: cs.CL, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] From Words to Worlds: Transforming One-line Prompt into Immersive Multi-modal Digital Stories with Communicative LLM Agent(https://arxiv.org/abs/)
Keywords: generative
Abstract: Digital storytelling, essential in entertainment, education, and marketing, faces challenges in production scalability and flexibility. The StoryAgent framework, introduced in this paper, utilizes Large Language Models and generative tools to automate and refine digital storytelling. Employing a top-down story drafting and bottom-up asset generation approach, StoryAgent tackles key issues such as manual intervention, interactive scene orchestration, and narrative consistency. This framework enables efficient production of interactive and consistent narratives across multiple modalities, democratizing content creation and enhancing engagement. Our results demonstrate the framework's capability to produce coherent digital stories without reference videos, marking a significant advancement in automated digital storytelling.

Title: Large Language Models as Event Forecasters

Authors: Libo Zhang, Yue Ning
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Large Language Models as Event Forecasters(https://arxiv.org/abs/)
Keywords: generative
Abstract: Key elements of human events are extracted as quadruples that consist of subject, relation, object, and timestamp. This representation can be extended to a quintuple by adding a fifth element: a textual summary that briefly describes the event. These quadruples or quintuples, when organized within a specific domain, form a temporal knowledge graph (TKG). Current learning frameworks focus on a few TKG-related tasks, such as predicting an object given a subject and a relation or forecasting the occurrences of multiple types of events (i.e., relation) in the next time window. They typically rely on complex structural and sequential models like graph neural networks (GNNs) and recurrent neural networks (RNNs) to update intermediate embeddings. However, these methods often neglect the contextual information inherent in each quintuple, which can be effectively captured through concise textual descriptions. In this paper, we investigate how large language models (LLMs) can streamline the design of TKG learning frameworks while maintaining competitive accuracy in prediction and forecasting tasks. We develop multiple prompt templates to frame the object prediction (OP) task as a standard question-answering (QA) task, suitable for instruction fine-tuning with an encoder-decoder generative LLM. For multi-event forecasting (MEF), we design simple yet effective prompt templates for each TKG quintuple. This novel approach removes the need for GNNs and RNNs, instead utilizing an encoder-only LLM to generate fixed intermediate embeddings, which are subsequently processed by a prediction head with a self-attention mechanism to forecast potential future relations. Extensive experiments on multiple real-world datasets using various evaluation metrics validate the effectiveness and robustness of our approach.

Title: Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition

Authors: Weichao Zhao, Wengang Zhou, Hezhen Hu, Min Wang, Houqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition(https://arxiv.org/abs/)
Keywords: self-supervised
Abstract: Recently, there have been efforts to improve the performance in sign language recognition by designing self-supervised learning methods. However, these methods capture limited information from sign pose data in a frame-wise learning manner, leading to sub-optimal solutions. To this end, we propose a simple yet effective self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency from two distinct perspectives and learn instance discriminative representation for sign language recognition. On one hand, since the semantics of sign language are expressed by the cooperation of fine-grained hands and coarse-grained trunks, we utilize both granularity information and encode them into latent spaces. The consistency between hand and trunk features is constrained to encourage learning consistent representation of instance samples. On the other hand, inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling. Additionally, we further bridge the interaction between the embedding spaces of both modalities, facilitating bidirectional knowledge transfer to enhance sign language representation. Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin. The source code is publicly available at this https URL.

Title: Learning to Adapt Foundation Model DINOv2 for Capsule Endoscopy Diagnosis

Authors: Bowen Zhang, Ying Chen, Yan Zhao, Jianhua Zhang, Hongliang Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Learning to Adapt Foundation Model DINOv2 for Capsule Endoscopy Diagnosis(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: Foundation models have become prominent in computer vision, achieving notable success in various tasks. However, their effectiveness largely depends on pre-training with extensive datasets. Applying foundation models directly to small datasets of capsule endoscopy images from scratch is challenging. Pre-training on broad, general vision datasets is crucial for successfully fine-tuning our model for specific tasks. In this work, we introduce a simplified approach called Adapt foundation models with a low-rank adaptation (LoRA) technique for easier customization. Our method, inspired by the DINOv2 foundation model, applies low-rank adaptation learning to tailor foundation models for capsule endoscopy diagnosis effectively. Unlike traditional fine-tuning methods, our strategy includes LoRA layers designed to absorb specific surgical domain knowledge. During the training process, we keep the main model (the backbone encoder) fixed and focus on optimizing the LoRA layers and the disease classification component. We tested our method on two publicly available datasets for capsule endoscopy disease classification. The results were impressive, with our model achieving 97.75% accuracy on the Kvasir-Capsule dataset and 98.81% on the Kvasirv2 dataset. Our solution demonstrates that foundation models can be adeptly adapted for capsule endoscopy diagnosis, highlighting that mere reliance on straightforward fine-tuning or pre-trained models from general computer vision tasks is inadequate for such specific applications.

Title: Lift Your Molecules: Molecular Graph Generation in Latent Euclidean Space

Authors: Mohamed Amine Ketata, Nicholas Gao, Johanna Sommer, Tom Wollschläger, Stephan Günnemann
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Lift Your Molecules: Molecular Graph Generation in Latent Euclidean Space(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: We introduce a new framework for molecular graph generation with 3D molecular generative models. Our Synthetic Coordinate Embedding (SyCo) framework maps molecular graphs to Euclidean point clouds via synthetic conformer coordinates and learns the inverse map using an E(n)-Equivariant Graph Neural Network (EGNN). The induced point cloud-structured latent space is well-suited to apply existing 3D molecular generative models. This approach simplifies the graph generation problem - without relying on molecular fragments nor autoregressive decoding - into a point cloud generation problem followed by node and edge classification tasks. Further, we propose a novel similarity-constrained optimization scheme for 3D diffusion models based on inpainting and guidance. As a concrete implementation of our framework, we develop EDM-SyCo based on the E(3) Equivariant Diffusion Model (EDM). EDM-SyCo achieves state-of-the-art performance in distribution learning of molecular graphs, outperforming the best non-autoregressive methods by more than 30% on ZINC250K and 16% on the large-scale GuacaMol dataset while improving conditional generation by up to 3.9 times.

Title: MALLM-GAN: Multi-Agent Large Language Model as Generative Adversarial Network for Synthesizing Tabular Data

Authors: Yaobin Ling, Xiaoqian Jiang, Yejin Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] MALLM-GAN: Multi-Agent Large Language Model as Generative Adversarial Network for Synthesizing Tabular Data(https://arxiv.org/abs/)
Keywords: generative
Abstract: In the era of big data, access to abundant data is crucial for driving research forward. However, such data is often inaccessible due to privacy concerns or high costs, particularly in healthcare domain. Generating synthetic (tabular) data can address this, but existing models typically require substantial amounts of data to train effectively, contradicting our objective to solve data scarcity. To address this challenge, we propose a novel framework to generate synthetic tabular data, powered by large language models (LLMs) that emulates the architecture of a Generative Adversarial Network (GAN). By incorporating data generation process as contextual information and utilizing LLM as the optimizer, our approach significantly enhance the quality of synthetic data generation in common scenarios with small sample sizes. Our experimental results on public and private datasets demonstrate that our model outperforms several state-of-art models regarding generating higher quality synthetic data for downstream tasks while keeping privacy of the real data.

Title: Self-Supervised Vision Transformer for Enhanced Virtual Clothes Try-On

Authors: Lingxiao Lu, Shengyi Wu, Haoxuan Sun, Junhong Gou, Jianlou Si, Chen Qian, Jianfu Zhang, Liqing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Self-Supervised Vision Transformer for Enhanced Virtual Clothes Try-On(https://arxiv.org/abs/)
Keywords: diffusion, self-supervised
Abstract: Virtual clothes try-on has emerged as a vital feature in online shopping, offering consumers a critical tool to visualize how clothing fits. In our research, we introduce an innovative approach for virtual clothes try-on, utilizing a self-supervised Vision Transformer (ViT) coupled with a diffusion model. Our method emphasizes detail enhancement by contrasting local clothing image embeddings, generated by ViT, with their global counterparts. Techniques such as conditional guidance and focus on key regions have been integrated into our approach. These combined strategies empower the diffusion model to reproduce clothing details with increased clarity and realism. The experimental results showcase substantial advancements in the realism and precision of details in virtual try-on experiences, significantly surpassing the capabilities of existing technologies.

Title: NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows

Authors: Zhenggang Tang, Zhongzheng Ren, Xiaoming Zhao, Bowen Wen, Jonathan Tremblay, Stan Birchfield, Alexander Schwing
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: We present a method for automatically modifying a NeRF representation based on a single observation of a non-rigid transformed version of the original scene. Our method defines the transformation as a 3D flow, specifically as a weighted linear blending of rigid transformations of 3D anchor points that are defined on the surface of the scene. In order to identify anchor points, we introduce a novel correspondence algorithm that first matches RGB-based pairs, then leverages multi-view information and 3D reprojection to robustly filter false positives in two steps. We also introduce a new dataset for exploring the problem of modifying a NeRF scene through a single observation. Our dataset ( this https URL ) contains 113 synthetic scenes leveraging 47 3D assets. We show that our proposed method outperforms NeRF editing methods as well as diffusion-based methods, and we also explore different methods for filtering correspondences.

Title: A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing

Authors: Ming Meng, Yufei Zhao, Bo Zhang, Yonggui Zhu, Weimin Shi, Maxwell Wen, Zhaoxin Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: Talking head synthesis, an advanced method for generating portrait videos from a still image driven by specific content, has garnered widespread attention in virtual reality, augmented reality and game production. Recently, significant breakthroughs have been made with the introduction of novel models such as the transformer and the diffusion model. Current methods can not only generate new content but also edit the generated material. This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques. We summarize milestone studies and critically analyze their innovations and shortcomings within each domain. Additionally, we organize an extensive collection of datasets and provide a thorough performance analysis of current methodologies based on various evaluation metrics, aiming to furnish a clear framework and robust data support for future research. Finally, we explore application scenarios of talking head synthesis, illustrate them with specific cases, and examine potential future directions.

Title: Enhancing Anomaly Detection Generalization through Knowledge Exposure: The Dual Effects of Augmentation

Authors: Mohammad Akhavan Anvari, Rojina Kashefi, Vahid Reza Khazaie, Mohammad Khalooei, Mohammad Sabokrou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Enhancing Anomaly Detection Generalization through Knowledge Exposure: The Dual Effects of Augmentation(https://arxiv.org/abs/)
Keywords: anomaly
Abstract: Anomaly detection involves identifying instances within a dataset that deviate from the norm and occur infrequently. Current benchmarks tend to favor methods biased towards low diversity in normal data, which does not align with real-world scenarios. Despite advancements in these benchmarks, contemporary anomaly detection methods often struggle with out-of-distribution generalization, particularly in classifying samples with subtle transformations during testing. These methods typically assume that normal samples during test time have distributions very similar to those in the training set, while anomalies are distributed much further away. However, real-world test samples often exhibit various levels of distribution shift while maintaining semantic consistency. Therefore, effectively generalizing to samples that have undergone semantic-preserving transformations, while accurately detecting normal samples whose semantic meaning has changed after transformation as anomalies, is crucial for the trustworthiness and reliability of a model. For example, although it is clear that rotation shifts the meaning for a car in the context of anomaly detection but preserves the meaning for a bird, current methods are likely to detect both as abnormal. This complexity underscores the necessity for dynamic learning procedures rooted in the intrinsic concept of outliers. To address this issue, we propose new testing protocols and a novel method called Knowledge Exposure (KE), which integrates external knowledge to comprehend concept dynamics and differentiate transformations that induce semantic shifts. This approach enhances generalization by utilizing insights from a pre-trained CLIP model to evaluate the significance of anomalies for each concept. Evaluation on CIFAR-10, CIFAR-100, and SVHN with the new protocols demonstrates superior performance compared to previous methods.

Title: On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models

Authors: Sree Harsha Tanneru, Dan Ley, Chirag Agarwal, Himabindu Lakkaraju
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models(https://arxiv.org/abs/)
Keywords: in-context
Abstract: As Large Language Models (LLMs) are increasingly being employed in real-world applications in critical domains such as healthcare, it is important to ensure that the Chain-of-Thought (CoT) reasoning generated by these models faithfully captures their underlying behavior. While LLMs are known to generate CoT reasoning that is appealing to humans, prior studies have shown that these explanations do not accurately reflect the actual behavior of the underlying LLMs. In this work, we explore the promise of three broad approaches commonly employed to steer the behavior of LLMs to enhance the faithfulness of the CoT reasoning generated by LLMs: in-context learning, fine-tuning, and activation editing. Specifically, we introduce novel strategies for in-context learning, fine-tuning, and activation editing aimed at improving the faithfulness of the CoT reasoning. We then carry out extensive empirical analyses with multiple benchmark datasets to explore the promise of these strategies. Our analyses indicate that these strategies offer limited success in improving the faithfulness of the CoT reasoning, with only slight performance enhancements in controlled scenarios. Activation editing demonstrated minimal success, while fine-tuning and in-context learning achieved marginal improvements that failed to generalize across diverse reasoning and truthful question-answering benchmarks. In summary, our work underscores the inherent difficulty in eliciting faithful CoT reasoning from LLMs, suggesting that the current array of approaches may not be sufficient to address this complex challenge.

Title: Applications of Generative AI in Healthcare: algorithmic, ethical, legal and societal considerations

Authors: Onyekachukwu R. Okonji, Kamol Yunusov, Bonnie Gordon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Applications of Generative AI in Healthcare: algorithmic, ethical, legal and societal considerations(https://arxiv.org/abs/)
Keywords: generative
Abstract: Generative AI is rapidly transforming medical imaging and text analysis, offering immense potential for enhanced diagnosis and personalized care. However, this transformative technology raises crucial ethical, societal, and legal questions. This paper delves into these complexities, examining issues of accuracy, informed consent, data privacy, and algorithmic limitations in the context of generative AI's application to medical imaging and text. We explore the legal landscape surrounding liability and accountability, emphasizing the need for robust regulatory frameworks. Furthermore, we dissect the algorithmic challenges, including data biases, model limitations, and workflow integration. By critically analyzing these challenges and proposing responsible solutions, we aim to foster a roadmap for ethical and responsible implementation of generative AI in healthcare, ensuring its transformative potential serves humanity with utmost care and precision.

Title: UniZero: Generalized and Efficient Planning with Scalable Latent World Models

Authors: Yuan Pu, Yazhe Niu, Jiyuan Ren, Zhenjie Yang, Hongsheng Li, Yu Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] UniZero: Generalized and Efficient Planning with Scalable Latent World Models(https://arxiv.org/abs/)
Keywords: self-supervised
Abstract: Learning predictive world models is essential for enhancing the planning capabilities of reinforcement learning agents. Notably, the MuZero-style algorithms, based on the value equivalence principle and Monte Carlo Tree Search (MCTS), have achieved superhuman performance in various domains. However, in environments that require capturing long-term dependencies, MuZero's performance deteriorates rapidly. We identify that this is partially due to the \textit{entanglement} of latent representations with historical information, which results in incompatibility with the auxiliary self-supervised state regularization. To overcome this limitation, we present \textit{UniZero}, a novel approach that \textit{disentangles} latent states from implicit latent history using a transformer-based latent world model. By concurrently predicting latent dynamics and decision-oriented quantities conditioned on the learned latent history, UniZero enables joint optimization of the long-horizon world model and policy, facilitating broader and more efficient planning in latent space. We demonstrate that UniZero, even with single-frame inputs, matches or surpasses the performance of MuZero-style algorithms on the Atari 100k benchmark. Furthermore, it significantly outperforms prior baselines in benchmarks that require long-term memory. Lastly, we validate the effectiveness and scalability of our design choices through extensive ablation studies, visual analyses, and multi-task learning results. The code is available at \textcolor{magenta}{this https URL}.

Title: A Late-Stage Bitemporal Feature Fusion Network for Semantic Change Detection

Authors: Chenyao Zhou, Haotian Zhang, Han Guo, Zhengxia Zou, Zhenwei Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] A Late-Stage Bitemporal Feature Fusion Network for Semantic Change Detection(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: Semantic change detection is an important task in geoscience and earth observation. By producing a semantic change map for each temporal phase, both the land use land cover categories and change information can be interpreted. Recently some multi-task learning based semantic change detection methods have been proposed to decompose the task into semantic segmentation and binary change detection subtasks. However, previous works comprise triple branches in an entangled manner, which may not be optimal and hard to adopt foundation models. Besides, lacking explicit refinement of bitemporal features during fusion may cause low accuracy. In this letter, we propose a novel late-stage bitemporal feature fusion network to address the issue. Specifically, we propose local global attentional aggregation module to strengthen feature fusion, and propose local global context enhancement module to highlight pivotal semantics. Comprehensive experiments are conducted on two public datasets, including SECOND and Landsat-SCD. Quantitative and qualitative results show that our proposed model achieves new state-of-the-art performance on both datasets.

Title: GenMM: Geometrically and Temporally Consistent Multimodal Data Generation for Video and LiDAR

Authors: Bharat Singh, Viveka Kulharia, Luyu Yang, Avinash Ravichandran, Ambrish Tyagi, Ashish Shrivastava
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] GenMM: Geometrically and Temporally Consistent Multimodal Data Generation for Video and LiDAR(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: Multimodal synthetic data generation is crucial in domains such as autonomous driving, robotics, augmented/virtual reality, and retail. We propose a novel approach, GenMM, for jointly editing RGB videos and LiDAR scans by inserting temporally and geometrically consistent 3D objects. Our method uses a reference image and 3D bounding boxes to seamlessly insert and blend new objects into target videos. We inpaint the 2D Regions of Interest (consistent with 3D boxes) using a diffusion-based video inpainting model. We then compute semantic boundaries of the object and estimate it's surface depth using state-of-the-art semantic segmentation and monocular depth estimation techniques. Subsequently, we employ a geometry-based optimization algorithm to recover the 3D shape of the object's surface, ensuring it fits precisely within the 3D bounding box. Finally, LiDAR rays intersecting with the new object surface are updated to reflect consistent depths with its geometry. Our experiments demonstrate the effectiveness of GenMM in inserting various 3D objects across video and LiDAR modalities.

Title: Text-space Graph Foundation Models: Comprehensive Benchmarks and New Insights

Authors: Zhikai Chen, Haitao Mao, Jingzhe Liu, Yu Song, Bingheng Li, Wei Jin, Bahare Fatemi, Anton Tsitsulin, Bryan Perozzi, Hui Liu, Jiliang Tang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Text-space Graph Foundation Models: Comprehensive Benchmarks and New Insights(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: Given the ubiquity of graph data and its applications in diverse domains, building a Graph Foundation Model (GFM) that can work well across different graphs and tasks with a unified backbone has recently garnered significant interests. A major obstacle to achieving this goal stems from the fact that graphs from different domains often exhibit diverse node features. Inspired by multi-modal models that align different modalities with natural language, the text has recently been adopted to provide a unified feature space for diverse graphs. Despite the great potential of these text-space GFMs, current research in this field is hampered by two problems. First, the absence of a comprehensive benchmark with unified problem settings hinders a clear understanding of the comparative effectiveness and practical value of different text-space GFMs. Second, there is a lack of sufficient datasets to thoroughly explore the methods' full potential and verify their effectiveness across diverse settings. To address these issues, we conduct a comprehensive benchmark providing novel text-space datasets and comprehensive evaluation under unified problem settings. Empirical results provide new insights and inspire future research directions. Our code and data are publicly available from \url{this https URL}.

Title: A Comprehensive Survey of Foundation Models in Medicine

Authors: Wasif Khan, Seowung Leem, Kyle B. See, Joshua K. Wong, Shaoting Zhang, Ruogu Fang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] A Comprehensive Survey of Foundation Models in Medicine(https://arxiv.org/abs/)
Keywords: self-supervised, foundation model
Abstract: Foundation models (FMs) are large-scale deep-learning models trained on extensive datasets using self-supervised techniques. These models serve as a base for various downstream tasks, including healthcare. FMs have been adopted with great success across various domains within healthcare, including natural language processing (NLP), computer vision, graph learning, biology, and omics. Existing healthcare-based surveys have not yet included all of these domains. Therefore, this survey provides a comprehensive overview of FMs in healthcare. We focus on the history, learning strategies, flagship models, applications, and challenges of FMs. We explore how FMs such as the BERT and GPT families are reshaping various healthcare domains, including clinical large language models, medical image analysis, and omics data. Furthermore, we provide a detailed taxonomy of healthcare applications facilitated by FMs, such as clinical NLP, medical computer vision, graph learning, and other biology-related tasks. Despite the promising opportunities FMs provide, they also have several associated challenges, which are explained in detail. We also outline potential future directions to provide researchers and practitioners with insights into the potential and limitations of FMs in healthcare to advance their deployment and mitigate associated risks.

Title: FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

Authors: Zhikai Zhang, Yitang Li, Haofeng Huang, Mingxian Lin, Li Yi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: Human motion synthesis is a fundamental task in computer animation. Despite recent progress in this field utilizing deep learning and motion capture data, existing methods are always limited to specific motion categories, environments, and styles. This poor generalizability can be partially attributed to the difficulty and expense of collecting large-scale and high-quality motion data. At the same time, foundation models trained with internet-scale image and text data have demonstrated surprising world knowledge and reasoning ability for various downstream tasks. Utilizing these foundation models may help with human motion synthesis, which some recent works have superficially explored. However, these methods didn't fully unveil the foundation models' potential for this task and only support several simple actions and environments. In this paper, we for the first time, without any motion data, explore open-set human motion synthesis using natural language instructions as user control signals based on MLLMs across any motion task and environment. Our framework can be split into two stages: 1) sequential keyframe generation by utilizing MLLMs as a keyframe designer and animator; 2) motion filling between keyframes through interpolation and motion tracking. Our method can achieve general human motion synthesis for many downstream tasks. The promising results demonstrate the worth of mocap-free human motion synthesis aided by MLLMs and pave the way for future research.

Title: Quantifying Generative Media Bias with a Corpus of Real-world and Generated News Articles

Authors: Filip Trhlik, Pontus Stenetorp
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Quantifying Generative Media Bias with a Corpus of Real-world and Generated News Articles(https://arxiv.org/abs/)
Keywords: generative
Abstract: Large language models (LLMs) are increasingly being utilised across a range of tasks and domains, with a burgeoning interest in their application within the field of journalism. This trend raises concerns due to our limited understanding of LLM behaviour in this domain, especially with respect to political bias. Existing studies predominantly focus on LLMs undertaking political questionnaires, which offers only limited insights into their biases and operational nuances. To address this gap, our study establishes a new curated dataset that contains 2,100 human-written articles and utilises their descriptions to generate 56,700 synthetic articles using nine LLMs. This enables us to analyse shifts in properties between human-authored and machine-generated articles, with this study focusing on political bias, detecting it using both supervised models and LLMs. Our findings reveal significant disparities between base and instruction-tuned LLMs, with instruction-tuned models exhibiting consistent political bias. Furthermore, we are able to study how LLMs behave as classifiers, observing their display of political bias even in this role. Overall, for the first time within the journalistic domain, this study outlines a framework and provides a structured dataset for quantifiable experiments, serving as a foundation for further research into LLM political bias and its implications.

Title: STAR: Scale-wise Text-to-image generation via Auto-Regressive representations

Authors: Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Huaian Chen, Yi Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] STAR: Scale-wise Text-to-image generation via Auto-Regressive representations(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: We present STAR, a text-to-image model that employs scale-wise auto-regressive paradigm. Unlike VAR, which is limited to class-conditioned synthesis within a fixed set of predetermined categories, our STAR enables text-driven open-set generation through three key designs: To boost diversity and generalizability with unseen combinations of objects and concepts, we introduce a pre-trained text encoder to extract representations for textual constraints, which we then use as guidance. To improve the interactions between generated images and fine-grained textual guidance, making results more controllable, additional cross-attention layers are incorporated at each scale. Given the natural structure correlation across different scales, we leverage 2D Rotary Positional Encoding (RoPE) and tweak it into a normalized version. This ensures consistent interpretation of relative positions across token maps at different scales and stabilizes the training process. Extensive experiments demonstrate that STAR surpasses existing benchmarks in terms of fidelity,image text consistency, and aesthetic quality. Our findings emphasize the potential of auto-regressive methods in the field of high-quality image synthesis, offering promising new directions for the T2I field currently dominated by diffusion methods.

Title: Diffusion Model With Optimal Covariance Matching

Authors: Zijing Ou, Mingtian Zhang, Andi Zhang, Tim Z. Xiao, Yingzhen Li, David Barber
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Diffusion Model With Optimal Covariance Matching(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: The probabilistic diffusion model has become highly effective across various domains. Typically, sampling from a diffusion model involves using a denoising distribution characterized by a Gaussian with a learned mean and either fixed or learned covariances. In this paper, we leverage the recently proposed full covariance moment matching technique and introduce a novel method for learning covariances. Unlike traditional data-driven covariance approximation approaches, our method involves directly regressing the optimal analytic covariance using a new, unbiased objective named Optimal Covariance Matching (OCM). This approach can significantly reduce the approximation error in covariance prediction. We demonstrate how our method can substantially enhance the sampling efficiency of both Markovian (DDPM) and non-Markovian (DDIM) diffusion model families.

Title: Post-hoc Utterance Refining Method by Entity Mining for Faithful Knowledge Grounded Conversations

Authors: Yoonna Jang, Suhyune Son, Jeongwoo Lee, Junyoung Son, Yuna Hur, Jungwoo Lim, Hyeonseok Moon, Kisu Yang, Heuiseok Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Post-hoc Utterance Refining Method by Entity Mining for Faithful Knowledge Grounded Conversations(https://arxiv.org/abs/)
Keywords: generative
Abstract: Despite the striking advances in recent language generation performance, model-generated responses have suffered from the chronic problem of hallucinations that are either untrue or unfaithful to a given source. Especially in the task of knowledge grounded conversation, the models are required to generate informative responses, but hallucinated utterances lead to miscommunication. In particular, entity-level hallucination that causes critical misinformation and undesirable conversation is one of the major concerns. To address this issue, we propose a post-hoc refinement method called REM. It aims to enhance the quality and faithfulness of hallucinated utterances by refining them based on the source knowledge. If the generated utterance has a low source-faithfulness score with the given knowledge, REM mines the key entities in the knowledge and implicitly uses them for refining the utterances. We verify that our method reduces entity hallucination in the utterance. Also, we show the adaptability and efficacy of REM with extensive experiments and generative results. Our code is available at this https URL.

Title: On the Effectiveness of Supervision in Asymmetric Non-Contrastive Learning

Authors: Jeongheon Oh, Kibok Lee
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] On the Effectiveness of Supervision in Asymmetric Non-Contrastive Learning(https://arxiv.org/abs/)
Keywords: self-supervised
Abstract: Supervised contrastive representation learning has been shown to be effective in various transfer learning scenarios. However, while asymmetric non-contrastive learning (ANCL) often outperforms its contrastive learning counterpart in self-supervised representation learning, the extension of ANCL to supervised scenarios is less explored. To bridge the gap, we study ANCL for supervised representation learning, coined SupSiam and SupBYOL, leveraging labels in ANCL to achieve better representations. The proposed supervised ANCL framework improves representation learning while avoiding collapse. Our analysis reveals that providing supervision to ANCL reduces intra-class variance, and the contribution of supervision should be adjusted to achieve the best performance. Experiments demonstrate the superiority of supervised ANCL across various datasets and tasks. The code is available at: this https URL.

Title: Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags

Authors: Daiqing Qi, Handong Zhao, Zijun Wei, Sheng Li
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Reminding Multimodal Large Language Models of Object-aware Knowledge with Retrieved Tags(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: Despite recent advances in the general visual instruction-following ability of Multimodal Large Language Models (MLLMs), they still struggle with critical problems when required to provide a precise and detailed response to a visual instruction: (1) failure to identify novel objects or entities, (2) mention of non-existent objects, and (3) neglect of object's attributed details. Intuitive solutions include improving the size and quality of data or using larger foundation models. They show effectiveness in mitigating these issues, but at an expensive cost of collecting a vast amount of new data and introducing a significantly larger model. Standing at the intersection of these approaches, we examine the three object-oriented problems from the perspective of the image-to-text mapping process by the multimodal connector. In this paper, we first identify the limitations of multimodal connectors stemming from insufficient training data. Driven by this, we propose to enhance the mapping with retrieval-augmented tag tokens, which contain rich object-aware information such as object names and attributes. With our Tag-grounded visual instruction tuning with retrieval Augmentation (TUNA), we outperform baselines that share the same language model and training data on 12 benchmarks. Furthermore, we show the zero-shot capability of TUNA when provided with specific datastores.

Title: CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph

Authors: Haitao Lin, Guojiang Zhao, Odin Zhang, Yufei Huang, Lirong Wu, Zicheng Liu, Siyuan Li, Cheng Tan, Zhifeng Gao, Stan Z. Li
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph(https://arxiv.org/abs/)
Keywords: generative
Abstract: Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair comparisons and inconclusive insights. To address this dilemma, we propose CBGBench, a comprehensive benchmark for SBDD, that unifies the task as a generative heterogeneous graph completion, analogous to fill-in-the-blank of the 3D complex binding graph. By categorizing existing methods based on their attributes, CBGBench facilitates a modular and extensible framework that implements various cutting-edge methods. Secondly, a single task on \textit{de novo} molecule generation can hardly reflect their capabilities. To broaden the scope, we have adapted these models to a range of tasks essential in drug design, which are considered sub-tasks within the graph fill-in-the-blank tasks. These tasks include the generative designation of \textit{de novo} molecules, linkers, fragments, scaffolds, and sidechains, all conditioned on the structures of protein pockets. Our evaluations are conducted with fairness, encompassing comprehensive perspectives on interaction, chemical properties, geometry authenticity, and substructure validity. We further provide the pre-trained versions of the state-of-the-art models and deep insights with analysis from empirical studies. The codebase for CBGBench is publicly accessible at \url{this https URL}.

Title: Geometric-informed GFlowNets for Structure-Based Drug Design

Authors: Grayson Lee, Tony Shen, Martin Ester
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Geometric-informed GFlowNets for Structure-Based Drug Design(https://arxiv.org/abs/)
Keywords: generative
Abstract: The rise of cost involved with drug discovery and current speed of which they are discover, underscore the need for more efficient structure-based drug design (SBDD) methods. We employ Generative Flow Networks (GFlowNets), to effectively explore the vast combinatorial space of drug-like molecules, which traditional virtual screening methods fail to cover. We introduce a novel modification to the GFlowNet framework by incorporating trigonometrically consistent embeddings, previously utilized in tasks involving protein conformation and protein-ligand interactions, to enhance the model's ability to generate molecules tailored to specific protein pockets. We have modified the existing protein conditioning used by GFlowNets, blending geometric information from both protein and ligand embeddings to achieve more geometrically consistent embeddings. Experiments conducted using CrossDocked2020 demonstrated an improvement in the binding affinity between generated molecules and protein pockets for both single and multi-objective tasks, compared to previous work. Additionally, we propose future work aimed at further increasing the geometric information captured in protein-ligand interactions.

Title: Graph Neural Reaction Diffusion Models

Authors: Moshe Eliasof, Eldad Haber, Eran Treister
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Graph Neural Reaction Diffusion Models(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: The integration of Graph Neural Networks (GNNs) and Neural Ordinary and Partial Differential Equations has been extensively studied in recent years. GNN architectures powered by neural differential equations allow us to reason about their behavior, and develop GNNs with desired properties such as controlled smoothing or energy conservation. In this paper we take inspiration from Turing instabilities in a Reaction Diffusion (RD) system of partial differential equations, and propose a novel family of GNNs based on neural RD systems. We \textcolor{black}{demonstrate} that our RDGNN is powerful for the modeling of various data types, from homophilic, to heterophilic, and spatio-temporal datasets. We discuss the theoretical properties of our RDGNN, its implementation, and show that it improves or offers competitive performance to state-of-the-art methods.

Title: Imperceptible Face Forgery Attack via Adversarial Semantic Mask

Authors: Decheng Liu, Qixuan Su, Chunlei Peng, Nannan Wang, Xinbo Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Imperceptible Face Forgery Attack via Adversarial Semantic Mask(https://arxiv.org/abs/)
Keywords: generative
Abstract: With the great development of generative model techniques, face forgery detection draws more and more attention in the related field. Researchers find that existing face forgery models are still vulnerable to adversarial examples with generated pixel perturbations in the global image. These generated adversarial samples still can't achieve satisfactory performance because of the high detectability. To address these problems, we propose an Adversarial Semantic Mask Attack framework (ASMA) which can generate adversarial examples with good transferability and invisibility. Specifically, we propose a novel adversarial semantic mask generative model, which can constrain generated perturbations in local semantic regions for good stealthiness. The designed adaptive semantic mask selection strategy can effectively leverage the class activation values of different semantic regions, and further ensure better attack transferability and stealthiness. Extensive experiments on the public face forgery dataset prove the proposed method achieves superior performance compared with several representative adversarial attack methods. The code is publicly available at this https URL.

Title: Benchmarking Label Noise in Instance Segmentation: Spatial Noise Matters

Authors: Moshe Kimhi, Eden Grad, Lion Halika, Chaim Baskin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Benchmarking Label Noise in Instance Segmentation: Spatial Noise Matters(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: Obtaining accurate labels for instance segmentation is particularly challenging due to the complex nature of the task. Each image necessitates multiple annotations, encompassing not only the object's class but also its precise spatial boundaries. These requirements elevate the likelihood of errors and inconsistencies in both manual and automated annotation processes. By simulating different noise conditions, we provide a realistic scenario for assessing the robustness and generalization capabilities of instance segmentation models in different segmentation tasks, introducing COCO-N and Cityscapes-N. We also propose a benchmark for weakly annotation noise, dubbed COCO-WAN, which utilizes foundation models and weak annotations to simulate semi-automated annotation tools and their noisy labels. This study sheds light on the quality of segmentation masks produced by various models and challenges the efficacy of popular methods designed to address learning with label noise.

Title: Breaking the Attention Bottleneck

Authors: Kalle Hilsenbek
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Breaking the Attention Bottleneck(https://arxiv.org/abs/)
Keywords: generative
Abstract: Attention-based transformers have become the standard architecture in many deep learning fields, primarily due to their ability to model long-range dependencies and handle variable-length input sequences. However, the attention mechanism with its quadratic complexity is a significant bottleneck in the transformer architecture. This algorithm is only uni-directional in the decoder and converges to a static pattern in over-parametrized decoder-only models. I address this issue by developing a generative function as attention or activation replacement. It still has the auto-regressive character by comparing each token with the previous one. In my test setting with nanoGPT this yields a smaller loss while having a smaller model. The loss further drops by incorporating an average context vector. This concept of attention replacement is distributed under the GNU AGPL v3 license at this https URL.

Title: MICL: Improving In-Context Learning through Multiple-Label Words in Demonstration

Authors: Zhu Zixiao, Feng Zijian, Zhou Hanzhang, Qian Junlang, Mao Kezhi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] MICL: Improving In-Context Learning through Multiple-Label Words in Demonstration(https://arxiv.org/abs/)
Keywords: in-context
Abstract: In-context learning (ICL) enables large language models (LLMs) to perform new tasks by using sample-label pairs as demonstrations. However, variations in demonstrations can lead to significantly different performances. Current research mainly focuses on selecting demonstration samples, preassuming the class name to be the label word when creating sample-label pairs. However, the choice of label words is crucial for ICL performance. In addition, we observe that using a single class name in demonstration may not yield optimal results. In this paper, we propose to use multiple label words in one sample-label pair to enhance ICL performance. Further, we select and order sample-label pairs based on LLM's output distribution, aiming to optimize the demonstration examples from both the samples' and labels' perspectives. Evaluation results on seven classification datasets show that the use of multiple label words, strategically organized by their selection, order and quantity, improves ICL performance through diverse label information.

Title: Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask

Authors: Xiao Jingyu, Xu Zhiyao, Zou Qingsong, Li Qing, Zhao Dan, Fang Dong, Li Ruoyu, Tang Wenxin, Li Kang, Zuo Xudong, Hu Penghui, Jiang Yong, Weng Zixuan, Lyv.R Michael
Subjects: cs.CR, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask(https://arxiv.org/abs/)
Keywords: anomaly
Abstract: Smart homes, powered by the Internet of Things, offer great convenience but also pose security concerns due to abnormal behaviors, such as improper operations of users and potential attacks from malicious attackers. Several behavior modeling methods have been proposed to identify abnormal behaviors and mitigate potential risks. However, their performance often falls short because they do not effectively learn less frequent behaviors, consider temporal context, or account for the impact of noise in human behaviors. In this paper, we propose SmartGuard, an autoencoder-based unsupervised user behavior anomaly detection framework. First, we design a Loss-guided Dynamic Mask Strategy (LDMS) to encourage the model to learn less frequent behaviors, which are often overlooked during learning. Second, we propose a Three-level Time-aware Position Embedding (TTPE) to incorporate temporal information into positional embedding to detect temporal context anomaly. Third, we propose a Noise-aware Weighted Reconstruction Loss (NWRL) that assigns different weights for routine behaviors and noise behaviors to mitigate the interference of noise behaviors during inference. Comprehensive experiments on three datasets with ten types of anomaly behaviors demonstrates that SmartGuard consistently outperforms state-of-the-art baselines and also offers highly interpretable results.

Title: E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models

Authors: Zhenyu Zhang, Bingguang Hao, Jinpeng Li, Zekai Zhang, Dongyan Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] E-Bench: Towards Evaluating the Ease-of-Use of Large Language Models(https://arxiv.org/abs/)
Keywords: generative
Abstract: Most large language models (LLMs) are sensitive to prompts, and another synonymous expression or a typo may lead to unexpected results for the model. Composing an optimal prompt for a specific demand lacks theoretical support and relies entirely on human experimentation, which poses a considerable obstacle to popularizing generative artificial intelligence. However, there is no systematic analysis of the stability of LLMs in resisting prompt perturbations in real-world scenarios. In this work, we propose to evaluate the ease-of-use of LLMs and construct E-Bench, simulating the actual situation of human use from synonymous perturbation (including paraphrasing, simplification, and colloquialism) and typographical perturbation (such as typing). On this basis, we also discuss the combination of these two types of perturbation and analyze the main reasons for performance degradation. Experimental results indicate that with the increase of model size, although the ease-of-use are significantly improved, there is still a long way to go to build a sufficiently user-friendly model.

Title: ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts

Authors: Samar Khanna, Medhanie Irgau, David B. Lobell, Stefano Ermon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts(https://arxiv.org/abs/)
Keywords: self-supervised, foundation model
Abstract: Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this new domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain. In this extended pre-training phase, ExPLoRA only unfreezes 1-2 pre-trained ViT blocks and all normalization layers, and then tunes all other layers with LoRA. Finally, we fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 7% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the art approaches. Our ablation studies confirm the efficacy of our approach over other baselines, including PEFT and simply unfreezing more transformer blocks.

Title: ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models

Authors: Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: With the advance of diffusion models, today's video generation has achieved impressive quality. But generating temporal consistent long videos is still challenging. A majority of video diffusion models (VDMs) generate long videos in an autoregressive manner, i.e., generating subsequent clips conditioned on last frames of previous clip. However, existing approaches all involve bidirectional computations, which restricts the receptive context of each autoregression step, and results in the model lacking long-term dependencies. Inspired from the huge success of large language models (LLMs) and following GPT (generative pre-trained transformer), we bring causal (i.e., unidirectional) generation into VDMs, and use past frames as prompt to generate future frames. For Causal Generation, we introduce causal temporal attention into VDM, which forces each generated frame to depend on its previous frames. For Frame as Prompt, we inject the conditional frames by concatenating them with noisy frames (frames to be generated) along the temporal axis. Consequently, we present Video Diffusion GPT (ViD-GPT). Based on the two key designs, in each autoregression step, it is able to acquire long-term context from prompting frames concatenated by all previously generated frames. Additionally, we bring the kv-cache mechanism to VDMs, which eliminates the redundant computation from overlapped frames, significantly boosting the inference speed. Extensive experiments demonstrate that our ViD-GPT achieves state-of-the-art performance both quantitatively and qualitatively on long video generation. Code will be available at this https URL.

Title: Data Shapley in One Training Run

Authors: Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia
Subjects: cs.LG, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Data Shapley in One Training Run(https://arxiv.org/abs/)
Keywords: generative
Abstract: Generative artificial intelligence (AI) systems are trained on large data corpora to generate new pieces of text, images, videos, and other media. There is growing concern that such systems may infringe on the copyright interests of training data contributors. To address the copyright challenges of generative AI, we propose a framework that compensates copyright owners proportionally to their contributions to the creation of AI-generated content. The metric for contributions is quantitatively determined by leveraging the probabilistic nature of modern generative AI models and using techniques from cooperative game theory in economics. This framework enables a platform where AI developers benefit from access to high-quality training data, thus improving model performance. Meanwhile, copyright owners receive fair compensation, driving the continued provision of relevant data for generative model training. Experiments demonstrate that our framework successfully identifies the most relevant data sources used in artwork generation, ensuring a fair and interpretable distribution of revenues among copyright owners.

Title: Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry

Authors: Boris Chidlovskii, Leonid Antsfeld
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry(https://arxiv.org/abs/)
Keywords: self-supervised
Abstract: For the task of simultaneous monocular depth and visual odometry estimation, we propose learning self-supervised transformer-based models in two steps. Our first step consists in a generic pretraining to learn 3D geometry, using cross-view completion objective (CroCo), followed by self-supervised finetuning on non-annotated videos. We show that our self-supervised models can reach state-of-the-art performance 'without bells and whistles' using standard components such as visual transformers, dense prediction transformers and adapters. We demonstrate the effectiveness of our proposed method by running evaluations on six benchmark datasets, both static and dynamic, indoor and outdoor, with synthetic and real images. For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.

Title: Boosting Medical Image Classification with Segmentation Foundation Model

Authors: Pengfei Gu, Zihan Zhao, Hongxiao Wang, Yaopeng Peng, Yizhe Zhang, Nishchal Sapkota, Chaoli Wang, Danny Z. Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Boosting Medical Image Classification with Segmentation Foundation Model(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: The Segment Anything Model (SAM) exhibits impressive capabilities in zero-shot segmentation for natural images. Recently, SAM has gained a great deal of attention for its applications in medical image segmentation. However, to our best knowledge, no studies have shown how to harness the power of SAM for medical image classification. To fill this gap and make SAM a true ``foundation model'' for medical image analysis, it is highly desirable to customize SAM specifically for medical image classification. In this paper, we introduce SAMAug-C, an innovative augmentation method based on SAM for augmenting classification datasets by generating variants of the original images. The augmented datasets can be used to train a deep learning classification model, thereby boosting the classification performance. Furthermore, we propose a novel framework that simultaneously processes raw and SAMAug-C augmented image input, capitalizing on the complementary information that is offered by both. Experiments on three public datasets validate the effectiveness of our new approach.

Title: garak: A Framework for Security Probing Large Language Models

Authors: Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, Nanna Inie
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] garak: A Framework for Security Probing Large Language Models(https://arxiv.org/abs/)
Keywords: generative
Abstract: As Large Language Models (LLMs) are deployed and integrated into thousands of applications, the need for scalable evaluation of how models respond to adversarial attacks grows rapidly. However, LLM security is a moving target: models produce unpredictable output, are constantly updated, and the potential adversary is highly diverse: anyone with access to the internet and a decent command of natural language. Further, what constitutes a security weak in one context may not be an issue in a different context; one-fits-all guardrails remain theoretical. In this paper, we argue that it is time to rethink what constitutes ``LLM security'', and pursue a holistic approach to LLM security evaluation, where exploration and discovery of issues are central. To this end, this paper introduces garak (Generative AI Red-teaming and Assessment Kit), a framework which can be used to discover and identify vulnerabilities in a target LLM or dialog system. garak probes an LLM in a structured fashion to discover potential vulnerabilities. The outputs of the framework describe a target model's weaknesses, contribute to an informed discussion of what composes vulnerabilities in unique contexts, and can inform alignment and policy discussions for LLM deployment.

Title: Leveraging Foundation Models for Multi-modal Federated Learning with Incomplete Modality

Authors: Liwei Che, Jiaqi Wang, Xinyue Liu, Fenglong Ma
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Leveraging Foundation Models for Multi-modal Federated Learning with Incomplete Modality(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: Federated learning (FL) has obtained tremendous progress in providing collaborative training solutions for distributed data silos with privacy guarantees. However, few existing works explore a more realistic scenario where the clients hold multiple data modalities. In this paper, we aim to solve a novel challenge in multi-modal federated learning (MFL) -- modality missing -- the clients may lose part of the modalities in their local data sets. To tackle the problems, we propose a novel multi-modal federated learning method, Federated Multi-modal contrastiVe training with Pre-trained completion (FedMVP), which integrates the large-scale pre-trained models to enhance the federated training. In the proposed FedMVP framework, each client deploys a large-scale pre-trained model with frozen parameters for modality completion and representation knowledge transfer, enabling efficient and robust local training. On the server side, we utilize generated data to uniformly measure the representation similarity among the uploaded client models and construct a graph perspective to aggregate them according to their importance in the system. We demonstrate that the model achieves superior performance over two real-world image-text classification datasets and is robust to the performance degradation caused by missing modality.

Title: RAEmoLLM: Retrieval Augmented LLMs for Cross-Domain Misinformation Detection Using In-Context Learning based on Emotional Information

Authors: Zhiwei Liu, Kailai Yang, Qianqian Xie, Christine de Kock, Sophia Ananiadou, Eduard Hovy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] RAEmoLLM: Retrieval Augmented LLMs for Cross-Domain Misinformation Detection Using In-Context Learning based on Emotional Information(https://arxiv.org/abs/)
Keywords: in-context
Abstract: Misinformation is prevalent in various fields such as education, politics, health, etc., causing significant harm to society. However, current methods for cross-domain misinformation detection rely on time and resources consuming fine-tuning and complex model structures. With the outstanding performance of LLMs, many studies have employed them for misinformation detection. Unfortunately, they focus on in-domain tasks and do not incorporate significant sentiment and emotion features (which we jointly call affect). In this paper, we propose RAEmoLLM, the first retrieval augmented (RAG) LLMs framework to address cross-domain misinformation detection using in-context learning based on affective information. It accomplishes this by applying an emotion-aware LLM to construct a retrieval database of affective embeddings. This database is used by our retrieval module to obtain source-domain samples, which are subsequently used for the inference module's in-context few-shot learning to detect target domain misinformation. We evaluate our framework on three misinformation benchmarks. Results show that RAEmoLLM achieves significant improvements compared to the zero-shot method on three datasets, with the highest increases of 20.69%, 23.94%, and 39.11% respectively. This work will be released on this https URL.

Title: An Analysis on Quantizing Diffusion Transformers

Authors: Yuewei Yang, Jialiang Wang, Xiaoliang Dai, Peizhao Zhang, Hongbo Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] An Analysis on Quantizing Diffusion Transformers(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: Diffusion Models (DMs) utilize an iterative denoising process to transform random noise into synthetic data. Initally proposed with a UNet structure, DMs excel at producing images that are virtually indistinguishable with or without conditioned text prompts. Later transformer-only structure is composed with DMs to achieve better performance. Though Latent Diffusion Models (LDMs) reduce the computational requirement by denoising in a latent space, it is extremely expensive to inference images for any operating devices due to the shear volume of parameters and feature sizes. Post Training Quantization (PTQ) offers an immediate remedy for a smaller storage size and more memory-efficient computation during inferencing. Prior works address PTQ of DMs on UNet structures have addressed the challenges in calibrating parameters for both activations and weights via moderate optimization. In this work, we pioneer an efficient PTQ on transformer-only structure without any optimization. By analysing challenges in quantizing activations and weights for diffusion transformers, we propose a single-step sampling calibration on activations and adapt group-wise quantization on weights for low-bit quantization. We demonstrate the efficiency and effectiveness of proposed methods with preliminary experiments on conditional image generation.

Title: Exploiting Diffusion Prior for Out-of-Distribution Detection

Authors: Armando Zhu, Jiabei Liu, Keqin Li, Shuying Dai, Bo Hong, Peng Zhao, Changsong Wei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Exploiting Diffusion Prior for Out-of-Distribution Detection(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: Out-of-distribution (OOD) detection is crucial for deploying robust machine learning models, especially in areas where security is critical. However, traditional OOD detection methods often fail to capture complex data distributions from large scale date. In this paper, we present a novel approach for OOD detection that leverages the generative ability of diffusion models and the powerful feature extraction capabilities of CLIP. By using these features as conditional inputs to a diffusion model, we can reconstruct the images after encoding them with CLIP. The difference between the original and reconstructed images is used as a signal for OOD identification. The practicality and scalability of our method is increased by the fact that it does not require class-specific labeled ID data, as is the case with many other methods. Extensive experiments on several benchmark datasets demonstrates the robustness and effectiveness of our method, which have significantly improved the detection accuracy.

Title: Dynamic Order Template Prediction for Generative Aspect-Based Sentiment Analysis

Authors: Yonghyun Jun, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Dynamic Order Template Prediction for Generative Aspect-Based Sentiment Analysis(https://arxiv.org/abs/)
Keywords: generative
Abstract: Aspect-based sentiment analysis (ABSA) assesses sentiments towards specific aspects within texts, resulting in detailed sentiment tuples. Previous ABSA models often use static templates to predict all of the elements in the tuples, and these models often fail to accurately capture dependencies between elements. Multi-view prompting method improves the performance of ABSA by predicting tuples with various templates and then ensembling the results. However, this method suffers from inefficiencies and out-of-distribution errors. In this paper, we propose a Dynamic Order Template (DOT) method for ABSA, which dynamically generates necessary views for each instance based on instance-level entropy. Ensuring the diverse and relevant view generation, our proposed method improves F1-scores on ASQP and ACOS datasets while significantly reducing inference time.

Title: Diffusion Models in Low-Level Vision: A Survey

Authors: Chunming He, Yuqi Shen, Chengyu Fang, Fengyang Xiao, Longxiang Tang, Yulun Zhang, Wangmeng Zuo, Zhenhua Guo, Xiu Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Diffusion Models in Low-Level Vision: A Survey(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: Deep generative models have garnered significant attention in low-level vision tasks due to their generative capabilities. Among them, diffusion model-based solutions, characterized by a forward diffusion process and a reverse denoising process, have emerged as widely acclaimed for their ability to produce samples of superior quality and diversity. This ensures the generation of visually compelling results with intricate texture information. Despite their remarkable success, a noticeable gap exists in a comprehensive survey that amalgamates these pioneering diffusion model-based works and organizes the corresponding threads. This paper proposes the comprehensive review of diffusion model-based techniques. We present three generic diffusion modeling frameworks and explore their correlations with other deep generative models, establishing the theoretical foundation. Following this, we introduce a multi-perspective categorization of diffusion models, considering both the underlying framework and the target task. Additionally, we summarize extended diffusion models applied in other tasks, including medical, remote sensing, and video scenarios. Moreover, we provide an overview of commonly used benchmarks and evaluation metrics. We conduct a thorough evaluation, encompassing both performance and efficiency, of diffusion model-based techniques in three prominent tasks. Finally, we elucidate the limitations of current diffusion models and propose seven intriguing directions for future research. This comprehensive examination aims to facilitate a profound understanding of the landscape surrounding denoising diffusion models in the context of low-level vision tasks. A curated list of diffusion model-based techniques in over 20 low-level vision tasks can be found at this https URL.

Title: Learning Iterative Reasoning through Energy Diffusion

Authors: Yilun Du, Jiayuan Mao, Joshua B. Tenenbaum
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Learning Iterative Reasoning through Energy Diffusion(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: We introduce iterative reasoning through energy diffusion (IRED), a novel framework for learning to reason for a variety of tasks by formulating reasoning and decision-making problems with energy-based optimization. IRED learns energy functions to represent the constraints between input conditions and desired outputs. After training, IRED adapts the number of optimization steps during inference based on problem difficulty, enabling it to solve problems outside its training distribution -- such as more complex Sudoku puzzles, matrix completion with large value magnitudes, and pathfinding in larger graphs. Key to our method's success is two novel techniques: learning a sequence of annealed energy landscapes for easier inference and a combination of score function and energy landscape supervision for faster and more stable training. Our experiments show that IRED outperforms existing methods in continuous-space reasoning, discrete-space reasoning, and planning tasks, particularly in more challenging scenarios. Code and visualizations at this https URL

Title: A Survey on Human Preference Learning for Large Language Models

Authors: Ruili Jiang, Kehai Chen, Xuefeng Bai, Zhixuan He, Juntao Li, Muyun Yang, Tiejun Zhao, Liqiang Nie, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] A Survey on Human Preference Learning for Large Language Models(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: The recent surge of versatile large language models (LLMs) largely depends on aligning increasingly capable foundation models with human intentions by preference learning, enhancing LLMs with excellent applicability and effectiveness in a wide range of contexts. Despite the numerous related studies conducted, a perspective on how human preferences are introduced into LLMs remains limited, which may prevent a deeper comprehension of the relationships between human preferences and LLMs as well as the realization of their limitations. In this survey, we review the progress in exploring human preference learning for LLMs from a preference-centered perspective, covering the sources and formats of preference feedback, the modeling and usage of preference signals, as well as the evaluation of the aligned LLMs. We first categorize the human feedback according to data sources and formats. We then summarize techniques for human preferences modeling and compare the advantages and disadvantages of different schools of models. Moreover, we present various preference usage methods sorted by the objectives to utilize human preference signals. Finally, we summarize some prevailing approaches to evaluate LLMs in terms of alignment with human intentions and discuss our outlooks on the human intention alignment for LLMs.

Title: In-Context Editing: Learning Knowledge from Self-Induced Distributions

Authors: Siyuan Qi, Bangcheng Yang, Kailin Jiang, Xiaobo Wang, Jiaqi Li, Yifan Zhong, Yaodong Yang, Zilong Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] In-Context Editing: Learning Knowledge from Self-Induced Distributions(https://arxiv.org/abs/)
Keywords: in-context
Abstract: The existing fine-tuning paradigm for language models is brittle in knowledge editing scenarios, where the model must incorporate new information without extensive retraining. This brittleness often results in overfitting, reduced performance, and unnatural language generation. To address this, we propose Consistent In-Context Editing (ICE), a novel approach that leverages the model's in-context learning capability to tune toward a contextual distribution rather than a one-hot target. ICE introduces a straightforward optimization framework that includes both a target and a procedure, enhancing the robustness and effectiveness of gradient-based tuning methods. We provide analytical insights into ICE across four critical aspects of knowledge editing: accuracy, locality, generalization, and linguistic quality, showing its advantages. Experimental results across four datasets confirm the effectiveness of ICE and demonstrate its potential for continual editing, ensuring that updated information is incorporated while preserving the integrity of the model.

Title: Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion

Authors: Rishab Parthasarathy, Zack Ankner, Aaron Gokaslan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: A recent frontier in computer vision has been the task of 3D video generation, which consists of generating a time-varying 3D representation of a scene. To generate dynamic 3D scenes, current methods explicitly model 3D temporal dynamics by jointly optimizing for consistency across both time and views of the scene. In this paper, we instead investigate whether it is necessary to explicitly enforce multiview consistency over time, as current approaches do, or if it is sufficient for a model to generate 3D representations of each timestep independently. We hence propose a model, Vid3D, that leverages 2D video diffusion to generate 3D videos by first generating a 2D "seed" of the video's temporal dynamics and then independently generating a 3D representation for each timestep in the seed video. We evaluate Vid3D against two state-of-the-art 3D video generation methods and find that Vid3D is achieves comparable results despite not explicitly modeling 3D temporal dynamics. We further ablate how the quality of Vid3D depends on the number of views generated per frame. While we observe some degradation with fewer views, performance degradation remains minor. Our results thus suggest that 3D temporal knowledge may not be necessary to generate high-quality dynamic 3D scenes, potentially enabling simpler generative algorithms for this task.

Title: Consistency^2: Consistent and Fast 3D Painting with Latent Consistency Models

Authors: Tianfu Wang, Anton Obukhov, Konrad Schindler
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Consistency^2: Consistent and Fast 3D Painting with Latent Consistency Models(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: Generative 3D Painting is among the top productivity boosters in high-resolution 3D asset management and recycling. Ever since text-to-image models became accessible for inference on consumer hardware, the performance of 3D Painting methods has consistently improved and is currently close to plateauing. At the core of most such models lies denoising diffusion in the latent space, an inherently time-consuming iterative process. Multiple techniques have been developed recently to accelerate generation and reduce sampling iterations by orders of magnitude. Designed for 2D generative imaging, these techniques do not come with recipes for lifting them into 3D. In this paper, we address this shortcoming by proposing a Latent Consistency Model (LCM) adaptation for the task at hand. We analyze the strengths and weaknesses of the proposed model and evaluate it quantitatively and qualitatively. Based on the Objaverse dataset samples study, our 3D painting method attains strong preference in all evaluations. Source code is available at this https URL.

Title: Probing the Decision Boundaries of In-context Learning in Large Language Models

Authors: Siyan Zhao, Tung Nguyen, Aditya Grover
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Probing the Decision Boundaries of In-context Learning in Large Language Models(https://arxiv.org/abs/)
Keywords: in-context
Abstract: In-context learning is a key paradigm in large language models (LLMs) that enables them to generalize to new tasks and domains by simply prompting these models with a few exemplars without explicit parameter updates. Many attempts have been made to understand in-context learning in LLMs as a function of model scale, pretraining data, and other factors. In this work, we propose a new mechanism to probe and understand in-context learning from the lens of decision boundaries for in-context binary classification. Decision boundaries are straightforward to visualize and provide important information about the qualitative behavior of the inductive biases of standard classifiers. To our surprise, we find that the decision boundaries learned by current LLMs in simple binary classification tasks are often irregular and non-smooth, regardless of linear separability in the underlying task. This paper investigates the factors influencing these decision boundaries and explores methods to enhance their generalizability. We assess various approaches, including training-free and fine-tuning methods for LLMs, the impact of model architecture, and the effectiveness of active prompting techniques for smoothing decision boundaries in a data-efficient manner. Our findings provide a deeper understanding of in-context learning dynamics and offer practical improvements for enhancing robustness and generalizability of in-context learning.

Title: Accurate and Fast Pixel Retrieval with Spatial and Uncertainty Aware Hypergraph Diffusion

Authors: Guoyuan An, Yuchi Huo, Sung-Eui Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Accurate and Fast Pixel Retrieval with Spatial and Uncertainty Aware Hypergraph Diffusion(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: This paper presents a novel method designed to enhance the efficiency and accuracy of both image retrieval and pixel retrieval. Traditional diffusion methods struggle to propagate spatial information effectively in conventional graphs due to their reliance on scalar edge weights. To overcome this limitation, we introduce a hypergraph-based framework, uniquely capable of efficiently propagating spatial information using local features during query time, thereby accurately retrieving and localizing objects within a database. Additionally, we innovatively utilize the structural information of the image graph through a technique we term "community selection". This approach allows for the assessment of the initial search result's uncertainty and facilitates an optimal balance between accuracy and speed. This is particularly crucial in real-world applications where such trade-offs are often necessary. Our experimental results, conducted on the (P)ROxford and (P)RParis datasets, demonstrate the significant superiority of our method over existing diffusion techniques. We achieve state-of-the-art (SOTA) accuracy in both image-level and pixel-level retrieval, while also maintaining impressive processing speed. This dual achievement underscores the effectiveness of our hypergraph-based framework and community selection technique, marking a notable advancement in the field of content-based image retrieval.

Title: FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation

Authors: Bangzheng Li, Ben Zhou, Xingyu Fu, Fei Wang, Dan Roth, Muhao Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation(https://arxiv.org/abs/)
Keywords: in-context
Abstract: Language models have shown impressive in-context-learning capabilities, which allow them to benefit from input prompts and perform better on downstream end tasks. Existing works investigate the mechanisms behind this observation, and propose label-agnostic prompt metrics that can better estimate end-task performances. One popular approach is using perplexity as a way to measure models' familiarity with the prompt. While showing consistent improvements on in-domain tasks, we found that familiarity metrics such as perplexity cannot accurately estimate performance in complicated situations such as task or domain transferring scenarios. In this work, we propose a revised measure called FamiCom, providing a more comprehensive measure for task-agnostic performance estimation. Specifically, FamiCom combines familiarity with \textit{complexity} -- the inherent difficulty of end tasks, which is an important factor missing from current metrics. Experiments show that FamiCom strongly correlates with end-task performances, producing a 0.85 Spearman's correlation, versus 0.43 of familiarity-only ones'. We further apply FamiCom to automatic prompt and demonstration selection, and outperform existing methods and baselines by more than 7.0% in accuracy.

Title: Generative Visual Instruction Tuning

Authors: Jefferson Hernandez, Ruben Villegas, Vicente Ordonez
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Generative Visual Instruction Tuning(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: We propose to use machine-generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language, and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pre-trained models through instruction finetuning: LLaMA for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities on par with LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models. We open-source our dataset, codebase, and model checkpoints to foster further research and application in this domain.

Title: Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs

Authors: Duygu Nur Yaldiz, Yavuz Faruk Bakman, Baturalp Buyukates, Chenyang Tao, Anil Ramakrishna, Dimitrios Dimitriadis, Salman Avestimehr
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs(https://arxiv.org/abs/)
Keywords: generative
Abstract: In this work, we introduce the Learnable Response Scoring Function (LARS) for Uncertainty Estimation (UE) in generative Large Language Models (LLMs). Current scoring functions for probability-based UE, such as length-normalized scoring and semantic contribution-based weighting, are designed to solve specific aspects of the problem but exhibit limitations, including the inability to handle biased probabilities and under-performance in low-resource languages like Turkish. To address these issues, we propose LARS, a scoring function that leverages supervised data to capture complex dependencies between tokens and probabilities, thereby producing more reliable and calibrated response scores in computing the uncertainty of generations. Our extensive experiments across multiple datasets show that LARS substantially outperforms existing scoring functions considering various probability-based UE methods.

Title: Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding

Authors: Yunsong Wang, Na Zhao, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding(https://arxiv.org/abs/)
Keywords: self-supervised, generative
Abstract: The field of self-supervised 3D representation learning has emerged as a promising solution to alleviate the challenge presented by the scarcity of extensive, well-annotated datasets. However, it continues to be hindered by the lack of diverse, large-scale, real-world 3D scene datasets for source data. To address this shortfall, we propose Generalizable Representation Learning (GRL), where we devise a generative Bayesian network to produce diverse synthetic scenes with real-world patterns, and conduct pre-training with a joint objective. By jointly learning a coarse-to-fine contrastive learning task and an occlusion-aware reconstruction task, the model is primed with transferable, geometry-informed representations. Post pre-training on synthetic data, the acquired knowledge of the model can be seamlessly transferred to two principal downstream tasks associated with 3D scene understanding, namely 3D object detection and 3D semantic segmentation, using real-world benchmark datasets. A thorough series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.

Title: VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

Authors: Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, Min Zhang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] VideoVista: A Versatile Benchmark for Video Understanding and Reasoning(https://arxiv.org/abs/)
Keywords: anomaly
Abstract: Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations, and abilities. Specifically, VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes. Besides, it encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning). To achieve this, we present an automatic data construction framework, leveraging powerful GPT-4o alongside advanced analysis tools (e.g., video splitting, object segmenting, and tracking). We also utilize this framework to construct training data to enhance the capabilities of video-related LMMs (Video-LMMs). Through a comprehensive and quantitative evaluation of cutting-edge models, we reveal that: 1) Video-LMMs face difficulties in fine-grained video tasks involving temporal location, object tracking, and anomaly detection; 2) Video-LMMs present inferior logical and relation reasoning abilities; 3) Open-source Video-LMMs' performance is significantly lower than GPT-4o and Gemini-1.5, lagging by 20 points. This highlights the crucial role VideoVista will play in advancing LMMs that can accurately understand videos and perform precise reasoning.

Title: Hallucination Mitigation Prompts Long-term Video Understanding

Authors: Yiwei Sun, Zhihang Liu, Chuanbin Liu, Bowei Pu, Zhihan Zhang, Hongtao Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Hallucination Mitigation Prompts Long-term Video Understanding(https://arxiv.org/abs/)
Keywords: in-context
Abstract: Recently, multimodal large language models have made significant advancements in video understanding tasks. However, their ability to understand unprocessed long videos is very limited, primarily due to the difficulty in supporting the enormous memory overhead. Although existing methods achieve a balance between memory and information by aggregating frames, they inevitably introduce the severe hallucination issue. To address this issue, this paper constructs a comprehensive hallucination mitigation pipeline based on existing MLLMs. Specifically, we use the CLIP Score to guide the frame sampling process with questions, selecting key frames relevant to the question. Then, We inject question information into the queries of the image Q-former to obtain more important visual features. Finally, during the answer generation stage, we utilize chain-of-thought and in-context learning techniques to explicitly control the generation of answers. It is worth mentioning that for the breakpoint mode, we found that image understanding models achieved better results than video understanding models. Therefore, we aggregated the answers from both types of models using a comparison mechanism. Ultimately, We achieved 84.2\% and 62.9\% for the global and breakpoint modes respectively on the MovieChat dataset, surpassing the official baseline model by 29.1\% and 24.1\%. Moreover the proposed method won the third place in the CVPR LOVEU 2024 Long-Term Video Question Answering Challenge. The code is avaiable at this https URL

Title: Fine-grained Controllable Text Generation through In-context Learning with Feedback

Authors: Sarubi Thillainathan, Alexander Koller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Fine-grained Controllable Text Generation through In-context Learning with Feedback(https://arxiv.org/abs/)
Keywords: in-context
Abstract: We present a method for rewriting an input sentence to match specific values of nontrivial linguistic features, such as dependency depth. In contrast to earlier work, our method uses in-context learning rather than finetuning, making it applicable in use cases where data is sparse. We show that our model performs accurate rewrites and matches the state of the art on rewriting sentences to a specified school grade level.

Title: A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

Authors: Leonardo Bertolazzi, Albert Gatt, Raffaella Bernardi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences(https://arxiv.org/abs/)
Keywords: in-context
Abstract: The reasoning abilities of Large Language Models (LLMs) are becoming a central focus of study in NLP. In this paper, we consider the case of syllogistic reasoning, an area of deductive reasoning studied extensively in logic and cognitive psychology. Previous research has shown that pre-trained LLMs exhibit reasoning biases, such as $\textit{content effects}$, avoid answering that $\textit{no conclusion follows}$, display human-like difficulties, and struggle with multi-step reasoning. We contribute to this research line by systematically investigating the effects of chain-of-thought reasoning, in-context learning (ICL), and supervised fine-tuning (SFT) on syllogistic reasoning, considering syllogisms with conclusions that support or violate world knowledge, as well as ones with multiple premises. Crucially, we go beyond the standard focus on accuracy, with an in-depth analysis of the conclusions generated by the models. Our results suggest that the behavior of pre-trained LLMs can be explained by heuristics studied in cognitive science and that both ICL and SFT improve model performance on valid inferences, although only the latter mitigates most reasoning biases without harming model consistency.

Title: P-TA: Using Proximal Policy Optimization to Enhance Tabular Data Augmentation via Large Language Models

Authors: Shuo Yang, Chenchen Yuan, Yao Rong, Felix Steinbauer, Gjergji Kasneci
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] P-TA: Using Proximal Policy Optimization to Enhance Tabular Data Augmentation via Large Language Models(https://arxiv.org/abs/)
Keywords: generative
Abstract: A multitude of industries depend on accurate and reasonable tabular data augmentation for their business processes. Contemporary methodologies in generating tabular data revolve around utilizing Generative Adversarial Networks (GAN) or fine-tuning Large Language Models (LLM). However, GAN-based approaches are documented to produce samples with common-sense errors attributed to the absence of external knowledge. On the other hand, LLM-based methods exhibit a limited capacity to capture the disparities between synthesized and actual data distribution due to the absence of feedback from a discriminator during training. Furthermore, the decoding of LLM-based generation introduces gradient breakpoints, impeding the backpropagation of loss from a discriminator, thereby complicating the integration of these two approaches. To solve this challenge, we propose using proximal policy optimization (PPO) to apply GANs, guiding LLMs to enhance the probability distribution of tabular features. This approach enables the utilization of LLMs as generators for GANs in synthesizing tabular data. Our experiments demonstrate that PPO leads to an approximately 4\% improvement in the accuracy of models trained on synthetically generated data over state-of-the-art across three real-world datasets.

Title: Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

Authors: Franz Louis Cesista
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to force them to reason before responding with structured outputs that downstream APIs can parse and use. We provide a detailed account of our approach, including the technical details, theoretical discussions, and final evaluation results in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference. Our approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall. This shows the method's ability to generalize to unseen tasks. And that simple engineering can beat expensive & complicated modelling steps as we first discussed in our paper, Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use. All of our scripts, deployment steps, and evaluation results can be accessed in this https URL

Title: Cross-domain Open-world Discovery

Authors: Shuo Wen, Maria Brbic
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Cross-domain Open-world Discovery(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: In many real-world applications, test data may commonly exhibit categorical shifts, characterized by the emergence of novel classes, as well as distribution shifts arising from feature distributions different from the ones the model was trained on. However, existing methods either discover novel classes in the open-world setting or assume domain shifts without the ability to discover novel classes. In this work, we consider a cross-domain open-world discovery setting, where the goal is to assign samples to seen classes and discover unseen classes under a domain shift. To address this challenging problem, we present CROW, a prototype-based approach that introduces a cluster-then-match strategy enabled by a well-structured representation space of foundation models. In this way, CROW discovers novel classes by robustly matching clusters with previously seen classes, followed by fine-tuning the representation space using an objective designed for cross-domain open-world discovery. Extensive experimental results on image classification benchmark datasets demonstrate that CROW outperforms alternative baselines, achieving an 8% average performance improvement across 75 experimental settings.

Title: AnyTrans: Translate AnyText in the Image with Large Scale Models

Authors: Zhipeng Qian, Pei Zhang, Baosong Yang, Kai Fan, Yiwei Ma, Derek F. Wong, Xiaoshuai Sun, Rongrong Ji
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] AnyTrans: Translate AnyText in the Image with Large Scale Models(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during translation. The few-shot learning capability of LLMs allows for the translation of fragmented texts by considering the overall context. Meanwhile, the advanced inpainting and editing abilities of diffusion models make it possible to fuse translated text seamlessly into the original image while preserving its style and realism. Additionally, our framework can be constructed entirely using open-source models and requires no training, making it highly accessible and easily expandable. To encourage advancement in the TATI task, we have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.

Title: MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More

Authors: Yue Jiang, Jiawei Chen, Dingkang Yang, Mingcheng Li, Shunli Wang, Tong Wu, Ke Li, Lihua Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More(https://arxiv.org/abs/)
Keywords: generative
Abstract: When Large Vision Language Models (LVLMs) are applied to multimodal medical generative tasks, they suffer from significant model hallucination issues. This severely impairs the model's generative accuracy, making it challenging for LVLMs to be implemented in real-world medical scenarios to assist doctors in diagnosis. Enhancing the training data for downstream medical generative tasks is an effective way to address model hallucination. Moreover, the limited availability of training data in the medical field and privacy concerns greatly hinder the model's accuracy and generalization capabilities. In this paper, we introduce a method that mimics human cognitive processes to construct fine-grained instruction pairs and apply the concept of chain-of-thought (CoT) from inference scenarios to training scenarios, thereby proposing a method called MedThink. Our experiments on various LVLMs demonstrate that our novel data construction method tailored for the medical domain significantly improves the model's performance in medical image report generation tasks and substantially mitigates the hallucinations. All resources of this work will be released soon.

Title: Automating Easy Read Text Segmentation

Authors: Jesús Calleja, Thierry Etchegoyhen, David Ponce
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Automating Easy Read Text Segmentation(https://arxiv.org/abs/)
Keywords: generative
Abstract: Easy Read text is one of the main forms of access to information for people with reading difficulties. One of the key characteristics of this type of text is the requirement to split sentences into smaller grammatical segments, to facilitate reading. Automated segmentation methods could foster the creation of Easy Read content, but their viability has yet to be addressed. In this work, we study novel methods for the task, leveraging masked and generative language models, along with constituent parsing. We conduct comprehensive automatic and human evaluations in three languages, analysing the strengths and weaknesses of the proposed alternatives, under scarce resource limitations. Our results highlight the viability of automated ER segmentation and remaining deficiencies compared to expert-driven human segmentation.

Title: Promises, Outlooks and Challenges of Diffusion Language Modeling

Authors: Justin Deschenaux, Caglar Gulcehre
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Promises, Outlooks and Challenges of Diffusion Language Modeling(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: The modern autoregressive Large Language Models (LLMs) have achieved outstanding performance on NLP benchmarks, and they are deployed in the real world. However, they still suffer from limitations of the autoregressive training paradigm. For example, autoregressive token generation is notably slow and can be prone to \textit{exposure bias}. The diffusion-based language models were proposed as an alternative to autoregressive generation to address some of these limitations. We evaluate the recently proposed Score Entropy Discrete Diffusion (SEDD) approach and show it is a promising alternative to autoregressive generation but it has some short-comings too. We empirically demonstrate the advantages and challenges of SEDD, and observe that SEDD generally matches autoregressive models in perplexity and on benchmarks such as HellaSwag, Arc or WinoGrande. Additionally, we show that in terms of inference latency, SEDD can be up to 4.5$\times$ more efficient than GPT-2. While SEDD allows conditioning on tokens at abitrary positions, SEDD appears slightly weaker than GPT-2 for conditional generation given short prompts. Finally, we reproduced the main results from the original SEDD paper.

Title: How Far Can In-Context Alignment Go? Exploring the State of In-Context Alignment

Authors: Heyan Huang, Yinghao Li, Huashan Sun, Yu Bai, Yang Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] How Far Can In-Context Alignment Go? Exploring the State of In-Context Alignment(https://arxiv.org/abs/)
Keywords: in-context
Abstract: Recent studies have demonstrated that In-Context Learning (ICL), through the use of specific demonstrations, can align Large Language Models (LLMs) with human preferences known as In-Context Alignment (ICA), indicating that models can comprehend human instructions without requiring parameter adjustments. However, the exploration of the mechanism and applicability of ICA remains limited. In this paper, we begin by dividing the context text used in ICA into three categories: format, system prompt, and example. Through ablation experiments, we investigate the effectiveness of each part in enabling ICA to function effectively. We then examine how variants in these parts impact the model's alignment performance. Our findings indicate that the example part is crucial for enhancing the model's alignment capabilities, with changes in examples significantly affecting alignment performance. We also conduct a comprehensive evaluation of ICA's zero-shot capabilities in various alignment tasks. The results indicate that compared to parameter fine-tuning methods, ICA demonstrates superior performance in knowledge-based tasks and tool-use tasks. However, it still exhibits certain limitations in areas such as multi-turn dialogues and instruction following.

Title: Prior Normality Prompt Transformer for Multi-class Industrial Image Anomaly Detection

Authors: Haiming Yao, Yunkang Cao, Wei Luo, Weihang Zhang, Wenyong Yu, Weiming Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Prior Normality Prompt Transformer for Multi-class Industrial Image Anomaly Detection(https://arxiv.org/abs/)
Keywords: anomaly
Abstract: Image anomaly detection plays a pivotal role in industrial inspection. Traditional approaches often demand distinct models for specific categories, resulting in substantial deployment costs. This raises concerns about multi-class anomaly detection, where a unified model is developed for multiple classes. However, applying conventional methods, particularly reconstruction-based models, directly to multi-class scenarios encounters challenges such as identical shortcut learning, hindering effective discrimination between normal and abnormal instances. To tackle this issue, our study introduces the Prior Normality Prompt Transformer (PNPT) method for multi-class image anomaly detection. PNPT strategically incorporates normal semantics prompting to mitigate the "identical mapping" problem. This entails integrating a prior normality prompt into the reconstruction process, yielding a dual-stream model. This innovative architecture combines normal prior semantics with abnormal samples, enabling dual-stream reconstruction grounded in both prior knowledge and intrinsic sample characteristics. PNPT comprises four essential modules: Class-Specific Normality Prompting Pool (CS-NPP), Hierarchical Patch Embedding (HPE), Semantic Alignment Coupling Encoding (SACE), and Contextual Semantic Conditional Decoding (CSCD). Experimental validation on diverse benchmark datasets and real-world industrial applications highlights PNPT's superior performance in multi-class industrial anomaly detection.

Title: HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model

Authors: Di Wang, Meiqi Hu, Yao Jin, Yuchun Miao, Jiaqi Yang, Yichu Xu, Xiaolei Qin, Jiaqi Ma, Lingyu Sun, Chenxing Li, Chuan Fu, Hongruixuan Chen, Chengxi Han, Naoto Yokoya, Jing Zhang, Minqiang Xu, Lin Liu, Lefei Zhang, Chen Wu, Bo Du, Dacheng Tao, Liangpei Zhang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model(https://arxiv.org/abs/)
Keywords: foundation model
Abstract: Foundation models (FMs) are revolutionizing the analysis and understanding of remote sensing (RS) scenes, including aerial RGB, multispectral, and SAR images. However, hyperspectral images (HSIs), which are rich in spectral information, have not seen much application of FMs, with existing methods often restricted to specific tasks and lacking generality. To fill this gap, we introduce HyperSIGMA, a vision transformer-based foundation model for HSI interpretation, scalable to over a billion parameters. To tackle the spectral and spatial redundancy challenges in HSIs, we introduce a novel sparse sampling attention (SSA) mechanism, which effectively promotes the learning of diverse contextual features and serves as the basic block of HyperSIGMA. HyperSIGMA integrates spatial and spectral features using a specially designed spectral enhancement module. In addition, we construct a large-scale hyperspectral dataset, HyperGlobal-450K, for pre-training, which contains about 450K hyperspectral images, significantly surpassing existing datasets in scale. Extensive experiments on various high-level and low-level HSI tasks demonstrate HyperSIGMA's versatility and superior representational capability compared to current state-of-the-art methods. Moreover, HyperSIGMA shows significant advantages in scalability, robustness, cross-modal transferring capability, and real-world applicability.

Title: Simple Yet Efficient: Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment

Authors: Jianan Jiang, Di Wu, Zhilin Jiang, Weiren Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Simple Yet Efficient: Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment(https://arxiv.org/abs/)
Keywords: self-supervised
Abstract: Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) aims to minimize the distance between sketches and corresponding images in the embedding space. However, scalability is hindered by the growing complexity of solutions, mainly due to the abstract nature of fine-grained sketches. In this paper, we propose a simple yet efficient approach to narrow the gap between the two modes. It mainly facilitates unified mutual information sharing both intra- and inter-samples, rather than treating them as a single feature alignment problem between modalities. Specifically, our approach includes: (i) Employing dual weight-sharing networks to optimize alignment within sketch and image domain, which also effectively mitigates model learning saturation issues. (ii) Introducing an objective optimization function based on contrastive loss to enhance the model's ability to align features intra- and inter-samples. (iii) Presenting a learnable TRSM combined of self-attention and cross-attention to promote feature representations among tokens, further enhancing sample alignment in the embedding space. Our framework achieves excellent results on CNN- and ViT-based backbones. Extensive experiments demonstrate its superiority over existing methods. We also introduce Cloths-V1, the first professional fashion sketches and images dataset, utilized to validate our method and will be beneficial for other applications.

Title: Quaternion Generative Adversarial Neural Networks and Applications to Color Image Inpainting

Authors: Duan Wang, Dandan Zhu, Meixiang Zhao, Zhigang Jia
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Quaternion Generative Adversarial Neural Networks and Applications to Color Image Inpainting(https://arxiv.org/abs/)
Keywords: generative
Abstract: Color image inpainting is a challenging task in imaging science. The existing method is based on real operation, and the red, green and blue channels of the color image are processed separately, ignoring the correlation between each channel. In order to make full use of the correlation between each channel, this paper proposes a Quaternion Generative Adversarial Neural Network (QGAN) model and related theory, and applies it to solve the problem of color image inpainting with large area missing. Firstly, the definition of quaternion deconvolution is given and the quaternion batch normalization is proposed. Secondly, the above two innovative modules are applied to generate adversarial networks to improve stability. Finally, QGAN is applied to color image inpainting and compared with other state-of-the-art algorithms. The experimental results show that QGAN has superiority in color image inpainting with large area missing.

Title: ChildDiffusion: Unlocking the Potential of Generative AI and Controllable Augmentations for Child Facial Data using Stable Diffusion and Large Language Models

Authors: Muhammad Ali Farooq, Wang Yao, Peter Corcoran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] ChildDiffusion: Unlocking the Potential of Generative AI and Controllable Augmentations for Child Facial Data using Stable Diffusion and Large Language Models(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: In this research work we have proposed high-level ChildDiffusion framework capable of generating photorealistic child facial samples and further embedding several intelligent augmentations on child facial data using short text prompts, detailed textual guidance from LLMs, and further image to image transformation using text guidance control conditioning thus providing an opportunity to curate fully synthetic large scale child datasets. The framework is validated by rendering high-quality child faces representing ethnicity data, micro expressions, face pose variations, eye blinking effects, facial accessories, different hair colours and styles, aging, multiple and different child gender subjects in a single frame. Addressing privacy concerns regarding child data acquisition requires a comprehensive approach that involves legal, ethical, and technological considerations. Keeping this in view this framework can be adapted to synthesise child facial data which can be effectively used for numerous downstream machine learning tasks. The proposed method circumvents common issues encountered in generative AI tools, such as temporal inconsistency and limited control over the rendered outputs. As an exemplary use case we have open-sourced child ethnicity data consisting of 2.5k child facial samples of five different classes which includes African, Asian, White, South Asian/ Indian, and Hispanic races by deploying the model in production inference phase. The rendered data undergoes rigorous qualitative as well as quantitative tests to cross validate its efficacy and further fine-tuning Yolo architecture for detecting and classifying child ethnicity as an exemplary downstream machine learning task.

Title: Standardizing Structural Causal Models

Authors: Weronika Ormaniec, Scott Sussex, Lars Lorch, Bernhard Schölkopf, Andreas Krause
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Standardizing Structural Causal Models(https://arxiv.org/abs/)
Keywords: generative
Abstract: Synthetic datasets generated by structural causal models (SCMs) are commonly used for benchmarking causal structure learning algorithms. However, the variances and pairwise correlations in SCM data tend to increase along the causal ordering. Several popular algorithms exploit these artifacts, possibly leading to conclusions that do not generalize to real-world settings. Existing metrics like $\operatorname{Var}$-sortability and $\operatorname{R^2}$-sortability quantify these patterns, but they do not provide tools to remedy them. To address this, we propose internally-standardized structural causal models (iSCMs), a modification of SCMs that introduces a standardization operation at each variable during the generative process. By construction, iSCMs are not $\operatorname{Var}$-sortable, and as we show experimentally, not $\operatorname{R^2}$-sortable either for commonly-used graph families. Moreover, contrary to the post-hoc standardization of data generated by standard SCMs, we prove that linear iSCMs are less identifiable from prior knowledge on the weights and do not collapse to deterministic relationships in large systems, which may make iSCMs a useful model in causal inference beyond the benchmarking problem studied here.

Title: Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

Authors: Mingyang Song, Mao Zheng, Xuan Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!(https://arxiv.org/abs/)
Keywords: in-context
Abstract: Leveraging Large Language Models (LLMs) as judges for evaluating the performance of LLMs has recently garnered attention. Nonetheless, this type of approach concurrently introduces potential biases from LLMs, raising concerns about the reliability of the evaluation results. To mitigate this issue, we propose and study two versions of many-shot in-context prompts, Reinforced and Unsupervised ICL, for helping GPT-4o-as-a-Judge in single answer grading. Based on the designed prompts, we investigate the impact of scaling the number of in-context examples on the agreement and quality of the evaluation. Furthermore, we first reveal the symbol bias in GPT-4o-as-a-Judge for pairwise comparison and then propose a simple yet effective approach to mitigate it. Experimental results show that advanced long-context LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot regime. Meanwhile, the experimental results further verify the effectiveness of the symbol bias mitigation approach.

Title: AnyMaker: Zero-shot General Object Customization via Decoupled Dual-Level ID Injection

Authors: Lingjie Kong, Kai Wu, Xiaobin Hu, Wenhui Han, Jinlong Peng, Chengming Xu, Donghao Luo, Jiangning Zhang, Chengjie Wang, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] AnyMaker: Zero-shot General Object Customization via Decoupled Dual-Level ID Injection(https://arxiv.org/abs/)
Keywords: diffusion, self-supervised
Abstract: Text-to-image based object customization, aiming to generate images with the same identity (ID) as objects of interest in accordance with text prompts and reference images, has made significant progress. However, recent customizing research is dominated by specialized tasks, such as human customization or virtual try-on, leaving a gap in general object customization. To this end, we introduce AnyMaker, an innovative zero-shot object customization framework capable of generating general objects with high ID fidelity and flexible text editability. The efficacy of AnyMaker stems from its novel general ID extraction, dual-level ID injection, and ID-aware decoupling. Specifically, the general ID extraction module extracts sufficient ID information with an ensemble of self-supervised models to tackle the diverse customization tasks for general objects. Then, to provide the diffusion UNet with the extracted ID as much while not damaging the text editability in the generation process, we design a global-local dual-level ID injection module, in which the global-level semantic ID is injected into text descriptions while the local-level ID details are injected directly into the model through newly added cross-attention modules. In addition, we propose an ID-aware decoupling module to disentangle ID-related information from non-ID elements in the extracted representations for high-fidelity generation of both identity and text descriptions. To validate our approach and boost the research of general object customization, we create the first large-scale general ID dataset, Multi-Category ID-Consistent (MC-IDC) dataset, with 315k text-image samples and 10k categories. Experiments show that AnyMaker presents remarkable performance in general object customization and outperforms specialized methods in corresponding tasks. Code and dataset will be released soon.

Title: HoLLMwood: Unleashing the Creativity of Large Language Models in Screenwriting via Role Playing

Authors: Jing Chen, Xinyu Zhu, Cheng Yang, Chufan Shi, Yadong Xi, Yuxiang Zhang, Junjie Wang, Jiashu Pu, Rongsheng Zhang, Yujiu Yang, Tian Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] HoLLMwood: Unleashing the Creativity of Large Language Models in Screenwriting via Role Playing(https://arxiv.org/abs/)
Keywords: generative
Abstract: Generative AI has demonstrated unprecedented creativity in the field of computer vision, yet such phenomena have not been observed in natural language processing. In particular, large language models (LLMs) can hardly produce written works at the level of human experts due to the extremely high complexity of literature writing. In this paper, we present HoLLMwood, an automated framework for unleashing the creativity of LLMs and exploring their potential in screenwriting, which is a highly demanding task. Mimicking the human creative process, we assign LLMs to different roles involved in the real-world scenario. In addition to the common practice of treating LLMs as ${Writer}$, we also apply LLMs as ${Editor}$, who is responsible for providing feedback and revision advice to ${Writer}$. Besides, to enrich the characters and deepen the plots, we introduce a role-playing mechanism and adopt LLMs as ${Actors}$ that can communicate and interact with each other. Evaluations on automatically generated screenplays show that HoLLMwood substantially outperforms strong baselines in terms of coherence, relevance, interestingness and overall quality.

Title: Lightweight Model Pre-training via Language Guided Knowledge Distillation

Authors: Mingsheng Li, Lin Zhang, Mingzhen Zhu, Zilong Huang, Gang Yu, Jiayuan Fan, Tao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Lightweight Model Pre-training via Language Guided Knowledge Distillation(https://arxiv.org/abs/)
Keywords: self-supervised
Abstract: This paper studies the problem of pre-training for small models, which is essential for many mobile devices. Current state-of-the-art methods on this problem transfer the representational knowledge of a large network (as a Teacher) into a smaller model (as a Student) using self-supervised distillation, improving the performance of the small model on downstream tasks. However, existing approaches are insufficient in extracting the crucial knowledge that is useful for discerning categories in downstream tasks during the distillation process. In this paper, for the first time, we introduce language guidance to the distillation process and propose a new method named Language-Guided Distillation (LGD) system, which uses category names of the target downstream task to help refine the knowledge transferred between the teacher and student. To this end, we utilize a pre-trained text encoder to extract semantic embeddings from language and construct a textual semantic space called Textual Semantics Bank (TSB). Furthermore, we design a Language-Guided Knowledge Aggregation (LGKA) module to construct the visual semantic space, also named Visual Semantics Bank (VSB). The task-related knowledge is transferred by driving a student encoder to mimic the similarity score distribution inferred by a teacher over TSB and VSB. Compared with other small models obtained by either ImageNet pre-training or self-supervised distillation, experiment results show that the distilled lightweight model using the proposed LGD method presents state-of-the-art performance and is validated on various downstream tasks, including classification, detection, and segmentation. We have made the code available at this https URL.

Title: Meta Reasoning for Large Language Models

Authors: Peizhong Gao, Ao Xie, Shaoguang Mao, Wenshan Wu, Yan Xia, Haipeng Mi, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Meta Reasoning for Large Language Models(https://arxiv.org/abs/)
Keywords: in-context
Abstract: We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs) inspired by human meta-reasoning. Traditional in-context learning-based reasoning techniques, such as Tree-of-Thoughts, show promise but lack consistent state-of-the-art performance across diverse tasks due to their specialized nature. MRP addresses this limitation by guiding LLMs to dynamically select and apply different reasoning methods based on the specific requirements of each task, optimizing both performance and computational efficiency. With MRP, LLM reasoning operates in two phases. Initially, the LLM identifies the most appropriate reasoning method using task input cues and objective descriptions of available methods. Subsequently, it applies the chosen method to complete the task. This dynamic strategy mirrors human meta-reasoning, allowing the model to excel in a wide range of problem domains. We evaluate the effectiveness of MRP through comprehensive benchmarks. The results demonstrate that MRP achieves or approaches state-of-the-art performance across diverse tasks. MRP represents a significant advancement in enabling LLMs to identify cognitive challenges across problems and leverage benefits across different reasoning approaches, enhancing their ability to handle diverse and complex problem domains efficiently. Every LLM deserves a Meta-Reasoning Prompting to unlock its full potential and ensure adaptability in an ever-evolving landscape of challenges and applications.

Title: Latent Denoising Diffusion GAN: Faster sampling, Higher image quality

Authors: Luan Thanh Trinh, Tomoki Hamagami
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Latent Denoising Diffusion GAN: Faster sampling, Higher image quality(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: Diffusion models are emerging as powerful solutions for generating high-fidelity and diverse images, often surpassing GANs under many circumstances. However, their slow inference speed hinders their potential for real-time applications. To address this, DiffusionGAN leveraged a conditional GAN to drastically reduce the denoising steps and speed up inference. Its advancement, Wavelet Diffusion, further accelerated the process by converting data into wavelet space, thus enhancing efficiency. Nonetheless, these models still fall short of GANs in terms of speed and image quality. To bridge these gaps, this paper introduces the Latent Denoising Diffusion GAN, which employs pre-trained autoencoders to compress images into a compact latent space, significantly improving inference speed and image quality. Furthermore, we propose a Weighted Learning strategy to enhance diversity and image quality. Experimental results on the CIFAR-10, CelebA-HQ, and LSUN-Church datasets prove that our model achieves state-of-the-art running speed among diffusion models. Compared to its predecessors, DiffusionGAN and Wavelet Diffusion, our model shows remarkable improvements in all evaluation metrics. Code and pre-trained checkpoints: \url{this https URL}

Title: Transcendence: Generative Models Can Outperform The Experts That Train Them

Authors: Edwin Zhang, Vincent Zhu, Naomi Saphra, Anat Kleiman, Benjamin L. Edelman, Milind Tambe, Sham M. Kakade, Eran Malach
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Transcendence: Generative Models Can Outperform The Experts That Train Them(https://arxiv.org/abs/)
Keywords: generative
Abstract: Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities that surpass the abilities of the experts generating its data. We demonstrate transcendence by training an autoregressive transformer to play chess from game transcripts, and show that the trained model can sometimes achieve better performance than all players in the dataset. We theoretically prove that transcendence is enabled by low-temperature sampling, and rigorously assess this experimentally. Finally, we discuss other sources of transcendence, laying the groundwork for future investigation of this phenomenon in a broader setting.

Title: CELL your Model: Contrastive Explanation Methods for Large Language Models

Authors: Ronny Luss, Erik Miehling, Amit Dhurandhar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] CELL your Model: Contrastive Explanation Methods for Large Language Models(https://arxiv.org/abs/)
Keywords: generative
Abstract: The advent of black-box deep neural network classification models has sparked the need to explain their decisions. However, in the case of generative AI such as large language models (LLMs), there is no class prediction to explain. Rather, one can ask why an LLM output a particular response to a given prompt. In this paper, we answer this question by proposing, to the best of our knowledge, the first contrastive explanation methods requiring simply black-box/query access. Our explanations suggest that an LLM outputs a reply to a given prompt because if the prompt was slightly modified, the LLM would have given a different response that is either less preferable or contradicts the original response. The key insight is that contrastive explanations simply require a distance function that has meaning to the user and not necessarily a real valued representation of a specific response (viz. class label). We offer two algorithms for finding contrastive explanations: i) A myopic algorithm, which although effective in creating contrasts, requires many model calls and ii) A budgeted algorithm, our main algorithmic contribution, which intelligently creates contrasts adhering to a query budget, necessary for longer contexts. We show the efficacy of these methods on diverse natural language tasks such as open-text generation, automated red teaming, and explaining conversational degradation.

Title: MegaScenes: Scene-Level View Synthesis at Scale

Authors: Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, Noah Snavely
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] MegaScenes: Scene-Level View Synthesis at Scale(https://arxiv.org/abs/)
Keywords: diffusion, foundation model
Abstract: Scene-level novel view synthesis (NVS) is fundamental to many vision and graphics applications. Recently, pose-conditioned diffusion models have led to significant progress by extracting 3D information from 2D foundation models, but these methods are limited by the lack of scene-level training data. Common dataset choices either consist of isolated objects (Objaverse), or of object-centric scenes with limited pose distributions (DTU, CO3D). In this paper, we create a large-scale scene-level dataset from Internet photo collections, called MegaScenes, which contains over 100K structure from motion (SfM) reconstructions from around the world. Internet photos represent a scalable data source but come with challenges such as lighting and transient objects. We address these issues to further create a subset suitable for the task of NVS. Additionally, we analyze failure cases of state-of-the-art NVS methods and significantly improve generation consistency. Through extensive experiments, we validate the effectiveness of both our dataset and method on generating in-the-wild scenes. For details on the dataset and code, see our project page at this https URL .

Title: Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

Authors: Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, Yu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation. We identified two main obstacles behind this issue. One is the misalignment between the next token prediction training in LLM and the requirement for discriminative prompt features in diffusion models. The other is the intrinsic positional bias introduced by the decoder-only architecture. To deal with this issue, we propose a novel framework to fully harness the capabilities of LLMs. Through the carefully designed usage guidance, we effectively enhance the text representation capability for prompt encoding and eliminate its inherent positional bias. This allows us to integrate state-of-the-art LLMs into the text-to-image generation model flexibly. Furthermore, we also provide an effective manner to fuse multiple LLMs into our framework. Considering the excellent performance and scaling capabilities demonstrated by the transformer architecture, we further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework. We conduct extensive experiments to validate LI-DiT across model size and data size. Benefiting from the inherent ability of the LLMs and our innovative designs, the prompt understanding performance of LI-DiT easily surpasses state-of-the-art open-source models as well as mainstream closed-source commercial models including Stable Diffusion 3, DALL-E 3, and Midjourney V6. The powerful LI-DiT-10B will be available after further optimization and security checks.

Title: OoDIS: Anomaly Instance Segmentation Benchmark

Authors: Alexey Nekrasov, Rui Zhou, Miriam Ackermann, Alexander Hermans, Bastian Leibe, Matthias Rottmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] OoDIS: Anomaly Instance Segmentation Benchmark(https://arxiv.org/abs/)
Keywords: anomaly
Abstract: Autonomous vehicles require a precise understanding of their environment to navigate safely. Reliable identification of unknown objects, especially those that are absent during training, such as wild animals, is critical due to their potential to cause serious accidents. Significant progress in semantic segmentation of anomalies has been driven by the availability of out-of-distribution (OOD) benchmarks. However, a comprehensive understanding of scene dynamics requires the segmentation of individual objects, and thus the segmentation of instances is essential. Development in this area has been lagging, largely due to the lack of dedicated benchmarks. To address this gap, we have extended the most commonly used anomaly segmentation benchmarks to include the instance segmentation task. Our evaluation of anomaly instance segmentation methods shows that this challenge remains an unsolved problem. The benchmark website and the competition page can be found at: this https URL .

Title: Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%

Authors: Lei Zhu, Fangyun Wei, Yanye Lu, Dong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%(https://arxiv.org/abs/)
Keywords: diffusion, generative
Abstract: In the realm of image quantization exemplified by VQGAN, the process encodes images into discrete tokens drawn from a codebook with a predefined size. Recent advancements, particularly with LLAMA 3, reveal that enlarging the codebook significantly enhances model performance. However, VQGAN and its derivatives, such as VQGAN-FC (Factorized Codes) and VQGAN-EMA, continue to grapple with challenges related to expanding the codebook size and enhancing codebook utilization. For instance, VQGAN-FC is restricted to learning a codebook with a maximum size of 16,384, maintaining a typically low utilization rate of less than 12% on ImageNet. In this work, we propose a novel image quantization model named VQGAN-LC (Large Codebook), which extends the codebook size to 100,000, achieving an utilization rate exceeding 99%. Unlike previous methods that optimize each codebook entry, our approach begins with a codebook initialized with 100,000 features extracted by a pre-trained vision encoder. Optimization then focuses on training a projector that aligns the entire codebook with the feature distributions of the encoder in VQGAN-LC. We demonstrate the superior performance of our model over its counterparts across a variety of tasks, including image reconstruction, image classification, auto-regressive image generation using GPT, and image creation with diffusion- and flow-based generative models. Code and models are available at this https URL.

Title: Autoregressive Image Generation without Vector Quantization

Authors: Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, Kaiming He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/
Pdf URL: https://arxiv.org/pdf/
Copy Paste: [[]] Autoregressive Image Generation without Vector Quantization(https://arxiv.org/abs/)
Keywords: diffusion
Abstract: Conventional wisdom holds that autoregressive models for image generation are typically accompanied by vector-quantized tokens. We observe that while a discrete-valued space can facilitate representing a categorical distribution, it is not a necessity for autoregressive modeling. In this work, we propose to model the per-token probability distribution using a diffusion procedure, which allows us to apply autoregressive models in a continuous-valued space. Rather than using categorical cross-entropy loss, we define a Diffusion Loss function to model the per-token probability. This approach eliminates the need for discrete-valued tokenizers. We evaluate its effectiveness across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants. By removing vector quantization, our image generator achieves strong results while enjoying the speed advantage of sequence modeling. We hope this work will motivate the use of autoregressive generation in other continuous-valued domains and applications.