2024-04-03

Title: Holo-VQVAE: VQ-VAE for phase-only holograms

Authors: Joohyun Park, Hyeongyeop Kang
Subjects: cs.CV, cs.GR, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2404.01330
Pdf URL: https://arxiv.org/pdf/2404.01330
Copy Paste: [[2404.01330]] Holo-VQVAE: VQ-VAE for phase-only holograms(https://arxiv.org/abs/2404.01330)
Keywords: generative
Abstract: Holography stands at the forefront of visual technology innovation, offering immersive, three-dimensional visualizations through the manipulation of light wave amplitude and phase. Contemporary research in hologram generation has predominantly focused on image-to-hologram conversion, producing holograms from existing images. These approaches, while effective, inherently limit the scope of innovation and creativity in hologram generation. In response to this limitation, we present Holo-VQVAE, a novel generative framework tailored for phase-only holograms (POHs). Holo-VQVAE leverages the architecture of Vector Quantized Variational AutoEncoders, enabling it to learn the complex distributions of POHs. Furthermore, it integrates the Angular Spectrum Method into the training process, facilitating learning in the image domain. This framework allows for the generation of unseen, diverse holographic content directly from its intricately learned latent space without requiring pre-existing images. This pioneering work paves the way for groundbreaking applications and methodologies in holographic content creation, opening a new era in the exploration of holographic content.

Title: LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Authors: Musashi Hinck, Matthew L. Olson, David Cobbley, Shao-Yen Tseng, Vasudev Lal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.01331
Pdf URL: https://arxiv.org/pdf/2404.01331
Copy Paste: [[2404.01331]] LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model(https://arxiv.org/abs/2404.01331)
Keywords: foundation model
Abstract: We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models. Closer analysis of performance shows mixed effects; skipping pretraining tends to reduce performance, larger vision models sometimes improve performance, and increasing language model size has inconsistent effects. We publicly release training recipes, code and weights for our models for the LLaVA-Gemma models.

Title: Generative AI for Architectural Design: A Literature Review

Authors: Chengyuan Li, Tianyu Zhang, Xusheng Du, Ye Zhang, Haoran Xie
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2404.01335
Pdf URL: https://arxiv.org/pdf/2404.01335
Copy Paste: [[2404.01335]] Generative AI for Architectural Design: A Literature Review(https://arxiv.org/abs/2404.01335)
Keywords: generative
Abstract: Generative Artificial Intelligence (AI) has pioneered new methodological paradigms in architectural design, significantly expanding the innovative potential and efficiency of the design process. This paper explores the extensive applications of generative AI technologies in architectural design, a trend that has benefited from the rapid development of deep generative models. This article provides a comprehensive review of the basic principles of generative AI and large-scale models and highlights the applications in the generation of 2D images, videos, and 3D models. In addition, by reviewing the latest literature from 2020, this paper scrutinizes the impact of generative AI technologies at different stages of architectural design, from generating initial architectural 3D forms to producing final architectural imagery. The marked trend of research growth indicates an increasing inclination within the architectural design community towards embracing generative AI, thereby catalyzing a shared enthusiasm for research. These research cases and methodologies have not only proven to enhance efficiency and innovation significantly but have also posed challenges to the conventional boundaries of architectural creativity. Finally, we point out new directions for design innovation and articulate fresh trajectories for applying generative AI in the architectural domain. This article provides the first comprehensive literature review about generative AI for architectural design, and we believe this work can facilitate more research work on this significant topic in architecture.

Title: DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

Authors: Lirui Zhao, Yue Yang, Kaipeng Zhang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Rongrong Ji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.01342
Pdf URL: https://arxiv.org/pdf/2404.01342
Copy Paste: [[2404.01342]] DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model(https://arxiv.org/abs/2404.01342)
Keywords: generative
Abstract: Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research. For example, the Civitai community, a platform for T2I innovation, currently hosts an impressive array of 74,492 distinct models. However, this diversity presents a formidable challenge in selecting the most appropriate model and parameters, a process that typically requires numerous trials. Drawing inspiration from the tool usage research of large language models (LLMs), we introduce DiffAgent, an LLM agent designed to screen the accurate selection in seconds via API calls. DiffAgent leverages a novel two-stage training framework, SFTA, enabling it to accurately align T2I API responses with user input in accordance with human preferences. To train and evaluate DiffAgent's capabilities, we present DABench, a comprehensive dataset encompassing an extensive range of T2I APIs from the community. Our evaluations reveal that DiffAgent not only excels in identifying the appropriate T2I API but also underscores the effectiveness of the SFTA training framework. Codes are available at https://github.com/OpenGVLab/DiffAgent.

Title: Bigger is not Always Better: Scaling Properties of Latent Diffusion Models

Authors: Kangfu Mei, Zhengzhong Tu, Mauricio Delbracio, Hossein Talebi, Vishal M. Patel, Peyman Milanfar
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2404.01367
Pdf URL: https://arxiv.org/pdf/2404.01367
Copy Paste: [[2404.01367]] Bigger is not Always Better: Scaling Properties of Latent Diffusion Models(https://arxiv.org/abs/2404.01367)
Keywords: diffusion, generative
Abstract: We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their sampling efficiency. While improved network architecture and inference algorithms have shown to effectively boost sampling efficiency of diffusion models, the role of model size -- a critical determinant of sampling efficiency -- has not been thoroughly examined. Through empirical analysis of established text-to-image diffusion models, we conduct an in-depth investigation into how model size influences sampling efficiency across varying sampling steps. Our findings unveil a surprising trend: when operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results. Moreover, we extend our study to demonstrate the generalizability of the these findings by applying various diffusion samplers, exploring diverse downstream tasks, evaluating post-distilled models, as well as comparing performance relative to training compute. These findings open up new pathways for the development of LDM scaling strategies which can be employed to enhance generative capabilities within limited inference budgets.

Title: Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Authors: Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David L. Donoho, Sanmi Koyejo
Subjects: cs.LG, cs.AI, cs.CL, cs.ET, stat.ML
Abstract URL: https://arxiv.org/abs/2404.01413
Pdf URL: https://arxiv.org/pdf/2404.01413
Copy Paste: [[2404.01413]] Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data(https://arxiv.org/abs/2404.01413)
Keywords: diffusion, generative
Abstract: The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops discovered that such loops can lead to model collapse, a phenomenon where performance progressively degrades with each model-fitting iteration until the latest model becomes useless. However, several recent papers studying model collapse assumed that new data replace old data over time rather than assuming data accumulate over time. In this paper, we compare these two settings and show that accumulating data prevents model collapse. We begin by studying an analytically tractable setup in which a sequence of linear models are fit to the previous models' predictions. Previous work showed if data are replaced, the test error increases linearly with the number of model-fitting iterations; we extend this result by proving that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations. We next empirically test whether accumulating data similarly prevents model collapse by pretraining sequences of language models on text corpora. We confirm that replacing data does indeed cause model collapse, then demonstrate that accumulating data prevents model collapse; these results hold across a range of model sizes, architectures and hyperparameters. We further show that similar results hold for other deep generative models on real data: diffusion models for molecule generation and variational autoencoders for image generation. Our work provides consistent theoretical and empirical evidence that data accumulation mitigates model collapse.

Title: DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery

Authors: Yixuan Zhu, Ao Li, Yansong Tang, Wenliang Zhao, Jie Zhou, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.01424
Pdf URL: https://arxiv.org/pdf/2404.01424
Copy Paste: [[2404.01424]] DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery(https://arxiv.org/abs/2404.01424)
Keywords: diffusion
Abstract: The recovery of occluded human meshes presents challenges for current methods due to the difficulty in extracting effective image features under severe occlusion. In this paper, we introduce DPMesh, an innovative framework for occluded human mesh recovery that capitalizes on the profound diffusion prior about object structure and spatial relationships embedded in a pre-trained text-to-image diffusion model. Unlike previous methods reliant on conventional backbones for vanilla feature extraction, DPMesh seamlessly integrates the pre-trained denoising U-Net with potent knowledge as its image backbone and performs a single-step inference to provide occlusion-aware information. To enhance the perception capability for occluded poses, DPMesh incorporates well-designed guidance via condition injection, which produces effective controls from 2D observations for the denoising U-Net. Furthermore, we explore a dedicated noisy key-point reasoning approach to mitigate disturbances arising from occlusion and crowded scenarios. This strategy fully unleashes the perceptual capability of the diffusion prior, thereby enhancing accuracy. Extensive experiments affirm the efficacy of our framework, as we outperform state-of-the-art methods on both occlusion-specific and standard datasets. The persuasive results underscore its ability to achieve precise and robust 3D human mesh recovery, particularly in challenging scenarios involving occlusion and crowded scenes.

Title: Predicting the Performance of Foundation Models via Agreement-on-the-Line

Authors: Aman Mehra, Rahul Saxena, Taeyoun Kim, Christina Baek, Zico Kolter, Aditi Raghunathan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2404.01542
Pdf URL: https://arxiv.org/pdf/2404.01542
Copy Paste: [[2404.01542]] Predicting the Performance of Foundation Models via Agreement-on-the-Line(https://arxiv.org/abs/2404.01542)
Keywords: foundation model
Abstract: Estimating the out-of-distribution performance in regimes where labels are scarce is critical to safely deploy foundation models. Recently, it was shown that ensembles of neural networks observe the phenomena ``agreement-on-the-line'', which can be leveraged to reliably predict OOD performance without labels. However, in contrast to classical neural networks that are trained on in-distribution data from scratch for numerous epochs, foundation models undergo minimal finetuning from heavily pretrained weights, which may reduce the ensemble diversity needed to observe agreement-on-the-line. In our work, we demonstrate that when lightly finetuning multiple runs from a $\textit{single}$ foundation model, the choice of randomness during training (linear head initialization, data ordering, and data subsetting) can lead to drastically different levels of agreement-on-the-line in the resulting ensemble. Surprisingly, only random head initialization is able to reliably induce agreement-on-the-line in finetuned foundation models across vision and language benchmarks. Second, we demonstrate that ensembles of $\textit{multiple}$ foundation models pretrained on different datasets but finetuned on the same task can also show agreement-on-the-line. In total, by careful construction of a diverse ensemble, we can utilize agreement-on-the-line-based methods to predict the OOD performance of foundation models with high precision.

Title: Diffusion Deepfake

Authors: Chaitali Bhattacharyya, Hanxiao Wang, Feng Zhang, Sungho Kim, Xiatian Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.01579
Pdf URL: https://arxiv.org/pdf/2404.01579
Copy Paste: [[2404.01579]] Diffusion Deepfake(https://arxiv.org/abs/2404.01579)
Keywords: diffusion, generative
Abstract: Recent progress in generative AI, primarily through diffusion models, presents significant challenges for real-world deepfake detection. The increased realism in image details, diverse content, and widespread accessibility to the general public complicates the identification of these sophisticated deepfakes. Acknowledging the urgency to address the vulnerability of current deepfake detectors to this evolving threat, our paper introduces two extensive deepfake datasets generated by state-of-the-art diffusion models as other datasets are less diverse and low in quality. Our extensive experiments also showed that our dataset is more challenging compared to the other face deepfake datasets. Our strategic dataset creation not only challenge the deepfake detectors but also sets a new benchmark for more evaluation. Our comprehensive evaluation reveals the struggle of existing detection methods, often optimized for specific image domains and manipulations, to effectively adapt to the intricate nature of diffusion deepfakes, limiting their practical utility. To address this critical issue, we investigate the impact of enhancing training data diversity on representative detection methods. This involves expanding the diversity of both manipulation techniques and image domains. Our findings underscore that increasing training data diversity results in improved generalizability. Moreover, we propose a novel momentum difficulty boosting strategy to tackle the additional challenge posed by training data heterogeneity. This strategy dynamically assigns appropriate sample weights based on learning difficulty, enhancing the model's adaptability to both easy and challenging samples. Extensive experiments on both existing and newly proposed benchmarks demonstrate that our model optimization approach surpasses prior alternatives significantly.

Title: Entity Disambiguation via Fusion Entity Decoding

Authors: Junxiong Wang, Ali Mousavi, Omar Attia, Saloni Potdar, Alexander M. Rush, Umar Farooq Minhas, Yunyao Li
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2404.01626
Pdf URL: https://arxiv.org/pdf/2404.01626
Copy Paste: [[2404.01626]] Entity Disambiguation via Fusion Entity Decoding(https://arxiv.org/abs/2404.01626)
Keywords: generative
Abstract: Entity disambiguation (ED), which links the mentions of ambiguous entities to their referent entities in a knowledge base, serves as a core component in entity linking (EL). Existing generative approaches demonstrate improved accuracy compared to classification approaches under the standardized ZELDA benchmark. Nevertheless, generative approaches suffer from the need for large-scale pre-training and inefficient generation. Most importantly, entity descriptions, which could contain crucial information to distinguish similar entities from each other, are often overlooked. We propose an encoder-decoder model to disambiguate entities with more detailed entity descriptions. Given text and candidate entities, the encoder learns interactions between the text and each candidate entity, producing representations for each entity candidate. The decoder then fuses the representations of entity candidates together and selects the correct entity. Our experiments, conducted on various entity disambiguation benchmarks, demonstrate the strong and robust performance of this model, particularly +1.5% in the ZELDA benchmark compared with GENRE. Furthermore, we integrate this approach into the retrieval/reader framework and observe +1.5% improvements in end-to-end entity linking in the GERBIL benchmark compared with EntQA.

Title: Enhancing Functional Safety in Automotive AMS Circuits through Unsupervised Machine Learning

Authors: Ayush Arunachalam, Ian Kintz, Suvadeep Banerjee, Arnab Raha, Xiankun Jin, Fei Su, Viswanathan Pillai Prasanth, Rubin A. Parekhji, Suriyaprakash Natarajan, Kanad Basu
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2404.01632
Pdf URL: https://arxiv.org/pdf/2404.01632
Copy Paste: [[2404.01632]] Enhancing Functional Safety in Automotive AMS Circuits through Unsupervised Machine Learning(https://arxiv.org/abs/2404.01632)
Keywords: anomaly
Abstract: Given the widespread use of safety-critical applications in the automotive field, it is crucial to ensure the Functional Safety (FuSa) of circuits and components within automotive systems. The Analog and Mixed-Signal (AMS) circuits prevalent in these systems are more vulnerable to faults induced by parametric perturbations, noise, environmental stress, and other factors, in comparison to their digital counterparts. However, their continuous signal characteristics present an opportunity for early anomaly detection, enabling the implementation of safety mechanisms to prevent system failure. To address this need, we propose a novel framework based on unsupervised machine learning for early anomaly detection in AMS circuits. The proposed approach involves injecting anomalies at various circuit locations and individual components to create a diverse and comprehensive anomaly dataset, followed by the extraction of features from the observed circuit signals. Subsequently, we employ clustering algorithms to facilitate anomaly detection. Finally, we propose a time series framework to enhance and expedite anomaly detection performance. Our approach encompasses a systematic analysis of anomaly abstraction at multiple levels pertaining to the automotive domain, from hardware- to block-level, where anomalies are injected to create diverse fault scenarios. By monitoring the system behavior under these anomalous conditions, we capture the propagation of anomalies and their effects at different abstraction levels, thereby potentially paving the way for the implementation of reliable safety mechanisms to ensure the FuSa of automotive SoCs. Our experimental findings indicate that our approach achieves 100% anomaly detection accuracy and significantly optimizes the associated latency by 5X, underscoring the effectiveness of our devised solution.

Title: AI WALKUP: A Computer-Vision Approach to Quantifying MDS-UPDRS in Parkinson's Disease

Authors: Xiang Xiang, Zihan Zhang, Jing Ma, Yao Deng
Subjects: cs.CV, cs.AI, eess.IV, eess.SP
Abstract URL: https://arxiv.org/abs/2404.01654
Pdf URL: https://arxiv.org/pdf/2404.01654
Copy Paste: [[2404.01654]] AI WALKUP: A Computer-Vision Approach to Quantifying MDS-UPDRS in Parkinson's Disease(https://arxiv.org/abs/2404.01654)
Keywords: generative
Abstract: Parkinson's Disease (PD) is the second most common neurodegenerative disorder. The existing assessment method for PD is usually the Movement Disorder Society - Unified Parkinson's Disease Rating Scale (MDS-UPDRS) to assess the severity of various types of motor symptoms and disease progression. However, manual assessment suffers from high subjectivity, lack of consistency, and high cost and low efficiency of manual communication. We want to use a computer vision based solution to capture human pose images based on a camera, reconstruct and perform motion analysis using algorithms, and extract the features of the amount of motion through feature engineering. The proposed approach can be deployed on different smartphones, and the video recording and artificial intelligence analysis can be done quickly and easily through our APP.

Title: FashionEngine: Interactive Generation and Editing of 3D Clothed Humans

Authors: Tao Hu, Fangzhou Hong, Zhaoxi Chen, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.01655
Pdf URL: https://arxiv.org/pdf/2404.01655
Copy Paste: [[2404.01655]] FashionEngine: Interactive Generation and Editing of 3D Clothed Humans(https://arxiv.org/abs/2404.01655)
Keywords: diffusion
Abstract: We present FashionEngine, an interactive 3D human generation and editing system that allows us to design 3D digital humans in a way that aligns with how humans interact with the world, such as natural languages, visual perceptions, and hand-drawing. FashionEngine automates the 3D human production with three key components: 1) A pre-trained 3D human diffusion model that learns to model 3D humans in a semantic UV latent space from 2D image training data, which provides strong priors for diverse generation and editing tasks. 2) Multimodality-UV Space encoding the texture appearance, shape topology, and textual semantics of human clothing in a canonical UV-aligned space, which faithfully aligns the user multimodal inputs with the implicit UV latent space for controllable 3D human editing. The multimodality-UV space is shared across different user inputs, such as texts, images, and sketches, which enables various joint multimodal editing tasks. 3) Multimodality-UV Aligned Sampler learns to sample high-quality and diverse 3D humans from the diffusion prior for multimodal user inputs. Extensive experiments validate FashionEngine's state-of-the-art performance for conditional generation/editing tasks. In addition, we present an interactive user interface for our FashionEngine that enables both conditional and unconditional generation tasks, and editing tasks including pose/view/shape control, text-, image-, and sketch-driven 3D human editing and 3D virtual try-on, in a unified framework. Our project page is at: https://taohuumd.github.io/projects/FashionEngine.

Title: Release of Pre-Trained Models for the Japanese Language

Authors: Kei Sawada, Tianyu Zhao, Makoto Shing, Kentaro Mitsui, Akio Kaga, Yukiya Hono, Toshiaki Wakatsuki, Koh Mitsuda
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2404.01657
Pdf URL: https://arxiv.org/pdf/2404.01657
Copy Paste: [[2404.01657]] Release of Pre-Trained Models for the Japanese Language(https://arxiv.org/abs/2404.01657)
Keywords: diffusion, generative
Abstract: AI democratization aims to create a world in which the average person can utilize AI techniques. To achieve this goal, numerous research institutes have attempted to make their results accessible to the public. In particular, large pre-trained models trained on large-scale data have shown unprecedented potential, and their release has had a significant impact. However, most of the released models specialize in the English language, and thus, AI democratization in non-English-speaking communities is lagging significantly. To reduce this gap in AI access, we released Generative Pre-trained Transformer (GPT), Contrastive Language and Image Pre-training (CLIP), Stable Diffusion, and Hidden-unit Bidirectional Encoder Representations from Transformers (HuBERT) pre-trained in Japanese. By providing these models, users can freely interface with AI that aligns with Japanese cultural values and ensures the identity of Japanese culture, thus enhancing the democratization of AI. Additionally, experiments showed that pre-trained models specialized for Japanese can efficiently achieve high performance in Japanese tasks.

Title: MotionChain: Conversational Motion Controllers via Multimodal Prompts

Authors: Biao Jiang, Xin Chen, Chi Zhang, Fukun Yin, Zhuoyuan Li, Gang YU, Jiayuan Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.01700
Pdf URL: https://arxiv.org/pdf/2404.01700
Copy Paste: [[2404.01700]] MotionChain: Conversational Motion Controllers via Multimodal Prompts(https://arxiv.org/abs/2404.01700)
Keywords: generative
Abstract: Recent advancements in language models have demonstrated their adeptness in conducting multi-turn dialogues and retaining conversational context. However, this proficiency remains largely unexplored in other multimodal generative models, particularly in human motion models. By integrating multi-turn conversations in controlling continuous virtual human movements, generative human motion models can achieve an intuitive and step-by-step process of human task execution for humanoid robotics, game agents, or other embodied systems. In this work, we present MotionChain, a conversational human motion controller to generate continuous and long-term human motion through multimodal prompts. Specifically, MotionChain consists of multi-modal tokenizers that transform various data types such as text, image, and motion, into discrete tokens, coupled with a Vision-Motion-aware Language model. By leveraging large-scale language, vision-language, and vision-motion data to assist motion-related generation tasks, MotionChain thus comprehends each instruction in multi-turn conversation and generates human motions followed by these prompts. Extensive experiments validate the efficacy of MotionChain, demonstrating state-of-the-art performance in conversational motion generation, as well as more intuitive manners of controlling and interacting with virtual humans.

Title: Upsample Guidance: Scale Up Diffusion Models without Training

Authors: Juno Hwang, Yong-Hyun Park, Junghyo Jo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2404.01709
Pdf URL: https://arxiv.org/pdf/2404.01709
Copy Paste: [[2404.01709]] Upsample Guidance: Scale Up Diffusion Models without Training(https://arxiv.org/abs/2404.01709)
Keywords: diffusion, generative
Abstract: Diffusion models have demonstrated superior performance across various generative tasks including images, videos, and audio. However, they encounter difficulties in directly generating high-resolution samples. Previously proposed solutions to this issue involve modifying the architecture, further training, or partitioning the sampling process into multiple stages. These methods have the limitation of not being able to directly utilize pre-trained models as-is, requiring additional work. In this paper, we introduce upsample guidance, a technique that adapts pretrained diffusion model (e.g., $512^2$) to generate higher-resolution images (e.g., $1536^2$) by adding only a single term in the sampling process. Remarkably, this technique does not necessitate any additional training or relying on external models. We demonstrate that upsample guidance can be applied to various models, such as pixel-space, latent space, and video diffusion models. We also observed that the proper selection of guidance scale can improve image quality, fidelity, and prompt alignment.

Title: Generative AI for Immersive Communication: The Next Frontier in Internet-of-Senses Through 6G

Authors: Nassim Sehad, Lina Bariah, Wassim Hamidouche, Hamed Hellaoui, Riku Jäntti, Mérouane Debbah
Subjects: cs.CL, cs.AI, cs.HC, cs.MM, cs.NI
Abstract URL: https://arxiv.org/abs/2404.01713
Pdf URL: https://arxiv.org/pdf/2404.01713
Copy Paste: [[2404.01713]] Generative AI for Immersive Communication: The Next Frontier in Internet-of-Senses Through 6G(https://arxiv.org/abs/2404.01713)
Keywords: generative
Abstract: Over the past two decades, the Internet-of-Things (IoT) has been a transformative concept, and as we approach 2030, a new paradigm known as the Internet of Senses (IoS) is emerging. Unlike conventional Virtual Reality (VR), IoS seeks to provide multi-sensory experiences, acknowledging that in our physical reality, our perception extends far beyond just sight and sound; it encompasses a range of senses. This article explores existing technologies driving immersive multi-sensory media, delving into their capabilities and potential applications. This exploration includes a comparative analysis between conventional immersive media streaming and a proposed use case that lever- ages semantic communication empowered by generative Artificial Intelligence (AI). The focal point of this analysis is the substantial reduction in bandwidth consumption by 99.93% in the proposed scheme. Through this comparison, we aim to underscore the practical applications of generative AI for immersive media while addressing the challenges and outlining future trajectories.

Title: AddSR: Accelerating Diffusion-based Blind Super-Resolution with Adversarial Diffusion Distillation

Authors: Rui Xie, Ying Tai, Kai Zhang, Zhenyu Zhang, Jun Zhou, Jian Yang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2404.01717
Pdf URL: https://arxiv.org/pdf/2404.01717
Copy Paste: [[2404.01717]] AddSR: Accelerating Diffusion-based Blind Super-Resolution with Adversarial Diffusion Distillation(https://arxiv.org/abs/2404.01717)
Keywords: diffusion, generative
Abstract: Blind super-resolution methods based on stable diffusion showcase formidable generative capabilities in reconstructing clear high-resolution images with intricate details from low-resolution inputs. However, their practical applicability is often hampered by poor efficiency, stemming from the requirement of thousands or hundreds of sampling steps. Inspired by the efficient text-to-image approach adversarial diffusion distillation (ADD), we design AddSR to address this issue by incorporating the ideas of both distillation and ControlNet. Specifically, we first propose a prediction-based self-refinement strategy to provide high-frequency information in the student model output with marginal additional time cost. Furthermore, we refine the training process by employing HR images, rather than LR images, to regulate the teacher model, providing a more robust constraint for distillation. Second, we introduce a timestep-adapting loss to address the perception-distortion imbalance problem introduced by ADD. Extensive experiments demonstrate our AddSR generates better restoration results, while achieving faster speed than previous SD-based state-of-the-art models (e.g., 7x faster than SeeSR).

Title: Self-Improvement Programming for Temporal Knowledge Graph Question Answering

Authors: Zhuo Chen, Zhao Zhang, Zixuan Li, Fei Wang, Yutao Zeng, Xiaolong Jin, Yongjun Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.01720
Pdf URL: https://arxiv.org/pdf/2404.01720
Copy Paste: [[2404.01720]] Self-Improvement Programming for Temporal Knowledge Graph Question Answering(https://arxiv.org/abs/2404.01720)
Keywords: in-context
Abstract: Temporal Knowledge Graph Question Answering (TKGQA) aims to answer questions with temporal intent over Temporal Knowledge Graphs (TKGs). The core challenge of this task lies in understanding the complex semantic information regarding multiple types of time constraints (e.g., before, first) in questions. Existing end-to-end methods implicitly model the time constraints by learning time-aware embeddings of questions and candidate answers, which is far from understanding the question comprehensively. Motivated by semantic-parsing-based approaches that explicitly model constraints in questions by generating logical forms with symbolic operators, we design fundamental temporal operators for time constraints and introduce a novel self-improvement Programming method for TKGQA (Prog-TQA). Specifically, Prog-TQA leverages the in-context learning ability of Large Language Models (LLMs) to understand the combinatory time constraints in the questions and generate corresponding program drafts with a few examples given. Then, it aligns these drafts to TKGs with the linking module and subsequently executes them to generate the answers. To enhance the ability to understand questions, Prog-TQA is further equipped with a self-improvement strategy to effectively bootstrap LLMs using high-quality self-generated drafts. Extensive experiments demonstrate the superiority of the proposed Prog-TQA on MultiTQ and CronQuestions datasets, especially in the Hits@1 metric.

Title: Asymptotics of Language Model Alignment

Authors: Joy Qiping Yang, Salman Salamatian, Ziteng Sun, Ananda Theertha Suresh, Ahmad Beirami
Subjects: cs.LG, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2404.01730
Pdf URL: https://arxiv.org/pdf/2404.01730
Copy Paste: [[2404.01730]] Asymptotics of Language Model Alignment(https://arxiv.org/abs/2404.01730)
Keywords: generative
Abstract: Let $p$ denote a generative language model. Let $r$ denote a reward model that returns a scalar that captures the degree at which a draw from $p$ is preferred. The goal of language model alignment is to alter $p$ to a new distribution $\phi$ that results in a higher expected reward while keeping $\phi$ close to $p.$ A popular alignment method is the KL-constrained reinforcement learning (RL), which chooses a distribution $\phi_\Delta$ that maximizes $E_{\phi_{\Delta}} r(y)$ subject to a relative entropy constraint $KL(\phi_\Delta || p) \leq \Delta.$ Another simple alignment method is best-of-$N$, where $N$ samples are drawn from $p$ and one with highest reward is selected. In this paper, we offer a closed-form characterization of the optimal KL-constrained RL solution. We demonstrate that any alignment method that achieves a comparable trade-off between KL divergence and reward must approximate the optimal KL-constrained RL solution in terms of relative entropy. To further analyze the properties of alignment methods, we introduce two simplifying assumptions: we let the language model be memoryless, and the reward model be linear. Although these assumptions may not reflect complex real-world scenarios, they enable a precise characterization of the asymptotic behavior of both the best-of-$N$ alignment, and the KL-constrained RL method, in terms of information-theoretic quantities. We prove that the reward of the optimal KL-constrained RL solution satisfies a large deviation principle, and we fully characterize its rate function. We also show that the rate of growth of the scaled cumulants of the reward is characterized by a proper Renyi cross entropy. Finally, we show that best-of-$N$ is asymptotically equivalent to KL-constrained RL solution by proving that their expected rewards are asymptotically equal, and concluding that the two distributions must be close in KL divergence.

Title: T-VSL: Text-Guided Visual Sound Source Localization in Mixtures

Authors: Tanvir Mahmud, Yapeng Tian, Diana Marculescu
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2404.01751
Pdf URL: https://arxiv.org/pdf/2404.01751
Copy Paste: [[2404.01751]] T-VSL: Text-Guided Visual Sound Source Localization in Mixtures(https://arxiv.org/abs/2404.01751)
Keywords: self-supervised
Abstract: Visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately distinguish the semantic regions of each sounding object, particularly in multi-source mixtures. These methods often rely on audio-visual correspondence as guidance, which can lead to substantial performance drops in complex multi-source localization scenarios. The lack of access to individual source sounds in multi-source mixtures during training exacerbates the difficulty of learning effective audio-visual correspondence for localization. To address this limitation, in this paper, we propose incorporating the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures. Our framework, dubbed T-VSL, begins by predicting the class of sounding entities in mixtures. Subsequently, the textual representation of each sounding source is employed as guidance to disentangle fine-grained audio-visual source correspondence from multi-source mixtures, leveraging the tri-modal AudioCLIP embedding. This approach enables our framework to handle a flexible number of sources and exhibits promising zero-shot transferability to unseen classes during test time. Extensive experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods.

Title: Generative AI-Based Text Generation Methods Using Pre-Trained GPT-2 Model

Authors: Rohit Pandey, Hetvi Waghela, Sneha Rakshit, Aparna Rangari, Anjali Singh, Rahul Kumar, Ratnadeep Ghosal, Jaydip Sen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.01786
Pdf URL: https://arxiv.org/pdf/2404.01786
Copy Paste: [[2404.01786]] Generative AI-Based Text Generation Methods Using Pre-Trained GPT-2 Model(https://arxiv.org/abs/2404.01786)
Keywords: generative
Abstract: This work delved into the realm of automatic text generation, exploring a variety of techniques ranging from traditional deterministic approaches to more modern stochastic methods. Through analysis of greedy search, beam search, top-k sampling, top-p sampling, contrastive searching, and locally typical searching, this work has provided valuable insights into the strengths, weaknesses, and potential applications of each method. Each text-generating method is evaluated using several standard metrics and a comparative study has been made on the performance of the approaches. Finally, some future directions of research in the field of automatic text generation are also identified.

Title: Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Authors: Xu He, Qiaochu Huang, Zhensong Zhang, Zhiwei Lin, Zhiyong Wu, Sicheng Yang, Minglei Li, Zhiyi Chen, Songcen Xu, Xiaofei Wu
Subjects: cs.CV, cs.HC, cs.MM
Abstract URL: https://arxiv.org/abs/2404.01862
Pdf URL: https://arxiv.org/pdf/2404.01862
Copy Paste: [[2404.01862]] Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model(https://arxiv.org/abs/2404.01862)
Keywords: diffusion
Abstract: Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech, and performs generation in the latent motion space, followed by an optimal motion selection module to produce long-term coherent and consistent gesture videos. For better visual perception, we further design a refinement network focusing on missing details of certain areas. Extensive experimental results show that our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations. Our code, demos, and more resources are available at https://github.com/thuhcsi/S2G-MDDiffusion.

Title: Real, fake and synthetic faces - does the coin have three sides?

Authors: Shahzeb Naeem, Ramzi Al-Sharawi, Muhammad Riyyan Khan, Usman Tariq, Abhinav Dhall, Hasan Al-Nashash
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2404.01878
Pdf URL: https://arxiv.org/pdf/2404.01878
Copy Paste: [[2404.01878]] Real, fake and synthetic faces - does the coin have three sides?(https://arxiv.org/abs/2404.01878)
Keywords: generative
Abstract: With the ever-growing power of generative artificial intelligence, deepfake and artificially generated (synthetic) media have continued to spread online, which creates various ethical and moral concerns regarding their usage. To tackle this, we thus present a novel exploration of the trends and patterns observed in real, deepfake and synthetic facial images. The proposed analysis is done in two parts: firstly, we incorporate eight deep learning models and analyze their performances in distinguishing between the three classes of images. Next, we look to further delve into the similarities and differences between these three sets of images by investigating their image properties both in the context of the entire image as well as in the context of specific regions within the image. ANOVA test was also performed and provided further clarity amongst the patterns associated between the images of the three classes. From our findings, we observe that the investigated deeplearning models found it easier to detect synthetic facial images, with the ViT Patch-16 model performing best on this task with a class-averaged sensitivity, specificity, precision, and accuracy of 97.37%, 98.69%, 97.48%, and 98.25%, respectively. This observation was supported by further analysis of various image properties. We saw noticeable differences across the three category of images. This analysis can help us build better algorithms for facial image generation, and also shows that synthetic, deepfake and real face images are indeed three different classes.

Title: Bi-LORA: A Vision-Language Approach for Synthetic Image Detection

Authors: Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdenour Hadid, Abdelmalik Taleb-Ahmed
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2404.01959
Pdf URL: https://arxiv.org/pdf/2404.01959
Copy Paste: [[2404.01959]] Bi-LORA: A Vision-Language Approach for Synthetic Image Detection(https://arxiv.org/abs/2404.01959)
Keywords: diffusion, generative
Abstract: Advancements in deep image synthesis techniques, such as generative adversarial networks (GANs) and diffusion models (DMs), have ushered in an era of generating highly realistic images. While this technological progress has captured significant interest, it has also raised concerns about the potential difficulty in distinguishing real images from their synthetic counterparts. This paper takes inspiration from the potent convergence capabilities between vision and language, coupled with the zero-shot nature of vision-language models (VLMs). We introduce an innovative method called Bi-LORA that leverages VLMs, combined with low-rank adaptation (LORA) tuning techniques, to enhance the precision of synthetic image detection for unseen model-generated images. The pivotal conceptual shift in our methodology revolves around reframing binary classification as an image captioning task, leveraging the distinctive capabilities of cutting-edge VLM, notably bootstrapping language image pre-training (BLIP2). Rigorous and comprehensive experiments are conducted to validate the effectiveness of our proposed approach, particularly in detecting unseen diffusion-generated images from unknown diffusion-based generative models during training, showcasing robustness to noise, and demonstrating generalization capabilities to GANs. The obtained results showcase an impressive average accuracy of 93.41% in synthetic image detection on unseen generation models. The code and models associated with this research can be publicly accessed at https://github.com/Mamadou-Keita/VLM-DETECT.

Title: Team UTSA-NLP at SemEval 2024 Task 5: Prompt Ensembling for Argument Reasoning in Civil Procedures with GPT4

Authors: Dan Schumacher, Anthony Rios
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.01961
Pdf URL: https://arxiv.org/pdf/2404.01961
Copy Paste: [[2404.01961]] Team UTSA-NLP at SemEval 2024 Task 5: Prompt Ensembling for Argument Reasoning in Civil Procedures with GPT4(https://arxiv.org/abs/2404.01961)
Keywords: in-context
Abstract: In this paper, we present our system for the SemEval Task 5, The Legal Argument Reasoning Task in Civil Procedure Challenge. Legal argument reasoning is an essential skill that all law students must master. Moreover, it is important to develop natural language processing solutions that can reason about a question given terse domain-specific contextual information. Our system explores a prompt-based solution using GPT4 to reason over legal arguments. We also evaluate an ensemble of prompting strategies, including chain-of-thought reasoning and in-context learning. Overall, our system results in a Macro F1 of .8095 on the validation dataset and .7315 (5th out of 21 teams) on the final test set. Code for this project is available at https://github.com/danschumac1/CivilPromptReasoningGPT4.

Title: Fashion Style Editing with Generative Human Prior

Authors: Chaerin Kong, Seungyong Lee, Soohyeok Im, Wonsuk Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.01984
Pdf URL: https://arxiv.org/pdf/2404.01984
Copy Paste: [[2404.01984]] Fashion Style Editing with Generative Human Prior(https://arxiv.org/abs/2404.01984)
Keywords: generative
Abstract: Image editing has been a long-standing challenge in the research community with its far-reaching impact on numerous applications. Recently, text-driven methods started to deliver promising results in domains like human faces, but their applications to more complex domains have been relatively limited. In this work, we explore the task of fashion style editing, where we aim to manipulate the fashion style of human imagery using text descriptions. Specifically, we leverage a generative human prior and achieve fashion style editing by navigating its learned latent space. We first verify that the existing text-driven editing methods fall short for our problem due to their overly simplified guidance signal, and propose two directions to reinforce the guidance: textual augmentation and visual referencing. Combined with our empirical findings on the latent space structure, our Fashion Style Editing framework (FaSE) successfully projects abstract fashion concepts onto human images and introduces exciting new applications to the field.

Title: DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning

Authors: Mengfei Du, Binhao Wu, Jiwen Zhang, Zhihao Fan, Zejun Li, Ruipu Luo, Xuanjing Huang, Zhongyu Wei
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.01994
Pdf URL: https://arxiv.org/pdf/2404.01994
Copy Paste: [[2404.01994]] DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning(https://arxiv.org/abs/2404.01994)
Keywords: self-supervised
Abstract: Vision-and-Language navigation (VLN) requires an agent to navigate in unseen environment by following natural language instruction. For task completion, the agent needs to align and integrate various navigation modalities, including instruction, observation and navigation history. Existing works primarily concentrate on cross-modal attention at the fusion stage to achieve this objective. Nevertheless, modality features generated by disparate uni-encoders reside in their own spaces, leading to a decline in the quality of cross-modal fusion and decision. To address this problem, we propose a Dual-levEL AligNment (DELAN) framework by cross-modal contrastive learning. This framework is designed to align various navigation-related modalities before fusion, thereby enhancing cross-modal interaction and action decision-making. Specifically, we divide the pre-fusion alignment into dual levels: instruction-history level and landmark-observation level according to their semantic correlations. We also reconstruct a dual-level instruction for adaptation to the dual-level alignment. As the training signals for pre-fusion alignment are extremely limited, self-supervised contrastive learning strategies are employed to enforce the matching between different modalities. Our approach seamlessly integrates with the majority of existing models, resulting in improved navigation performance on various VLN benchmarks, including R2R, R4R, RxR and CVDN.

Title: Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context

Authors: Antoine Caubrière, Elodie Gauthier
Subjects: cs.CL, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2404.02000
Pdf URL: https://arxiv.org/pdf/2404.02000
Copy Paste: [[2404.02000]] Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context(https://arxiv.org/abs/2404.02000)
Keywords: self-supervised
Abstract: We present the first self-supervised multilingual speech model trained exclusively on African speech. The model learned from nearly 60 000 hours of unlabeled speech segments in 21 languages and dialects spoken in sub-Saharan Africa. On the SSA subset of the FLEURS-102 dataset, our approach based on a HuBERT$_{base}$ (0.09B) architecture shows competitive results, for ASR downstream task, compared to the w2v-bert-51 (0.6B) pre-trained model proposed in the FLEURS benchmark, while being more efficient by using 7x less data and 6x less parameters. Furthermore, in the context of a LID downstream task, our approach outperforms FLEURS baselines accuracy by over 22\%.

Title: AUTODIFF: Autoregressive Diffusion Modeling for Structure-based Drug Design

Authors: Xinze Li, Penglei Wang, Tianfan Fu, Wenhao Gao, Chengtao Li, Leilei Shi, Junhong Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2404.02003
Pdf URL: https://arxiv.org/pdf/2404.02003
Copy Paste: [[2404.02003]] AUTODIFF: Autoregressive Diffusion Modeling for Structure-based Drug Design(https://arxiv.org/abs/2404.02003)
Keywords: diffusion
Abstract: Structure-based drug design (SBDD), which aims to generate molecules that can bind tightly to the target protein, is an essential problem in drug discovery, and previous approaches have achieved initial success. However, most existing methods still suffer from invalid local structure or unrealistic conformation issues, which are mainly due to the poor leaning of bond angles or torsional angles. To alleviate these problems, we propose AUTODIFF, a diffusion-based fragment-wise autoregressive generation model. Specifically, we design a novel molecule assembly strategy named conformal motif that preserves the conformation of local structures of molecules first, then we encode the interaction of the protein-ligand complex with an SE(3)-equivariant convolutional network and generate molecules motif-by-motif with diffusion modeling. In addition, we also improve the evaluation framework of SBDD by constraining the molecular weights of the generated molecules in the same range, together with some new metrics, which make the evaluation more fair and practical. Extensive experiments on CrossDocked2020 demonstrate that our approach outperforms the existing models in generating realistic molecules with valid structures and conformations while maintaining high binding affinity.

Title: SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation

Authors: Vinkle Srivastav, Keqi Chen, Nicolas Padoy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.02041
Pdf URL: https://arxiv.org/pdf/2404.02041
Copy Paste: [[2404.02041]] SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation(https://arxiv.org/abs/2404.02041)
Keywords: self-supervised
Abstract: We present a new self-supervised approach, SelfPose3d, for estimating 3d poses of multiple persons from multiple camera views. Unlike current state-of-the-art fully-supervised methods, our approach does not require any 2d or 3d ground-truth poses and uses only the multi-view input images from a calibrated camera setup and 2d pseudo poses generated from an off-the-shelf 2d human pose estimator. We propose two self-supervised learning objectives: self-supervised person localization in 3d space and self-supervised 3d pose estimation. We achieve self-supervised 3d person localization by training the model on synthetically generated 3d points, serving as 3d person root positions, and on the projected root-heatmaps in all the views. We then model the 3d poses of all the localized persons with a bottleneck representation, map them onto all views obtaining 2d joints, and render them using 2d Gaussian heatmaps in an end-to-end differentiable manner. Afterwards, we use the corresponding 2d joints and heatmaps from the pseudo 2d poses for learning. To alleviate the intrinsic inaccuracy of the pseudo labels, we propose an adaptive supervision attention mechanism to guide the self-supervision. Our experiments and analysis on three public benchmark datasets, including Panoptic, Shelf, and Campus, show the effectiveness of our approach, which is comparable to fully-supervised methods. Code is available at \url{https://github.com/CAMMA-public/SelfPose3D}

Title: Universal representations for financial transactional data: embracing local, global, and external contexts

Authors: Alexandra Bazarova, Maria Kovaleva, Ilya Kuleshov, Evgenia Romanenkova, Alexander Stepikin, Alexandr Yugay, Dzhambulat Mollaev, Ivan Kireev, Andrey Savchenko, Alexey Zaytsev
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02047
Pdf URL: https://arxiv.org/pdf/2404.02047
Copy Paste: [[2404.02047]] Universal representations for financial transactional data: embracing local, global, and external contexts(https://arxiv.org/abs/2404.02047)
Keywords: generative
Abstract: Effective processing of financial transactions is essential for banking data analysis. However, in this domain, most methods focus on specialized solutions to stand-alone problems instead of constructing universal representations suitable for many problems. We present a representation learning framework that addresses diverse business challenges. We also suggest novel generative models that account for data specifics, and a way to integrate external information into a client's representation, leveraging insights from other customers' actions. Finally, we offer a benchmark, describing representation quality globally, concerning the entire transaction history; locally, reflecting the client's current state; and dynamically, capturing representation evolution over time. Our generative approach demonstrates superior performance in local tasks, with an increase in ROC-AUC of up to 14\% for the next MCC prediction task and up to 46\% for downstream tasks from existing contrastive baselines. Incorporating external information improves the scores by an additional 20\%.

Title: Deconstructing In-Context Learning: Understanding Prompts via Corruption

Authors: Namrata Shivagunde, Vladislav Lialin, Sherin Muckatira, Anna Rumshisky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02054
Pdf URL: https://arxiv.org/pdf/2404.02054
Copy Paste: [[2404.02054]] Deconstructing In-Context Learning: Understanding Prompts via Corruption(https://arxiv.org/abs/2404.02054)
Keywords: in-context
Abstract: The ability of large language models (LLMs) to "learn in context" based on the provided prompt has led to an explosive growth in their use, culminating in the proliferation of AI assistants such as ChatGPT, Claude, and Bard. These AI assistants are known to be robust to minor prompt modifications, mostly due to alignment techniques that use human feedback. In contrast, the underlying pre-trained LLMs they use as a backbone are known to be brittle in this respect. Building high-quality backbone models remains a core challenge, and a common approach to assessing their quality is to conduct few-shot evaluation. Such evaluation is notorious for being highly sensitive to minor prompt modifications, as well as the choice of specific in-context examples. Prior work has examined how modifying different elements of the prompt can affect model performance. However, these earlier studies tended to concentrate on a limited number of specific prompt attributes and often produced contradictory results. Additionally, previous research either focused on models with fewer than 15 billion parameters or exclusively examined black-box models like GPT-3 or PaLM, making replication challenging. In the present study, we decompose the entire prompt into four components: task description, demonstration inputs, labels, and inline instructions provided for each demonstration. We investigate the effects of structural and semantic corruptions of these elements on model performance. We study models ranging from 1.5B to 70B in size, using ten datasets covering classification and generation tasks. We find that repeating text within the prompt boosts model performance, and bigger models ($\geq$30B) are more sensitive to the semantics of the prompt. Finally, we observe that adding task and inline instructions to the demonstrations enhances model performance even when the instructions are semantically corrupted.

Title: Long-context LLMs Struggle with Long In-context Learning

Authors: Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.02060
Pdf URL: https://arxiv.org/pdf/2404.02060
Copy Paste: [[2404.02060]] Long-context LLMs Struggle with Long In-context Learning(https://arxiv.org/abs/2404.02060)
Keywords: in-context
Abstract: Large Language Models (LLMs) have made significant strides in handling long sequences exceeding 32K tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their abilities in more nuanced, real-world scenarios. This study introduces a specialized benchmark (LIConBench) focusing on long in-context learning within the realm of extreme-label classification. We meticulously selected six datasets with a label range spanning 28 to 174 classes covering different input (few-shot demonstration) length from 2K to 50K. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct prediction. We evaluate 13 long-context LLMs on our benchmarks. We find that the long-context LLMs perform relatively well under the token length of 20K and the performance benefits from utilizing the long context window. However, after the context window exceeds 20K, most LLMs except GPT-4 will dip dramatically. This suggests a notable gap in current LLM capabilities for processing and understanding long, context-rich sequences. Further analysis revealed a tendency among models to favor predictions for labels presented towards the end at the sequence. Their ability to reason over multiple pieces in the long sequence is yet to be improved. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LIConBench could serve as a more realistic evaluation for the future long context LLMs.

Title: Red-Teaming Segment Anything Model

Authors: Krzysztof Jankowski, Bartlomiej Sobieski, Mateusz Kwiatkowski, Jakub Szulc, Michal Janik, Hubert Baniecki, Przemyslaw Biecek
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.02067
Pdf URL: https://arxiv.org/pdf/2404.02067
Copy Paste: [[2404.02067]] Red-Teaming Segment Anything Model(https://arxiv.org/abs/2404.02067)
Keywords: foundation model
Abstract: Foundation models have emerged as pivotal tools, tackling many complex tasks through pre-training on vast datasets and subsequent fine-tuning for specific applications. The Segment Anything Model is one of the first and most well-known foundation models for computer vision segmentation tasks. This work presents a multi-faceted red-teaming analysis that tests the Segment Anything Model against challenging tasks: (1) We analyze the impact of style transfer on segmentation masks, demonstrating that applying adverse weather conditions and raindrops to dashboard images of city roads significantly distorts generated masks. (2) We focus on assessing whether the model can be used for attacks on privacy, such as recognizing celebrities' faces, and show that the model possesses some undesired knowledge in this task. (3) Finally, we check how robust the model is to adversarial attacks on segmentation masks under text prompts. We not only show the effectiveness of popular white-box attacks and resistance to black-box attacks but also introduce a novel approach - Focused Iterative Gradient Attack (FIGA) that combines white-box approaches to construct an efficient attack resulting in a smaller number of modified pixels. All of our testing methods and analyses indicate a need for enhanced safety measures in foundation models for image segmentation.

Title: WcDT: World-centric Diffusion Transformer for Traffic Scene Generation

Authors: Chen Yang, Aaron Xuxiang Tian, Dong Chen, Tianyu Shi, Arsalan Heydarian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.02082
Pdf URL: https://arxiv.org/pdf/2404.02082
Copy Paste: [[2404.02082]] WcDT: World-centric Diffusion Transformer for Traffic Scene Generation(https://arxiv.org/abs/2404.02082)
Keywords: diffusion
Abstract: In this paper, we introduce a novel approach for autonomous driving trajectory generation by harnessing the complementary strengths of diffusion probabilistic models (a.k.a., diffusion models) and transformers. Our proposed framework, termed the "World-Centric Diffusion Transformer" (WcDT), optimizes the entire trajectory generation process, from feature extraction to model inference. To enhance the scene diversity and stochasticity, the historical trajectory data is first preprocessed and encoded into latent space using Denoising Diffusion Probabilistic Models (DDPM) enhanced with Diffusion with Transformer (DiT) blocks. Then, the latent features, historical trajectories, HD map features, and historical traffic signal information are fused with various transformer-based encoders. The encoded traffic scenes are then decoded by a trajectory decoder to generate multimodal future trajectories. Comprehensive experimental results show that the proposed approach exhibits superior performance in generating both realistic and diverse trajectories, showing its potential for integration into automatic driving simulation systems.

Title: Adaptive Feature Fusion Neural Network for Glaucoma Segmentation on Unseen Fundus Images

Authors: Jiyuan Zhong, Hu Ke, Ming Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.02084
Pdf URL: https://arxiv.org/pdf/2404.02084
Copy Paste: [[2404.02084]] Adaptive Feature Fusion Neural Network for Glaucoma Segmentation on Unseen Fundus Images(https://arxiv.org/abs/2404.02084)
Keywords: self-supervised
Abstract: Fundus image segmentation on unseen domains is challenging, especially for the over-parameterized deep models trained on the small medical datasets. To address this challenge, we propose a method named Adaptive Feature-fusion Neural Network (AFNN) for glaucoma segmentation on unseen domains, which mainly consists of three modules: domain adaptor, feature-fusion network, and self-supervised multi-task learning. Specifically, the domain adaptor helps the pretrained-model fast adapt from other image domains to the medical fundus image domain. Feature-fusion network and self-supervised multi-task learning for the encoder and decoder are introduced to improve the domain generalization ability. In addition, we also design the weighted-dice-loss to improve model performance on complex optic-cup segmentation tasks. Our proposed method achieves a competitive performance over existing fundus segmentation methods on four public glaucoma datasets.

Title: BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition

Authors: Alexandros Haliassos, Andreas Zinonos, Rodrigo Mira, Stavros Petridis, Maja Pantic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.02098
Pdf URL: https://arxiv.org/pdf/2404.02098
Copy Paste: [[2404.02098]] BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition(https://arxiv.org/abs/2404.02098)
Keywords: self-supervised
Abstract: Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods in various settings. Moreover, we observe favourable scaling behaviour by increasing the amount of unlabelled data well beyond other self-supervised works. In particular, we achieve 20.0% / 1.7% word error rate for VSR / ASR on the LRS3 test set, with only 30 hours of labelled data and no external ASR models. Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.

Title: Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models

Authors: Wanyong Feng, Jaewook Lee, Hunter McNichols, Alexander Scarlatos, Digory Smith, Simon Woodhead, Nancy Otero Ornelas, Andrew Lan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.02124
Pdf URL: https://arxiv.org/pdf/2404.02124
Copy Paste: [[2404.02124]] Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models(https://arxiv.org/abs/2404.02124)
Keywords: in-context
Abstract: Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and are a reliable format in assessments and practices. One of the most important aspects of MCQs is the distractors, i.e., incorrect options that are designed to target common errors or misconceptions among real students. To date, the task of crafting high-quality distractors largely remains a labor and time-intensive process for teachers and learning content designers, which has limited scalability. In this work, we study the task of automated distractor generation in the domain of math MCQs and explore a wide variety of large language model (LLM)-based approaches, from in-context learning to fine-tuning. We conduct extensive experiments using a real-world math MCQ dataset and find that although LLMs can generate some mathematically valid distractors, they are less adept at anticipating common errors or misconceptions among real students.

Title: 3D Congealing: 3D-Aware Image Alignment in the Wild

Authors: Yunzhi Zhang, Zizhang Li, Amit Raj, Andreas Engelhardt, Yuanzhen Li, Tingbo Hou, Jiajun Wu, Varun Jampani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.02125
Pdf URL: https://arxiv.org/pdf/2404.02125
Copy Paste: [[2404.02125]] 3D Congealing: 3D-Aware Image Alignment in the Wild(https://arxiv.org/abs/2404.02125)
Keywords: generative
Abstract: We propose 3D Congealing, a novel problem of 3D-aware alignment for 2D images capturing semantically similar objects. Given a collection of unlabeled Internet images, our goal is to associate the shared semantic parts from the inputs and aggregate the knowledge from 2D images to a shared 3D canonical space. We introduce a general framework that tackles the task without assuming shape templates, poses, or any camera parameters. At its core is a canonical 3D representation that encapsulates geometric and semantic information. The framework optimizes for the canonical representation together with the pose for each input image, and a per-image coordinate map that warps 2D pixel coordinates to the 3D canonical frame to account for the shape matching. The optimization procedure fuses prior knowledge from a pre-trained image generative model and semantic information from input images. The former provides strong knowledge guidance for this under-constraint task, while the latter provides the necessary information to mitigate the training data bias from the pre-trained model. Our framework can be used for various tasks such as correspondence matching, pose estimation, and image editing, achieving strong results on real-world image datasets under challenging illumination conditions and on in-the-wild online image collections.

Title: Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Orthogonal Diffusion Models

Authors: Zeyu Yang, Zijie Pan, Chun Gu, Li Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.02148
Pdf URL: https://arxiv.org/pdf/2404.02148
Copy Paste: [[2404.02148]] Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Orthogonal Diffusion Models(https://arxiv.org/abs/2404.02148)
Keywords: diffusion
Abstract: Recent advancements in 3D generation are predominantly propelled by improvements in 3D-aware image diffusion models which are pretrained on Internet-scale image data and fine-tuned on massive 3D data, offering the capability of producing highly consistent multi-view images. However, due to the scarcity of synchronized multi-view video data, it is impractical to adapt this paradigm to 4D generation directly. Despite that, the available video and 3D data are adequate for training video and multi-view diffusion models that can provide satisfactory dynamic and geometric priors respectively. In this paper, we present Diffusion$^2$, a novel framework for dynamic 3D content creation that leverages the knowledge about geometric consistency and temporal smoothness from these models to directly sample dense multi-view and multi-frame images which can be employed to optimize continuous 4D representation. Specifically, we design a simple yet effective denoising strategy via score composition of video and multi-view diffusion models based on the probability structure of the images to be generated. Owing to the high parallelism of the image generation and the efficiency of the modern 4D reconstruction pipeline, our framework can generate 4D content within few minutes. Furthermore, our method circumvents the reliance on 4D data, thereby having the potential to benefit from the scalability of the foundation video and multi-view diffusion models. Extensive experiments demonstrate the efficacy of our proposed framework and its capability to flexibly adapt to various types of prompts.

Title: Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Authors: Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion
Subjects: cs.CR, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2404.02151
Pdf URL: https://arxiv.org/pdf/2404.02151
Copy Paste: [[2404.02151]] Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks(https://arxiv.org/abs/2404.02151)
Keywords: in-context
Abstract: We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize the target logprob (e.g., of the token "Sure"), potentially with multiple restarts. In this way, we achieve nearly 100\% attack success rate -- according to GPT-4 as a judge -- on GPT-3.5/4, Llama-2-Chat-7B/13B/70B, Gemma-7B, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with 100\% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). We provide the code, prompts, and logs of the attacks at https://github.com/tml-epfl/llm-adaptive-attacks.

Title: GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image

Authors: Chong Bao, Yinda Zhang, Yuan Li, Xiyu Zhang, Bangbang Yang, Hujun Bao, Marc Pollefeys, Guofeng Zhang, Zhaopeng Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.02152
Pdf URL: https://arxiv.org/pdf/2404.02152
Copy Paste: [[2404.02152]] GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image(https://arxiv.org/abs/2404.02152)
Keywords: generative
Abstract: Recently, we have witnessed the explosive growth of various volumetric representations in modeling animatable head avatars. However, due to the diversity of frameworks, there is no practical method to support high-level applications like 3D head avatar editing across different representations. In this paper, we propose a generic avatar editing approach that can be universally applied to various 3DMM driving volumetric head avatars. To achieve this goal, we design a novel expression-aware modification generative model, which enables lift 2D editing from a single image to a consistent 3D modification field. To ensure the effectiveness of the generative modification process, we develop several techniques, including an expression-dependent modification distillation scheme to draw knowledge from the large-scale head avatar model and 2D facial texture editing tools, implicit latent space guidance to enhance model convergence, and a segmentation-based loss reweight strategy for fine-grained texture inversion. Extensive experiments demonstrate that our method delivers high-quality and consistent results across multiple expression and viewpoints. Project page: https://zju3dv.github.io/geneavatar/