2024-03-15

Title: Veagle: Advancements in Multimodal Representation Learning

Authors: Rajat Chawla, Arkajit Datta, Tushar Verma, Adarsh Jha, Anmol Gautam, Ayush Vatsal, Sukrit Chaterjee, Mukunda NS, Ishaan Bhola
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2403.08773
Pdf URL: https://arxiv.org/pdf/2403.08773
Copy Paste: [[2403.08773]] Veagle: Advancements in Multimodal Representation Learning(https://arxiv.org/abs/2403.08773)
Keywords: large language model
Abstract: Lately, researchers in artificial intelligence have been really interested in how language and vision come together, giving rise to the development of multimodal models that aim to seamlessly integrate textual and visual information. Multimodal models, an extension of Large Language Models (LLMs), have exhibited remarkable capabilities in addressing a diverse array of tasks, ranging from image captioning and visual question answering (VQA) to visual grounding. While these models have showcased significant advancements, challenges persist in accurately interpreting images and answering the question, a common occurrence in real-world scenarios. This paper introduces a novel approach to enhance the multimodal capabilities of existing models. In response to the limitations observed in current Vision Language Models (VLMs) and Multimodal Large Language Models (MLLMs), our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works. Veagle leverages a dynamic mechanism to project encoded visual information directly into the language model. This dynamic approach allows for a more nuanced understanding of intricate details present in visual contexts. To validate the effectiveness of Veagle, we conduct comprehensive experiments on benchmark datasets, emphasizing tasks such as visual question answering and image understanding. Our results indicate a improvement of 5-6 \% in performance, with Veagle outperforming existing models by a notable margin. The outcomes underscore the model's versatility and applicability beyond traditional benchmarks.

Title: Procedural terrain generation with style transfer

Authors: Fabio Merizzi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.08782
Pdf URL: https://arxiv.org/pdf/2403.08782
Copy Paste: [[2403.08782]] Procedural terrain generation with style transfer(https://arxiv.org/abs/2403.08782)
Keywords: generative
Abstract: In this study we introduce a new technique for the generation of terrain maps, exploiting a combination of procedural generation and Neural Style Transfer. We consider our approach to be a viable alternative to competing generative models, with our technique achieving greater versatility, lower hardware requirements and greater integration in the creative process of designers and developers. Our method involves generating procedural noise maps using either multi-layered smoothed Gaussian noise or the Perlin algorithm. We then employ an enhanced Neural Style transfer technique, drawing style from real-world height maps. This fusion of algorithmic generation and neural processing holds the potential to produce terrains that are not only diverse but also closely aligned with the morphological characteristics of real-world landscapes, with our process yielding consistent terrain structures with low computational cost and offering the capability to create customized maps. Numerical evaluations further validate our model's enhanced ability to accurately replicate terrain morphology, surpassing traditional procedural methods.

Title: Image-Text Out-Of-Context Detection Using Synthetic Multimodal Misinformation

Authors: Fatma Shalabi, Huy H. Nguyen, Hichem Felouat, Ching-Chun Chang, Isao Echizen
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2403.08783
Pdf URL: https://arxiv.org/pdf/2403.08783
Copy Paste: [[2403.08783]] Image-Text Out-Of-Context Detection Using Synthetic Multimodal Misinformation(https://arxiv.org/abs/2403.08783)
Keywords: robust
Abstract: Misinformation has become a major challenge in the era of increasing digital information, requiring the development of effective detection methods. We have investigated a novel approach to Out-Of-Context detection (OOCD) that uses synthetic data generation. We created a dataset specifically designed for OOCD and developed an efficient detector for accurate classification. Our experimental findings validate the use of synthetic data generation and demonstrate its efficacy in addressing the data limitations associated with OOCD. The dataset and detector should serve as valuable resources for future research and the development of robust misinformation detection systems.

Title: Verification for Object Detection -- IBP IoU

Authors: Noémie Cohen, Mélanie Ducoffe, Ryma Boumazouza (CRIL), Christophe Gabreau, Claire Pagetti, Xavier Pucel, Audrey Galametz
Subjects: cs.CV, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2403.08788
Pdf URL: https://arxiv.org/pdf/2403.08788
Copy Paste: [[2403.08788]] Verification for Object Detection -- IBP IoU(https://arxiv.org/abs/2403.08788)
Keywords: secure, robust
Abstract: We introduce a novel Interval Bound Propagation (IBP) approach for the formal verification of object detection models, specifically targeting the Intersection over Union (IoU) metric. The approach has been implemented in an open source code, named IBP IoU, compatible with popular abstract interpretation based verification tools. The resulting verifier is evaluated on landing approach runway detection and handwritten digit recognition case studies. Comparisons against a baseline (Vanilla IBP IoU) highlight the superior performance of IBP IoU in ensuring accuracy and stability, contributing to more secure and robust machine learning applications.

Title: Bridging Human Concepts and Computer Vision for Explainable Face Verification

Authors: Miriam Doh (UMons, IRIDIA), Caroline Mazini Rodrigues (LRDE, LIGM), Nicolas Boutry (LRDE), Laurent Najman (LIGM), Matei Mancas (UMONS), Hugues Bersini (IRIDIA)
Subjects: cs.CV, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2403.08789
Pdf URL: https://arxiv.org/pdf/2403.08789
Copy Paste: [[2403.08789]] Bridging Human Concepts and Computer Vision for Explainable Face Verification(https://arxiv.org/abs/2403.08789)
Keywords: fair, interpretability, segmentation
Abstract: With Artificial Intelligence (AI) influencing the decision-making process of sensitive applications such as Face Verification, it is fundamental to ensure the transparency, fairness, and accountability of decisions. Although Explainable Artificial Intelligence (XAI) techniques exist to clarify AI decisions, it is equally important to provide interpretability of these decisions to humans. In this paper, we present an approach to combine computer and human vision to increase the explanation's interpretability of a face verification algorithm. In particular, we are inspired by the human perceptual process to understand how machines perceive face's human-semantic areas during face comparison tasks. We use Mediapipe, which provides a segmentation technique that identifies distinct human-semantic facial regions, enabling the machine's perception analysis. Additionally, we adapted two model-agnostic algorithms to provide human-interpretable insights into the decision-making processes.

Title: CoBra: Complementary Branch Fusing Class and Semantic Knowledge for Robust Weakly Supervised Semantic Segmentation

Authors: Woojung Han, Seil Kang, Kyobin Choo, Seong Jae Hwang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08801
Pdf URL: https://arxiv.org/pdf/2403.08801
Copy Paste: [[2403.08801]] CoBra: Complementary Branch Fusing Class and Semantic Knowledge for Robust Weakly Supervised Semantic Segmentation(https://arxiv.org/abs/2403.08801)
Keywords: robust, transformer, segmentation
Abstract: Leveraging semantically precise pseudo masks derived from image-level class knowledge for segmentation, namely image-level Weakly Supervised Semantic Segmentation (WSSS), still remains challenging. While Class Activation Maps (CAMs) using CNNs have steadily been contributing to the success of WSSS, the resulting activation maps often narrowly focus on class-specific parts (e.g., only face of human). On the other hand, recent works based on vision transformers (ViT) have shown promising results based on their self-attention mechanism to capture the semantic parts but fail in capturing complete class-specific details (e.g., entire body parts of human but also with a dog nearby). In this work, we propose Complementary Branch (CoBra), a novel dual branch framework consisting of two distinct architectures which provide valuable complementary knowledge of class (from CNN) and semantic (from ViT) to each branch. In particular, we learn Class-Aware Projection (CAP) for the CNN branch and Semantic-Aware Projection (SAP) for the ViT branch to explicitly fuse their complementary knowledge and facilitate a new type of extra patch-level supervision. Our model, through CoBra, fuses CNN and ViT's complementary outputs to create robust pseudo masks that integrate both class and semantic information effectively. Extensive experiments qualitatively and quantitatively investigate how CNN and ViT complement each other on the PASCAL VOC 2012 dataset, showing a state-of-the-art WSSS result. This includes not only the masks generated by our model, but also the segmentation results derived from utilizing these masks as pseudo labels.

Title: Adversarially Robust Deepfake Detection via Adversarial Feature Similarity Learning

Authors: Sarwar Khan
Subjects: cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2403.08806
Pdf URL: https://arxiv.org/pdf/2403.08806
Copy Paste: [[2403.08806]] Adversarially Robust Deepfake Detection via Adversarial Feature Similarity Learning(https://arxiv.org/abs/2403.08806)
Keywords: protect, defense, attack, robust
Abstract: Deepfake technology has raised concerns about the authenticity of digital content, necessitating the development of effective detection methods. However, the widespread availability of deepfakes has given rise to a new challenge in the form of adversarial attacks. Adversaries can manipulate deepfake videos with small, imperceptible perturbations that can deceive the detection models into producing incorrect outputs. To tackle this critical issue, we introduce Adversarial Feature Similarity Learning (AFSL), which integrates three fundamental deep feature learning paradigms. By optimizing the similarity between samples and weight vectors, our approach aims to distinguish between real and fake instances. Additionally, we aim to maximize the similarity between both adversarially perturbed examples and unperturbed examples, regardless of their real or fake nature. Moreover, we introduce a regularization technique that maximizes the dissimilarity between real and fake samples, ensuring a clear separation between these two categories. With extensive experiments on popular deepfake datasets, including FaceForensics++, FaceShifter, and DeeperForensics, the proposed method outperforms other standard adversarial training-based defense methods significantly. This further demonstrates the effectiveness of our approach to protecting deepfake detectors from adversarial attacks.

Title: Thermometer: Towards Universal Calibration for Large Language Models

Authors: Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory Wornell, Soumya Ghosh
Subjects: cs.LG, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2403.08819
Pdf URL: https://arxiv.org/pdf/2403.08819
Copy Paste: [[2403.08819]] Thermometer: Towards Universal Calibration for Large Language Models(https://arxiv.org/abs/2403.08819)
Keywords: large language model
Abstract: We consider the issue of calibration in large language models (LLM). Recent studies have found that common interventions such as instruction tuning often result in poorly calibrated LLMs. Although calibration is well-explored in traditional applications, calibrating LLMs is uniquely challenging. These challenges stem as much from the severe computational requirements of LLMs as from their versatility, which allows them to be applied to diverse tasks. Addressing these challenges, we propose THERMOMETER, a calibration approach tailored to LLMs. THERMOMETER learns an auxiliary model, given data from multiple tasks, for calibrating a LLM. It is computationally efficient, preserves the accuracy of the LLM, and produces better-calibrated responses for new tasks. Extensive empirical evaluations across various benchmarks demonstrate the effectiveness of the proposed method.

Title: Diet-ODIN: A Novel Framework for Opioid Misuse Detection with Interpretable Dietary Patterns

Authors: Zheyuan Zhang, Zehong Wang, Shifu Hou, Evan Hall, Landon Bachman, Vincent Galassi, Jasmine White, Nitesh V. Chawla, Chuxu Zhang, Yanfang Ye
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2403.08820
Pdf URL: https://arxiv.org/pdf/2403.08820
Copy Paste: [[2403.08820]] Diet-ODIN: A Novel Framework for Opioid Misuse Detection with Interpretable Dietary Patterns(https://arxiv.org/abs/2403.08820)
Keywords: large language model
Abstract: The opioid crisis has been one of the most critical society concerns in the United States. Although the medication assisted treatment (MAT) is recognized as the most effective treatment for opioid misuse and addiction, the various side effects can trigger opioid relapse. In addition to MAT, the dietary nutrition intervention has been demonstrated its importance in opioid misuse prevention and recovery. However, research on the alarming connections between dietary patterns and opioid misuse remain under-explored. In response to this gap, in this paper, we first establish a large-scale multifaceted dietary benchmark dataset related to opioid users at the first attempt and then develop a novel framework - i.e., namely Opioid Misuse Detection with Interpretable Dietary Patterns (Diet-ODIN) - to bridge heterogeneous graph (HG) and large language model (LLM) for the identification of users with opioid misuse and the interpretation of their associated dietary patterns. Specifically, in Diet-ODIN, we first construct an HG to comprehensively incorporate both dietary and health-related information, and then we devise a holistic graph learning framework with noise reduction to fully capitalize both users' individual dietary habits and shared dietary patterns for the detection of users with opioid misuse. To further delve into the intricate correlations between dietary patterns and opioid misuse, we exploit an LLM by utilizing the knowledge obtained from the graph learning model for interpretation. The extensive experimental results based on our established benchmark with quantitative and qualitative measures demonstrate the outstanding performance of Diet-ODIN in exploring the complex interplay between opioid misuse and dietary patterns, by comparison with state-of-the-art baseline methods.

Title: LoRA-SP: Streamlined Partial Parameter Adaptation for Resource-Efficient Fine-Tuning of Large Language Models

Authors: Yichao Wu, Yafei Xiang, Shuning Huo, Yulu Gong, Penghao Liang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2403.08822
Pdf URL: https://arxiv.org/pdf/2403.08822
Copy Paste: [[2403.08822]] LoRA-SP: Streamlined Partial Parameter Adaptation for Resource-Efficient Fine-Tuning of Large Language Models(https://arxiv.org/abs/2403.08822)
Keywords: large language model
Abstract: In addressing the computational and memory demands of fine-tuning Large Language Models(LLMs), we propose LoRA-SP(Streamlined Partial Parameter Adaptation), a novel approach utilizing randomized half-selective parameter freezing within the Low-Rank Adaptation(LoRA)framework. This method efficiently balances pre-trained knowledge retention and adaptability for task-specific optimizations. Through a randomized mechanism, LoRA-SP determines which parameters to update or freeze, significantly reducing computational and memory requirements without compromising model performance. We evaluated LoRA-SP across several benchmark NLP tasks, demonstrating its ability to achieve competitive performance with substantially lower resource consumption compared to traditional full-parameter fine-tuning and other parameter-efficient techniques. LoRA-SP innovative approach not only facilitates the deployment of advanced NLP models in resource-limited settings but also opens new research avenues into effective and efficient model adaptation strategies.

Title: TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation

Authors: Dingbang Li, Wenzhou Chen, Xin Lin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.08833
Pdf URL: https://arxiv.org/pdf/2403.08833
Copy Paste: [[2403.08833]] TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation(https://arxiv.org/abs/2403.08833)
Keywords: explainability, large language model
Abstract: Zero-shot navigation is a critical challenge in Vision-Language Navigation (VLN) tasks, where the ability to adapt to unfamiliar instructions and to act in unknown environments is essential. Existing supervised learning-based models, trained using annotated data through reinforcement learning, exhibit limitations in generalization capabilities. Large Language Models (LLMs), with their extensive knowledge and emergent reasoning abilities, present a potential pathway for achieving zero-shot navigation. This paper presents a VLN agent based on LLMs, exploring approaches to the zero-shot navigation problem. To compensate for the shortcomings of LLMs in environmental perception, we propose the Thinking, Interacting, and Action (TINA) framework. TINA enables the agent to scrutinize perceptual information and autonomously query key clues within the environment through an introduced question-answering module, thereby aligning instructions with specific perceptual data. The navigation agent's perceptual abilities are enhanced through the TINA framework, while the explicit thought and query processes also improve the navigational procedure's explainability and transparency. We evaluate the performance of our method on the Room-to-Room dataset. The experiment results indicate that our approach improves the navigation performance of LLM-based agents. Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.

Title: Structural Positional Encoding for knowledge integration in transformer-based medical process monitoring

Authors: Christopher Irwin, Marco Dossena, Giorgio Leonardi, Stefania Montani
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.08836
Pdf URL: https://arxiv.org/pdf/2403.08836
Copy Paste: [[2403.08836]] Structural Positional Encoding for knowledge integration in transformer-based medical process monitoring(https://arxiv.org/abs/2403.08836)
Keywords: transformer
Abstract: Predictive process monitoring is a process mining task aimed at forecasting information about a running process trace, such as the most correct next activity to be executed. In medical domains, predictive process monitoring can provide valuable decision support in atypical and nontrivial situations. Decision support and quality assessment in medicine cannot ignore domain knowledge, in order to be grounded on all the available information (which is not limited to data) and to be really acceptable by end users. In this paper, we propose a predictive process monitoring approach relying on the use of a {\em transformer}, a deep learning architecture based on the attention mechanism. A major contribution of our work lies in the incorporation of ontological domain-specific knowledge, carried out through a graph positional encoding technique. The paper presents and discusses the encouraging experimental result we are collecting in the domain of stroke management.

Title: Predictive Clustering of Vessel Behavior Based on Hierarchical Trajectory Representation

Authors: Rui Zhang, Hanyue Wu, Zhenzhong Yin, Zhu Xiao, Yong Xiong, Kezhong Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.08838
Pdf URL: https://arxiv.org/pdf/2403.08838
Copy Paste: [[2403.08838]] Predictive Clustering of Vessel Behavior Based on Hierarchical Trajectory Representation(https://arxiv.org/abs/2403.08838)
Keywords: robust
Abstract: Vessel trajectory clustering, which aims to find similar trajectory patterns, has been widely leveraged in overwater applications. Most traditional methods use predefined rules and thresholds to identify discrete vessel behaviors. They aim for high-quality clustering and conduct clustering on entire sequences, whether the original trajectory or its sub-trajectories, failing to represent their evolution. To resolve this problem, we propose a Predictive Clustering of Hierarchical Vessel Behavior (PC-HiV). PC-HiV first uses hierarchical representations to transform every trajectory into a behavioral sequence. Then, it predicts evolution at each timestamp of the sequence based on the representations. By applying predictive clustering and latent encoding, PC-HiV improves clustering and predictions simultaneously. Experiments on real AIS datasets demonstrate PC-HiV's superiority over existing methods, showcasing its effectiveness in capturing behavioral evolution discrepancies between vessel types (tramp vs. liner) and within emission control areas. Results show that our method outperforms NN-Kmeans and Robust DAA by 3.9% and 6.4% of the purity score.

Title: NoiseDiffusion: Correcting Noise for Image Interpolation with Diffusion Models beyond Spherical Linear Interpolation

Authors: PengFei Zheng, Yonggang Zhang, Zhen Fang, Tongliang Liu, Defu Lian, Bo Han
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.08840
Pdf URL: https://arxiv.org/pdf/2403.08840
Copy Paste: [[2403.08840]] NoiseDiffusion: Correcting Noise for Image Interpolation with Diffusion Models beyond Spherical Linear Interpolation(https://arxiv.org/abs/2403.08840)
Keywords: diffusion
Abstract: Image interpolation based on diffusion models is promising in creating fresh and interesting images. Advanced interpolation methods mainly focus on spherical linear interpolation, where images are encoded into the noise space and then interpolated for denoising to images. However, existing methods face challenges in effectively interpolating natural images (not generated by diffusion models), thereby restricting their practical applicability. Our experimental investigations reveal that these challenges stem from the invalidity of the encoding noise, which may no longer obey the expected noise distribution, e.g., a normal distribution. To address these challenges, we propose a novel approach to correct noise for image interpolation, NoiseDiffusion. Specifically, NoiseDiffusion approaches the invalid noise to the expected distribution by introducing subtle Gaussian noise and introduces a constraint to suppress noise with extreme values. In this context, promoting noise validity contributes to mitigating image artifacts, but the constraint and introduced exogenous noise typically lead to a reduction in signal-to-noise ratio, i.e., loss of original image information. Hence, NoiseDiffusion performs interpolation within the noisy image space and injects raw images into these noisy counterparts to address the challenge of information loss. Consequently, NoiseDiffusion enables us to interpolate natural images without causing artifacts or information loss, thus achieving the best interpolation results.

Title: DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation

Authors: Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu, Wei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08857
Pdf URL: https://arxiv.org/pdf/2403.08857
Copy Paste: [[2403.08857]] DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation(https://arxiv.org/abs/2403.08857)
Keywords: fair, large language model
Abstract: Text-to-image (T2I) generation models have significantly advanced in recent years. However, effective interaction with these models is challenging for average users due to the need for specialized prompt engineering knowledge and the inability to perform multi-turn image generation, hindering a dynamic and iterative creation process. Recent attempts have tried to equip Multi-modal Large Language Models (MLLMs) with T2I models to bring the user's natural language instructions into reality. Hence, the output modality of MLLMs is extended, and the multi-turn generation quality of T2I models is enhanced thanks to the strong multi-modal comprehension ability of MLLMs. However, many of these works face challenges in identifying correct output modalities and generating coherent images accordingly as the number of output modalities increases and the conversations go deeper. Therefore, we propose DialogGen, an effective pipeline to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System (MIDS) for multi-turn Text-to-Image generation. It is composed of drawing prompt alignment, careful training data curation, and error correction. Moreover, as the field of MIDS flourishes, comprehensive benchmarks are urgently needed to evaluate MIDS fairly in terms of output modality correctness and multi-modal output coherence. To address this issue, we introduce the Multi-modal Dialogue Benchmark (DialogBen), a comprehensive bilingual benchmark designed to assess the ability of MLLMs to generate accurate and coherent multi-modal content that supports image editing. It contains two evaluation metrics to measure the model's ability to switch modalities and the coherence of the output images. Our extensive experiments on DialogBen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.

Title: ARtVista: Gateway To Empower Anyone Into Artist

Authors: Trong-Vu Hoang, Quang-Binh Nguyen, Duy-Nam Ly, Khanh-Duy Le, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08876
Pdf URL: https://arxiv.org/pdf/2403.08876
Copy Paste: [[2403.08876]] ARtVista: Gateway To Empower Anyone Into Artist(https://arxiv.org/abs/2403.08876)
Keywords: generative
Abstract: Drawing is an art that enables people to express their imagination and emotions. However, individuals usually face challenges in drawing, especially when translating conceptual ideas into visually coherent representations and bridging the gap between mental visualization and practical execution. In response, we propose ARtVista - a novel system integrating AR and generative AI technologies. ARtVista not only recommends reference images aligned with users' abstract ideas and generates sketches for users to draw but also goes beyond, crafting vibrant paintings in various painting styles. ARtVista also offers users an alternative approach to create striking paintings by simulating the paint-by-number concept on reference images, empowering users to create visually stunning artwork devoid of the necessity for advanced drawing skills. We perform a pilot study and reveal positive feedback on its usability, emphasizing its effectiveness in visualizing user ideas and aiding the painting process to achieve stunning pictures without requiring advanced drawing skills. The source code will be available at https://github.com/htrvu/ARtVista.

Title: REFRESH: Responsible and Efficient Feature Reselection Guided by SHAP Values

Authors: Shubham Sharma, Sanghamitra Dutta, Emanuele Albini, Freddy Lecue, Daniele Magazzeni, Manuela Veloso
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.08880
Pdf URL: https://arxiv.org/pdf/2403.08880
Copy Paste: [[2403.08880]] REFRESH: Responsible and Efficient Feature Reselection Guided by SHAP Values(https://arxiv.org/abs/2403.08880)
Keywords: robust, fair
Abstract: Feature selection is a crucial step in building machine learning models. This process is often achieved with accuracy as an objective, and can be cumbersome and computationally expensive for large-scale datasets. Several additional model performance characteristics such as fairness and robustness are of importance for model development. As regulations are driving the need for more trustworthy models, deployed models need to be corrected for model characteristics associated with responsible artificial intelligence. When feature selection is done with respect to one model performance characteristic (eg. accuracy), feature selection with secondary model performance characteristics (eg. fairness and robustness) as objectives would require going through the computationally expensive selection process from scratch. In this paper, we introduce the problem of feature \emph{reselection}, so that features can be selected with respect to secondary model performance characteristics efficiently even after a feature selection process has been done with respect to a primary objective. To address this problem, we propose REFRESH, a method to reselect features so that additional constraints that are desirable towards model performance can be achieved without having to train several new models. REFRESH's underlying algorithm is a novel technique using SHAP values and correlation analysis that can approximate for the predictions of a model without having to train these models. Empirical evaluations on three datasets, including a large-scale loan defaulting dataset show that REFRESH can help find alternate models with better model characteristics efficiently. We also discuss the need for reselection and REFRESH based on regulation desiderata.

Title: Federated Data Model

Authors: Xiao Chen, Shunan Zhang, Eric Z. Chen, Yikang Liu, Lin Zhao, Terrence Chen, Shanhui Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08887
Pdf URL: https://arxiv.org/pdf/2403.08887
Copy Paste: [[2403.08887]] Federated Data Model(https://arxiv.org/abs/2403.08887)
Keywords: privacy, robust, federate, diffusion, segmentation
Abstract: In artificial intelligence (AI), especially deep learning, data diversity and volume play a pivotal role in model development. However, training a robust deep learning model often faces challenges due to data privacy, regulations, and the difficulty of sharing data between different locations, especially for medical applications. To address this, we developed a method called the Federated Data Model (FDM). This method uses diffusion models to learn the characteristics of data at one site and then creates synthetic data that can be used at another site without sharing the actual data. We tested this approach with a medical image segmentation task, focusing on cardiac magnetic resonance images from different hospitals. Our results show that models trained with this method perform well both on the data they were originally trained on and on data from other sites. This approach offers a promising way to train accurate and privacy-respecting AI models across different locations.

Title: From "um" to "yeah": Producing, predicting, and regulating information flow in human conversation

Authors: Claire Augusta Bergey, Simon DeDeo
Subjects: cs.CL, cs.IT, q-bio.NC
Abstract URL: https://arxiv.org/abs/2403.08890
Pdf URL: https://arxiv.org/pdf/2403.08890
Copy Paste: [[2403.08890]] From "um" to "yeah": Producing, predicting, and regulating information flow in human conversation(https://arxiv.org/abs/2403.08890)
Keywords: large language model
Abstract: Conversation demands attention. Speakers must call words to mind, listeners must make sense of them, and both together must negotiate this flow of information, all in fractions of a second. We used large language models to study how this works in a large-scale dataset of English-language conversation, the CANDOR corpus. We provide a new estimate of the information density of unstructured conversation, of approximately 13 bits/second, and find significant effects associated with the cognitive load of both retrieving, and presenting, that information. We also reveal a role for backchannels -- the brief yeahs, uh-huhs, and mhmms that listeners provide -- in regulating the production of novelty: the lead-up to a backchannel is associated with declining information rate, while speech downstream rebounds to previous rates. Our results provide new insights into long-standing theories of how we respond to fluctuating demands on cognitive resources, and how we negotiate those demands in partnership with others.

Title: Envision3D: One Image to 3D with Anchor Views Interpolation

Authors: Yatian Pang, Tanghui Jia, Yujun Shi, Zhenyu Tang, Junwu Zhang, Xinhua Cheng, Xing Zhou, Francis E.H. Tay, Li Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.08902
Pdf URL: https://arxiv.org/pdf/2403.08902
Copy Paste: [[2403.08902]] Envision3D: One Image to 3D with Anchor Views Interpolation(https://arxiv.org/abs/2403.08902)
Keywords: robust, extraction, diffusion
Abstract: We present Envision3D, a novel method for efficiently generating high-quality 3D content from a single image. Recent methods that extract 3D content from multi-view images generated by diffusion models show great potential. However, it is still challenging for diffusion models to generate dense multi-view consistent images, which is crucial for the quality of 3D content extraction. To address this issue, we propose a novel cascade diffusion framework, which decomposes the challenging dense views generation task into two tractable stages, namely anchor views generation and anchor views interpolation. In the first stage, we train the image diffusion model to generate global consistent anchor views conditioning on image-normal pairs. Subsequently, leveraging our video diffusion model fine-tuned on consecutive multi-view images, we conduct interpolation on the previous anchor views to generate extra dense views. This framework yields dense, multi-view consistent images, providing comprehensive 3D information. To further enhance the overall generation quality, we introduce a coarse-to-fine sampling strategy for the reconstruction algorithm to robustly extract textured meshes from the generated dense images. Extensive experiments demonstrate that our method is capable of generating high-quality 3D content in terms of texture and geometry, surpassing previous image-to-3D baseline methods.

Title: Efficiently Computing Similarities to Private Datasets

Authors: Arturs Backurs, Zinan Lin, Sepideh Mahabadi, Sandeep Silwal, Jakub Tarnawski
Subjects: cs.CR, cs.DS, cs.LG
Abstract URL: https://arxiv.org/abs/2403.08917
Pdf URL: https://arxiv.org/pdf/2403.08917
Copy Paste: [[2403.08917]] Efficiently Computing Similarities to Private Datasets(https://arxiv.org/abs/2403.08917)
Keywords: privacy
Abstract: Many methods in differentially private model training rely on computing the similarity between a query point (such as public or synthetic data) and private data. We abstract out this common subroutine and study the following fundamental algorithmic problem: Given a similarity function $f$ and a large high-dimensional private dataset $X \subset \mathbb{R}^d$, output a differentially private (DP) data structure which approximates $\sum_{x \in X} f(x,y)$ for any query $y$. We consider the cases where $f$ is a kernel function, such as $f(x,y) = e^{-\|x-y\|_2^2/\sigma^2}$ (also known as DP kernel density estimation), or a distance function such as $f(x,y) = \|x-y\|_2$, among others. Our theoretical results improve upon prior work and give better privacy-utility trade-offs as well as faster query times for a wide range of kernels and distance functions. The unifying approach behind our results is leveraging `low-dimensional structures' present in the specific functions $f$ that we study, using tools such as provable dimensionality reduction, approximation theory, and one-dimensional decomposition of the functions. Our algorithms empirically exhibit improved query times and accuracy over prior state of the art. We also present an application to DP classification. Our experiments demonstrate that the simple methodology of classifying based on average similarity is orders of magnitude faster than prior DP-SGD based approaches for comparable accuracy.

Title: Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images

Authors: Giuseppe Cartella, Vittorio Cuculo, Marcella Cornia, Rita Cucchiara
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.08933
Pdf URL: https://arxiv.org/pdf/2403.08933
Copy Paste: [[2403.08933]] Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images(https://arxiv.org/abs/2403.08933)
Keywords: diffusion, generative
Abstract: Creating high-quality and realistic images is now possible thanks to the impressive advancements in image generation. A description in natural language of your desired output is all you need to obtain breathtaking results. However, as the use of generative models grows, so do concerns about the propagation of malicious content and misinformation. Consequently, the research community is actively working on the development of novel fake detection techniques, primarily focusing on low-level features and possible fingerprints left by generative models during the image generation process. In a different vein, in our work, we leverage human semantic knowledge to investigate the possibility of being included in frameworks of fake image detection. To achieve this, we collect a novel dataset of partially manipulated images using diffusion models and conduct an eye-tracking experiment to record the eye movements of different observers while viewing real and fake stimuli. A preliminary statistical analysis is conducted to explore the distinctive patterns in how humans perceive genuine and altered images. Statistical findings reveal that, when perceiving counterfeit samples, humans tend to focus on more confined regions of the image, in contrast to the more dispersed observational pattern observed when viewing genuine images. Our dataset is publicly available at: https://github.com/aimagelab/unveiling-the-truth.

Title: FogGuard: guarding YOLO against fog using perceptual loss

Authors: Soheil Gharatappeh, Sepideh Neshatfar, Salimeh Yasaei Sekeh, Vikas Dhiman
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.08939
Pdf URL: https://arxiv.org/pdf/2403.08939
Copy Paste: [[2403.08939]] FogGuard: guarding YOLO against fog using perceptual loss(https://arxiv.org/abs/2403.08939)
Keywords: robust
Abstract: In this paper, we present a novel fog-aware object detection network called FogGuard, designed to address the challenges posed by foggy weather conditions. Autonomous driving systems heavily rely on accurate object detection algorithms, but adverse weather conditions can significantly impact the reliability of deep neural networks (DNNs). Existing approaches fall into two main categories, 1) image enhancement such as IA-YOLO 2) domain adaptation based approaches. Image enhancement based techniques attempt to generate fog-free image. However, retrieving a fogless image from a foggy image is a much harder problem than detecting objects in a foggy image. Domain-adaptation based approaches, on the other hand, do not make use of labelled datasets in the target domain. Both categories of approaches are attempting to solve a harder version of the problem. Our approach builds over fine-tuning on the Our framework is specifically designed to compensate for foggy conditions present in the scene, ensuring robust performance even. We adopt YOLOv3 as the baseline object detection algorithm and introduce a novel Teacher-Student Perceptual loss, to high accuracy object detection in foggy images. Through extensive evaluations on common datasets such as PASCAL VOC and RTTS, we demonstrate the improvement in performance achieved by our network. We demonstrate that FogGuard achieves 69.43\% mAP, as compared to 57.78\% for YOLOv3 on the RTTS dataset. Furthermore, we show that while our training method increases time complexity, it does not introduce any additional overhead during inference compared to the regular YOLO network.

Title: LMStyle Benchmark: Evaluating Text Style Transfer for Chatbots

Authors: Jianlin Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.08943
Pdf URL: https://arxiv.org/pdf/2403.08943
Copy Paste: [[2403.08943]] LMStyle Benchmark: Evaluating Text Style Transfer for Chatbots(https://arxiv.org/abs/2403.08943)
Keywords: large language model
Abstract: Since the breakthrough of ChatGPT, large language models (LLMs) have garnered significant attention in the research community. With the development of LLMs, the question of text style transfer for conversational models has emerged as a natural extension, where chatbots may possess their own styles or even characters. However, standard evaluation metrics have not yet been established for this new settings. This paper aims to address this issue by proposing the LMStyle Benchmark, a novel evaluation framework applicable to chat-style text style transfer (C-TST), that can measure the quality of style transfer for LLMs in an automated and scalable manner. In addition to conventional style strength metrics, LMStyle Benchmark further considers a novel aspect of metrics called appropriateness, a high-level metrics take account of coherence, fluency and other implicit factors without the aid of reference samples. Our experiments demonstrate that the new evaluation methods introduced by LMStyle Benchmark have a higher correlation with human judgments in terms of appropriateness. Based on LMStyle Benchmark, we present a comprehensive list of evaluation results for popular LLMs, including LLaMA, Alpaca, and Vicuna, reflecting their stylistic properties, such as formality and sentiment strength, along with their appropriateness.

Title: Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era

Authors: Xuansheng Wu, Haiyan Zhao, Yaochen Zhu, Yucheng Shi, Fan Yang, Tianming Liu, Xiaoming Zhai, Wenlin Yao, Jundong Li, Mengnan Du, Ninghao Liu
Subjects: cs.LG, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2403.08946
Pdf URL: https://arxiv.org/pdf/2403.08946
Copy Paste: [[2403.08946]] Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era(https://arxiv.org/abs/2403.08946)
Keywords: explainability, large language model
Abstract: Explainable AI (XAI) refers to techniques that provide human-understandable insights into the workings of AI models. Recently, the focus of XAI is being extended towards Large Language Models (LLMs) which are often criticized for their lack of transparency. This extension calls for a significant transformation in XAI methodologies because of two reasons. First, many existing XAI methods cannot be directly applied to LLMs due to their complexity advanced capabilities. Second, as LLMs are increasingly deployed across diverse industry applications, the role of XAI shifts from merely opening the "black box" to actively enhancing the productivity and applicability of LLMs in real-world settings. Meanwhile, unlike traditional machine learning models that are passive recipients of XAI insights, the distinct abilities of LLMs can reciprocally enhance XAI. Therefore, in this paper, we introduce Usable XAI in the context of LLMs by analyzing (1) how XAI can benefit LLMs and AI systems, and (2) how LLMs can contribute to the advancement of XAI. We introduce 10 strategies, introducing the key techniques for each and discussing their associated challenges. We also provide case studies to demonstrate how to obtain and leverage explanations. The code used in this paper can be found at: https://github.com/JacksonWuxs/UsableXAI_LLM.

Title: Towards Efficient Risk-Sensitive Policy Gradient: An Iteration Complexity Analysis

Authors: Rui Liu, Erfaun Noorani, Pratap Tokekar, John S. Baras
Subjects: cs.LG, cs.AI, math.OC
Abstract URL: https://arxiv.org/abs/2403.08955
Pdf URL: https://arxiv.org/pdf/2403.08955
Copy Paste: [[2403.08955]] Towards Efficient Risk-Sensitive Policy Gradient: An Iteration Complexity Analysis(https://arxiv.org/abs/2403.08955)
Keywords: robust
Abstract: Reinforcement Learning (RL) has shown exceptional performance across various applications, enabling autonomous agents to learn optimal policies through interaction with their environments. However, traditional RL frameworks often face challenges in terms of iteration complexity and robustness. Risk-sensitive RL, which balances expected return and risk, has been explored for its potential to yield probabilistically robust policies, yet its iteration complexity analysis remains underexplored. In this study, we conduct a thorough iteration complexity analysis for the risk-sensitive policy gradient method, focusing on the REINFORCE algorithm and employing the exponential utility function. We obtain an iteration complexity of $\mathcal{O}(\epsilon^{-2})$ to reach an $\epsilon$-approximate first-order stationary point (FOSP). We investigate whether risk-sensitive algorithms can achieve better iteration complexity compared to their risk-neutral counterparts. Our theoretical analysis demonstrates that risk-sensitive REINFORCE can have a reduced number of iterations required for convergence. This leads to improved iteration complexity, as employing the exponential utility does not entail additional computation per iteration. We characterize the conditions under which risk-sensitive algorithms can achieve better iteration complexity. Our simulation results also validate that risk-averse cases can converge and stabilize more quickly after approximately half of the episodes compared to their risk-neutral counterparts.

Title: PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and Captioning

Authors: Qifeng Zhou, Wenliang Zhong, Yuzhi Guo, Michael Xiao, Hehuan Ma, Junzhou Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.08967
Pdf URL: https://arxiv.org/pdf/2403.08967
Copy Paste: [[2403.08967]] PathM3: A Multimodal Multi-Task Multiple Instance Learning Framework for Whole Slide Image Classification and Captioning(https://arxiv.org/abs/2403.08967)
Keywords: transformer
Abstract: In the field of computational histopathology, both whole slide images (WSIs) and diagnostic captions provide valuable insights for making diagnostic decisions. However, aligning WSIs with diagnostic captions presents a significant challenge. This difficulty arises from two main factors: 1) Gigapixel WSIs are unsuitable for direct input into deep learning models, and the redundancy and correlation among the patches demand more attention; and 2) Authentic WSI diagnostic captions are extremely limited, making it difficult to train an effective model. To overcome these obstacles, we present PathM3, a multimodal, multi-task, multiple instance learning (MIL) framework for WSI classification and captioning. PathM3 adapts a query-based transformer to effectively align WSIs with diagnostic captions. Given that histopathology visual patterns are redundantly distributed across WSIs, we aggregate each patch feature with MIL method that considers the correlations among instances. Furthermore, our PathM3 overcomes data scarcity in WSI-level captions by leveraging limited WSI diagnostic caption data in the manner of multi-task joint learning. Extensive experiments with improved classification accuracy and caption generation demonstrate the effectiveness of our method on both WSI classification and captioning task.

Title: Representing Anatomical Trees by Denoising Diffusion of Implicit Neural Fields

Authors: Ashish Sinha, Ghassan Hamarneh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.08974
Pdf URL: https://arxiv.org/pdf/2403.08974
Copy Paste: [[2403.08974]] Representing Anatomical Trees by Denoising Diffusion of Implicit Neural Fields(https://arxiv.org/abs/2403.08974)
Keywords: diffusion
Abstract: Anatomical trees play a central role in clinical diagnosis and treatment planning. However, accurately representing anatomical trees is challenging due to their varying and complex topology and geometry. Traditional methods for representing tree structures, captured using medical imaging, while invaluable for visualizing vascular and bronchial networks, exhibit drawbacks in terms of limited resolution, flexibility, and efficiency. Recently, implicit neural representations (INRs) have emerged as a powerful tool for representing shapes accurately and efficiently. We propose a novel approach for representing anatomical trees using INR, while also capturing the distribution of a set of trees via denoising diffusion in the space of INRs. We accurately capture the intricate geometries and topologies of anatomical trees at any desired resolution. Through extensive qualitative and quantitative evaluation, we demonstrate high-fidelity tree reconstruction with arbitrary resolution yet compact storage, and versatility across anatomical sites and tree complexities.

Title: AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents

Authors: Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, Honglak Lee
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.08978
Pdf URL: https://arxiv.org/pdf/2403.08978
Copy Paste: [[2403.08978]] AutoGuide: Automated Generation and Selection of State-Aware Guidelines for Large Language Model Agents(https://arxiv.org/abs/2403.08978)
Keywords: large language model
Abstract: The primary limitation of large language models (LLMs) is their restricted understanding of the world. This poses significant difficulties for LLM-based agents, particularly in domains where pre-trained LLMs lack sufficient knowledge. In this paper, we introduce a novel framework, called AutoGuide, that bridges the knowledge gap in pre-trained LLMs by leveraging implicit knowledge in offline experiences. Specifically, AutoGuide effectively extracts knowledge embedded in offline data by extracting a set of state-aware guidelines. Importantly, each state-aware guideline is expressed in concise natural language and follows a conditional structure, clearly describing the state where it is applicable. As such, the resulting guidelines enable a principled way to provide helpful knowledge pertinent to an agent's current decision-making process. We show that our approach outperforms competitive LLM-based baselines by a large margin in sequential decision-making benchmarks.

Title: Ethos: Rectifying Language Models in Orthogonal Parameter Space

Authors: Lei Gao, Yue Niu, Tingting Tang, Salman Avestimehr, Murali Annavaram
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.08994
Pdf URL: https://arxiv.org/pdf/2403.08994
Copy Paste: [[2403.08994]] Ethos: Rectifying Language Models in Orthogonal Parameter Space(https://arxiv.org/abs/2403.08994)
Keywords: privacy
Abstract: Language models (LMs) have greatly propelled the research on natural language processing. However, LMs also raise concerns regarding the generation of biased or toxic content and the potential disclosure of private information from the training dataset. In this work, we present a new efficient approach, Ethos, that rectifies LMs to mitigate toxicity and bias in outputs and avoid privacy leakage. Ethos is built on task arithmetic. However, unlike current task arithmetic algorithms, Ethos distinguishes general beneficial and undesired knowledge when reconstructing task vectors. Specifically, Ethos first obtains a set of principal components from the pre-trained models using singular value decomposition. Then, by projecting the task vector onto principal components, Ethos identifies the principal components that encode general or undesired knowledge. Ethos performs negating using the task vector with undesired knowledge only, thereby minimizing collateral damage on general model utility. We demonstrate the efficacy of our approach on three different tasks: debiasing, detoxification, and memorization unlearning. Evaluations show Ethos is more effective in removing undesired knowledge and maintaining the overall model performance compared to current task arithmetic methods.

Title: CART: Caltech Aerial RGB-Thermal Dataset in the Wild

Authors: Connor Lee, Matthew Anderson, Nikhil Raganathan, Xingxing Zuo, Kevin Do, Georgia Gkioxari, Soon-Jo Chung
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2403.08997
Pdf URL: https://arxiv.org/pdf/2403.08997
Copy Paste: [[2403.08997]] CART: Caltech Aerial RGB-Thermal Dataset in the Wild(https://arxiv.org/abs/2403.08997)
Keywords: robust, segmentation
Abstract: We present the first publicly available RGB-thermal dataset designed for aerial robotics operating in natural environments. Our dataset captures a variety of terrains across the continental United States, including rivers, lakes, coastlines, deserts, and forests, and consists of synchronized RGB, long-wave thermal, global positioning, and inertial data. Furthermore, we provide semantic segmentation annotations for 10 classes commonly encountered in natural settings in order to facilitate the development of perception algorithms robust to adverse weather and nighttime conditions. Using this dataset, we propose new and challenging benchmarks for thermal and RGB-thermal semantic segmentation, RGB-to-thermal image translation, and visual-inertial odometry. We present extensive results using state-of-the-art methods and highlight the challenges posed by temporal and geographical domain shifts in our data. Dataset and accompanying code will be provided at https://github.com/aerorobotics/caltech-aerial-rgbt-dataset

Title: AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic

Authors: Emad A. Alghamdi, Reem I. Masoud, Deema Alnuhait, Afnan Y. Alomairi, Ahmed Ashraf, Mohamed Zaytoon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.09017
Pdf URL: https://arxiv.org/pdf/2403.09017
Copy Paste: [[2403.09017]] AraTrust: An Evaluation of Trustworthiness for LLMs in Arabic(https://arxiv.org/abs/2403.09017)
Keywords: privacy, fair, large language model
Abstract: The swift progress and widespread acceptance of artificial intelligence (AI) systems highlight a pressing requirement to comprehend both the capabilities and potential risks associated with AI. Given the linguistic complexity, cultural richness, and underrepresented status of Arabic in AI research, there is a pressing need to focus on Large Language Models (LLMs) performance and safety for Arabic related tasks. Despite some progress in their development, there is a lack of comprehensive trustworthiness evaluation benchmarks which presents a major challenge in accurately assessing and improving the safety of LLMs when prompted in Arabic. In this paper, we introduce AraTrust 1, the first comprehensive trustworthiness benchmark for LLMs in Arabic. AraTrust comprises 516 human-written multiple-choice questions addressing diverse dimensions related to truthfulness, ethics, safety, physical health, mental health, unfairness, illegal activities, privacy, and offensive language. By introducing AraTrust, we aim to promote collaborative efforts to create safer and more trustworthy LLMs for Arabic users. We evaluated a set of LLMs against our benchmark to assess its trustworthiness. GPT-4 showed to be the most trustworthy regarding Arabic language.

Title: Semiparametric Token-Sequence Co-Supervision

Authors: Hyunji Lee, Doyoung Kim, Jihoon Jun, Sejune Joo, Joel Jang, Kyoung-Woon On, Minjoon Seo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09024
Pdf URL: https://arxiv.org/pdf/2403.09024
Copy Paste: [[2403.09024]] Semiparametric Token-Sequence Co-Supervision(https://arxiv.org/abs/2403.09024)
Keywords: robust
Abstract: In this work, we introduce a semiparametric token-sequence co-supervision training method. It trains a language model by simultaneously leveraging supervision from the traditional next token prediction loss which is calculated over the parametric token embedding space and the next sequence prediction loss which is calculated over the nonparametric sequence embedding space. The nonparametric sequence embedding space is constructed by a separate language model tasked to condense an input text into a single representative embedding. Our experiments demonstrate that a model trained via both supervisions consistently surpasses models trained via each supervision independently. Analysis suggests that this co-supervision encourages a broader generalization capability across the model. Especially, the robustness of parametric token space which is established during the pretraining step tends to effectively enhance the stability of nonparametric sequence embedding space, a new space established by another language model.

Title: VDNA-PR: Using General Dataset Representations for Robust Sequential Visual Place Recognition

Authors: Benjamin Ramtoula, Daniele De Martini, Matthew Gadd, Paul Newman
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2403.09025
Pdf URL: https://arxiv.org/pdf/2403.09025
Copy Paste: [[2403.09025]] VDNA-PR: Using General Dataset Representations for Robust Sequential Visual Place Recognition(https://arxiv.org/abs/2403.09025)
Keywords: robust
Abstract: This paper adapts a general dataset representation technique to produce robust Visual Place Recognition (VPR) descriptors, crucial to enable real-world mobile robot localisation. Two parallel lines of work on VPR have shown, on one side, that general-purpose off-the-shelf feature representations can provide robustness to domain shifts, and, on the other, that fused information from sequences of images improves performance. In our recent work on measuring domain gaps between image datasets, we proposed a Visual Distribution of Neuron Activations (VDNA) representation to represent datasets of images. This representation can naturally handle image sequences and provides a general and granular feature representation derived from a general-purpose model. Moreover, our representation is based on tracking neuron activation values over the list of images to represent and is not limited to a particular neural network layer, therefore having access to high- and low-level concepts. This work shows how VDNAs can be used for VPR by learning a very lightweight and simple encoder to generate task-specific descriptors. Our experiments show that our representation can allow for better robustness than current solutions to serious domain shifts away from the training data distribution, such as to indoor environments and aerial imagery.

Title: VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Authors: Chris Kelly, Luhui Hu, Bang Yang, Yu Tian, Deshun Yang, Cindy Yang, Zaoshan Huang, Zihao Li, Jiayin Hu, Yuexian Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09027
Pdf URL: https://arxiv.org/pdf/2403.09027
Copy Paste: [[2403.09027]] VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework(https://arxiv.org/abs/2403.09027)
Keywords: large language model
Abstract: With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question. In this paper, we introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models, thereby facilitating vision-language understanding and the development of vision-oriented AI. VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features: (1) utilizing LLMs (e.g., LLaMA-2) as the pivot to break down users' requests into detailed action proposals to call suitable foundation models; (2) integrating multi-source outputs from foundation models automatically and generating comprehensive responses for users; (3) adaptable to a wide range of applications such as text-conditioned image understanding/generation/editing and visual question answering. This paper outlines the architecture and capabilities of VisionGPT, demonstrating its potential to revolutionize the field of computer vision through enhanced efficiency, versatility, and generalization, and performance. Our code and models will be made publicly available. Keywords: VisionGPT, Open-world visual perception, Vision-language understanding, Large language model, and Foundation model

Title: rFaceNet: An End-to-End Network for Enhanced Physiological Signal Extraction through Identity-Specific Facial Contours

Authors: Dali Zhu, Wenli Zhang, Hualin Zeng, Xiaohao Liu, Long Yang, Jiaqi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09034
Pdf URL: https://arxiv.org/pdf/2403.09034
Copy Paste: [[2403.09034]] rFaceNet: An End-to-End Network for Enhanced Physiological Signal Extraction through Identity-Specific Facial Contours(https://arxiv.org/abs/2403.09034)
Keywords: extraction, interpretability
Abstract: Remote photoplethysmography (rPPG) technique extracts blood volume pulse (BVP) signals from subtle pixel changes in video frames. This study introduces rFaceNet, an advanced rPPG method that enhances the extraction of facial BVP signals with a focus on facial contours. rFaceNet integrates identity-specific facial contour information and eliminates redundant data. It efficiently extracts facial contours from temporally normalized frame inputs through a Temporal Compressor Unit (TCU) and steers the model focus to relevant facial regions by using the Cross-Task Feature Combiner (CTFC). Through elaborate training, the quality and interpretability of facial physiological signals extracted by rFaceNet are greatly improved compared to previous methods. Moreover, our novel approach demonstrates superior performance than SOTA methods in various heart rate estimation benchmarks.

Title: The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?

Authors: Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng, Stephen Gould
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2403.09037
Pdf URL: https://arxiv.org/pdf/2403.09037
Copy Paste: [[2403.09037]] The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?(https://arxiv.org/abs/2403.09037)
Keywords: attack
Abstract: Large vision-language models (LVLMs), designed to interpret and respond to human instructions, occasionally generate hallucinated or harmful content due to inappropriate instructions. This study uses linear probing to shed light on the hidden knowledge at the output layer of LVLMs. We demonstrate that the logit distributions of the first tokens contain sufficient information to determine whether to respond to the instructions, including recognizing unanswerable visual questions, defending against multi-modal jailbreaking attack, and identifying deceptive questions. Such hidden knowledge is gradually lost in logits of subsequent tokens during response generation. Then, we illustrate a simple decoding strategy at the generation of the first token, effectively improving the generated content. In experiments, we find a few interesting insights: First, the CLIP model already contains a strong signal for solving these tasks, indicating potential bias in the existing datasets. Second, we observe performance improvement by utilizing the first logit distributions on three additional tasks, including indicting uncertainty in math solving, mitigating hallucination, and image classification. Last, with the same training data, simply finetuning LVLMs improve models' performance but is still inferior to linear probing on these tasks.

Title: Taming Cross-Domain Representation Variance in Federated Prototype Learning with Heterogeneous Data Domains

Authors: Lei Wang, Jieming Bian, Letian Zhang, Chen Chen, Jie Xu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2403.09048
Pdf URL: https://arxiv.org/pdf/2403.09048
Copy Paste: [[2403.09048]] Taming Cross-Domain Representation Variance in Federated Prototype Learning with Heterogeneous Data Domains(https://arxiv.org/abs/2403.09048)
Keywords: privacy, federate
Abstract: Federated learning (FL) allows collaborative machine learning training without sharing private data. While most FL methods assume identical data domains across clients, real-world scenarios often involve heterogeneous data domains. Federated Prototype Learning (FedPL) addresses this issue, using mean feature vectors as prototypes to enhance model generalization. However, existing FedPL methods create the same number of prototypes for each client, leading to cross-domain performance gaps and disparities for clients with varied data distributions. To mitigate cross-domain feature representation variance, we introduce FedPLVM, which establishes variance-aware dual-level prototypes clustering and employs a novel $\alpha$-sparsity prototype loss. The dual-level prototypes clustering strategy creates local clustered prototypes based on private data features, then performs global prototypes clustering to reduce communication complexity and preserve local data privacy. The $\alpha$-sparsity prototype loss aligns samples from underrepresented domains, enhancing intra-class similarity and reducing inter-class similarity. Evaluations on Digit-5, Office-10, and DomainNet datasets demonstrate our method's superiority over existing approaches.

Title: Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

Authors: Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath
Subjects: cs.LG, cs.AI, cs.AR, cs.CL
Abstract URL: https://arxiv.org/abs/2403.09054
Pdf URL: https://arxiv.org/pdf/2403.09054
Copy Paste: [[2403.09054]] Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference(https://arxiv.org/abs/2403.09054)
Keywords: transformer, generative, large language model
Abstract: Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs. This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as "key" tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer's reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model's accuracy.

Title: StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control

Authors: Jaerin Lee, Daniel Sungho Jung, Kanggeon Lee, Kyoung Mu Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09055
Pdf URL: https://arxiv.org/pdf/2403.09055
Copy Paste: [[2403.09055]] StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control(https://arxiv.org/abs/2403.09055)
Keywords: diffusion
Abstract: The enormous success of diffusion models in text-to-image synthesis has made them promising candidates for the next generation of end-user applications for image generation and editing. Previous works have focused on improving the usability of diffusion models by reducing the inference time or increasing user interactivity by allowing new, fine-grained controls such as region-based text prompts. However, we empirically find that integrating both branches of works is nontrivial, limiting the potential of diffusion models. To solve this incompatibility, we present StreamMultiDiffusion, the first real-time region-based text-to-image generation framework. By stabilizing fast inference techniques and restructuring the model into a newly proposed multi-prompt stream batch architecture, we achieve $\times 10$ faster panorama generation than existing solutions, and the generation speed of 1.57 FPS in region-based text-to-image synthesis on a single RTX 2080 Ti GPU. Our solution opens up a new paradigm for interactive image generation named semantic palette, where high-quality images are generated in real-time from given multiple hand-drawn regions, encoding prescribed semantic meanings (e.g., eagle, girl). Our code and demo application are available at https://github.com/ironjr/StreamMultiDiffusion.

Title: LAMP: A Language Model on the Map

Authors: Pasquale Balsebre, Weiming Huang, Gao Cong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.09059
Pdf URL: https://arxiv.org/pdf/2403.09059
Copy Paste: [[2403.09059]] LAMP: A Language Model on the Map(https://arxiv.org/abs/2403.09059)
Keywords: large language model
Abstract: Large Language Models (LLMs) are poised to play an increasingly important role in our lives, providing assistance across a wide array of tasks. In the geospatial domain, LLMs have demonstrated the ability to answer generic questions, such as identifying a country's capital; nonetheless, their utility is hindered when it comes to answering fine-grained questions about specific places, such as grocery stores or restaurants, which constitute essential aspects of people's everyday lives. This is mainly because the places in our cities haven't been systematically fed into LLMs, so as to understand and memorize them. This study introduces a novel framework for fine-tuning a pre-trained model on city-specific data, to enable it to provide accurate recommendations, while minimizing hallucinations. We share our model, LAMP, and the data used to train it. We conduct experiments to analyze its ability to correctly retrieving spatial objects, and compare it to well-known open- and closed- source language models, such as GPT-4. Finally, we explore its emerging capabilities through a case study on day planning.

Title: Distribution and Depth-Aware Transformers for 3D Human Mesh Recovery

Authors: Jerrin Bright, Bavesh Balaji, Harish Prakash, Yuhao Chen, David A Clausi, John Zelek
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09063
Pdf URL: https://arxiv.org/pdf/2403.09063
Copy Paste: [[2403.09063]] Distribution and Depth-Aware Transformers for 3D Human Mesh Recovery(https://arxiv.org/abs/2403.09063)
Keywords: robust, transformer
Abstract: Precise Human Mesh Recovery (HMR) with in-the-wild data is a formidable challenge and is often hindered by depth ambiguities and reduced precision. Existing works resort to either pose priors or multi-modal data such as multi-view or point cloud information, though their methods often overlook the valuable scene-depth information inherently present in a single image. Moreover, achieving robust HMR for out-of-distribution (OOD) data is exceedingly challenging due to inherent variations in pose, shape and depth. Consequently, understanding the underlying distribution becomes a vital subproblem in modeling human forms. Motivated by the need for unambiguous and robust human modeling, we introduce Distribution and depth-aware human mesh recovery (D2A-HMR), an end-to-end transformer architecture meticulously designed to minimize the disparity between distributions and incorporate scene-depth leveraging prior depth information. Our approach demonstrates superior performance in handling OOD data in certain scenarios while consistently achieving competitive results against state-of-the-art HMR methods on controlled datasets.

Title: When Semantic Segmentation Meets Frequency Aliasing

Authors: Linwei Chen, Lin Gu, Ying Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09065
Pdf URL: https://arxiv.org/pdf/2403.09065
Copy Paste: [[2403.09065]] When Semantic Segmentation Meets Frequency Aliasing(https://arxiv.org/abs/2403.09065)
Keywords: segmentation
Abstract: Despite recent advancements in semantic segmentation, where and what pixels are hard to segment remains largely unexplored. Existing research only separates an image into easy and hard regions and empirically observes the latter are associated with object boundaries. In this paper, we conduct a comprehensive analysis of hard pixel errors, categorizing them into three types: false responses, merging mistakes, and displacements. Our findings reveal a quantitative association between hard pixels and aliasing, which is distortion caused by the overlapping of frequency components in the Fourier domain during downsampling. To identify the frequencies responsible for aliasing, we propose using the equivalent sampling rate to calculate the Nyquist frequency, which marks the threshold for aliasing. Then, we introduce the aliasing score as a metric to quantify the extent of aliasing. While positively correlated with the proposed aliasing score, three types of hard pixels exhibit different patterns. Here, we propose two novel de-aliasing filter (DAF) and frequency mixing (FreqMix) modules to alleviate aliasing degradation by accurately removing or adjusting frequencies higher than the Nyquist frequency. The DAF precisely removes the frequencies responsible for aliasing before downsampling, while the FreqMix dynamically selects high-frequency components within the encoder block. Experimental results demonstrate consistent improvements in semantic segmentation and low-light instance segmentation tasks. The code is available at: \url{https://github.com/Linwei-Chen/Seg-Aliasing}.

Title: UniCode: Learning a Unified Codebook for Multimodal Large Language Models

Authors: Sipeng Zheng, Bohan Zhou, Yicheng Feng, Ye Wang, Zongqing Lu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2403.09072
Pdf URL: https://arxiv.org/pdf/2403.09072
Copy Paste: [[2403.09072]] UniCode: Learning a Unified Codebook for Multimodal Large Language Models(https://arxiv.org/abs/2403.09072)
Keywords: large language model
Abstract: In this paper, we propose \textbf{UniCode}, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals. This innovation addresses a critical limitation in existing MLLMs: their reliance on a text-only codebook, which restricts MLLM's ability to generate images and texts in a multimodal context. Towards this end, we propose a language-driven iterative training paradigm, coupled with an in-context pre-training task we term ``image decompression'', enabling our model to interpret compressed visual data and generate high-quality images.The unified codebook empowers our model to extend visual instruction tuning to non-linguistic generation tasks. Moreover, UniCode is adaptable to diverse stacked quantization approaches in order to compress visual signals into a more compact token representation. Despite using significantly fewer parameters and less data during training, Unicode demonstrates promising capabilities in visual reconstruction and generation. It also achieves performances comparable to leading MLLMs across a spectrum of VQA benchmarks.

Title: Large Language Models are Parallel Multilingual Learners

Authors: Yongyu Mu, Peinan Feng, Zhiquan Cao, Yuzhang Wu, Bei Li, Chenglong Wang, Tong Xiao, Kai Song, Tongran Liu, Chunliang Zhang, Jingbo Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.09073
Pdf URL: https://arxiv.org/pdf/2403.09073
Copy Paste: [[2403.09073]] Large Language Models are Parallel Multilingual Learners(https://arxiv.org/abs/2403.09073)
Keywords: large language model
Abstract: In this study, we reveal an in-context learning (ICL) capability of multilingual large language models (LLMs): by translating the input to several languages, we provide Parallel Input in Multiple Languages (PiM) to LLMs, which significantly enhances their comprehension abilities. To test this capability, we design extensive experiments encompassing 8 typical datasets, 7 languages and 8 state-of-the-art multilingual LLMs. Experimental results show that (1) incorporating more languages help PiM surpass the conventional ICL further; (2) even combining with the translations that are inferior to baseline performance can also help. Moreover, by examining the activated neurons in LLMs, we discover a counterintuitive but interesting phenomenon. Contrary to the common thought that PiM would activate more neurons than monolingual input to leverage knowledge learned from diverse languages, PiM actually inhibits neurons and promotes more precise neuron activation especially when more languages are added. This phenomenon aligns with the neuroscience insight about synaptic pruning, which removes less used neural connections, strengthens remainders, and then enhances brain intelligence.

Title: Information Extraction: An application to the domain of hyper-local financial data on developing countries

Authors: Abuzar Royesh, Olamide Oladeji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.09077
Pdf URL: https://arxiv.org/pdf/2403.09077
Copy Paste: [[2403.09077]] Information Extraction: An application to the domain of hyper-local financial data on developing countries(https://arxiv.org/abs/2403.09077)
Keywords: extraction, transformer
Abstract: Despite the need for financial data on company activities in developing countries for development research and economic analysis, such data does not exist. In this project, we develop and evaluate two Natural Language Processing (NLP) based techniques to address this issue. First, we curate a custom dataset specific to the domain of financial text data on developing countries and explore multiple approaches for information extraction. We then explore a text-to-text approach with the transformer-based T5 model with the goal of undertaking simultaneous NER and relation extraction. We find that this model is able to learn the custom text structure output data corresponding to the entities and their relations, resulting in an accuracy of 92.44\%, a precision of 68.25\% and a recall of 54.20\% from our best T5 model on the combined task. Secondly, we explore an approach with sequential NER and relation extration. For the NER, we run pre-trained and fine-tuned models using SpaCy, and we develop a custom relation extraction model using SpaCy's Dependency Parser output and some heuristics to determine entity relationships \cite{spacy}. We obtain an accuracy of 84.72\%, a precision of 6.06\% and a recall of 5.57\% on this sequential task.

Title: Ciphertext-Only Attack on a Secure $k$-NN Computation on Cloud

Authors: Shyam Murthy, Santosh Kumar Upadhyaya, Srinivas Vivek
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.09080
Pdf URL: https://arxiv.org/pdf/2403.09080
Copy Paste: [[2403.09080]] Ciphertext-Only Attack on a Secure $k$-NN Computation on Cloud(https://arxiv.org/abs/2403.09080)
Keywords: secure, privacy, protect, attack
Abstract: The rise of cloud computing has spurred a trend of transferring data storage and computational tasks to the cloud. To protect confidential information such as customer data and business details, it is essential to encrypt this sensitive data before cloud storage. Implementing encryption can prevent unauthorized access, data breaches, and the resultant financial loss, reputation damage, and legal issues. Moreover, to facilitate the execution of data mining algorithms on the cloud-stored data, the encryption needs to be compatible with domain computation. The $k$-nearest neighbor ($k$-NN) computation for a specific query vector is widely used in fields like location-based services. Sanyashi et al. (ICISS 2023) proposed an encryption scheme to facilitate privacy-preserving $k$-NN computation on the cloud by utilizing Asymmetric Scalar-Product-Preserving Encryption (ASPE). In this work, we identify a significant vulnerability in the aforementioned encryption scheme of Sanyashi et al. Specifically, we give an efficient algorithm and also empirically demonstrate that their encryption scheme is vulnerable to the ciphertext-only attack (COA).

Title: Meaningful Learning: Advancing Abstract Reasoning in Large Language Models via Generic Fact Guidance

Authors: Kai Xiong, Xiao Ding, Ting Liu, Bing Qin, Dongliang Xu, Qing Yang, Hongtao Liu, Yixin Cao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09085
Pdf URL: https://arxiv.org/pdf/2403.09085
Copy Paste: [[2403.09085]] Meaningful Learning: Advancing Abstract Reasoning in Large Language Models via Generic Fact Guidance(https://arxiv.org/abs/2403.09085)
Keywords: explainability, large language model
Abstract: Large language models (LLMs) have developed impressive performance and strong explainability across various reasoning scenarios, marking a significant stride towards mimicking human-like intelligence. Despite this, when tasked with simple questions supported by a generic fact, LLMs often fail to provide consistent and precise answers, indicating a deficiency in abstract reasoning abilities. This has sparked a vigorous debate about whether LLMs are genuinely reasoning or merely memorizing. In light of this, we design a preliminary study to quantify and delve into the abstract reasoning abilities of existing LLMs. Our findings reveal a substantial discrepancy between their general reasoning and abstract reasoning performances. To relieve this problem, we tailor an abstract reasoning dataset (AbsR) together with a meaningful learning paradigm to teach LLMs how to leverage generic facts for reasoning purposes. The results show that our approach not only boosts the general reasoning performance of LLMs but also makes considerable strides towards their capacity for abstract reasoning, moving beyond simple memorization or imitation to a more nuanced understanding and application of generic facts.

Title: Learning from straggler clients in federated learning

Authors: Andrew Hard, Antonious M. Girgis, Ehsan Amid, Sean Augenstein, Lara McConnaughey, Rajiv Mathews, Rohan Anil
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.09086
Pdf URL: https://arxiv.org/pdf/2403.09086
Copy Paste: [[2403.09086]] Learning from straggler clients in federated learning(https://arxiv.org/abs/2403.09086)
Keywords: federate
Abstract: How well do existing federated learning algorithms learn from client devices that return model updates with a significant time delay? Is it even possible to learn effectively from clients that report back minutes, hours, or days after being scheduled? We answer these questions by developing Monte Carlo simulations of client latency that are guided by real-world applications. We study synchronous optimization algorithms like FedAvg and FedAdam as well as the asynchronous FedBuff algorithm, and observe that all these existing approaches struggle to learn from severely delayed clients. To improve upon this situation, we experiment with modifications, including distillation regularization and exponential moving averages of model weights. Finally, we introduce two new algorithms, FARe-DUST and FeAST-on-MSG, based on distillation and averaging, respectively. Experiments with the EMNIST, CIFAR-100, and StackOverflow benchmark federated learning tasks demonstrate that our new algorithms outperform existing ones in terms of accuracy for straggler clients, while also providing better trade-offs between training time and total accuracy.

Title: Desigen: A Pipeline for Controllable Design Template Generation

Authors: Haohan Weng, Danqing Huang, Yu Qiao, Zheng Hu, Chin-Yew Lin, Tong Zhang, C. L. Philip Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09093
Pdf URL: https://arxiv.org/pdf/2403.09093
Copy Paste: [[2403.09093]] Desigen: A Pipeline for Controllable Design Template Generation(https://arxiv.org/abs/2403.09093)
Keywords: diffusion, transformer
Abstract: Templates serve as a good starting point to implement a design (e.g., banner, slide) but it takes great effort from designers to manually create. In this paper, we present Desigen, an automatic template creation pipeline which generates background images as well as harmonious layout elements over the background. Different from natural images, a background image should preserve enough non-salient space for the overlaying layout elements. To equip existing advanced diffusion-based models with stronger spatial control, we propose two simple but effective techniques to constrain the saliency distribution and reduce the attention weight in desired regions during the background generation process. Then conditioned on the background, we synthesize the layout with a Transformer-based autoregressive generator. To achieve a more harmonious composition, we propose an iterative inference strategy to adjust the synthesized background and layout in multiple rounds. We constructed a design dataset with more than 40k advertisement banners to verify our approach. Extensive experiments demonstrate that the proposed pipeline generates high-quality templates comparable to human designers. More than a single-page design, we further show an application of presentation generation that outputs a set of theme-consistent slides. The data and code are available at https://whaohan.github.io/desigen.

Title: AI on AI: Exploring the Utility of GPT as an Expert Annotator of AI Publications

Authors: Autumn Toney-Wails, Christian Schoeberl, James Dunham
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.09097
Pdf URL: https://arxiv.org/pdf/2403.09097
Copy Paste: [[2403.09097]] AI on AI: Exploring the Utility of GPT as an Expert Annotator of AI Publications(https://arxiv.org/abs/2403.09097)
Keywords: transformer
Abstract: Identifying scientific publications that are within a dynamic field of research often requires costly annotation by subject-matter experts. Resources like widely-accepted classification criteria or field taxonomies are unavailable for a domain like artificial intelligence (AI), which spans emerging topics and technologies. We address these challenges by inferring a functional definition of AI research from existing expert labels, and then evaluating state-of-the-art chatbot models on the task of expert data annotation. Using the arXiv publication database as ground-truth, we experiment with prompt engineering for GPT chatbot models to identify an alternative, automated expert annotation pipeline that assigns AI labels with 94% accuracy. For comparison, we fine-tune SPECTER, a transformer language model pre-trained on scientific publications, that achieves 96% accuracy (only 2% higher than GPT) on classifying AI publications. Our results indicate that with effective prompt engineering, chatbots can be used as reliable data annotators even where subject-area expertise is required. To evaluate the utility of chatbot-annotated datasets on downstream classification tasks, we train a new classifier on GPT-labeled data and compare its performance to the arXiv-trained model. The classifier trained on GPT-labeled data outperforms the arXiv-trained model by nine percentage points, achieving 82% accuracy.

Title: Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement

Authors: Daiwei Yu, Zhuorong Li, Lina Wei, Canghong Jin, Yun Zhang, Sixian Chan
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2403.09101
Pdf URL: https://arxiv.org/pdf/2403.09101
Copy Paste: [[2403.09101]] Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement(https://arxiv.org/abs/2403.09101)
Keywords: attack, robust
Abstract: Adversarial training (AT) is currently one of the most effective ways to obtain the robustness of deep neural networks against adversarial attacks. However, most AT methods suffer from robust overfitting, i.e., a significant generalization gap in adversarial robustness between the training and testing curves. In this paper, we first identify a connection between robust overfitting and the excessive memorization of noisy labels in AT from a view of gradient norm. As such label noise is mainly caused by a distribution mismatch and improper label assignments, we are motivated to propose a label refinement approach for AT. Specifically, our Self-Guided Label Refinement first self-refines a more accurate and informative label distribution from over-confident hard labels, and then it calibrates the training by dynamically incorporating knowledge from self-distilled models into the current model and thus requiring no external teachers. Empirical results demonstrate that our method can simultaneously boost the standard accuracy and robust performance across multiple benchmark datasets, attack types, and architectures. In addition, we also provide a set of analyses from the perspectives of information theory to dive into our method and suggest the importance of soft labels for robust generalization.

Title: CardioCaps: Attention-based Capsule Network for Class-Imbalanced Echocardiogram Classification

Authors: Hyunkyung Han, Jihyeon Seong, Jaesik Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09108
Pdf URL: https://arxiv.org/pdf/2403.09108
Copy Paste: [[2403.09108]] CardioCaps: Attention-based Capsule Network for Class-Imbalanced Echocardiogram Classification(https://arxiv.org/abs/2403.09108)
Keywords: robust
Abstract: Capsule Neural Networks (CapsNets) is a novel architecture that utilizes vector-wise representations formed by multiple neurons. Specifically, the Dynamic Routing CapsNets (DR-CapsNets) employ an affine matrix and dynamic routing mechanism to train capsules and acquire translation-equivariance properties, enhancing its robustness compared to traditional Convolutional Neural Networks (CNNs). Echocardiograms, which capture moving images of the heart, present unique challenges for traditional image classification methods. In this paper, we explore the potential of DR-CapsNets and propose CardioCaps, a novel attention-based DR-CapsNet architecture for class-imbalanced echocardiogram classification. CardioCaps comprises two key components: a weighted margin loss incorporating a regression auxiliary loss and an attention mechanism. First, the weighted margin loss prioritizes positive cases, supplemented by an auxiliary loss function based on the Ejection Fraction (EF) regression task, a crucial measure of cardiac function. This approach enhances the model's resilience in the face of class imbalance. Second, recognizing the quadratic complexity of dynamic routing leading to training inefficiencies, we adopt the attention mechanism as a more computationally efficient alternative. Our results demonstrate that CardioCaps surpasses traditional machine learning baseline methods, including Logistic Regression, Random Forest, and XGBoost with sampling methods and a class weight matrix. Furthermore, CardioCaps outperforms other deep learning baseline methods such as CNNs, ResNets, U-Nets, and ViTs, as well as advanced CapsNets methods such as EM-CapsNets and Efficient-CapsNets. Notably, our model demonstrates robustness to class imbalance, achieving high precision even in datasets with a substantial proportion of negative cases.

Title: Graph-Based DDoS Attack Detection in IoT Systems with Lossy Network

Authors: Arvin Hekmati, Bhaskar Krishnamachari
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.09118
Pdf URL: https://arxiv.org/pdf/2403.09118
Copy Paste: [[2403.09118]] Graph-Based DDoS Attack Detection in IoT Systems with Lossy Network(https://arxiv.org/abs/2403.09118)
Keywords: security, attack, robust
Abstract: This study introduces a robust solution for the detection of Distributed Denial of Service (DDoS) attacks in Internet of Things (IoT) systems, leveraging the capabilities of Graph Convolutional Networks (GCN). By conceptualizing IoT devices as nodes within a graph structure, we present a detection mechanism capable of operating efficiently even in lossy network environments. We introduce various graph topologies for modeling IoT networks and evaluate them for detecting tunable futuristic DDoS attacks. By studying different levels of network connection loss and various attack situations, we demonstrate that the correlation-based hybrid graph structure is effective in spotting DDoS attacks, substantiating its good performance even in lossy network scenarios. The results indicate a remarkable performance of the GCN-based DDoS detection model with an F1 score of up to 91%. Furthermore, we observe at most a 2% drop in F1-score in environments with up to 50% connection loss. The findings from this study highlight the advantages of utilizing GCN for the security of IoT systems which benefit from high detection accuracy while being resilient to connection disruption.

Title: Single Domain Generalization for Crowd Counting

Authors: Zhuoxuan Peng, S.-H. Gary Chan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09124
Pdf URL: https://arxiv.org/pdf/2403.09124
Copy Paste: [[2403.09124]] Single Domain Generalization for Crowd Counting(https://arxiv.org/abs/2403.09124)
Keywords: robust, segmentation
Abstract: Current image-based crowd counting widely employs density map regression due to its promising results. However, the method often suffers from severe performance degradation when tested on data from unseen scenarios. To address this so-called "domain shift" problem, we investigate single domain generalization (SDG) for crowd counting. The existing SDG approaches are mainly for classification and segmentation, and can hardly be extended to our case due to its regression nature and label ambiguity (i.e., ambiguous pixel-level ground truths). We propose MPCount, a novel SDG approach effective even for narrow source distribution. Reconstructing diverse features for density map regression with a single memory bank, MPCount retains only domain-invariant representations using a content error mask and attention consistency loss. It further introduces patch-wise classification as an auxiliary task to boost the robustness of density prediction to achieve highly accurate labels. Through extensive experiments on different datasets, MPCount is shown to significantly improve counting accuracy compared to the state of the art under diverse scenarios unobserved in the training data of narrow source distribution. Code is available at https://github.com/Shimmer93/MPCount.

Title: Rethinking Referring Object Removal

Authors: Xiangtian Xue, Jiasong Wu, Youyong Kong, Lotfi Senhadji, Huazhong Shu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09128
Pdf URL: https://arxiv.org/pdf/2403.09128
Copy Paste: [[2403.09128]] Rethinking Referring Object Removal(https://arxiv.org/abs/2403.09128)
Keywords: diffusion, segmentation
Abstract: Referring object removal refers to removing the specific object in an image referred by natural language expressions and filling the missing region with reasonable semantics. To address this task, we construct the ComCOCO, a synthetic dataset consisting of 136,495 referring expressions for 34,615 objects in 23,951 image pairs. Each pair contains an image with referring expressions and the ground truth after elimination. We further propose an end-to-end syntax-aware hybrid mapping network with an encoding-decoding structure. Linguistic features are hierarchically extracted at the syntactic level and fused in the downsampling process of visual features with multi-head attention. The feature-aligned pyramid network is leveraged to generate segmentation masks and replace internal pixels with region affinity learned from external semantics in high-level feature maps. Extensive experiments demonstrate that our model outperforms diffusion models and two-stage methods which process the segmentation and inpainting task separately by a significant margin.

Title: ProSwitch: Knowledge-Guided Language Model Fine-Tuning to Generate Professional and Non-Professional Styled Text

Authors: Chang Zong, Yuyan Chen, Weiming Lu, Jian Shao, Yueting Zhuang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09131
Pdf URL: https://arxiv.org/pdf/2403.09131
Copy Paste: [[2403.09131]] ProSwitch: Knowledge-Guided Language Model Fine-Tuning to Generate Professional and Non-Professional Styled Text(https://arxiv.org/abs/2403.09131)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated efficacy in various linguistic applications, including text summarization and controlled text generation. However, studies into their capacity of switching between styles via fine-tuning remain underexplored. This study concentrates on textual professionalism and introduces a novel methodology, named ProSwitch, which equips a language model with the ability to produce both professional and non-professional responses through knowledge-guided instruction tuning. ProSwitch unfolds across three phases: data preparation for gathering domain knowledge and training corpus; instruction tuning for optimizing language models with multiple levels of instruction formats; and comprehensive evaluation for assessing the professionalism discrimination and reference-based quality of generated text. Comparative analysis of ProSwitch against both general and specialized language models reveals that our approach outperforms baselines in switching between professional and non-professional text generation.

Title: Metadata-Driven Federated Learning of Connectional Brain Templates in Non-IID Multi-Domain Scenarios

Authors: Geng Chen, Qingyue Wang, Islem Rekik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09139
Pdf URL: https://arxiv.org/pdf/2403.09139
Copy Paste: [[2403.09139]] Metadata-Driven Federated Learning of Connectional Brain Templates in Non-IID Multi-Domain Scenarios(https://arxiv.org/abs/2403.09139)
Keywords: privacy, federate
Abstract: A connectional brain template (CBT) is a holistic representation of a population of multi-view brain connectivity graphs, encoding shared patterns and normalizing typical variations across individuals. The federation of CBT learning allows for an inclusive estimation of the representative center of multi-domain brain connectivity datasets in a fully data-preserving manner. However, existing methods overlook the non-independent and identically distributed (non-IDD) issue stemming from multidomain brain connectivity heterogeneity, in which data domains are drawn from different hospitals and imaging modalities. To overcome this limitation, we unprecedentedly propose a metadata-driven federated learning framework, called MetaFedCBT, for cross-domain CBT learning. Given the data drawn from a specific domain (i.e., hospital), our model aims to learn metadata in a fully supervised manner by introducing a local client-based regressor network. The generated meta-data is forced to meet the statistical attributes (e.g., mean) of other domains, while preserving their privacy. Our supervised meta-data generation approach boosts the unsupervised learning of a more centered, representative, and holistic CBT of a particular brain state across diverse domains. As the federated learning progresses over multiple rounds, the learned metadata and associated generated connectivities are continuously updated to better approximate the target domain information. MetaFedCBT overcomes the non-IID issue of existing methods by generating informative brain connectivities for privacy-preserving holistic CBT learning with guidance using metadata. Extensive experiments on multi-view morphological brain networks of normal and patient subjects demonstrate that our MetaFedCBT is a superior federated CBT learning model and significantly advances the state-of-the-art performance.

Title: Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior

Authors: Cheng Chen, Xiaofeng Yang, Fan Yang, Chengzeng Feng, Zhoujie Fu, Chuan-Sheng Foo, Guosheng Lin, Fayao Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09140
Pdf URL: https://arxiv.org/pdf/2403.09140
Copy Paste: [[2403.09140]] Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior(https://arxiv.org/abs/2403.09140)
Keywords: diffusion
Abstract: Recent works on text-to-3d generation show that using only 2D diffusion supervision for 3D generation tends to produce results with inconsistent appearances (e.g., faces on the back view) and inaccurate shapes (e.g., animals with extra legs). Existing methods mainly address this issue by retraining diffusion models with images rendered from 3D data to ensure multi-view consistency while struggling to balance 2D generation quality with 3D consistency. In this paper, we present a new framework Sculpt3D that equips the current pipeline with explicit injection of 3D priors from retrieved reference objects without re-training the 2D diffusion model. Specifically, we demonstrate that high-quality and diverse 3D geometry can be guaranteed by keypoints supervision through a sparse ray sampling approach. Moreover, to ensure accurate appearances of different views, we further modulate the output of the 2D diffusion model to the correct patterns of the template views without altering the generated object's style. These two decoupled designs effectively harness 3D information from reference objects to generate 3D objects while preserving the generation quality of the 2D diffusion model. Extensive experiments show our method can largely improve the multi-view consistency while retaining fidelity and diversity. Our project page is available at: https://stellarcheng.github.io/Sculpt3D/.

Title: Evaluating LLMs for Gender Disparities in Notable Persons

Authors: Lauren Rhue, Sofie Goethals, Arun Sundararajan
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2403.09148
Pdf URL: https://arxiv.org/pdf/2403.09148
Copy Paste: [[2403.09148]] Evaluating LLMs for Gender Disparities in Notable Persons(https://arxiv.org/abs/2403.09148)
Keywords: fair, large language model
Abstract: This study examines the use of Large Language Models (LLMs) for retrieving factual information, addressing concerns over their propensity to produce factually incorrect "hallucinated" responses or to altogether decline to even answer prompt at all. Specifically, it investigates the presence of gender-based biases in LLMs' responses to factual inquiries. This paper takes a multi-pronged approach to evaluating GPT models by evaluating fairness across multiple dimensions of recall, hallucinations and declinations. Our findings reveal discernible gender disparities in the responses generated by GPT-3.5. While advancements in GPT-4 have led to improvements in performance, they have not fully eradicated these gender disparities, notably in instances where responses are declined. The study further explores the origins of these disparities by examining the influence of gender associations in prompts and the homogeneity in the responses.

Title: Basque and Spanish Counter Narrative Generation: Data Creation and Evaluation

Authors: Jaione Bengoetxea, Yi-Ling Chung, Marco Guerini, Rodrigo Agerri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.09159
Pdf URL: https://arxiv.org/pdf/2403.09159
Copy Paste: [[2403.09159]] Basque and Spanish Counter Narrative Generation: Data Creation and Evaluation(https://arxiv.org/abs/2403.09159)
Keywords: generative
Abstract: Counter Narratives (CNs) are non-negative textual responses to Hate Speech (HS) aiming at defusing online hatred and mitigating its spreading across media. Despite the recent increase in HS content posted online, research on automatic CN generation has been relatively scarce and predominantly focused on English. In this paper, we present CONAN-EUS, a new Basque and Spanish dataset for CN generation developed by means of Machine Translation (MT) and professional post-edition. Being a parallel corpus, also with respect to the original English CONAN, it allows to perform novel research on multilingual and crosslingual automatic generation of CNs. Our experiments on CN generation with mT5, a multilingual encoder-decoder model, show that generation greatly benefits from training on post-edited data, as opposed to relying on silver MT data only. These results are confirmed by their correlation with a qualitative manual evaluation, demonstrating that manually revised training data remains crucial for the quality of the generated CNs. Furthermore, multilingual data augmentation improves results over monolingual settings for structurally similar languages such as English and Spanish, while being detrimental for Basque, a language isolate. Similar findings occur in zero-shot crosslingual evaluations, where model transfer (fine-tuning in English and generating in a different target language) outperforms fine-tuning mT5 on machine translated data for Spanish but not for Basque. This provides an interesting insight into the asymmetry in the multilinguality of generative models, a challenging topic which is still open to research.

Title: Unveiling the Generalization Power of Fine-Tuned Large Language Models

Authors: Haoran Yang, Yumeng Zhang, Jiaqi Xu, Hongyuan Lu, Pheng Ann Heng, Wai Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.09162
Pdf URL: https://arxiv.org/pdf/2403.09162
Copy Paste: [[2403.09162]] Unveiling the Generalization Power of Fine-Tuned Large Language Models(https://arxiv.org/abs/2403.09162)
Keywords: large language model
Abstract: While Large Language Models (LLMs) have demonstrated exceptional multitasking abilities, fine-tuning these models on downstream, domain-specific datasets is often necessary to yield superior performance on test sets compared to their counterparts without fine-tuning. However, the comprehensive effects of fine-tuning on the LLMs' generalization ability are not fully understood. This paper delves into the differences between original, unmodified LLMs and their fine-tuned variants. Our primary investigation centers on whether fine-tuning affects the generalization ability intrinsic to LLMs. To elaborate on this, we conduct extensive experiments across five distinct language tasks on various datasets. Our main findings reveal that models fine-tuned on generation and classification tasks exhibit dissimilar behaviors in generalizing to different domains and tasks. Intriguingly, we observe that integrating the in-context learning strategy during fine-tuning on generation tasks can enhance the model's generalization ability. Through this systematic investigation, we aim to contribute valuable insights into the evolving landscape of fine-tuning practices for LLMs.

Title: Caveat Lector: Large Language Models in Legal Practice

Authors: Eliza Mik
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2403.09163
Pdf URL: https://arxiv.org/pdf/2403.09163
Copy Paste: [[2403.09163]] Caveat Lector: Large Language Models in Legal Practice(https://arxiv.org/abs/2403.09163)
Keywords: large language model
Abstract: The current fascination with large language models, or LLMs, derives from the fact that many users lack the expertise to evaluate the quality of the generated text. LLMs may therefore appear more capable than they actually are. The dangerous combination of fluency and superficial plausibility leads to the temptation to trust the generated text and creates the risk of overreliance. Who would not trust perfect legalese? Relying recent findings in both technical and legal scholarship, this Article counterbalances the overly optimistic predictions as to the role of LLMs in legal practice. Integrating LLMs into legal workstreams without a better comprehension of their limitations, will create inefficiencies if not outright risks. Notwithstanding their unprecedented ability to generate text, LLMs do not understand text. Without the ability to understand meaning, LLMs will remain unable to use language, to acquire knowledge and to perform complex reasoning tasks. Trained to model language on the basis of stochastic word predictions, LLMs cannot distinguish fact from fiction. Their knowledge of the law is limited to word strings memorized in their parameters. It is also incomplete and largely incorrect. LLMs operate at the level of word distributions, not at the level of verified facts. The resulting propensity to hallucinate, to produce statements that are incorrect but appear helpful and relevant, is alarming in high-risk areas like legal services. At present, lawyers should beware of relying on text generated by LLMs.

Title: Exploring the Comprehension of ChatGPT in Traditional Chinese Medicine Knowledge

Authors: Li Yizhen, Huang Shaohan, Qi Jiaxing, Quan Lei, Han Dongran, Luan Zhongzhi
Subjects: cs.CL, stat.AP
Abstract URL: https://arxiv.org/abs/2403.09164
Pdf URL: https://arxiv.org/pdf/2403.09164
Copy Paste: [[2403.09164]] Exploring the Comprehension of ChatGPT in Traditional Chinese Medicine Knowledge(https://arxiv.org/abs/2403.09164)
Keywords: large language model
Abstract: No previous work has studied the performance of Large Language Models (LLMs) in the context of Traditional Chinese Medicine (TCM), an essential and distinct branch of medical knowledge with a rich history. To bridge this gap, we present a TCM question dataset named TCM-QA, which comprises three question types: single choice, multiple choice, and true or false, to examine the LLM's capacity for knowledge recall and comprehensive reasoning within the TCM domain. In our study, we evaluate two settings of the LLM, zero-shot and few-shot settings, while concurrently discussing the differences between English and Chinese prompts. Our results indicate that ChatGPT performs best in true or false questions, achieving the highest precision of 0.688 while scoring the lowest precision is 0.241 in multiple-choice questions. Furthermore, we observed that Chinese prompts outperformed English prompts in our evaluations. Additionally, we assess the quality of explanations generated by ChatGPT and their potential contribution to TCM knowledge comprehension. This paper offers valuable insights into the applicability of LLMs in specialized domains and paves the way for future research in leveraging these powerful models to advance TCM.

Title: Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse

Authors: Jianwei Sun, Chaoyang Mei, Linlin Wei, Kaiyu Zheng, Na Liu, Ming Cui, Tianyi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.09167
Pdf URL: https://arxiv.org/pdf/2403.09167
Copy Paste: [[2403.09167]] Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse(https://arxiv.org/abs/2403.09167)
Keywords: large language model
Abstract: The efficacy of large language models (LLMs) is heavily dependent on the quality of the underlying data, particularly within specialized domains. A common challenge when fine-tuning LLMs for domain-specific applications is the potential degradation of the model's generalization capabilities. To address these issues, we propose a two-stage approach for the construction of production prompts designed to yield high-quality data. This method involves the generation of a diverse array of prompts that encompass a broad spectrum of tasks and exhibit a rich variety of expressions. Furthermore, we introduce a cost-effective, multi-dimensional quality assessment framework to ensure the integrity of the generated labeling data. Utilizing a dataset comprised of service provider and customer interactions from the real estate sector, we demonstrate a positive correlation between data quality and model performance. Notably, our findings indicate that the domain-specific proficiency of general LLMs can be enhanced through fine-tuning with data produced via our proposed method, without compromising their overall generalization abilities, even when exclusively domain-specific data is employed for fine-tuning.

Title: ADEdgeDrop: Adversarial Edge Dropping for Robust Graph Neural Networks

Authors: Zhaoliang Chen, Zhihao Wu, Ylli Sadikaj, Claudia Plant, Hong-Ning Dai, Shiping Wang, Wenzhong Guo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09171
Pdf URL: https://arxiv.org/pdf/2403.09171
Copy Paste: [[2403.09171]] ADEdgeDrop: Adversarial Edge Dropping for Robust Graph Neural Networks(https://arxiv.org/abs/2403.09171)
Keywords: robust, interpretability
Abstract: Although Graph Neural Networks (GNNs) have exhibited the powerful ability to gather graph-structured information from neighborhood nodes via various message-passing mechanisms, the performance of GNNs is limited by poor generalization and fragile robustness caused by noisy and redundant graph data. As a prominent solution, Graph Augmentation Learning (GAL) has recently received increasing attention. Among prior GAL approaches, edge-dropping methods that randomly remove edges from a graph during training are effective techniques to improve the robustness of GNNs. However, randomly dropping edges often results in bypassing critical edges, consequently weakening the effectiveness of message passing. In this paper, we propose a novel adversarial edge-dropping method (ADEdgeDrop) that leverages an adversarial edge predictor guiding the removal of edges, which can be flexibly incorporated into diverse GNN backbones. Employing an adversarial training framework, the edge predictor utilizes the line graph transformed from the original graph to estimate the edges to be dropped, which improves the interpretability of the edge-dropping method. The proposed ADEdgeDrop is optimized alternately by stochastic gradient descent and projected gradient descent. Comprehensive experiments on six graph benchmark datasets demonstrate that the proposed ADEdgeDrop outperforms state-of-the-art baselines across various GNN backbones, demonstrating improved generalization and robustness.

Title: SHAN: Object-Level Privacy Detection via Inference on Scene Heterogeneous Graph

Authors: Zhuohang Jiang, Bingkui Tong, Xia Du, Ahmed Alhammadi, Jizhe Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09172
Pdf URL: https://arxiv.org/pdf/2403.09172
Copy Paste: [[2403.09172]] SHAN: Object-Level Privacy Detection via Inference on Scene Heterogeneous Graph(https://arxiv.org/abs/2403.09172)
Keywords: privacy, protect, interpretability
Abstract: With the rise of social platforms, protecting privacy has become an important issue. Privacy object detection aims to accurately locate private objects in images. It is the foundation of safeguarding individuals' privacy rights and ensuring responsible data handling practices in the digital age. Since privacy of object is not shift-invariant, the essence of the privacy object detection task is inferring object privacy based on scene information. However, privacy object detection has long been studied as a subproblem of common object detection tasks. Therefore, existing methods suffer from serious deficiencies in accuracy, generalization, and interpretability. Moreover, creating large-scale privacy datasets is difficult due to legal constraints and existing privacy datasets lack label granularity. The granularity of existing privacy detection methods remains limited to the image level. To address the above two issues, we introduce two benchmark datasets for object-level privacy detection and propose SHAN, Scene Heterogeneous graph Attention Network, a model constructs a scene heterogeneous graph from an image and utilizes self-attention mechanisms for scene inference to obtain object privacy. Through experiments, we demonstrated that SHAN performs excellently in privacy object detection tasks, with all metrics surpassing those of the baseline model.

Title: Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts

Authors: Byeongjun Park, Hyojun Go, Jin-Young Kim, Sangmin Woo, Seokil Ham, Changick Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09176
Pdf URL: https://arxiv.org/pdf/2403.09176
Copy Paste: [[2403.09176]] Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts(https://arxiv.org/abs/2403.09176)
Keywords: diffusion, transformer, generative
Abstract: Diffusion models have achieved remarkable success across a range of generative tasks. Recent efforts to enhance diffusion model architectures have reimagined them as a form of multi-task learning, where each task corresponds to a denoising task at a specific noise level. While these efforts have focused on parameter isolation and task routing, they fall short of capturing detailed inter-task relationships and risk losing semantic information, respectively. In response, we introduce Switch Diffusion Transformer (Switch-DiT), which establishes inter-task relationships between conflicting tasks without compromising semantic information. To achieve this, we employ a sparse mixture-of-experts within each transformer block to utilize semantic information and facilitate handling conflicts in tasks through parameter isolation. Additionally, we propose a diffusion prior loss, encouraging similar tasks to share their denoising paths while isolating conflicting ones. Through these, each transformer block contains a shared expert across all tasks, where the common and task-specific denoising paths enable the diffusion model to construct its beneficial way of synergizing denoising tasks. Extensive experiments validate the effectiveness of our approach in improving both image quality and convergence rate, and further analysis demonstrates that Switch-DiT constructs tailored denoising paths across various generation scenarios.

Title: Generalized Relevance Learning Grassmann Quantization

Authors: M. Mohammadi, M. Babai, M.H.F. Wilkinson
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.09183
Pdf URL: https://arxiv.org/pdf/2403.09183
Copy Paste: [[2403.09183]] Generalized Relevance Learning Grassmann Quantization(https://arxiv.org/abs/2403.09183)
Keywords: robust
Abstract: Due to advancements in digital cameras, it is easy to gather multiple images (or videos) from an object under different conditions. Therefore, image-set classification has attracted more attention, and different solutions were proposed to model them. A popular way to model image sets is subspaces, which form a manifold called the Grassmann manifold. In this contribution, we extend the application of Generalized Relevance Learning Vector Quantization to deal with Grassmann manifold. The proposed model returns a set of prototype subspaces and a relevance vector. While prototypes model typical behaviours within classes, the relevance factors specify the most discriminative principal vectors (or images) for the classification task. They both provide insights into the model's decisions by highlighting influential images and pixels for predictions. Moreover, due to learning prototypes, the model complexity of the new method during inference is independent of dataset size, unlike previous works. We applied it to several recognition tasks including handwritten digit recognition, face recognition, activity recognition, and object recognition. Experiments demonstrate that it outperforms previous works with lower complexity and can successfully model the variation, such as handwritten style or lighting conditions. Moreover, the presence of relevances makes the model robust to the selection of subspaces' dimensionality.

Title: Intention-aware Denoising Diffusion Model for Trajectory Prediction

Authors: Chen Liu, Shibo He, Haoyu Liu, Jiming Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09190
Pdf URL: https://arxiv.org/pdf/2403.09190
Copy Paste: [[2403.09190]] Intention-aware Denoising Diffusion Model for Trajectory Prediction(https://arxiv.org/abs/2403.09190)
Keywords: diffusion, generative
Abstract: Trajectory prediction is an essential component in autonomous driving, particularly for collision avoidance systems. Considering the inherent uncertainty of the task, numerous studies have utilized generative models to produce multiple plausible future trajectories for each agent. However, most of them suffer from restricted representation ability or unstable training issues. To overcome these limitations, we propose utilizing the diffusion model to generate the distribution of future trajectories. Two cruxes are to be settled to realize such an idea. First, the diversity of intention is intertwined with the uncertain surroundings, making the true distribution hard to parameterize. Second, the diffusion process is time-consuming during the inference phase, rendering it unrealistic to implement in a real-time driving system. We propose an Intention-aware denoising Diffusion Model (IDM), which tackles the above two problems. We decouple the original uncertainty into intention uncertainty and action uncertainty and model them with two dependent diffusion processes. To decrease the inference time, we reduce the variable dimensions in the intention-aware diffusion process and restrict the initial distribution of the action-aware diffusion process, which leads to fewer diffusion steps. To validate our approach, we conduct experiments on the Stanford Drone Dataset (SDD) and ETH/UCY dataset. Our methods achieve state-of-the-art results, with an FDE of 13.83 pixels on the SDD dataset and 0.36 meters on the ETH/UCY dataset. Compared with the original diffusion model, IDM reduces inference time by two-thirds. Interestingly, our experiments further reveal that introducing intention information is beneficial in modeling the diffusion process of fewer steps.

Title: PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation

Authors: Yizhe Xiong, Hui Chen, Tianxiang Hao, Zijia Lin, Jungong Han, Yuesong Zhang, Guoxin Wang, Yongjun Bao, Guiguang Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09192
Pdf URL: https://arxiv.org/pdf/2403.09192
Copy Paste: [[2403.09192]] PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation(https://arxiv.org/abs/2403.09192)
Keywords: transformer
Abstract: Recently, the scale of transformers has grown rapidly, which introduces considerable challenges in terms of training overhead and inference efficiency in the scope of task adaptation. Existing works, namely Parameter-Efficient Fine-Tuning (PEFT) and model compression, have separately investigated the challenges. However, PEFT cannot guarantee the inference efficiency of the original backbone, especially for large-scale models. Model compression requires significant training costs for structure searching and re-training. Consequently, a simple combination of them cannot guarantee accomplishing both training efficiency and inference efficiency with minimal costs. In this paper, we propose a novel Parallel Yielding Re-Activation (PYRA) method for such a challenge of training-inference efficient task adaptation. PYRA first utilizes parallel yielding adaptive weights to comprehensively perceive the data distribution in downstream tasks. A re-activation strategy for token modulation is then applied for tokens to be merged, leading to calibrated token features. Extensive experiments demonstrate that PYRA outperforms all competing methods under both low compression rate and high compression rate, demonstrating its effectiveness and superiority in maintaining both training efficiency and inference efficiency for large-scale foundation models. Our code will be released to the public.

Title: Intention-driven Ego-to-Exo Video Generation

Authors: Hongchen Luo, Kai Zhu, Wei Zhai, Yang Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09194
Pdf URL: https://arxiv.org/pdf/2403.09194
Copy Paste: [[2403.09194]] Intention-driven Ego-to-Exo Video Generation(https://arxiv.org/abs/2403.09194)
Keywords: diffusion
Abstract: Ego-to-exo video generation refers to generating the corresponding exocentric video according to the egocentric video, providing valuable applications in AR/VR and embodied AI. Benefiting from advancements in diffusion model techniques, notable progress has been achieved in video generation. However, existing methods build upon the spatiotemporal consistency assumptions between adjacent frames, which cannot be satisfied in the ego-to-exo scenarios due to drastic changes in views. To this end, this paper proposes an Intention-Driven Ego-to-exo video generation framework (IDE) that leverages action intention consisting of human movement and action description as view-independent representation to guide video generation, preserving the consistency of content and motion. Specifically, the egocentric head trajectory is first estimated through multi-view stereo matching. Then, cross-view feature perception module is introduced to establish correspondences between exo- and ego- views, guiding the trajectory transformation module to infer human full-body movement from the head trajectory. Meanwhile, we present an action description unit that maps the action semantics into the feature space consistent with the exocentric image. Finally, the inferred human movement and high-level action descriptions jointly guide the generation of exocentric motion and interaction content (i.e., corresponding optical flow and occlusion maps) in the backward process of the diffusion model, ultimately warping them into the corresponding exocentric video. We conduct extensive experiments on the relevant dataset with diverse exo-ego video pairs, and our IDE outperforms state-of-the-art models in both subjective and objective assessments, demonstrating its efficacy in ego-to-exo video generation.

Title: SAM-Lightening: A Lightweight Segment Anything Model with Dilated Flash Attention to Achieve 30 times Acceleration

Authors: Yanfei Songa, Bangzheng Pua, Peng Wanga, Hongxu Jiang, Dong Donga, Yiqing Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09195
Pdf URL: https://arxiv.org/pdf/2403.09195
Copy Paste: [[2403.09195]] SAM-Lightening: A Lightweight Segment Anything Model with Dilated Flash Attention to Achieve 30 times Acceleration(https://arxiv.org/abs/2403.09195)
Keywords: segmentation
Abstract: Segment Anything Model (SAM) has garnered significant attention in segmentation tasks due to their zero-shot generalization ability. However, a broader application of SAMs to real-world practice has been restricted by their low inference speed and high computational memory demands, which mainly stem from the attention mechanism. Existing work concentrated on optimizing the encoder, yet has not adequately addressed the inefficiency of the attention mechanism itself, even when distilled to a smaller model, which thus leaves space for further improvement. In response, we introduce SAM-Lightening, a variant of SAM, that features a re-engineered attention mechanism, termed Dilated Flash Attention. It not only facilitates higher parallelism, enhancing processing efficiency but also retains compatibility with the existing FlashAttention. Correspondingly, we propose a progressive distillation to enable an efficient knowledge transfer from the vanilla SAM without costly training from scratch. Experiments on COCO and LVIS reveal that SAM-Lightening significantly outperforms the state-of-the-art methods in both run-time efficiency and segmentation accuracy. Specifically, it can achieve an inference speed of 7 milliseconds (ms) per image, for images of size 1024*1024 pixels, which is 30.1 times faster than the vanilla SAM and 2.1 times than the state-of-the-art. Moreover, it takes only 244MB memory, which is 3.5\% of the vanilla SAM. The code and weights are available at https://anonymous.4open.science/r/SAM-LIGHTENING-BC25/.

Title: Noise Dimension of GAN: An Image Compression Perspective

Authors: Ziran Zhu, Tongda Xu, Ling Li, Yan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09196
Pdf URL: https://arxiv.org/pdf/2403.09196
Copy Paste: [[2403.09196]] Noise Dimension of GAN: An Image Compression Perspective(https://arxiv.org/abs/2403.09196)
Keywords: generative
Abstract: Generative adversial network (GAN) is a type of generative model that maps a high-dimensional noise to samples in target distribution. However, the dimension of noise required in GAN is not well understood. Previous approaches view GAN as a mapping from a continuous distribution to another continous distribution. In this paper, we propose to view GAN as a discrete sampler instead. From this perspective, we build a connection between the minimum noise required and the bits to losslessly compress the images. Furthermore, to understand the behaviour of GAN when noise dimension is limited, we propose divergence-entropy trade-off. This trade-off depicts the best divergence we can achieve when noise is limited. And as rate distortion trade-off, it can be numerically solved when source distribution is known. Finally, we verifies our theory with experiments on image generation.

Title: Customizing Segmentation Foundation Model via Prompt Learning for Instance Segmentation

Authors: Hyung-Il Kim, Kimin Yun, Jun-Seok Yun, Yuseok Bae
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09199
Pdf URL: https://arxiv.org/pdf/2403.09199
Copy Paste: [[2403.09199]] Customizing Segmentation Foundation Model via Prompt Learning for Instance Segmentation(https://arxiv.org/abs/2403.09199)
Keywords: segmentation
Abstract: Recently, foundation models trained on massive datasets to adapt to a wide range of domains have attracted considerable attention and are actively being explored within the computer vision community. Among these, the Segment Anything Model (SAM) stands out for its remarkable progress in generalizability and flexibility for image segmentation tasks, achieved through prompt-based object mask generation. However, despite its strength, SAM faces two key limitations when applied to customized instance segmentation that segments specific objects or those in unique environments not typically present in the training data: 1) the ambiguity inherent in input prompts and 2) the necessity for extensive additional training to achieve optimal segmentation. To address these challenges, we propose a novel method, customized instance segmentation via prompt learning tailored to SAM. Our method involves a prompt learning module (PLM), which adjusts input prompts into the embedding space to better align with user intentions, thereby enabling more efficient training. Furthermore, we introduce a point matching module (PMM) to enhance the feature representation for finer segmentation by ensuring detailed alignment with ground truth boundaries. Experimental results on various customized instance segmentation scenarios demonstrate the effectiveness of the proposed method.

Title: On the Laplace Approximation as Model Selection Criterion for Gaussian Processes

Authors: Andreas Besginow, Jan David Hüwel, Thomas Pawellek, Christian Beecks, Markus Lange-Hegermann
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2403.09215
Pdf URL: https://arxiv.org/pdf/2403.09215
Copy Paste: [[2403.09215]] On the Laplace Approximation as Model Selection Criterion for Gaussian Processes(https://arxiv.org/abs/2403.09215)
Keywords: interpretability
Abstract: Model selection aims to find the best model in terms of accuracy, interpretability or simplicity, preferably all at once. In this work, we focus on evaluating model performance of Gaussian process models, i.e. finding a metric that provides the best trade-off between all those criteria. While previous work considers metrics like the likelihood, AIC or dynamic nested sampling, they either lack performance or have significant runtime issues, which severely limits applicability. We address these challenges by introducing multiple metrics based on the Laplace approximation, where we overcome a severe inconsistency occuring during naive application of the Laplace approximation. Experiments show that our metrics are comparable in quality to the gold standard dynamic nested sampling without compromising for computational speed. Our model selection criteria allow significantly faster and high quality model selection of Gaussian process models.

Title: MCformer: Multivariate Time Series Forecasting with Mixed-Channels Transformer

Authors: Wenyong Han, Tao Zhu Member, Liming Chen, Huansheng Ning, Yang Luo, Yaping Wan
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2403.09223
Pdf URL: https://arxiv.org/pdf/2403.09223
Copy Paste: [[2403.09223]] MCformer: Multivariate Time Series Forecasting with Mixed-Channels Transformer(https://arxiv.org/abs/2403.09223)
Keywords: transformer
Abstract: The massive generation of time-series data by largescale Internet of Things (IoT) devices necessitates the exploration of more effective models for multivariate time-series forecasting. In previous models, there was a predominant use of the Channel Dependence (CD) strategy (where each channel represents a univariate sequence). Current state-of-the-art (SOTA) models primarily rely on the Channel Independence (CI) strategy. The CI strategy treats all channels as a single channel, expanding the dataset to improve generalization performance and avoiding inter-channel correlation that disrupts long-term features. However, the CI strategy faces the challenge of interchannel correlation forgetting. To address this issue, we propose an innovative Mixed Channels strategy, combining the data expansion advantages of the CI strategy with the ability to counteract inter-channel correlation forgetting. Based on this strategy, we introduce MCformer, a multivariate time-series forecasting model with mixed channel features. The model blends a specific number of channels, leveraging an attention mechanism to effectively capture inter-channel correlation information when modeling long-term features. Experimental results demonstrate that the Mixed Channels strategy outperforms pure CI strategy in multivariate time-series forecasting tasks.

Title: D-YOLO a robust framework for object detection in adverse weather conditions

Authors: Zihan Chu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2403.09233
Pdf URL: https://arxiv.org/pdf/2403.09233
Copy Paste: [[2403.09233]] D-YOLO a robust framework for object detection in adverse weather conditions(https://arxiv.org/abs/2403.09233)
Keywords: robust, extraction
Abstract: Adverse weather conditions including haze, snow and rain lead to decline in image qualities, which often causes a decline in performance for deep-learning based detection networks. Most existing approaches attempts to rectify hazy images before performing object detection, which increases the complexity of the network and may result in the loss in latent information. To better integrate image restoration and object detection tasks, we designed a double-route network with an attention feature fusion module, taking both hazy and dehazed features into consideration. We also proposed a subnetwork to provide haze-free features to the detection network. Specifically, our D-YOLO improves the performance of the detection network by minimizing the distance between the clear feature extraction subnetwork and detection network. Experiments on RTTS and FoggyCityscapes datasets show that D-YOLO demonstrates better performance compared to the state-of-the-art methods. It is a robust detection framework for bridging the gap between low-level dehazing and high-level detection.

Title: WSI-SAM: Multi-resolution Segment Anything Model (SAM) for histopathology whole-slide images

Authors: Hong Liu, Haosen Yang, Paul J. van Diest, Josien P.W. Pluim, Mitko Veta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09257
Pdf URL: https://arxiv.org/pdf/2403.09257
Copy Paste: [[2403.09257]] WSI-SAM: Multi-resolution Segment Anything Model (SAM) for histopathology whole-slide images(https://arxiv.org/abs/2403.09257)
Keywords: segmentation
Abstract: The Segment Anything Model (SAM) marks a significant advancement in segmentation models, offering powerful zero-shot capabilities and dynamic prompting. However, existing medical SAMs are not suitable for the multi-scale nature of whole-slide images (WSIs), restricting their effectiveness. To resolve this drawback, we present WSI-SAM, enhancing SAM with precise object segmentation capabilities for histopathology images using multi-resolution patches, while preserving its original prompt-driven design, efficiency, and zero-shot adaptability. To fully exploit pretrained knowledge while minimizing training overhead, we keep SAM frozen, only introducing minimal additional parameters and computation. In particular, we introduce High-Resolution (HR) token, Low-Resolution (LR) token and dual mask decoder. This decoder integrates the original SAM mask decoder with a lightweight fusion module that integrates features at multiple scales. Instead of predicting a mask independently, we integrate HR and LR token at intermediate layer to jointly learn features of the same object across multiple resolutions. Experiments show that our WSI-SAM outperforms state-of-the-art SAM and its variants. In particular, our model outperforms SAM by 4.1 and 2.5 percent points on a ductal carcinoma in situ (DCIS) segmentation tasks and breast cancer metastasis segmentation task (CAMELYON16 dataset). The code will be available at https://github.com/HongLiuuuuu/WSI-SAM.

Title: CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification

Authors: Yiming Ma, Victor Sanchez, Tanaya Guha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09281
Pdf URL: https://arxiv.org/pdf/2403.09281
Copy Paste: [[2403.09281]] CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification(https://arxiv.org/abs/2403.09281)
Keywords: robust
Abstract: The CLIP (Contrastive Language-Image Pretraining) model has exhibited outstanding performance in recognition problems, such as zero-shot image classification and object detection. However, its ability to count remains understudied due to the inherent challenges of transforming counting--a regression task--into a recognition task. In this paper, we investigate CLIP's potential in counting, focusing specifically on estimating crowd sizes. Existing classification-based crowd-counting methods have encountered issues, including inappropriate discretization strategies, which impede the application of CLIP and result in suboptimal performance. To address these challenges, we propose the Enhanced Blockwise Classification (EBC) framework. In contrast to previous methods, EBC relies on integer-valued bins that facilitate the learning of robust decision boundaries. Within our model-agnostic EBC framework, we introduce CLIP-EBC, the first fully CLIP-based crowd-counting model capable of generating density maps. Comprehensive evaluations across diverse crowd-counting datasets demonstrate the state-of-the-art performance of our methods. Particularly, EBC can improve existing models by up to 76.9%. Moreover, our CLIP-EBC model surpasses current crowd-counting methods, achieving mean absolute errors of 55.0 and 6.3 on ShanghaiTech part A and part B datasets, respectively. The code will be made publicly available.

Title: DA-PFL: Dynamic Affinity Aggregation for Personalized Federated Learning

Authors: Xu Yang, Jiyuan Feng, Songyue Guo, Ye Wang, Ye Ding, Binxing Fang, Qing Liao
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2403.09284
Pdf URL: https://arxiv.org/pdf/2403.09284
Copy Paste: [[2403.09284]] DA-PFL: Dynamic Affinity Aggregation for Personalized Federated Learning(https://arxiv.org/abs/2403.09284)
Keywords: federate
Abstract: Personalized federated learning becomes a hot research topic that can learn a personalized learning model for each client. Existing personalized federated learning models prefer to aggregate similar clients with similar data distribution to improve the performance of learning models. However, similaritybased personalized federated learning methods may exacerbate the class imbalanced problem. In this paper, we propose a novel Dynamic Affinity-based Personalized Federated Learning model (DA-PFL) to alleviate the class imbalanced problem during federated learning. Specifically, we build an affinity metric from a complementary perspective to guide which clients should be aggregated. Then we design a dynamic aggregation strategy to dynamically aggregate clients based on the affinity metric in each round to reduce the class imbalanced risk. Extensive experiments show that the proposed DA-PFL model can significantly improve the accuracy of each client in three real-world datasets with state-of-the-art comparison methods.

Title: SELECTOR: Heterogeneous graph network with convolutional masked autoencoder for multimodal robust prediction of cancer survival

Authors: Liangrui Pan, Yijun Peng, Yan Li, Xiang Wang, Wenjuan Liu, Liwen Xu, Qingchun Liang, Shaoliang Peng
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.09290
Pdf URL: https://arxiv.org/pdf/2403.09290
Copy Paste: [[2403.09290]] SELECTOR: Heterogeneous graph network with convolutional masked autoencoder for multimodal robust prediction of cancer survival(https://arxiv.org/abs/2403.09290)
Keywords: robust
Abstract: Accurately predicting the survival rate of cancer patients is crucial for aiding clinicians in planning appropriate treatment, reducing cancer-related medical expenses, and significantly enhancing patients' quality of life. Multimodal prediction of cancer patient survival offers a more comprehensive and precise approach. However, existing methods still grapple with challenges related to missing multimodal data and information interaction within modalities. This paper introduces SELECTOR, a heterogeneous graph-aware network based on convolutional mask encoders for robust multimodal prediction of cancer patient survival. SELECTOR comprises feature edge reconstruction, convolutional mask encoder, feature cross-fusion, and multimodal survival prediction modules. Initially, we construct a multimodal heterogeneous graph and employ the meta-path method for feature edge reconstruction, ensuring comprehensive incorporation of feature information from graph edges and effective embedding of nodes. To mitigate the impact of missing features within the modality on prediction accuracy, we devised a convolutional masked autoencoder (CMAE) to process the heterogeneous graph post-feature reconstruction. Subsequently, the feature cross-fusion module facilitates communication between modalities, ensuring that output features encompass all features of the modality and relevant information from other modalities. Extensive experiments and analysis on six cancer datasets from TCGA demonstrate that our method significantly outperforms state-of-the-art methods in both modality-missing and intra-modality information-confirmed cases. Our codes are made available at https://github.com/panliangrui/Selector.

Title: Anatomical Structure-Guided Medical Vision-Language Pre-training

Authors: Qingqiu Li, Xiaohan Yan, Jilan Xu, Runtian Yuan, Yuejie Zhang, Rui Feng, Quanli Shen, Xiaobo Zhang, Shujun Wang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2403.09294
Pdf URL: https://arxiv.org/pdf/2403.09294
Copy Paste: [[2403.09294]] Anatomical Structure-Guided Medical Vision-Language Pre-training(https://arxiv.org/abs/2403.09294)
Keywords: interpretability
Abstract: Learning medical visual representations through vision-language pre-training has reached remarkable progress. Despite the promising performance, it still faces challenges, i.e., local alignment lacks interpretability and clinical relevance, and the insufficient internal and external representation learning of image-report pairs. To address these issues, we propose an Anatomical Structure-Guided (ASG) framework. Specifically, we parse raw reports into triplets , and fully utilize each element as supervision to enhance representation learning. For anatomical region, we design an automatic anatomical region-sentence alignment paradigm in collaboration with radiologists, considering them as the minimum semantic units to explore fine-grained local alignment. For finding and existence, we regard them as image tags, applying an image-tag recognition decoder to associate image features with their respective tags within each sample and constructing soft labels for contrastive learning to improve the semantic association of different image-report pairs. We evaluate the proposed ASG framework on two downstream tasks, including five public benchmarks. Experimental results demonstrate that our method outperforms the state-of-the-art methods.

Title: Annotation Free Semantic Segmentation with Vision Foundation Models

Authors: Soroush Seifi, Daniel Olmeda Reino, Fabien Despinoy, Rahaf Aljundi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09307
Pdf URL: https://arxiv.org/pdf/2403.09307
Copy Paste: [[2403.09307]] Annotation Free Semantic Segmentation with Vision Foundation Models(https://arxiv.org/abs/2403.09307)
Keywords: segmentation
Abstract: Semantic Segmentation is one of the most challenging vision tasks, usually requiring large amounts of training data with expensive pixel-level annotations. With the success of foundation models and especially vision-language models, recent works attempt to achieve zero-shot semantic segmentation while requiring either large scale training or additional image/pixel-level annotations. In this work, we build a lightweight module on top of a self-supervised pretrained vision encoder to align patch features with a pre-trained text encoder. Importantly, we generate free annotations for any semantic segmentation dataset using existing foundation models and train our alignment module cost free. We use CLIP to detect objects and SAM to generate high quality object masks. Our approach can bring language-based semantics to any pre-trained vision encoder with minimal training. Our module is lightweight, uses foundation models as a sole source of supervision and shows impressive generalization capability from little training data with no annotation.

Title: Knowledge Distillation in YOLOX-ViT for Side-Scan Sonar Object Detection

Authors: Martin Aubard, László Antal, Ana Madureira, Erika Ábrahám
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09313
Pdf URL: https://arxiv.org/pdf/2403.09313
Copy Paste: [[2403.09313]] Knowledge Distillation in YOLOX-ViT for Side-Scan Sonar Object Detection(https://arxiv.org/abs/2403.09313)
Keywords: transformer
Abstract: In this paper we present YOLOX-ViT, a novel object detection model, and investigate the efficacy of knowledge distillation for model size reduction without sacrificing performance. Focused on underwater robotics, our research addresses key questions about the viability of smaller models and the impact of the visual transformer layer in YOLOX. Furthermore, we introduce a new side-scan sonar image dataset, and use it to evaluate our object detector's performance. Results show that knowledge distillation effectively reduces false positives in wall detection. Additionally, the introduced visual transformer layer significantly improves object detection accuracy in the underwater environment. The source code of the knowledge distillation in the YOLOX-ViT is at https://github.com/remaro-network/KD-YOLOX-ViT.

Title: Semi- and Weakly-Supervised Learning for Mammogram Mass Segmentation with Limited Annotations

Authors: Xinyu Xiong, Churan Wang, Wenxue Li, Guanbin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09315
Pdf URL: https://arxiv.org/pdf/2403.09315
Copy Paste: [[2403.09315]] Semi- and Weakly-Supervised Learning for Mammogram Mass Segmentation with Limited Annotations(https://arxiv.org/abs/2403.09315)
Keywords: segmentation
Abstract: Accurate identification of breast masses is crucial in diagnosing breast cancer; however, it can be challenging due to their small size and being camouflaged in surrounding normal glands. Worse still, it is also expensive in clinical practice to obtain adequate pixel-wise annotations for training deep neural networks. To overcome these two difficulties with one stone, we propose a semi- and weakly-supervised learning framework for mass segmentation that utilizes limited strongly-labeled samples and sufficient weakly-labeled samples to achieve satisfactory performance. The framework consists of an auxiliary branch to exclude lesion-irrelevant background areas, a segmentation branch for final prediction, and a spatial prompting module to integrate the complementary information of the two branches. We further disentangle encoded obscure features into lesion-related and others to boost performance. Experiments on CBIS-DDSM and INbreast datasets demonstrate the effectiveness of our method.

Title: SD-Net: Symmetric-Aware Keypoint Prediction and Domain Adaptation for 6D Pose Estimation In Bin-picking Scenarios

Authors: Ding-Tao Huang, En-Te Lin, Lipeng Chen, Li-Fu Liu, Long Zeng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09317
Pdf URL: https://arxiv.org/pdf/2403.09317
Copy Paste: [[2403.09317]] SD-Net: Symmetric-Aware Keypoint Prediction and Domain Adaptation for 6D Pose Estimation In Bin-picking Scenarios(https://arxiv.org/abs/2403.09317)
Keywords: robust
Abstract: Despite the success in 6D pose estimation in bin-picking scenarios, existing methods still struggle to produce accurate prediction results for symmetry objects and real world scenarios. The primary bottlenecks include 1) the ambiguity keypoints caused by object symmetries; 2) the domain gap between real and synthetic data. To circumvent these problem, we propose a new 6D pose estimation network with symmetric-aware keypoint prediction and self-training domain adaptation (SD-Net). SD-Net builds on pointwise keypoint regression and deep hough voting to perform reliable detection keypoint under clutter and occlusion. Specifically, at the keypoint prediction stage, we designe a robust 3D keypoints selection strategy considering the symmetry class of objects and equivalent keypoints, which facilitate locating 3D keypoints even in highly occluded scenes. Additionally, we build an effective filtering algorithm on predicted keypoint to dynamically eliminate multiple ambiguity and outlier keypoint candidates. At the domain adaptation stage, we propose the self-training framework using a student-teacher training scheme. To carefully distinguish reliable predictions, we harnesses a tailored heuristics for 3D geometry pseudo labelling based on semi-chamfer distance. On public Sil'eane dataset, SD-Net achieves state-of-the-art results, obtaining an average precision of 96%. Testing learning and generalization abilities on public Parametric datasets, SD-Net is 8% higher than the state-of-the-art method. The code is available at https://github.com/dingthuang/SD-Net.

Title: Privacy Preserving Anomaly Detection on Homomorphic Encrypted Data from IoT Sensors

Authors: Anca Hangan, Dragos Lazea, Tudor Cioara
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.09322
Pdf URL: https://arxiv.org/pdf/2403.09322
Copy Paste: [[2403.09322]] Privacy Preserving Anomaly Detection on Homomorphic Encrypted Data from IoT Sensors(https://arxiv.org/abs/2403.09322)
Keywords: privacy, protect, attack, robust
Abstract: IoT devices have become indispensable components of our lives, and the advancement of AI technologies will make them even more pervasive, increasing the vulnerability to malfunctions or cyberattacks and raising privacy concerns. Encryption can mitigate these challenges; however, most existing anomaly detection techniques decrypt the data to perform the analysis, potentially undermining the encryption protection provided during transit or storage. Homomorphic encryption schemes are promising solutions as they enable the processing and execution of operations on IoT data while still encrypted, however, these schemes offer only limited operations, which poses challenges to their practical usage. In this paper, we propose a novel privacy-preserving anomaly detection solution designed for homomorphically encrypted data generated by IoT devices that efficiently detects abnormal values without performing decryption. We have adapted the Histogram-based anomaly detection technique for TFHE scheme to address limitations related to the input size and the depth of computation by implementing vectorized support operations. These operations include addition, value placement in buckets, labeling abnormal buckets based on a threshold frequency, labeling abnormal values based on their range, and bucket labels. Evaluation results show that the solution effectively detects anomalies without requiring data decryption and achieves consistent results comparable to the mechanism operating on plain data. Also, it shows robustness and resilience against various challenges commonly encountered in IoT environments, such as noisy sensor data, adversarial attacks, communication failures, and device malfunctions. Moreover, the time and computational overheads determined for several solution configurations, despite being large, are reasonable compared to those reported in existing literature.

Title: Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

Authors: Yufei Zhan, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, Jinqiao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09333
Pdf URL: https://arxiv.org/pdf/2403.09333
Copy Paste: [[2403.09333]] Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring(https://arxiv.org/abs/2403.09333)
Keywords: large language model
Abstract: Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, Counting and \etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scaling up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details, and significantly improves multimodal perception ability especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts and even coordinates. Experiments demonstrate that Griffon v2 can localize any objects of interest with visual and textual referring, achieve state-of-the-art performance on REC, phrase grounding, and REG tasks, and outperform expert models in object detection and object counting. Data, codes and models will be released at https://github.com/jefferyZhan/Griffon.

Title: Video Editing via Factorized Diffusion Distillation

Authors: Uriel Singer, Amit Zohar, Yuval Kirstain, Shelly Sheynin, Adam Polyak, Devi Parikh, Yaniv Taigman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09334
Pdf URL: https://arxiv.org/pdf/2403.09334
Copy Paste: [[2403.09334]] Video Editing via Factorized Diffusion Distillation(https://arxiv.org/abs/2403.09334)
Keywords: diffusion
Abstract: We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure, Factorized Diffusion Distillation. This procedure distills knowledge from one or more teachers simultaneously, without any supervised data. We utilize this procedure to teach EVE to edit videos by jointly distilling knowledge to (i) precisely edit each individual frame from the image editing adapter, and (ii) ensure temporal consistency among the edited frames using the video generation adapter. Finally, to demonstrate the potential of our approach in unlocking other capabilities, we align additional combinations of adapters

Title: LocalMamba: Visual State Space Model with Windowed Selective Scan

Authors: Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, Chang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09338
Pdf URL: https://arxiv.org/pdf/2403.09338
Copy Paste: [[2403.09338]] LocalMamba: Visual State Space Model with Windowed Selective Scan(https://arxiv.org/abs/2403.09338)
Keywords: transformer
Abstract: Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: https://github.com/hunto/LocalMamba.

Title: AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions

Authors: Hao Zhang, Wenqi Shao, Hong Liu, Yongqiang Ma, Ping Luo, Yu Qiao, Kaipeng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09346
Pdf URL: https://arxiv.org/pdf/2403.09346
Copy Paste: [[2403.09346]] AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions(https://arxiv.org/abs/2403.09346)
Keywords: security, attack, robust, fair
Abstract: Large Vision-Language Models (LVLMs) have shown significant progress in well responding to visual-instructions from users. However, these instructions, encompassing images and text, are susceptible to both intentional and inadvertent attacks. Despite the critical importance of LVLMs' robustness against such threats, current research in this area remains limited. To bridge this gap, we introduce AVIBench, a framework designed to analyze the robustness of LVLMs when facing various adversarial visual-instructions (AVIs), including four types of image-based AVIs, ten types of text-based AVIs, and nine types of content bias AVIs (such as gender, violence, cultural, and racial biases, among others). We generate 260K AVIs encompassing five categories of multimodal capabilities (nine tasks) and content bias. We then conduct a comprehensive evaluation involving 14 open-source LVLMs to assess their performance. AVIBench also serves as a convenient tool for practitioners to evaluate the robustness of LVLMs against AVIs. Our findings and extensive experimental results shed light on the vulnerabilities of LVLMs, and highlight that inherent biases exist even in advanced closed-source LVLMs like GeminiProVision and GPT-4V. This underscores the importance of enhancing the robustness, security, and fairness of LVLMs. The source code and benchmark will be made publicly available.

Title: LDPRecover: Recovering Frequencies from Poisoning Attacks against Local Differential Privacy

Authors: Xinyue Sun, Qingqing Ye, Haibo Hu, Jiawei Duan, Tianyu Wo, Jie Xu, Renyu Yang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.09351
Pdf URL: https://arxiv.org/pdf/2403.09351
Copy Paste: [[2403.09351]] LDPRecover: Recovering Frequencies from Poisoning Attacks against Local Differential Privacy(https://arxiv.org/abs/2403.09351)
Keywords: privacy, protect, attack
Abstract: Local differential privacy (LDP), which enables an untrusted server to collect aggregated statistics from distributed users while protecting the privacy of those users, has been widely deployed in practice. However, LDP protocols for frequency estimation are vulnerable to poisoning attacks, in which an attacker can poison the aggregated frequencies by manipulating the data sent from malicious users. Therefore, it is an open challenge to recover the accurate aggregated frequencies from poisoned ones. In this work, we propose LDPRecover, a method that can recover accurate aggregated frequencies from poisoning attacks, even if the server does not learn the details of the attacks. In LDPRecover, we establish a genuine frequency estimator that theoretically guides the server to recover the frequencies aggregated from genuine users' data by eliminating the impact of malicious users' data in poisoned frequencies. Since the server has no idea of the attacks, we propose an adaptive attack to unify existing attacks and learn the statistics of the malicious data within this adaptive attack by exploiting the properties of LDP protocols. By taking the estimator and the learning statistics as constraints, we formulate the problem of recovering aggregated frequencies to approach the genuine ones as a constraint inference (CI) problem. Consequently, the server can obtain accurate aggregated frequencies by solving this problem optimally. Moreover, LDPRecover can serve as a frequency recovery paradigm that recovers more accurate aggregated frequencies by integrating attack details as new constraints in the CI problem. Our evaluation on two real-world datasets, three LDP protocols, and untargeted and targeted poisoning attacks shows that LDPRecover is both accurate and widely applicable against various poisoning attacks.

Title: REPQC: Reverse Engineering and Backdooring Hardware Accelerators for Post-quantum Cryptography

Authors: Samuel Pagliarini, Aikata Aikata, Malik Imran, Sujoy Sinha Roy
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.09352
Pdf URL: https://arxiv.org/pdf/2403.09352
Copy Paste: [[2403.09352]] REPQC: Reverse Engineering and Backdooring Hardware Accelerators for Post-quantum Cryptography(https://arxiv.org/abs/2403.09352)
Keywords: robust, steal
Abstract: Significant research efforts have been dedicated to designing cryptographic algorithms that are quantum-resistant. The motivation is clear: robust quantum computers, once available, will render current cryptographic standards vulnerable. Thus, we need new Post-Quantum Cryptography (PQC) algorithms, and, due to the inherent complexity of such algorithms, there is also a demand to accelerate them in hardware. In this paper, we show that PQC hardware accelerators can be backdoored by two different adversaries located in the chip supply chain. We propose REPQC, a sophisticated reverse engineering algorithm that can be employed to confidently identify hashing operations (i.e., Keccak) within the PQC accelerator - the location of which serves as an anchor for finding secret information to be leaked. Armed with REPQC, an adversary proceeds to insert malicious logic in the form of a stealthy Hardware Trojan Horse (HTH). Using Dilithium as a study case, our results demonstrate that HTHs that increase the accelerator's layout density by as little as 0.1\% can be inserted without any impact on the performance of the circuit and with a marginal increase in power consumption. An essential aspect is that the entire reverse engineering in REPQC is automated, and so is the HTH insertion that follows it, empowering adversaries to explore multiple HTH designs and identify the most suitable one.

Title: Komodo: A Linguistic Expedition into Indonesia's Regional Languages

Authors: Louis Owen, Vishesh Tripathi, Abhay Kumar, Biddwan Ahmed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.09362
Pdf URL: https://arxiv.org/pdf/2403.09362
Copy Paste: [[2403.09362]] Komodo: A Linguistic Expedition into Indonesia's Regional Languages(https://arxiv.org/abs/2403.09362)
Keywords: large language model
Abstract: The recent breakthroughs in Large Language Models (LLMs) have mostly focused on languages with easily available and sufficient resources, such as English. However, there remains a significant gap for languages that lack sufficient linguistic resources in the public domain. Our work introduces Komodo-7B, 7-billion-parameter Large Language Models designed to address this gap by seamlessly operating across Indonesian, English, and 11 regional languages in Indonesia. Komodo-7B is a family of LLMs that consist of Komodo-7B-Base and Komodo-7B-Instruct. Komodo-7B-Instruct stands out by achieving state-of-the-art performance in various tasks and languages, outperforming the benchmarks set by OpenAI's GPT-3.5, Cohere's Aya-101, Llama-2-Chat-13B, Mixtral-8x7B-Instruct-v0.1, Gemma-7B-it , and many more. This model not only demonstrates superior performance in both language-specific and overall assessments but also highlights its capability to excel in linguistic diversity. Our commitment to advancing language models extends beyond well-resourced languages, aiming to bridge the gap for those with limited linguistic assets. Additionally, Komodo-7B-Instruct's better cross-language understanding contributes to addressing educational disparities in Indonesia, offering direct translations from English to 11 regional languages, a significant improvement compared to existing language translation services. Komodo-7B represents a crucial step towards inclusivity and effectiveness in language models, providing to the linguistic needs of diverse communities.

Title: Sentinel-Guided Zero-Shot Learning: A Collaborative Paradigm without Real Data Exposure

Authors: Fan Wan, Xingyu Miao, Haoran Duan, Jingjing Deng, Rui Gao, Yang Long
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09363
Pdf URL: https://arxiv.org/pdf/2403.09363
Copy Paste: [[2403.09363]] Sentinel-Guided Zero-Shot Learning: A Collaborative Paradigm without Real Data Exposure(https://arxiv.org/abs/2403.09363)
Keywords: security, privacy, robust
Abstract: With increasing concerns over data privacy and model copyrights, especially in the context of collaborations between AI service providers and data owners, an innovative SG-ZSL paradigm is proposed in this work. SG-ZSL is designed to foster efficient collaboration without the need to exchange models or sensitive data. It consists of a teacher model, a student model and a generator that links both model entities. The teacher model serves as a sentinel on behalf of the data owner, replacing real data, to guide the student model at the AI service provider's end during training. Considering the disparity of knowledge space between the teacher and student, we introduce two variants of the teacher model: the omniscient and the quasi-omniscient teachers. Under these teachers' guidance, the student model seeks to match the teacher model's performance and explores domains that the teacher has not covered. To trade off between privacy and performance, we further introduce two distinct security-level training protocols: white-box and black-box, enhancing the paradigm's adaptability. Despite the inherent challenges of real data absence in the SG-ZSL paradigm, it consistently outperforms in ZSL and GZSL tasks, notably in the white-box protocol. Our comprehensive evaluation further attests to its robustness and efficiency across various setups, including stringent black-box training protocol.

Title: DF4LCZ: A SAM-Empowered Data Fusion Framework for Scene-Level Local Climate Zone Classification

Authors: Qianqian Wu, Xianping Ma, Jialu Sui, Man-On Pun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09367
Pdf URL: https://arxiv.org/pdf/2403.09367
Copy Paste: [[2403.09367]] DF4LCZ: A SAM-Empowered Data Fusion Framework for Scene-Level Local Climate Zone Classification(https://arxiv.org/abs/2403.09367)
Keywords: extraction
Abstract: Recent advancements in remote sensing (RS) technologies have shown their potential in accurately classifying local climate zones (LCZs). However, traditional scene-level methods using convolutional neural networks (CNNs) often struggle to integrate prior knowledge of ground objects effectively. Moreover, commonly utilized data sources like Sentinel-2 encounter difficulties in capturing detailed ground object information. To tackle these challenges, we propose a data fusion method that integrates ground object priors extracted from high-resolution Google imagery with Sentinel-2 multispectral imagery. The proposed method introduces a novel Dual-stream Fusion framework for LCZ classification (DF4LCZ), integrating instance-based location features from Google imagery with the scene-level spatial-spectral features extracted from Sentinel-2 imagery. The framework incorporates a Graph Convolutional Network (GCN) module empowered by the Segment Anything Model (SAM) to enhance feature extraction from Google imagery. Simultaneously, the framework employs a 3D-CNN architecture to learn the spectral-spatial features of Sentinel-2 imagery. Experiments are conducted on a multi-source remote sensing image dataset specifically designed for LCZ classification, validating the effectiveness of the proposed DF4LCZ. The related code and dataset are available at https://github.com/ctrlovefly/DF4LCZ.

Title: Impact of Synthetic Images on Morphing Attack Detection Using a Siamese Network

Authors: Juan Tapia, Christoph Busch
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09380
Pdf URL: https://arxiv.org/pdf/2403.09380
Copy Paste: [[2403.09380]] Impact of Synthetic Images on Morphing Attack Detection Using a Siamese Network(https://arxiv.org/abs/2403.09380)
Keywords: attack
Abstract: This paper evaluated the impact of synthetic images on Morphing Attack Detection (MAD) using a Siamese network with a semi-hard-loss function. Intra and cross-dataset evaluations were performed to measure synthetic image generalisation capabilities using a cross-dataset for evaluation. Three different pre-trained networks were used as feature extractors from traditional MobileNetV2, MobileNetV3 and EfficientNetB0. Our results show that MAD trained on EfficientNetB0 from FERET, FRGCv2, and FRLL can reach a lower error rate in comparison with SOTA. Conversely, worse performances were reached when the system was trained only with synthetic images. A mixed approach (synthetic + digital) database may help to improve MAD and reduce the error rate. This fact shows that we still need to keep going with our efforts to include synthetic images in the training process.

Title: GiT: Towards Generalist Vision Transformer through Universal Language Interface

Authors: Haiyang Wang, Hao Tang, Li Jiang, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, Liwei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09394
Pdf URL: https://arxiv.org/pdf/2403.09394
Copy Paste: [[2403.09394]] GiT: Towards Generalist Vision Transformer through Universal Language Interface(https://arxiv.org/abs/2403.09394)
Keywords: transformer, large language model, segmentation
Abstract: This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g, GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g., captioning), over sparse perception (e.g., detection), to dense prediction (e.g., segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models will be available at \url{https://github.com/Haiyang-W/GiT}.

Title: ConDiSR: Contrastive Disentanglement and Style Regularization for Single Domain Generalization

Authors: Aleksandr Matsun, Numan Saeed, Fadillah Adamsyah Maani, Mohammad Yaqub
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09400
Pdf URL: https://arxiv.org/pdf/2403.09400
Copy Paste: [[2403.09400]] ConDiSR: Contrastive Disentanglement and Style Regularization for Single Domain Generalization(https://arxiv.org/abs/2403.09400)
Keywords: privacy, extraction, segmentation
Abstract: Medical data often exhibits distribution shifts, which cause test-time performance degradation for deep learning models trained using standard supervised learning pipelines. This challenge is addressed in the field of Domain Generalization (DG) with the sub-field of Single Domain Generalization (SDG) being specifically interesting due to the privacy- or logistics-related issues often associated with medical data. Existing disentanglement-based SDG methods heavily rely on structural information embedded in segmentation masks, however classification labels do not provide such dense information. This work introduces a novel SDG method aimed at medical image classification that leverages channel-wise contrastive disentanglement. It is further enhanced with reconstruction-based style regularization to ensure extraction of distinct style and structure feature representations. We evaluate our method on the complex task of multicenter histopathology image classification, comparing it against state-of-the-art (SOTA) SDG baselines. Results demonstrate that our method surpasses the SOTA by a margin of 1% in average accuracy while also showing more stable performance. This study highlights the importance and challenges of exploring SDG frameworks in the context of the classification task. The code is publicly available at https://github.com/BioMedIA-MBZUAI/ConDiSR

Title: XCoOp: Explainable Prompt Learning for Computer-Aided Diagnosis via Concept-guided Context Optimization

Authors: Yequan Bie, Luyang Luo, Zhixuan Chen, Hao Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09410
Pdf URL: https://arxiv.org/pdf/2403.09410
Copy Paste: [[2403.09410]] XCoOp: Explainable Prompt Learning for Computer-Aided Diagnosis via Concept-guided Context Optimization(https://arxiv.org/abs/2403.09410)
Keywords: interpretability, explainability, large language model
Abstract: Utilizing potent representations of the large vision-language models (VLMs) to accomplish various downstream tasks has attracted increasing attention. Within this research field, soft prompt learning has become a representative approach for efficiently adapting VLMs such as CLIP, to tasks like image classification. However, most existing prompt learning methods learn text tokens that are unexplainable, which cannot satisfy the stringent interpretability requirements of Explainable Artificial Intelligence (XAI) in high-stakes scenarios like healthcare. To address this issue, we propose a novel explainable prompt learning framework that leverages medical knowledge by aligning the semantics of images, learnable prompts, and clinical concept-driven prompts at multiple granularities. Moreover, our framework addresses the lack of valuable concept annotations by eliciting knowledge from large language models and offers both visual and textual explanations for the prompts. Extensive experiments and explainability analyses conducted on various datasets, with and without concept labels, demonstrate that our method simultaneously achieves superior diagnostic performance, flexibility, and interpretability, shedding light on the effectiveness of foundation models in facilitating XAI. The code will be made publically available.

Title: OpenGraph: Open-Vocabulary Hierarchical 3D Graph Representation in Large-Scale Outdoor Environments

Authors: Yinan Deng, Jiahui Wang, Jingyu Zhao, Xinyu Tian, Guangyan Chen, Yi Yang, Yufeng Yue
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2403.09412
Pdf URL: https://arxiv.org/pdf/2403.09412
Copy Paste: [[2403.09412]] OpenGraph: Open-Vocabulary Hierarchical 3D Graph Representation in Large-Scale Outdoor Environments(https://arxiv.org/abs/2403.09412)
Keywords: segmentation
Abstract: Environment maps endowed with sophisticated semantics are pivotal for facilitating seamless interaction between robots and humans, enabling them to effectively carry out various tasks. Open-vocabulary maps, powered by Visual-Language models (VLMs), possess inherent advantages, including multimodal retrieval and open-set classes. However, existing open-vocabulary maps are constrained to closed indoor scenarios and VLM features, thereby diminishing their usability and inference capabilities. Moreover, the absence of topological relationships further complicates the accurate querying of specific instances. In this work, we propose OpenGraph, a representation of open-vocabulary hierarchical graph structure designed for large-scale outdoor environments. OpenGraph initially extracts instances and their captions from visual images using 2D foundation models, encoding the captions with features to enhance textual reasoning. Subsequently, 3D incremental panoramic mapping with feature embedding is achieved by projecting images onto LiDAR point clouds. Finally, the environment is segmented based on lane graph connectivity to construct a hierarchical graph. Validation results from real public dataset SemanticKITTI demonstrate that, even without fine-tuning the models, OpenGraph exhibits the ability to generalize to novel semantic classes and achieve the highest segmentation and query accuracy. The source code of OpenGraph is publicly available at https://github.com/BIT-DYN/OpenGraph.

Title: RoDUS: Robust Decomposition of Static and Dynamic Elements in Urban Scenes

Authors: Thang-Anh-Quan Nguyen, Luis Roldão, Nathan Piasco, Moussab Bennehar, Dzmitry Tsishkou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09419
Pdf URL: https://arxiv.org/pdf/2403.09419
Copy Paste: [[2403.09419]] RoDUS: Robust Decomposition of Static and Dynamic Elements in Urban Scenes(https://arxiv.org/abs/2403.09419)
Keywords: robust
Abstract: The task of separating dynamic objects from static environments using NeRFs has been widely studied in recent years. However, capturing large-scale scenes still poses a challenge due to their complex geometric structures and unconstrained dynamics. Without the help of 3D motion cues, previous methods often require simplified setups with slow camera motion and only a few/single dynamic actors, leading to suboptimal solutions in most urban setups. To overcome such limitations, we present RoDUS, a pipeline for decomposing static and dynamic elements in urban scenes, with thoughtfully separated NeRF models for moving and non-moving components. Our approach utilizes a robust kernel-based initialization coupled with 4D semantic information to selectively guide the learning process. This strategy enables accurate capturing of the dynamics in the scene, resulting in reduced artifacts caused by NeRF on background reconstruction, all by using self-supervision. Notably, experimental evaluations on KITTI-360 and Pandaset datasets demonstrate the effectiveness of our method in decomposing challenging urban scenes into precise static and dynamic components.

Title: Mitigating attribute amplification in counterfactual image generation

Authors: Tian Xia, Mélanie Roschewitz, Fabio De Sousa Ribeiro, Charles Jones, Ben Glocker
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09422
Pdf URL: https://arxiv.org/pdf/2403.09422
Copy Paste: [[2403.09422]] Mitigating attribute amplification in counterfactual image generation(https://arxiv.org/abs/2403.09422)
Keywords: protect, generative
Abstract: Causal generative modelling is gaining interest in medical imaging due to its ability to answer interventional and counterfactual queries. Most work focuses on generating counterfactual images that look plausible, using auxiliary classifiers to enforce effectiveness of simulated interventions. We investigate pitfalls in this approach, discovering the issue of attribute amplification, where unrelated attributes are spuriously affected during interventions, leading to biases across protected characteristics and disease status. We show that attribute amplification is caused by the use of hard labels in the counterfactual training process and propose soft counterfactual fine-tuning to mitigate this issue. Our method substantially reduces the amplification effect while maintaining effectiveness of generated images, demonstrated on a large chest X-ray dataset. Our work makes an important advancement towards more faithful and unbiased causal modelling in medical imaging.

Title: Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity

Authors: Zhuo Zhi, Ziquan Liu, Moe Elbadawi, Adam Daneshmend, Mine Orlu, Abdul Basit, Andreas Demosthenous, Miguel Rodrigues
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.09428
Pdf URL: https://arxiv.org/pdf/2403.09428
Copy Paste: [[2403.09428]] Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity(https://arxiv.org/abs/2403.09428)
Keywords: transformer
Abstract: Multimodal machine learning with missing modalities is an increasingly relevant challenge arising in various applications such as healthcare. This paper extends the current research into missing modalities to the low-data regime, i.e., a downstream task has both missing modalities and limited sample size issues. This problem setting is particularly challenging and also practical as it is often expensive to get full-modality data and sufficient annotated training samples. We propose to use retrieval-augmented in-context learning to address these two crucial issues by unleashing the potential of a transformer's in-context learning ability. Diverging from existing methods, which primarily belong to the parametric paradigm and often require sufficient training samples, our work exploits the value of the available full-modality data, offering a novel perspective on resolving the challenge. The proposed data-dependent framework exhibits a higher degree of sample efficiency and is empirically demonstrated to enhance the classification model's performance on both full- and missing-modality data in the low-data regime across various multimodal learning tasks. When only 1% of the training data are available, our proposed method demonstrates an average improvement of 6.1% over a recent strong baseline across various datasets and missing states. Notably, our method also reduces the performance gap between full-modality and missing-modality data compared with the baseline.

Title: Efficient Transferability Assessment for Selection of Pre-trained Detectors

Authors: Zhao Wang, Aoxue Li, Zhenguo Li, Qi Dou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09432
Pdf URL: https://arxiv.org/pdf/2403.09432
Copy Paste: [[2403.09432]] Efficient Transferability Assessment for Selection of Pre-trained Detectors(https://arxiv.org/abs/2403.09432)
Keywords: segmentation
Abstract: Large-scale pre-training followed by downstream fine-tuning is an effective solution for transferring deep-learning-based models. Since finetuning all possible pre-trained models is computational costly, we aim to predict the transferability performance of these pre-trained models in a computational efficient manner. Different from previous work that seek out suitable models for downstream classification and segmentation tasks, this paper studies the efficient transferability assessment of pre-trained object detectors. To this end, we build up a detector transferability benchmark which contains a large and diverse zoo of pre-trained detectors with various architectures, source datasets and training schemes. Given this zoo, we adopt 7 target datasets from 5 diverse domains as the downstream target tasks for evaluation. Further, we propose to assess classification and regression sub-tasks simultaneously in a unified framework. Additionally, we design a complementary metric for evaluating tasks with varying objects. Experimental results demonstrate that our method outperforms other state-of-the-art approaches in assessing transferability under different target domains while efficiently reducing wall-clock time 32$\times$ and requires a mere 5.2\% memory footprint compared to brute-force fine-tuning of all pre-trained detectors.

Title: 3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation

Authors: Frank Zhang, Yibo Zhang, Quan Zheng, Rui Ma, Wei Hua, Hujun Bao, Weiwei Xu, Changqing Zou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09439
Pdf URL: https://arxiv.org/pdf/2403.09439
Copy Paste: [[2403.09439]] 3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation(https://arxiv.org/abs/2403.09439)
Keywords: diffusion, generative
Abstract: Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However, these methods heavily rely on the outputs of existing models, leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g., outdoor and unreal scenarios). To address this limitation, we generatively refine the newly generated local views by querying and aggregating global 3D information, and then progressively generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that, in comparison to previous methods, our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.

Title: Adversarial Fine-tuning of Compressed Neural Networks for Joint Improvement of Robustness and Efficiency

Authors: Hallgrimur Thorsteinsson, Valdemar J Henriksen, Tong Chen, Raghavendra Selvan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2403.09441
Pdf URL: https://arxiv.org/pdf/2403.09441
Copy Paste: [[2403.09441]] Adversarial Fine-tuning of Compressed Neural Networks for Joint Improvement of Robustness and Efficiency(https://arxiv.org/abs/2403.09441)
Keywords: attack, robust
Abstract: As deep learning (DL) models are increasingly being integrated into our everyday lives, ensuring their safety by making them robust against adversarial attacks has become increasingly critical. DL models have been found to be susceptible to adversarial attacks which can be achieved by introducing small, targeted perturbations to disrupt the input data. Adversarial training has been presented as a mitigation strategy which can result in more robust models. This adversarial robustness comes with additional computational costs required to design adversarial attacks during training. The two objectives -- adversarial robustness and computational efficiency -- then appear to be in conflict of each other. In this work, we explore the effects of two different model compression methods -- structured weight pruning and quantization -- on adversarial robustness. We specifically explore the effects of fine-tuning on compressed models, and present the trade-off between standard fine-tuning and adversarial fine-tuning. Our results show that compression does not inherently lead to loss in model robustness and adversarial fine-tuning of a compressed model can yield large improvement to the robustness performance of models. We present experiments on two benchmark datasets showing that adversarial fine-tuning of compressed models can achieve robustness performance comparable to adversarially trained models, while also improving computational efficiency.

Title: Shake to Leak: Fine-tuning Diffusion Models Can Amplify the Generative Privacy Risk

Authors: Zhangheng Li, Junyuan Hong, Bo Li, Zhangyang Wang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2403.09450
Pdf URL: https://arxiv.org/pdf/2403.09450
Copy Paste: [[2403.09450]] Shake to Leak: Fine-tuning Diffusion Models Can Amplify the Generative Privacy Risk(https://arxiv.org/abs/2403.09450)
Keywords: privacy, attack, membership infer, diffusion, generative
Abstract: While diffusion models have recently demonstrated remarkable progress in generating realistic images, privacy risks also arise: published models or APIs could generate training images and thus leak privacy-sensitive training information. In this paper, we reveal a new risk, Shake-to-Leak (S2L), that fine-tuning the pre-trained models with manipulated data can amplify the existing privacy risks. We demonstrate that S2L could occur in various standard fine-tuning strategies for diffusion models, including concept-injection methods (DreamBooth and Textual Inversion) and parameter-efficient methods (LoRA and Hypernetwork), as well as their combinations. In the worst case, S2L can amplify the state-of-the-art membership inference attack (MIA) on diffusion models by $5.4\%$ (absolute difference) AUC and can increase extracted private samples from almost $0$ samples to $16.3$ samples on average per target domain. This discovery underscores that the privacy risk with diffusion models is even more severe than previously recognized. Codes are available at https://github.com/VITA-Group/Shake-to-Leak.

Title: Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing

Authors: Wonjun Kang, Kevin Galim, Hyung Il Koo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09468
Pdf URL: https://arxiv.org/pdf/2403.09468
Copy Paste: [[2403.09468]] Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing(https://arxiv.org/abs/2403.09468)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable success in the domain of text-guided image generation and, more recently, in text-guided image editing. A commonly adopted strategy for editing real images involves inverting the diffusion process to obtain a noisy representation of the original image, which is then denoised to achieve the desired edits. However, current methods for diffusion inversion often struggle to produce edits that are both faithful to the specified text prompt and closely resemble the source image. To overcome these limitations, we introduce a novel and adaptable diffusion inversion technique for real image editing, which is grounded in a theoretical analysis of the role of $\eta$ in the DDIM sampling equation for enhanced editability. By designing a universal diffusion inversion method with a time- and region-dependent $\eta$ function, we enable flexible control over the editing extent. Through a comprehensive series of quantitative and qualitative assessments, involving a comparison with a broad array of recent methods, we demonstrate the superiority of our approach. Our method not only sets a new benchmark in the field but also significantly outperforms existing strategies. Our code is available at https://github.com/furiosa-ai/eta-inversion

Title: MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models

Authors: Zunnan Xu, Yukang Lin, Haonan Han, Sicheng Yang, Ronghui Li, Yachao Zhang, Xiu Li
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2403.09471
Pdf URL: https://arxiv.org/pdf/2403.09471
Copy Paste: [[2403.09471]] MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models(https://arxiv.org/abs/2403.09471)
Keywords: diffusion
Abstract: Gesture synthesis is a vital realm of human-computer interaction, with wide-ranging applications across various fields like film, robotics, and virtual reality. Recent advancements have utilized the diffusion model and attention mechanisms to improve gesture synthesis. However, due to the high computational complexity of these techniques, generating long and diverse sequences with low latency remains a challenge. We explore the potential of state space models (SSMs) to address the challenge, implementing a two-stage modeling strategy with discrete motion priors to enhance the quality of gestures. Leveraging the foundational Mamba block, we introduce MambaTalk, enhancing gesture diversity and rhythm through multimodal integration. Extensive experiments demonstrate that our method matches or exceeds the performance of state-of-the-art models.

Title: Covert Communication for Untrusted UAV-Assisted Wireless Systems

Authors: Chan Gao, Linying Tian, Dong Zheng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.09475
Pdf URL: https://arxiv.org/pdf/2403.09475
Copy Paste: [[2403.09475]] Covert Communication for Untrusted UAV-Assisted Wireless Systems(https://arxiv.org/abs/2403.09475)
Keywords: security
Abstract: Wireless systems are of paramount importance for providing ubiquitous data transmission for smart cities. However, due to the broadcasting and openness of wireless channels, such systems face potential security challenges. UAV-assisted covert communication is a supporting technology for improving covert performances and has become a hot issue in the research of wireless communication security. This paper investigates the performance of joint covert and security communication in a tow-hop UAV-assisted wireless system, where a source transmits the covert message to a destination with the help of an untrusted UAV. We first design a transmission scheme such that use UAVs to assist in covert communications while ensuring the security of covert messages. Then, we develop a theoretical model to derive the expressions for the detection error probability of the warden and the covert and security rate, and the maximum covert and security rate is optimized by power control under a given covertness and security requirements. Finally, numerical results are provided to illustrate our theoretical analysis and the performance of covert and security communication in such systems.

Title: What Sketch Explainability Really Means for Downstream Tasks

Authors: Hmrishav Bandyopadhyay, Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Aneeshan Sain, Tao Xiang, Yi-Zhe Song
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09480
Pdf URL: https://arxiv.org/pdf/2403.09480
Copy Paste: [[2403.09480]] What Sketch Explainability Really Means for Downstream Tasks(https://arxiv.org/abs/2403.09480)
Keywords: attack, explainability
Abstract: In this paper, we explore the unique modality of sketch for explainability, emphasising the profound impact of human strokes compared to conventional pixel-oriented studies. Beyond explanations of network behavior, we discern the genuine implications of explainability across diverse downstream sketch-related tasks. We propose a lightweight and portable explainability solution -- a seamless plugin that integrates effortlessly with any pre-trained model, eliminating the need for re-training. Demonstrating its adaptability, we present four applications: highly studied retrieval and generation, and completely novel assisted drawing and sketch adversarial attacks. The centrepiece to our solution is a stroke-level attribution map that takes different forms when linked with downstream tasks. By addressing the inherent non-differentiability of rasterisation, we enable explanations at both coarse stroke level (SLA) and partial stroke level (P-SLA), each with its advantages for specific downstream tasks.

Title: Rectifying Demonstration Shortcut in In-Context Learning

Authors: Joonwon Jang, Sanghwan Jang, Wonbin Kweon, Minjin Jeon, Hwanjo Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09488
Pdf URL: https://arxiv.org/pdf/2403.09488
Copy Paste: [[2403.09488]] Rectifying Demonstration Shortcut in In-Context Learning(https://arxiv.org/abs/2403.09488)
Keywords: large language model
Abstract: Large language models (LLMs) are able to solve various tasks with only a few demonstrations utilizing their in-context learning (ICL) abilities. However, LLMs often rely on their pre-trained semantic priors of demonstrations rather than on the input-label relationships to proceed with ICL prediction. In this work, we term this phenomenon as the `Demonstration Shortcut'. While previous works have primarily focused on improving ICL prediction results for predefined tasks, we aim to rectify the Demonstration Shortcut, thereby enabling the LLM to effectively learn new input-label relationships from demonstrations. To achieve this, we introduce In-Context Calibration, a demonstration-aware calibration method. We evaluate the effectiveness of the proposed method in two settings: (1) the Original ICL Task using the standard label space and (2) the Task Learning setting, where the label space is replaced with semantically unrelated tokens. In both settings, In-Context Calibration demonstrates substantial improvements, with results generalized across three LLM families (OPT, GPT, and Llama2) under various configurations.

Title: EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

Authors: Jongsuk Kim, Hyeongkeun Lee, Kyeongha Rho, Junmo Kim, Joon Son Chung
Subjects: cs.LG, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2403.09502
Pdf URL: https://arxiv.org/pdf/2403.09502
Copy Paste: [[2403.09502]] EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning(https://arxiv.org/abs/2403.09502)
Keywords: robust
Abstract: Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and qualitative results verify the effectiveness of our method. EquiAV outperforms previous works across various audio-visual benchmarks.

Title: SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition

Authors: Jeonghyeok Do, Munchurl Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09508
Pdf URL: https://arxiv.org/pdf/2403.09508
Copy Paste: [[2403.09508]] SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition(https://arxiv.org/abs/2403.09508)
Keywords: transformer
Abstract: Skeleton-based action recognition, which classifies human actions based on the coordinates of joints and their connectivity within skeleton data, is widely utilized in various scenarios. While Graph Convolutional Networks (GCNs) have been proposed for skeleton data represented as graphs, they suffer from limited receptive fields constrained by joint connectivity. To address this limitation, recent advancements have introduced transformer-based methods. However, capturing correlations between all joints in all frames requires substantial memory resources. To alleviate this, we propose a novel approach called Skeletal-Temporal Transformer (SkateFormer) that partitions joints and frames based on different types of skeletal-temporal relation (Skate-Type) and performs skeletal-temporal self-attention (Skate-MSA) within each partition. We categorize the key skeletal-temporal relations for action recognition into a total of four distinct types. These types combine (i) two skeletal relation types based on physically neighboring and distant joints, and (ii) two temporal relation types based on neighboring and distant frames. Through this partition-specific attention strategy, our SkateFormer can selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner with efficient computation. Extensive experiments on various benchmark datasets validate that our SkateFormer outperforms recent state-of-the-art methods.

Title: AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting

Authors: Yu Wang, Xiaogeng Liu, Yu Li, Muhao Chen, Chaowei Xiao
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09513
Pdf URL: https://arxiv.org/pdf/2403.09513
Copy Paste: [[2403.09513]] AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting(https://arxiv.org/abs/2403.09513)
Keywords: defense, attack, robust, large language model
Abstract: With the advent and widespread deployment of Multimodal Large Language Models (MLLMs), the imperative to ensure their safety has become increasingly pronounced. However, with the integration of additional modalities, MLLMs are exposed to new vulnerabilities, rendering them prone to structured-based jailbreak attacks, where semantic content (e.g., "harmful text") has been injected into the images to mislead MLLMs. In this work, we aim to defend against such threats. Specifically, we propose \textbf{Ada}ptive \textbf{Shield} Prompting (\textbf{AdaShield}), which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks without fine-tuning MLLMs or training additional modules (e.g., post-stage content detector). Initially, we present a manually designed static defense prompt, which thoroughly examines the image and instruction content step by step and specifies response methods to malicious queries. Furthermore, we introduce an adaptive auto-refinement framework, consisting of a target MLLM and a LLM-based defense prompt generator (Defender). These components collaboratively and iteratively communicate to generate a defense prompt. Extensive experiments on the popular structure-based jailbreak attacks and benign datasets show that our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks without compromising the model's general capabilities evaluated on standard benign tasks. Our code is available at https://github.com/rain305f/AdaShield.

Title: Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information

Authors: Shadi Iskander, Kira Radinsky, Yonatan Belinkov
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2403.09516
Pdf URL: https://arxiv.org/pdf/2403.09516
Copy Paste: [[2403.09516]] Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information(https://arxiv.org/abs/2403.09516)
Keywords: fair
Abstract: Mitigating social biases typically requires identifying the social groups associated with each data sample. In this paper, we present DAFair, a novel approach to address social bias in language models. Unlike traditional methods that rely on explicit demographic labels, our approach does not require any such information. Instead, we leverage predefined prototypical demographic texts and incorporate a regularization term during the fine-tuning process to mitigate bias in the model's representations. Our empirical results across two tasks and two models demonstrate the effectiveness of our method compared to previous approaches that do not rely on labeled data. Moreover, with limited demographic-annotated data, our approach outperforms common debiasing approaches.

Title: MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation

Authors: Jiahuan Li, Shanbo Cheng, Shujian Huang, Jiajun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.09522
Pdf URL: https://arxiv.org/pdf/2403.09522
Copy Paste: [[2403.09522]] MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation(https://arxiv.org/abs/2403.09522)
Keywords: large language model
Abstract: Large Language Models (LLM) have demonstrated their strong ability in the field of machine translation (MT), yet they suffer from high computational cost and latency. Therefore, transferring translation knowledge from giant LLMs to medium-sized machine translation models is a promising research direction. However, traditional knowledge distillation methods do not take the capability of student and teacher models into consideration, therefore repeatedly teaching student models on the knowledge they have learned, and failing to extend to novel contexts and knowledge. In this paper, we propose a framework called MT-Patcher, which transfers knowledge from LLMs to existing MT models in a selective, comprehensive and proactive manner. Considering the current translation ability of student MT models, we only identify and correct their translation errors, instead of distilling the whole translation from the teacher. Leveraging the strong language abilities of LLMs, we instruct LLM teachers to synthesize diverse contexts and anticipate more potential errors for the student. Experiment results on translating both specific language phenomena and general MT benchmarks demonstrate that finetuning the student MT model on about 10% examples can achieve comparable results to the traditional knowledge distillation method, and synthesized potential errors and diverse contexts further improve translation performances on unseen contexts and words.

Title: VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

Authors: Chris Kelly, Luhui Hu, Jiayin Hu, Yu Tian, Deshun Yang, Bang Yang, Cindy Yang, Zihao Li, Zaoshan Huang, Yuexian Zou
Subjects: cs.CV, cs.AI, cs.CL, cs.GR
Abstract URL: https://arxiv.org/abs/2403.09530
Pdf URL: https://arxiv.org/pdf/2403.09530
Copy Paste: [[2403.09530]] VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding(https://arxiv.org/abs/2403.09530)
Keywords: large language model
Abstract: The evolution of text to visual components facilitates people's daily lives, such as generating image, videos from text and identifying the desired elements within the images. Computer vision models involving the multimodal abilities in the previous days are focused on image detection, classification based on well-defined objects. Large language models (LLMs) introduces the transformation from nature language to visual objects, which present the visual layout for text contexts. OpenAI GPT-4 has emerged as the pinnacle in LLMs, while the computer vision (CV) domain boasts a plethora of state-of-the-art (SOTA) models and algorithms to convert 2D images to their 3D representations. However, the mismatching between the algorithms with the problem could lead to undesired results. In response to this challenge, we propose an unified VisionGPT-3D framework to consolidate the state-of-the-art vision models, thereby facilitating the development of vision-oriented AI. VisionGPT-3D provides a versatile multimodal framework building upon the strengths of multimodal foundation models. It seamlessly integrates various SOTA vision models and brings the automation in the selection of SOTA vision models, identifies the suitable 3D mesh creation algorithms corresponding to 2D depth maps analysis, generates optimal results based on diverse multimodal inputs such as text prompts. Keywords: VisionGPT-3D, 3D vision understanding, Multimodal agent

Title: Logits of API-Protected LLMs Leak Proprietary Information

Authors: Matthew Finlayson, Swabha Swayamdipta, Xiang Ren
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2403.09539
Pdf URL: https://arxiv.org/pdf/2403.09539
Copy Paste: [[2403.09539]] Logits of API-Protected LLMs Leak Proprietary Information(https://arxiv.org/abs/2403.09539)
Keywords: protect, attack, large language model
Abstract: The commercialization of large language models (LLMs) has led to the common practice of high-level API-only access to proprietary models. In this work, we show that even with a conservative assumption about the model architecture, it is possible to learn a surprisingly large amount of non-public information about an API-protected LLM from a relatively small number of API queries (e.g., costing under $1,000 for OpenAI's gpt-3.5-turbo). Our findings are centered on one key observation: most modern LLMs suffer from a softmax bottleneck, which restricts the model outputs to a linear subspace of the full output space. We show that this lends itself to a model image or a model signature which unlocks several capabilities with affordable cost: efficiently discovering the LLM's hidden size, obtaining full-vocabulary outputs, detecting and disambiguating different model updates, identifying the source LLM given a single full LLM output, and even estimating the output layer parameters. Our empirical investigations show the effectiveness of our methods, which allow us to estimate the embedding size of OpenAI's gpt-3.5-turbo to be about 4,096. Lastly, we discuss ways that LLM providers can guard against these attacks, as well as how these capabilities can be viewed as a feature (rather than a bug) by allowing for greater transparency and accountability.

Title: RANDAO-based RNG: Last Revealer Attacks in Ethereum 2.0 Randomness and a Potential Solution

Authors: Do Hai Son, Tran Thi Thuy Quynh, Le Quang Minh
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.09541
Pdf URL: https://arxiv.org/pdf/2403.09541
Copy Paste: [[2403.09541]] RANDAO-based RNG: Last Revealer Attacks in Ethereum 2.0 Randomness and a Potential Solution(https://arxiv.org/abs/2403.09541)
Keywords: security, attack
Abstract: Ethereum 2.0 is a major upgrade to improve its scalability, throughput, and security. In this version, RANDAO is the scheme to randomly select the users who propose, confirm blocks, and get rewards. However, a vulnerability, referred to as the `Last Revealer Attack' (LRA), compromises the randomness of this scheme by introducing bias to the Random Number Generator (RNG) process. This vulnerability is first clarified again in this study. After that, we propose a Shamir's Secret Sharing (SSS)-based RANDAO scheme to mitigate the LRA. Through our analysis, the proposed method can prevent the LRA under favorable network conditions.

Title: Explorations in Texture Learning

Authors: Blaine Hoak, Patrick McDaniel
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.09543
Pdf URL: https://arxiv.org/pdf/2403.09543
Copy Paste: [[2403.09543]] Explorations in Texture Learning(https://arxiv.org/abs/2403.09543)
Keywords: interpretability
Abstract: In this work, we investigate \textit{texture learning}: the identification of textures learned by object classification models, and the extent to which they rely on these textures. We build texture-object associations that uncover new insights about the relationships between texture and object classes in CNNs and find three classes of results: associations that are strong and expected, strong and not expected, and expected but not present. Our analysis demonstrates that investigations in texture learning enable new methods for interpretability and have the potential to uncover unexpected biases.

Title: Breast Cancer Classification Using Gradient Boosting Algorithms Focusing on Reducing the False Negative and SHAP for Explainability

Authors: João Manoel Herrera Pinheiro, Marcelo Becker
Subjects: cs.LG, cs.CY, q-bio.QM
Abstract URL: https://arxiv.org/abs/2403.09548
Pdf URL: https://arxiv.org/pdf/2403.09548
Copy Paste: [[2403.09548]] Breast Cancer Classification Using Gradient Boosting Algorithms Focusing on Reducing the False Negative and SHAP for Explainability(https://arxiv.org/abs/2403.09548)
Keywords: interpretability, explainability
Abstract: Cancer is one of the diseases that kill the most women in the world, with breast cancer being responsible for the highest number of cancer cases and consequently deaths. However, it can be prevented by early detection and, consequently, early treatment. Any development for detection or perdition this kind of cancer is important for a better healthy life. Many studies focus on a model with high accuracy in cancer prediction, but sometimes accuracy alone may not always be a reliable metric. This study implies an investigative approach to studying the performance of different machine learning algorithms based on boosting to predict breast cancer focusing on the recall metric. Boosting machine learning algorithms has been proven to be an effective tool for detecting medical diseases. The dataset of the University of California, Irvine (UCI) repository has been utilized to train and test the model classifier that contains their attributes. The main objective of this study is to use state-of-the-art boosting algorithms such as AdaBoost, XGBoost, CatBoost and LightGBM to predict and diagnose breast cancer and to find the most effective metric regarding recall, ROC-AUC, and confusion matrix. Furthermore, our study is the first to use these four boosting algorithms with Optuna, a library for hyperparameter optimization, and the SHAP method to improve the interpretability of our model, which can be used as a support to identify and predict breast cancer. We were able to improve AUC or recall for all the models and reduce the False Negative for AdaBoost and LigthGBM the final AUC were more than 99.41\% for all models.

Title: WeakSurg: Weakly supervised surgical instrument segmentation using temporal equivariance and semantic continuity

Authors: Qiyuan Wang, Yanzhe Liu, Shang Zhao, Rong Liu, S. Kevin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09551
Pdf URL: https://arxiv.org/pdf/2403.09551
Copy Paste: [[2403.09551]] WeakSurg: Weakly supervised surgical instrument segmentation using temporal equivariance and semantic continuity(https://arxiv.org/abs/2403.09551)
Keywords: segmentation
Abstract: Weakly supervised surgical instrument segmentation with only instrument presence labels has been rarely explored in surgical domain. To mitigate the highly under-constrained challenges, we extend a two-stage weakly supervised segmentation paradigm with temporal attributes from two perspectives. From a temporal equivariance perspective, we propose a prototype-based temporal equivariance regulation loss to enhance pixel-wise consistency between adjacent features. From a semantic continuity perspective, we propose a class-aware temporal semantic continuity loss to constrain the semantic consistency between a global view of target frame and local non-discriminative regions of adjacent reference frame. To the best of our knowledge, WeakSurg is the first instrument-presence-only weakly supervised segmentation architecture to take temporal information into account for surgical scenarios. Extensive experiments are validated on Cholec80, an open benchmark for phase and instrument recognition. We annotate instance-wise instrument labels with fixed time-steps which are double checked by a clinician with 3-years experience. Our results show that WeakSurg compares favorably with state-of-the-art methods not only on semantic segmentation metrics but also on instance segmentation metrics.

Title: Less is More: Data Value Estimation for Visual Instruction Tuning

Authors: Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong Wen
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2403.09559
Pdf URL: https://arxiv.org/pdf/2403.09559
Copy Paste: [[2403.09559]] Less is More: Data Value Estimation for Visual Instruction Tuning(https://arxiv.org/abs/2403.09559)
Keywords: large language model
Abstract: Visual instruction tuning is the key to building multimodal large language models (MLLMs), which greatly improves the reasoning capabilities of large language models (LLMs) in vision scenario. However, existing MLLMs mostly rely on a mixture of multiple highly diverse visual instruction datasets for training (even more than a million instructions), which may introduce data redundancy. To investigate this issue, we conduct a series of empirical studies, which reveal a significant redundancy within the visual instruction datasets, and show that greatly reducing the amount of several instruction dataset even do not affect the performance. Based on the findings, we propose a new data selection approach TIVE, to eliminate redundancy within visual instruction data. TIVE first estimates the task-level and instance-level value of the visual instructions based on computed gradients. Then, according to the estimated values, TIVE determines the task proportion within the visual instructions, and selects representative instances to compose a smaller visual instruction subset for training. Experiments on LLaVA-1.5 show that our approach using only about 7.5% data can achieve comparable performance as the full-data fine-tuned model across seven benchmarks, even surpassing it on four of the benchmarks. Our code and data will be publicly released.

Title: PreCurious: How Innocent Pre-Trained Language Models Turn into Privacy Traps

Authors: Ruixuan Liu, Tianhao Wang, Yang Cao, Li Xiong
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2403.09562
Pdf URL: https://arxiv.org/pdf/2403.09562
Copy Paste: [[2403.09562]] PreCurious: How Innocent Pre-Trained Language Models Turn into Privacy Traps(https://arxiv.org/abs/2403.09562)
Keywords: privacy, defense, attack, steal, extraction, membership infer
Abstract: The pre-training and fine-tuning paradigm has demonstrated its effectiveness and has become the standard approach for tailoring language models to various tasks. Currently, community-based platforms offer easy access to various pre-trained models, as anyone can publish without strict validation processes. However, a released pre-trained model can be a privacy trap for fine-tuning datasets if it is carefully designed. In this work, we propose PreCurious framework to reveal the new attack surface where the attacker releases the pre-trained model and gets a black-box access to the final fine-tuned model. PreCurious aims to escalate the general privacy risk of both membership inference and data extraction. The key intuition behind PreCurious is to manipulate the memorization stage of the pre-trained model and guide fine-tuning with a seemingly legitimate configuration. The effectiveness of defending against privacy attacks on a fine-tuned model seems promising, as empirical and theoretical evidence suggests that parameter-efficient and differentially private fine-tuning techniques are invulnerable to privacy attacks. But PreCurious demonstrates the possibility of breaking up invulnerability in a stealthy manner compared to fine-tuning on a benign model. By further leveraging a sanitized dataset, PreCurious can extract originally unexposed secrets under differentially private fine-tuning. Thus, PreCurious raises warnings for users who download pre-trained models from unknown sources, rely solely on tutorials or common-sense defenses, and previously release sanitized datasets even after perfect scrubbing.

Title: Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation

Authors: Yunhao Gou, Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung, James T. Kwok, Yu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09572
Pdf URL: https://arxiv.org/pdf/2403.09572
Copy Paste: [[2403.09572]] Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation(https://arxiv.org/abs/2403.09572)
Keywords: protect, attack, robust, large language model
Abstract: Multimodal large language models (MLLMs) have shown impressive reasoning abilities, which, however, are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed due to the introduction of image features. To construct robust MLLMs, we propose ECSO(Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate intrinsic safety mechanism of pre-aligned LLMs in MLLMs. Experiments on five state-of-the-art (SoTA) MLLMs demonstrate that our ECSO enhances model safety significantly (e.g., a 37.6% improvement on the MM-SafetyBench (SD+OCR), and 71.3% on VLSafe for the LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks. Furthermore, we show that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for MLLM alignment without extra human intervention.

Title: Renovating Names in Open-Vocabulary Segmentation Benchmarks

Authors: Haiwen Huang, Songyou Peng, Dan Zhang, Andreas Geiger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09593
Pdf URL: https://arxiv.org/pdf/2403.09593
Copy Paste: [[2403.09593]] Renovating Names in Open-Vocabulary Segmentation Benchmarks(https://arxiv.org/abs/2403.09593)
Keywords: segmentation
Abstract: Names are essential to both human cognition and vision-language models. Open-vocabulary models utilize class names as text prompts to generalize to categories unseen during training. However, name qualities are often overlooked and lack sufficient precision in existing datasets. In this paper, we address this underexplored problem by presenting a framework for "renovating" names in open-vocabulary segmentation benchmarks (RENOVATE). Through human study, we demonstrate that the names generated by our model are more precise descriptions of the visual segments and hence enhance the quality of existing datasets by means of simple renaming. We further demonstrate that using our renovated names enables training of stronger open-vocabulary segmentation models. Using open-vocabulary segmentation for name quality evaluation, we show that our renovated names lead to up to 16% relative improvement from the original names on various benchmarks across various state-of-the-art models. We provide our code and relabelings for several popular segmentation datasets (ADE20K, Cityscapes, PASCAL Context) to the research community.

Title: Optimistic Verifiable Training by Controlling Hardware Nondeterminism

Authors: Megha Srivastava, Simran Arora, Dan Boneh
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2403.09603
Pdf URL: https://arxiv.org/pdf/2403.09603
Copy Paste: [[2403.09603]] Optimistic Verifiable Training by Controlling Hardware Nondeterminism(https://arxiv.org/abs/2403.09603)
Keywords: attack, robust
Abstract: The increasing compute demands of AI systems has led to the emergence of services that train models on behalf of clients lacking necessary resources. However, ensuring correctness of training and guarding against potential training-time attacks, such as data poisoning, poses challenges. Existing works on verifiable training largely fall into two classes: proof-based systems, which struggle to scale due to requiring cryptographic techniques, and "optimistic" methods that consider a trusted third-party auditor who replicates the training process. A key challenge with the latter is that hardware nondeterminism between GPU types during training prevents an auditor from replicating the training process exactly, and such schemes are therefore non-robust. We propose a method that combines training in a higher precision than the target model, rounding after intermediate computation steps, and storing rounding decisions based on an adaptive thresholding procedure, to successfully control for nondeterminism. Across three different NVIDIA GPUs (A40, Titan XP, RTX 2080 Ti), we achieve exact training replication at FP32 precision for both full-training and fine-tuning of ResNet-50 (23M) and GPT-2 (117M) models. Our verifiable training scheme significantly decreases the storage and time costs compared to proof-based systems.

Title: Counterfactual contrastive learning: robust representations via causal image synthesis

Authors: Melanie Roschewitz, Fabio De Sousa Ribeiro, Tian Xia, Galvin Khara, Ben Glocker
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09605
Pdf URL: https://arxiv.org/pdf/2403.09605
Copy Paste: [[2403.09605]] Counterfactual contrastive learning: robust representations via causal image synthesis(https://arxiv.org/abs/2403.09605)
Keywords: robust
Abstract: Contrastive pretraining is well-known to improve downstream task performance and model generalisation, especially in limited label settings. However, it is sensitive to the choice of augmentation pipeline. Positive pairs should preserve semantic information while destroying domain-specific information. Standard augmentation pipelines emulate domain-specific changes with pre-defined photometric transformations, but what if we could simulate realistic domain changes instead? In this work, we show how to utilise recent progress in counterfactual image generation to this effect. We propose CF-SimCLR, a counterfactual contrastive learning approach which leverages approximate counterfactual inference for positive pair creation. Comprehensive evaluation across five datasets, on chest radiography and mammography, demonstrates that CF-SimCLR substantially improves robustness to acquisition shift with higher downstream performance on both in- and out-of-distribution data, particularly for domains which are under-represented during training.

Title: Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey

Authors: Xiaoyu Liu, Paiheng Xu, Junda Wu, Jiaxin Yuan, Yifan Yang, Yuhang Zhou, Fuxiao Liu, Tianrui Guan, Haoliang Wang, Tong Yu, Julian McAuley, Wei Ai, Furong Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2403.09606
Pdf URL: https://arxiv.org/pdf/2403.09606
Copy Paste: [[2403.09606]] Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey(https://arxiv.org/abs/2403.09606)
Keywords: robust, fair, explainability, generative, large language model
Abstract: Causal inference has shown potential in enhancing the predictive accuracy, fairness, robustness, and explainability of Natural Language Processing (NLP) models by capturing causal relationships among variables. The emergence of generative Large Language Models (LLMs) has significantly impacted various NLP domains, particularly through their advanced reasoning capabilities. This survey focuses on evaluating and improving LLMs from a causal view in the following areas: understanding and improving the LLMs' reasoning capacity, addressing fairness and safety issues in LLMs, complementing LLMs with explanations, and handling multimodality. Meanwhile, LLMs' strong reasoning capacities can in turn contribute to the field of causal inference by aiding causal relationship discovery and causal effect estimations. This review explores the interplay between causal inference frameworks and LLMs from both perspectives, emphasizing their collective potential to further the development of more advanced and equitable artificial intelligence systems.

Title: MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Authors: Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2403.09611
Pdf URL: https://arxiv.org/pdf/2403.09611
Copy Paste: [[2403.09611]] MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training(https://arxiv.org/abs/2403.09611)
Keywords: large language model
Abstract: In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

Title: Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training

Authors: Yanlai Yang, Matt Jones, Michael C. Mozer, Mengye Ren
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2403.09613
Pdf URL: https://arxiv.org/pdf/2403.09613
Copy Paste: [[2403.09613]] Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training(https://arxiv.org/abs/2403.09613)
Keywords: robust
Abstract: We explore the training dynamics of neural networks in a structured non-IID setting where documents are presented cyclically in a fixed, repeated sequence. Typically, networks suffer from catastrophic interference when training on a sequence of documents; however, we discover a curious and remarkable property of LLMs fine-tuned sequentially in this setting: they exhibit anticipatory behavior, recovering from the forgetting on documents before encountering them again. The behavior emerges and becomes more robust as the architecture scales up its number of parameters. Through comprehensive experiments and visualizations, we uncover new insights into training over-parameterized networks in structured environments.

Title: Explore In-Context Segmentation via Latent Diffusion Models

Authors: Chaoyang Wang, Xiangtai Li, Henghui Ding, Lu Qi, Jiangning Zhang, Yunhai Tong, Chen Change Loy, Shuicheng Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09616
Pdf URL: https://arxiv.org/pdf/2403.09616
Copy Paste: [[2403.09616]] Explore In-Context Segmentation via Latent Diffusion Models(https://arxiv.org/abs/2403.09616)
Keywords: fair, diffusion, segmentation
Abstract: In-context segmentation has drawn more attention with the introduction of vision foundation models. Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries. In this work, we explore this problem from a new perspective, using one representative generation model, the latent diffusion model (LDM). We observe a task gap between generation and segmentation in diffusion models, but LDM is still an effective minimalist for in-context segmentation. In particular, we propose two meta-architectures and correspondingly design several output alignment and optimization strategies. We have conducted comprehensive ablation studies and empirically found that the segmentation quality counts on output alignment and in-context instructions. Moreover, we build a new and fair in-context segmentation benchmark that includes both image and video datasets. Experiments validate the efficiency of our approach, demonstrating comparable or even stronger results than previous specialist models or visual foundation models. Our study shows that LDMs can also achieve good enough results for challenging in-context segmentation tasks.

Title: PosSAM: Panoptic Open-vocabulary Segment Anything

Authors: Vibashan VS, Shubhankar Borse, Hyojin Park, Debasmit Das, Vishal Patel, Munawar Hayat, Fatih Porikli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09620
Pdf URL: https://arxiv.org/pdf/2403.09620
Copy Paste: [[2403.09620]] PosSAM: Panoptic Open-vocabulary Segment Anything(https://arxiv.org/abs/2403.09620)
Keywords: segmentation
Abstract: In this paper, we introduce an open-vocabulary panoptic segmentation model that effectively unifies the strengths of the Segment Anything Model (SAM) with the vision-language CLIP model in an end-to-end framework. While SAM excels in generating spatially-aware masks, it's decoder falls short in recognizing object class information and tends to oversegment without additional guidance. Existing approaches address this limitation by using multi-stage techniques and employing separate models to generate class-aware prompts, such as bounding boxes or segmentation masks. Our proposed method, PosSAM is an end-to-end model which leverages SAM's spatially rich features to produce instance-aware masks and harnesses CLIP's semantically discriminative features for effective instance classification. Specifically, we address the limitations of SAM and propose a novel Local Discriminative Pooling (LDP) module leveraging class-agnostic SAM and class-aware CLIP features for unbiased open-vocabulary classification. Furthermore, we introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image. We conducted extensive experiments to demonstrate our methods strong generalization properties across multiple datasets, achieving state-of-the-art performance with substantial improvements over SOTA open-vocabulary panoptic segmentation methods. In both COCO to ADE20K and ADE20K to COCO settings, PosSAM outperforms the previous state-of-the-art methods by a large margin, 2.4 PQ and 4.6 PQ, respectively. Project Website: https://vibashan.github.io/possam-web/.

Title: Minimax Optimal and Computationally Efficient Algorithms for Distributionally Robust Offline Reinforcement Learning

Authors: Zhishuai Liu, Pan Xu
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2403.09621
Pdf URL: https://arxiv.org/pdf/2403.09621
Copy Paste: [[2403.09621]] Minimax Optimal and Computationally Efficient Algorithms for Distributionally Robust Offline Reinforcement Learning(https://arxiv.org/abs/2403.09621)
Keywords: robust
Abstract: Distributionally robust offline reinforcement learning (RL), which seeks robust policy training against environment perturbation by modeling dynamics uncertainty, calls for function approximations when facing large state-action spaces. However, the consideration of dynamics uncertainty introduces essential nonlinearity and computational burden, posing unique challenges for analyzing and practically employing function approximation. Focusing on a basic setting where the nominal model and perturbed models are linearly parameterized, we propose minimax optimal and computationally efficient algorithms realizing function approximation and initiate the study on instance-dependent suboptimality analysis in the context of robust offline RL. Our results uncover that function approximation in robust offline RL is essentially distinct from and probably harder than that in standard offline RL. Our algorithms and theoretical results crucially depend on a variety of new techniques, involving a novel function approximation mechanism incorporating variance information, a new procedure of suboptimality and estimation uncertainty decomposition, a quantification of the robust value function shrinkage, and a meticulously designed family of hard instances, which might be of independent interest.

Title: Score-Guided Diffusion for 3D Human Recovery

Authors: Anastasis Stathopoulos, Ligong Han, Dimitris Metaxas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09623
Pdf URL: https://arxiv.org/pdf/2403.09623
Copy Paste: [[2403.09623]] Score-Guided Diffusion for 3D Human Recovery(https://arxiv.org/abs/2403.09623)
Keywords: diffusion
Abstract: We present Score-Guided Human Mesh Recovery (ScoreHMR), an approach for solving inverse problems for 3D human pose and shape reconstruction. These inverse problems involve fitting a human body model to image observations, traditionally solved through optimization techniques. ScoreHMR mimics model fitting approaches, but alignment with the image observation is achieved through score guidance in the latent space of a diffusion model. The diffusion model is trained to capture the conditional distribution of the human model parameters given an input image. By guiding its denoising process with a task-specific score, ScoreHMR effectively solves inverse problems for various applications without the need for retraining the task-agnostic diffusion model. We evaluate our approach on three settings/applications. These are: (i) single-frame model fitting; (ii) reconstruction from multiple uncalibrated views; (iii) reconstructing humans in video sequences. ScoreHMR consistently outperforms all optimization baselines on popular benchmarks across all settings. We make our code and models available at the https://statho.github.io/ScoreHMR.

Title: Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation

Authors: Fangfu Liu, Hanyang Wang, Weiliang Chen, Haowen Sun, Yueqi Duan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.09625
Pdf URL: https://arxiv.org/pdf/2403.09625
Copy Paste: [[2403.09625]] Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation(https://arxiv.org/abs/2403.09625)
Keywords: diffusion, generative
Abstract: Recent years have witnessed the strong power of 3D generation models, which offer a new level of creative flexibility by allowing users to guide the 3D content generation process through a single image or natural language. However, it remains challenging for existing 3D generation methods to create subject-driven 3D content across diverse prompts. In this paper, we introduce a novel 3D customization method, dubbed Make-Your-3D that can personalize high-fidelity and consistent 3D content from only a single image of a subject with text description within 5 minutes. Our key insight is to harmonize the distributions of a multi-view diffusion model and an identity-specific 2D generative model, aligning them with the distribution of the desired 3D subject. Specifically, we design a co-evolution framework to reduce the variance of distributions, where each model undergoes a process of learning from the other through identity-aware optimization and subject-prior optimization, respectively. Extensive experiments demonstrate that our method can produce high-quality, consistent, and subject-specific 3D content with text-driven modifications that are unseen in subject image.

Title: Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Authors: Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09626
Pdf URL: https://arxiv.org/pdf/2403.09626
Copy Paste: [[2403.09626]] Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding(https://arxiv.org/abs/2403.09626)
Keywords: transformer
Abstract: Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: https://github.com/OpenGVLab/video-mamba-suite.

Title: Generalized Predictive Model for Autonomous Driving

Authors: Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, Jun Zhang, Andreas Geiger, Yu Qiao, Hongyang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09630
Pdf URL: https://arxiv.org/pdf/2403.09630
Copy Paste: [[2403.09630]] Generalized Predictive Model for Autonomous Driving(https://arxiv.org/abs/2403.09630)
Keywords: diffusion
Abstract: In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model, we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos, spanning areas all over the world with diverse weather conditions and traffic scenarios. Inheriting the merits from recent latent diffusion models, our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. We showcase that it can generalize to various unseen driving datasets in a zero-shot manner, surpassing general or driving-specific video prediction counterparts. Furthermore, GenAD can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications.

Title: 3D-VLA: A 3D Vision-Language-Action Generative World Model

Authors: Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, Chuang Gan
Subjects: cs.CV, cs.AI, cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2403.09631
Pdf URL: https://arxiv.org/pdf/2403.09631
Copy Paste: [[2403.09631]] 3D-VLA: A 3D Vision-Language-Action Generative World Model(https://arxiv.org/abs/2403.09631)
Keywords: diffusion, generative, large language model
Abstract: Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from perception to action, neglecting the vast dynamics of the world and the relations between actions and dynamics. In contrast, human beings are endowed with world models that depict imagination about future scenarios to plan actions accordingly. To this end, we propose 3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is introduced to engage with the embodied environment. Furthermore, to inject generation abilities into the model, we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds. To train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications.

Title: Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models

Authors: Akhil Kedia, Mohd Abbas Zaidi, Sushil Khyalia, Jungho Jung, Harshith Goka, Haejun Lee
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2403.09635
Pdf URL: https://arxiv.org/pdf/2403.09635
Copy Paste: [[2403.09635]] Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models(https://arxiv.org/abs/2403.09635)
Keywords: robust, transformer
Abstract: In spite of their huge success, transformer models remain difficult to scale in depth. In this work, we develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model. Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores. We also propose DeepScaleLM, an initialization and scaling scheme that conserves unit output/gradient moments throughout the model, enabling the training of very deep models with 100s of layers. We find that transformer models could be much deeper - our deep models with fewer parameters outperform shallow models in Language Modeling, Speech Translation, and Image Classification, across Encoder-only, Decoder-only and Encoder-Decoder variants, for both Pre-LN and Post-LN transformers, for multiple datasets and model sizes. These improvements also translate into improved performance on downstream Question Answering tasks and improved robustness for image classification.

Title: Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Authors: Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2403.09636
Pdf URL: https://arxiv.org/pdf/2403.09636
Copy Paste: [[2403.09636]] Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference(https://arxiv.org/abs/2403.09636)
Keywords: transformer, large language model
Abstract: Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key-value cache compression at inference time. Most importantly, the model learns to apply different compression rates in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to ~3.7x throughput increase in auto-regressive inference on a NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. We find that DMC preserves the original downstream performance with up to 4x cache compression, outperforming up-trained grouped-query attention (GQA). GQA and DMC can be even combined to obtain compounded gains. As a result DMC fits longer contexts and larger batches within any given memory budget.

Title: SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior

Authors: Huan-ang Gao, Mingju Gao, Jiaju Li, Wenyi Li, Rong Zhi, Hao Tang, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2403.09638
Pdf URL: https://arxiv.org/pdf/2403.09638
Copy Paste: [[2403.09638]] SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior(https://arxiv.org/abs/2403.09638)
Keywords: diffusion
Abstract: Semantic image synthesis (SIS) shows good promises for sensor simulation. However, current best practices in this field, based on GANs, have not yet reached the desired level of quality. As latent diffusion models make significant strides in image generation, we are prompted to evaluate ControlNet, a notable method for its dense control capabilities. Our investigation uncovered two primary issues with its results: the presence of weird sub-structures within large semantic areas and the misalignment of content with the semantic mask. Through empirical study, we pinpointed the cause of these problems as a mismatch between the noised training data distribution and the standard normal prior applied at the inference stage. To address this challenge, we developed specific noise priors for SIS, encompassing spatial, categorical, and a novel spatial-categorical joint prior for inference. This approach, which we have named SCP-Diff, has yielded exceptional results, achieving an FID of 10.53 on Cityscapes and 12.66 on ADE20K.The code and models can be accessed via the project page.