2024-04-19

Title: SNP: Structured Neuron-level Pruning to Preserve Attention Scores

Authors: Kyunghwan Shim, Jaewoong Yun, Shinkook Choi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2404.11630
Pdf URL: https://arxiv.org/pdf/2404.11630
Copy Paste: [[2404.11630]] SNP: Structured Neuron-level Pruning to Preserve Attention Scores(https://arxiv.org/abs/2404.11630)
Keywords: transformer
Abstract: Multi-head self-attention (MSA) is a key component of Vision Transformers (ViTs), which have achieved great success in various vision tasks. However, their high computational cost and memory footprint hinder their deployment on resource-constrained devices. Conventional pruning approaches can only compress and accelerate the MSA module using head pruning, although the head is not an atomic unit. To address this issue, we propose a novel graph-aware neuron-level pruning method, Structured Neuron-level Pruning (SNP). SNP prunes neurons with less informative attention scores and eliminates redundancy among heads. Specifically, it prunes graphically connected query and key layers having the least informative attention scores while preserving the overall attention scores. Value layers, which can be pruned independently, are pruned to eliminate inter-head redundancy. Our proposed method effectively compresses and accelerates Transformer-based models for both edge devices and server processors. For instance, the DeiT-Small with SNP runs 3.1$\times$ faster than the original model and achieves performance that is 21.94\% faster and 1.12\% higher than the DeiT-Tiny. Additionally, SNP combine successfully with conventional head or block pruning approaches. SNP with head pruning could compress the DeiT-Base by 80\% of the parameters and computational costs and achieve 3.85$\times$ faster inference speed on RTX3090 and 4.93$\times$ on Jetson Nano.

Title: Exploring DNN Robustness Against Adversarial Attacks Using Approximate Multipliers

Authors: Mohammad Javad Askarizadeh, Ebrahim Farahmand, Jorge Castro-Godinez, Ali Mahani, Laura Cabrera-Quiros, Carlos Salazar-Garcia
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2404.11665
Pdf URL: https://arxiv.org/pdf/2404.11665
Copy Paste: [[2404.11665]] Exploring DNN Robustness Against Adversarial Attacks Using Approximate Multipliers(https://arxiv.org/abs/2404.11665)
Keywords: attack, robust
Abstract: Deep Neural Networks (DNNs) have advanced in many real-world applications, such as healthcare and autonomous driving. However, their high computational complexity and vulnerability to adversarial attacks are ongoing challenges. In this letter, approximate multipliers are used to explore DNN robustness improvement against adversarial attacks. By uniformly replacing accurate multipliers for state-of-the-art approximate ones in DNN layer models, we explore the DNNs robustness against various adversarial attacks in a feasible time. Results show up to 7% accuracy drop due to approximations when no attack is present while improving robust accuracy up to 10% when attacks applied.

Title: MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory

Authors: Ali Modarressi, Abdullatif Köksal, Ayyoob Imani, Mohsen Fayyaz, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.11672
Pdf URL: https://arxiv.org/pdf/2404.11672
Copy Paste: [[2404.11672]] MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory(https://arxiv.org/abs/2404.11672)
Keywords: interpretability, large language model
Abstract: While current large language models (LLMs) demonstrate some capabilities in knowledge-intensive tasks, they are limited by relying on their parameters as an implicit storage mechanism. As a result, they struggle with infrequent knowledge and temporal degradation. In addition, the uninterpretable nature of parametric memorization makes it challenging to understand and prevent hallucination. Parametric memory pools and model editing are only partial solutions. Retrieval Augmented Generation (RAG) $\unicode{x2013}$ though non-parametric $\unicode{x2013}$ has its own limitations: it lacks structure, complicates interpretability and makes it hard to effectively manage stored knowledge. In this paper, we introduce MemLLM, a novel method of enhancing LLMs by integrating a structured and explicit read-and-write memory module. MemLLM tackles the aforementioned challenges by enabling dynamic interaction with the memory and improving the LLM's capabilities in using stored knowledge. Our experiments indicate that MemLLM enhances the LLM's performance and interpretability, in language modeling in general and knowledge-intensive tasks in particular. We see MemLLM as an important step towards making LLMs more grounded and factual through memory augmentation.

Title: How often are errors in natural language reasoning due to paraphrastic variability?

Authors: Neha Srikanth, Marine Carpuat, Rachel Rudinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.11717
Pdf URL: https://arxiv.org/pdf/2404.11717
Copy Paste: [[2404.11717]] How often are errors in natural language reasoning due to paraphrastic variability?(https://arxiv.org/abs/2404.11717)
Keywords: large language model
Abstract: Large language models have been shown to behave inconsistently in response to meaning-preserving paraphrastic inputs. At the same time, researchers evaluate the knowledge and reasoning abilities of these models with test evaluations that do not disaggregate the effect of paraphrastic variability on performance. We propose a metric for evaluating the paraphrastic consistency of natural language reasoning models based on the probability of a model achieving the same correctness on two paraphrases of the same problem. We mathematically connect this metric to the proportion of a model's variance in correctness attributable to paraphrasing. To estimate paraphrastic consistency, we collect ParaNLU, a dataset of 7,782 human-written and validated paraphrased reasoning problems constructed on top of existing benchmark datasets for defeasible and abductive natural language inference. Using ParaNLU, we measure the paraphrastic consistency of several model classes and show that consistency dramatically increases with pretraining but not finetuning. All models tested exhibited room for improvement in paraphrastic consistency.

Title: Missed Connections: Lateral Thinking Puzzles for Large Language Models

Authors: Graham Todd, Tim Merino, Sam Earle, Julian Togelius
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.11730
Pdf URL: https://arxiv.org/pdf/2404.11730
Copy Paste: [[2404.11730]] Missed Connections: Lateral Thinking Puzzles for Large Language Models(https://arxiv.org/abs/2404.11730)
Keywords: large language model
Abstract: The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme. Solving the puzzle requires both common linguistic knowledge (i.e. definitions and typical usage) as well as, in many cases, lateral or abstract thinking. This is because the four categories ascend in complexity, with the most challenging category often requiring thinking about words in uncommon ways or as parts of larger phrases. We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning and a way to measure the semantic information encoded by data-driven linguistic systems. In particular, we study both a sentence-embedding baseline and modern large language models (LLMs). We report their accuracy on the task, measure the impacts of chain-of-thought prompting, and discuss their failure modes. Overall, we find that the Connections task is challenging yet feasible, and a strong test-bed for future work.

Title: Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach

Authors: Mir Rayat Imtiaz Hossain, Mennatullah Siam, Leonid Sigal, James J. Little
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11732
Pdf URL: https://arxiv.org/pdf/2404.11732
Copy Paste: [[2404.11732]] Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach(https://arxiv.org/abs/2404.11732)
Keywords: transformer, segmentation
Abstract: The emergence of attention-based transformer models has led to their extensive use in various tasks, due to their superior generalization and transfer properties. Recent research has demonstrated that such models, when prompted appropriately, are excellent for few-shot inference. However, such techniques are under-explored for dense prediction tasks like semantic segmentation. In this work, we examine the effectiveness of prompting a transformer-decoder with learned visual prompts for the generalized few-shot segmentation (GFSS) task. Our goal is to achieve strong performance not only on novel categories with limited examples, but also to retain performance on base categories. We propose an approach to learn visual prompts with limited examples. These learned visual prompts are used to prompt a multiscale transformer decoder to facilitate accurate dense predictions. Additionally, we introduce a unidirectional causal attention mechanism between the novel prompts, learned with limited examples, and the base prompts, learned with abundant data. This mechanism enriches the novel prompts without deteriorating the base class performance. Overall, this form of prompting helps us achieve state-of-the-art performance for GFSS on two different benchmark datasets: COCO-$20^i$ and Pascal-$5^i$, without the need for test-time optimization (or transduction). Furthermore, test-time optimization leveraging unlabelled test data can be used to improve the prompts, which we refer to as transductive prompt tuning.

Title: Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection

Authors: Deepti Hegde, Suhas Lohit, Kuan-Chuan Peng, Michael J. Jones, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11737
Pdf URL: https://arxiv.org/pdf/2404.11737
Copy Paste: [[2404.11737]] Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection(https://arxiv.org/abs/2404.11737)
Keywords: segmentation
Abstract: Popular representation learning methods encourage feature invariance under transformations applied at the input. However, in 3D perception tasks like object localization and segmentation, outputs are naturally equivariant to some transformations, such as rotation. Using pre-training loss functions that encourage equivariance of features under certain transformations provides a strong self-supervision signal while also retaining information of geometric relationships between transformed feature representations. This can enable improved performance in downstream tasks that are equivariant to such transformations. In this paper, we propose a spatio-temporal equivariant learning framework by considering both spatial and temporal augmentations jointly. Our experiments show that the best performance arises with a pre-training approach that encourages equivariance to translation, scaling, and flip, rotation and scene flow. For spatial augmentations, we find that depending on the transformation, either a contrastive objective or an equivariance-by-classification objective yields best results. To leverage real-world object deformations and motion, we consider sequential LiDAR scene pairs and develop a novel 3D scene flow-based equivariance objective that leads to improved performance overall. We show our pre-training method for 3D object detection which outperforms existing equivariant and invariant approaches in many settings.

Title: Improved Generalization Bounds for Communication Efficient Federated Learning

Authors: Peyman Gholami, Hulya Seferoglu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2404.11754
Pdf URL: https://arxiv.org/pdf/2404.11754
Copy Paste: [[2404.11754]] Improved Generalization Bounds for Communication Efficient Federated Learning(https://arxiv.org/abs/2404.11754)
Keywords: federate
Abstract: This paper focuses on reducing the communication cost of federated learning by exploring generalization bounds and representation learning. We first characterize a tighter generalization bound for one-round federated learning based on local clients' generalizations and heterogeneity of data distribution (non-iid scenario). We also characterize a generalization bound in R-round federated learning and its relation to the number of local updates (local stochastic gradient descents (SGDs)). Then, based on our generalization bound analysis and our representation learning interpretation of this analysis, we show for the first time that less frequent aggregations, hence more local updates, for the representation extractor (usually corresponds to initial layers) leads to the creation of more generalizable models, particularly for non-iid scenarios. We design a novel Federated Learning with Adaptive Local Steps (FedALS) algorithm based on our generalization bound and representation learning analysis. FedALS employs varying aggregation frequencies for different parts of the model, so reduces the communication cost. The paper is followed with experimental results showing the effectiveness of FedALS.

Title: Multimodal 3D Object Detection on Unseen Domains

Authors: Deepti Hegde, Suhas Lohit, Kuan-Chuan Peng, Michael J. Jones, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11764
Pdf URL: https://arxiv.org/pdf/2404.11764
Copy Paste: [[2404.11764]] Multimodal 3D Object Detection on Unseen Domains(https://arxiv.org/abs/2404.11764)
Keywords: robust
Abstract: LiDAR datasets for autonomous driving exhibit biases in properties such as point cloud density, range, and object dimensions. As a result, object detection networks trained and evaluated in different environments often experience performance degradation. Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem. However, in the real world, the exact conditions of deployment and access to samples representative of the test dataset may be unavailable while training. We argue that the more realistic and challenging formulation is to require robustness in performance to unseen target domains. We propose to address this problem in a two-pronged manner. First, we leverage paired LiDAR-image data present in most autonomous driving datasets to perform multimodal object detection. We suggest that working with multimodal features by leveraging both images and LiDAR point clouds for scene understanding tasks results in object detectors more robust to unseen domain shifts. Second, we train a 3D object detector to learn multimodal object features across different distributions and promote feature invariance across these source domains to improve generalizability to unseen target domains. To this end, we propose CLIX$^\text{3D}$, a multimodal fusion and supervised contrastive learning framework for 3D object detection that performs alignment of object features from same-class samples of different domains while pushing the features from different classes apart. We show that CLIX$^\text{3D}$ yields state-of-the-art domain generalization performance under multiple dataset shifts.

Title: QGen: On the Ability to Generalize in Quantization Aware Training

Authors: MohammadHossein AskariHemmat, Ahmadreza Jeddi, Reyhane Askari Hemmat, Ivan Lazarevich, Alexander Hoffman, Sudhakar Sah, Ehsan Saboori, Yvon Savaria, Jean-Pierre David
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2404.11769
Pdf URL: https://arxiv.org/pdf/2404.11769
Copy Paste: [[2404.11769]] QGen: On the Ability to Generalize in Quantization Aware Training(https://arxiv.org/abs/2404.11769)
Keywords: transformer
Abstract: Quantization lowers memory usage, computational requirements, and latency by utilizing fewer bits to represent model weights and activations. In this work, we investigate the generalization properties of quantized neural networks, a characteristic that has received little attention despite its implications on model performance. In particular, first, we develop a theoretical model for quantization in neural networks and demonstrate how quantization functions as a form of regularization. Second, motivated by recent work connecting the sharpness of the loss landscape and generalization, we derive an approximate bound for the generalization of quantized models conditioned on the amount of quantization noise. We then validate our hypothesis by experimenting with over 2000 models trained on CIFAR-10, CIFAR-100, and ImageNet datasets on convolutional and transformer-based models.

Title: CU-Mamba: Selective State Space Models with Channel Learning for Image Restoration

Authors: Rui Deng, Tianpei Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11778
Pdf URL: https://arxiv.org/pdf/2404.11778
Copy Paste: [[2404.11778]] CU-Mamba: Selective State Space Models with Channel Learning for Image Restoration(https://arxiv.org/abs/2404.11778)
Keywords: transformer
Abstract: Reconstructing degraded images is a critical task in image processing. Although CNN and Transformer-based models are prevalent in this field, they exhibit inherent limitations, such as inadequate long-range dependency modeling and high computational costs. To overcome these issues, we introduce the Channel-Aware U-Shaped Mamba (CU-Mamba) model, which incorporates a dual State Space Model (SSM) framework into the U-Net architecture. CU-Mamba employs a Spatial SSM module for global context encoding and a Channel SSM component to preserve channel correlation features, both in linear computational complexity relative to the feature map size. Extensive experimental results validate CU-Mamba's superiority over existing state-of-the-art methods, underscoring the importance of integrating both spatial and channel contexts in image restoration.

Title: REQUAL-LM: Reliability and Equity through Aggregation in Large Language Models

Authors: Sana Ebrahimi, Nima Shahbazi, Abolfazl Asudeh
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2404.11782
Pdf URL: https://arxiv.org/pdf/2404.11782
Copy Paste: [[2404.11782]] REQUAL-LM: Reliability and Equity through Aggregation in Large Language Models(https://arxiv.org/abs/2404.11782)
Keywords: large language model
Abstract: The extensive scope of large language models (LLMs) across various domains underscores the critical importance of responsibility in their application, beyond natural language processing. In particular, the randomized nature of LLMs, coupled with inherent biases and historical stereotypes in data, raises critical concerns regarding reliability and equity. Addressing these challenges are necessary before using LLMs for applications with societal impact. Towards addressing this gap, we introduce REQUAL-LM, a novel method for finding reliable and equitable LLM outputs through aggregation. Specifically, we develop a Monte Carlo method based on repeated sampling to find a reliable output close to the mean of the underlying distribution of possible outputs. We formally define the terms such as reliability and bias, and design an equity-aware aggregation to minimize harmful bias while finding a highly reliable output. REQUAL-LM does not require specialized hardware, does not impose a significant computing load, and uses LLMs as a blackbox. This design choice enables seamless scalability alongside the rapid advancement of LLM technologies. Our system does not require retraining the LLMs, which makes it deployment ready and easy to adapt. Our comprehensive experiments using various tasks and datasets demonstrate that REQUAL- LM effectively mitigates bias and selects a more equitable response, specifically the outputs that properly represents minority groups.

Title: Prompt-Driven Feature Diffusion for Open-World Semi-Supervised Learning

Authors: Marzi Heidari, Hanping Zhang, Yuhong Guo
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2404.11795
Pdf URL: https://arxiv.org/pdf/2404.11795
Copy Paste: [[2404.11795]] Prompt-Driven Feature Diffusion for Open-World Semi-Supervised Learning(https://arxiv.org/abs/2404.11795)
Keywords: diffusion
Abstract: In this paper, we present a novel approach termed Prompt-Driven Feature Diffusion (PDFD) within a semi-supervised learning framework for Open World Semi-Supervised Learning (OW-SSL). At its core, PDFD deploys an efficient feature-level diffusion model with the guidance of class-specific prompts to support discriminative feature representation learning and feature generation, tackling the challenge of the non-availability of labeled data for unseen classes in OW-SSL. In particular, PDFD utilizes class prototypes as prompts in the diffusion model, leveraging their class-discriminative and semantic generalization ability to condition and guide the diffusion process across all the seen and unseen classes. Furthermore, PDFD incorporates a class-conditional adversarial loss for diffusion model training, ensuring that the features generated via the diffusion process can be discriminatively aligned with the class-conditional features of the real data. Additionally, the class prototypes of the unseen classes are computed using only unlabeled instances with confident predictions within a semi-supervised learning framework. We conduct extensive experiments to evaluate the proposed PDFD. The empirical results show PDFD exhibits remarkable performance enhancements over many state-of-the-art existing methods.

Title: TempBEV: Improving Learned BEV Encoders with Combined Image and BEV Space Temporal Aggregation

Authors: Thomas Monninger, Vandana Dokkadi, Md Zafar Anwar, Steffen Staab
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2404.11803
Pdf URL: https://arxiv.org/pdf/2404.11803
Copy Paste: [[2404.11803]] TempBEV: Improving Learned BEV Encoders with Combined Image and BEV Space Temporal Aggregation(https://arxiv.org/abs/2404.11803)
Keywords: segmentation
Abstract: Autonomous driving requires an accurate representation of the environment. A strategy toward high accuracy is to fuse data from several sensors. Learned Bird's-Eye View (BEV) encoders can achieve this by mapping data from individual sensors into one joint latent space. For cost-efficient camera-only systems, this provides an effective mechanism to fuse data from multiple cameras with different views. Accuracy can further be improved by aggregating sensor information over time. This is especially important in monocular camera systems to account for the lack of explicit depth and velocity measurements. Thereby, the effectiveness of developed BEV encoders crucially depends on the operators used to aggregate temporal information and on the used latent representation spaces. We analyze BEV encoders proposed in the literature and compare their effectiveness, quantifying the effects of aggregation operators and latent representations. While most existing approaches aggregate temporal information either in image or in BEV latent space, our analyses and performance comparisons suggest that these latent representations exhibit complementary strengths. Therefore, we develop a novel temporal BEV encoder, TempBEV, which integrates aggregated temporal information from both latent spaces. We consider subsequent image frames as stereo through time and leverage methods from optical flow estimation for temporal stereo encoding. Empirical evaluation on the NuScenes dataset shows a significant improvement by TempBEV over the baseline for 3D object detection and BEV segmentation. The ablation uncovers a strong synergy of joint temporal aggregation in the image and BEV latent space. These results indicate the overall effectiveness of our approach and make a strong case for aggregating temporal information in both image and BEV latent spaces.

Title: Cross-model Mutual Learning for Exemplar-based Medical Image Segmentation

Authors: Qing En, Yuhong Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2404.11812
Pdf URL: https://arxiv.org/pdf/2404.11812
Copy Paste: [[2404.11812]] Cross-model Mutual Learning for Exemplar-based Medical Image Segmentation(https://arxiv.org/abs/2404.11812)
Keywords: robust, segmentation
Abstract: Medical image segmentation typically demands extensive dense annotations for model training, which is both time-consuming and skill-intensive. To mitigate this burden, exemplar-based medical image segmentation methods have been introduced to achieve effective training with only one annotated image. In this paper, we introduce a novel Cross-model Mutual learning framework for Exemplar-based Medical image Segmentation (CMEMS), which leverages two models to mutually excavate implicit information from unlabeled data at multiple granularities. CMEMS can eliminate confirmation bias and enable collaborative training to learn complementary information by enforcing consistency at different granularities across models. Concretely, cross-model image perturbation based mutual learning is devised by using weakly perturbed images to generate high-confidence pseudo-labels, supervising predictions of strongly perturbed images across models. This approach enables joint pursuit of prediction consistency at the image granularity. Moreover, cross-model multi-level feature perturbation based mutual learning is designed by letting pseudo-labels supervise predictions from perturbed multi-level features with different resolutions, which can broaden the perturbation space and enhance the robustness of our framework. CMEMS is jointly trained using exemplar data, synthetic data, and unlabeled data in an end-to-end manner. Experimental results on two medical image datasets indicate that the proposed CMEMS outperforms the state-of-the-art segmentation methods with extremely limited supervision.

Title: AquaSonic: Acoustic Manipulation of Underwater Data Center Operations and Resource Management

Authors: Jennifer Sheldon, Weidong Zhu, Adnan Abdullah, Sri Hrushikesh Varma Bhupathiraju, Takeshi Sugawara, Kevin R. B. Butler, Md Jahidul Islam, Sara Rampazzi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2404.11815
Pdf URL: https://arxiv.org/pdf/2404.11815
Copy Paste: [[2404.11815]] AquaSonic: Acoustic Manipulation of Underwater Data Center Operations and Resource Management(https://arxiv.org/abs/2404.11815)
Keywords: security, protect, defense, attack
Abstract: Underwater datacenters (UDCs) hold promise as next-generation data storage due to their energy efficiency and environmental sustainability benefits. While the natural cooling properties of water save power, the isolated aquatic environment and long-range sound propagation in water create unique vulnerabilities which differ from those of on-land data centers. Our research discovers the unique vulnerabilities of fault-tolerant storage devices, resource allocation software, and distributed file systems to acoustic injection attacks in UDCs. With a realistic testbed approximating UDC server operations, we empirically characterize the capabilities of acoustic injection underwater and find that an attacker can reduce fault-tolerant RAID 5 storage system throughput by 17% up to 100%. Our closed-water analyses reveal that attackers can (i) cause unresponsiveness and automatic node removal in a distributed filesystem with only 2.4 minutes of sustained acoustic injection, (ii) induce a distributed database's latency to increase by up to 92.7% to reduce system reliability, and (iii) induce load-balance managers to redirect up to 74% of resources to a target server to cause overload or force resource colocation. Furthermore, we perform open-water experiments in a lake and find that an attacker can cause controlled throughput degradation at a maximum allowable distance of 6.35 m using a commercial speaker. We also investigate and discuss the effectiveness of standard defenses against acoustic injection attacks. Finally, we formulate a novel machine learning-based detection system that reaches 0% False Positive Rate and 98.2% True Positive Rate trained on our dataset of profiled hard disk drives under 30-second FIO benchmark execution. With this work, we aim to help manufacturers proactively protect UDCs against acoustic injection attacks and ensure the security of subsea computing infrastructures.

Title: Tailoring Generative Adversarial Networks for Smooth Airfoil Design

Authors: Joyjit Chattoraj, Jian Cheng Wong, Zhang Zexuan, Manna Dai, Xia Yingzhi, Li Jichao, Xu Xinxing, Ooi Chin Chun, Yang Feng, Dao My Ha, Liu Yong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2404.11816
Pdf URL: https://arxiv.org/pdf/2404.11816
Copy Paste: [[2404.11816]] Tailoring Generative Adversarial Networks for Smooth Airfoil Design(https://arxiv.org/abs/2404.11816)
Keywords: generative
Abstract: In the realm of aerospace design, achieving smooth curves is paramount, particularly when crafting objects such as airfoils. Generative Adversarial Network (GAN), a widely employed generative AI technique, has proven instrumental in synthesizing airfoil designs. However, a common limitation of GAN is the inherent lack of smoothness in the generated airfoil surfaces. To address this issue, we present a GAN model featuring a customized loss function built to produce seamlessly contoured airfoil designs. Additionally, our model demonstrates a substantial increase in design diversity compared to a conventional GAN augmented with a post-processing smoothing filter.

Title: Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement

Authors: Pushkar Shukla, Dhruv Srikanth, Lee Cohen, Matthew Turk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11819
Pdf URL: https://arxiv.org/pdf/2404.11819
Copy Paste: [[2404.11819]] Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement(https://arxiv.org/abs/2404.11819)
Keywords: fair, generative
Abstract: We propose a novel approach to mitigate biases in computer vision models by utilizing counterfactual generation and fine-tuning. While counterfactuals have been used to analyze and address biases in DNN models, the counterfactuals themselves are often generated from biased generative models, which can introduce additional biases or spurious correlations. To address this issue, we propose using adversarial images, that is images that deceive a deep neural network but not humans, as counterfactuals for fair model training. Our approach leverages a curriculum learning framework combined with a fine-grained adversarial loss to fine-tune the model using adversarial examples. By incorporating adversarial images into the training data, we aim to prevent biases from propagating through the pipeline. We validate our approach through both qualitative and quantitative assessments, demonstrating improved bias mitigation and accuracy compared to existing methods. Qualitatively, our results indicate that post-training, the decisions made by the model are less dependent on the sensitive attribute and our model better disentangles the relationship between sensitive attributes and classification variables.

Title: AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence

Authors: Minbeom Kim, Hwanhee Lee, Joonsuk Park, Hwaran Lee, Kyomin Jung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.11826
Pdf URL: https://arxiv.org/pdf/2404.11826
Copy Paste: [[2404.11826]] AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence(https://arxiv.org/abs/2404.11826)
Keywords: large language model
Abstract: As the integration of large language models into daily life is on the rise, there is a clear gap in benchmarks for advising on subjective and personal dilemmas. To address this, we introduce AdvisorQA, the first benchmark developed to assess LLMs' capability in offering advice for deeply personalized concerns, utilizing the LifeProTips subreddit forum. This forum features a dynamic interaction where users post advice-seeking questions, receiving an average of 8.9 advice per query, with 164.2 upvotes from hundreds of users, embodying a collective intelligence framework. Therefore, we've completed a benchmark encompassing daily life questions, diverse corresponding responses, and majority vote ranking to train our helpfulness metric. Baseline experiments validate the efficacy of AdvisorQA through our helpfulness metric, GPT-4, and human evaluation, analyzing phenomena beyond the trade-off between helpfulness and harmlessness. AdvisorQA marks a significant leap in enhancing QA systems for providing personalized, empathetic advice, showcasing LLMs' improved understanding of human subjectivity.

Title: Actor-Critic Reinforcement Learning with Phased Actor

Authors: Ruofan Wu, Junmin Zhong, Jennie Si
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2404.11834
Pdf URL: https://arxiv.org/pdf/2404.11834
Copy Paste: [[2404.11834]] Actor-Critic Reinforcement Learning with Phased Actor(https://arxiv.org/abs/2404.11834)
Keywords: robust
Abstract: Policy gradient methods in actor-critic reinforcement learning (RL) have become perhaps the most promising approaches to solving continuous optimal control problems. However, the trial-and-error nature of RL and the inherent randomness associated with solution approximations cause variations in the learned optimal values and policies. This has significantly hindered their successful deployment in real life applications where control responses need to meet dynamic performance criteria deterministically. Here we propose a novel phased actor in actor-critic (PAAC) method, aiming at improving policy gradient estimation and thus the quality of the control policy. Specifically, PAAC accounts for both $Q$ value and TD error in its actor update. We prove qualitative properties of PAAC for learning convergence of the value and policy, solution optimality, and stability of system dynamics. Additionally, we show variance reduction in policy gradient estimation. PAAC performance is systematically and quantitatively evaluated in this study using DeepMind Control Suite (DMC). Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate. As PAAC can be piggybacked onto general policy gradient learning frameworks, we select well-known methods such as direct heuristic dynamic programming (dHDP), deep deterministic policy gradient (DDPG) and their variants to demonstrate the effectiveness of PAAC. Consequently we provide a unified view on these related policy gradient algorithms.

Title: Challenging Negative Gender Stereotypes: A Study on the Effectiveness of Automated Counter-Stereotypes

Authors: Isar Nejadgholi, Kathleen C. Fraser, Anna Kerkhof, Svetlana Kiritchenko
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2404.11845
Pdf URL: https://arxiv.org/pdf/2404.11845
Copy Paste: [[2404.11845]] Challenging Negative Gender Stereotypes: A Study on the Effectiveness of Automated Counter-Stereotypes(https://arxiv.org/abs/2404.11845)
Keywords: robust
Abstract: Gender stereotypes are pervasive beliefs about individuals based on their gender that play a significant role in shaping societal attitudes, behaviours, and even opportunities. Recognizing the negative implications of gender stereotypes, particularly in online communications, this study investigates eleven strategies to automatically counter-act and challenge these views. We present AI-generated gender-based counter-stereotypes to (self-identified) male and female study participants and ask them to assess their offensiveness, plausibility, and potential effectiveness. The strategies of counter-facts and broadening universals (i.e., stating that anyone can have a trait regardless of group membership) emerged as the most robust approaches, while humour, perspective-taking, counter-examples, and empathy for the speaker were perceived as less effective. Also, the differences in ratings were more pronounced for stereotypes about the different targets than between the genders of the raters. Alarmingly, many AI-generated counter-stereotypes were perceived as offensive and/or implausible. Our analysis and the collected dataset offer foundational insight into counter-stereotype generation, guiding future efforts to develop strategies that effectively challenge gender stereotypes in online interactions.

Title: Partial Large Kernel CNNs for Efficient Super-Resolution

Authors: Dongheon Lee, Seokju Yun, Youngmin Ro
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11848
Pdf URL: https://arxiv.org/pdf/2404.11848
Copy Paste: [[2404.11848]] Partial Large Kernel CNNs for Efficient Super-Resolution(https://arxiv.org/abs/2404.11848)
Keywords: transformer
Abstract: Recently, in the super-resolution (SR) domain, transformers have outperformed CNNs with fewer FLOPs and fewer parameters since they can deal with long-range dependency and adaptively adjust weights based on instance. In this paper, we demonstrate that CNNs, although less focused on in the current SR domain, surpass Transformers in direct efficiency measures. By incorporating the advantages of Transformers into CNNs, we aim to achieve both computational efficiency and enhanced performance. However, using a large kernel in the SR domain, which mainly processes large images, incurs a large computational overhead. To overcome this, we propose novel approaches to employing the large kernel, which can reduce latency by 86\% compared to the naive large kernel, and leverage an Element-wise Attention module to imitate instance-dependent weights. As a result, we introduce Partial Large Kernel CNNs for Efficient Super-Resolution (PLKSR), which achieves state-of-the-art performance on four datasets at a scale of $\times$4, with reductions of 68.1\% in latency and 80.2\% in maximum GPU memory occupancy compared to SRFormer-light.

Title: Progressive Multi-modal Conditional Prompt Tuning

Authors: Xiaoyu Qiu, Hao Feng, Yuechen Wang, Wengang Zhou, Houqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11864
Pdf URL: https://arxiv.org/pdf/2404.11864
Copy Paste: [[2404.11864]] Progressive Multi-modal Conditional Prompt Tuning(https://arxiv.org/abs/2404.11864)
Keywords: robust
Abstract: Pre-trained vision-language models (VLMs) have shown remarkable generalization capabilities via prompting, which leverages VLMs as knowledge bases to extract information beneficial for downstream tasks. However, existing methods primarily employ uni-modal prompting, which only engages a uni-modal branch, failing to simultaneously adjust vision-language (V-L) features. Additionally, the one-pass forward pipeline in VLM encoding struggles to align V-L features that have a huge gap. Confronting these challenges, we propose a novel method, Progressive Multi-modal conditional Prompt Tuning (ProMPT). ProMPT exploits a recurrent structure, optimizing and aligning V-L features by iteratively utilizing image and current encoding information. It comprises an initialization and a multi-modal iterative evolution (MIE) module. Initialization is responsible for encoding image and text using a VLM, followed by a feature filter that selects text features similar to image. MIE then facilitates multi-modal prompting through class-conditional vision prompting, instance-conditional text prompting, and feature filtering. In each MIE iteration, vision prompts are obtained from the filtered text features via a vision generator, promoting image features to focus more on target object during vision prompting. The encoded image features are fed into a text generator to produce text prompts that are more robust to class shift. Thus, V-L features are progressively aligned, enabling advance from coarse to exact classifications. Extensive experiments are conducted in three settings to evaluate the efficacy of ProMPT. The results indicate that ProMPT outperforms existing methods on average across all settings, demonstrating its superior generalization.

Title: From Image to Video, what do we need in multimodal LLMs?

Authors: Suyuan Huang, Haoxin Zhang, Yan Gao, Yao Hu, Zengchang Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11865
Pdf URL: https://arxiv.org/pdf/2404.11865
Copy Paste: [[2404.11865]] From Image to Video, what do we need in multimodal LLMs?(https://arxiv.org/abs/2404.11865)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information, covering from Image LLMs to the more complex Video LLMs. Numerous studies have illustrated their exceptional cross-modal comprehension. Recently, integrating video foundation models with large language models to build a comprehensive video understanding system has been proposed to overcome the limitations of specific pre-defined vision tasks. However, the current advancements in Video LLMs tend to overlook the foundational contributions of Image LLMs, often opting for more complicated structures and a wide variety of multimodal data for pre-training. This approach significantly increases the costs associated with these methods.In response to these challenges, this work introduces an efficient method that strategically leverages the priors of Image LLMs, facilitating a resource-efficient transition from Image to Video LLMs. We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs, which utilizes a temporal adaptation plug-and-play structure within the image fusion module of Image LLMs. This adaptation extends their understanding capabilities to include temporal information, enabling the development of Video LLMs that not only surpass baseline performances but also do so with minimal instructional data and training resources. Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models, effectively building upon the foundational work of Image LLMs.

Title: Multi-view Graph Structural Representation Learning via Graph Coarsening

Authors: Xiaorui Qi, Qijie Bai, Yanlong Wen, Haiwei Zhang, Xiaojie Yuan
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2404.11869
Pdf URL: https://arxiv.org/pdf/2404.11869
Copy Paste: [[2404.11869]] Multi-view Graph Structural Representation Learning via Graph Coarsening(https://arxiv.org/abs/2404.11869)
Keywords: transformer
Abstract: Graph Transformers (GTs) have made remarkable achievements in graph-level tasks. However, most existing works regard graph structures as a form of guidance or bias for enhancing node representations, which focuses on node-central perspectives and lacks explicit representations of edges and structures. One natural question is, can we treat graph structures node-like as a whole to learn high-level features? Through experimental analysis, we explore the feasibility of this assumption. Based on our findings, we propose a novel multi-view graph structural representation learning model via graph coarsening (MSLgo) on GT architecture for graph classification. Specifically, we build three unique views, original, coarsening, and conversion, to learn a thorough structural representation. We compress loops and cliques via hierarchical heuristic graph coarsening and restrict them with well-designed constraints, which builds the coarsening view to learn high-level interactions between structures. We also introduce line graphs for edge embeddings and switch to edge-central perspective to construct the conversion view. Experiments on six real-world datasets demonstrate the improvements of MSLgo over 14 baselines from various architectures.

Title: Enhancing Length Extrapolation in Sequential Models with Pointer-Augmented Neural Memory

Authors: Hung Le, Dung Nguyen, Kien Do, Svetha Venkatesh, Truyen Tran
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2404.11870
Pdf URL: https://arxiv.org/pdf/2404.11870
Copy Paste: [[2404.11870]] Enhancing Length Extrapolation in Sequential Models with Pointer-Augmented Neural Memory(https://arxiv.org/abs/2404.11870)
Keywords: transformer
Abstract: We propose Pointer-Augmented Neural Memory (PANM) to help neural networks understand and apply symbol processing to new, longer sequences of data. PANM integrates an external neural memory that uses novel physical addresses and pointer manipulation techniques to mimic human and computer symbol processing abilities. PANM facilitates pointer assignment, dereference, and arithmetic by explicitly using physical pointers to access memory content. Remarkably, it can learn to perform these operations through end-to-end training on sequence data, powering various sequential models. Our experiments demonstrate PANM's exceptional length extrapolating capabilities and improved performance in tasks that require symbol processing, such as algorithmic reasoning and Dyck language recognition. PANM helps Transformer achieve up to 100% generalization accuracy in compositional learning tasks and significantly better results in mathematical reasoning, question answering and machine translation tasks.

Title: Group-On: Boosting One-Shot Segmentation with Supportive Query

Authors: Hanjing Zhou, Mingze Yin, JinTai Chen, Danny Chen, Jian Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11871
Pdf URL: https://arxiv.org/pdf/2404.11871
Copy Paste: [[2404.11871]] Group-On: Boosting One-Shot Segmentation with Supportive Query(https://arxiv.org/abs/2404.11871)
Keywords: segmentation
Abstract: One-shot semantic segmentation aims to segment query images given only ONE annotated support image of the same class. This task is challenging because target objects in the support and query images can be largely different in appearance and pose (i.e., intra-class variation). Prior works suggested that incorporating more annotated support images in few-shot settings boosts performances but increases costs due to additional manual labeling. In this paper, we propose a novel approach for ONE-shot semantic segmentation, called Group-On, which packs multiple query images in batches for the benefit of mutual knowledge support within the same category. Specifically, after coarse segmentation masks of the batch of queries are predicted, query-mask pairs act as pseudo support data to enhance mask predictions mutually, under the guidance of a simple Group-On Voting module. Comprehensive experiments on three standard benchmarks show that, in the ONE-shot setting, our Group-On approach significantly outperforms previous works by considerable margins. For example, on the COCO-20i dataset, we increase mIoU scores by 8.21% and 7.46% on ASNet and HSNet baselines, respectively. With only one support image, Group-On can be even competitive with the counterparts using 5 annotated support images.

Title: Using a Local Surrogate Model to Interpret Temporal Shifts in Global Annual Data

Authors: Shou Nakano, Yang Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2404.11874
Pdf URL: https://arxiv.org/pdf/2404.11874
Copy Paste: [[2404.11874]] Using a Local Surrogate Model to Interpret Temporal Shifts in Global Annual Data(https://arxiv.org/abs/2404.11874)
Keywords: robust
Abstract: This paper focuses on explaining changes over time in globally-sourced, annual temporal data, with the specific objective of identifying pivotal factors that contribute to these temporal shifts. Leveraging such analytical frameworks can yield transformative impacts, including the informed refinement of public policy and the identification of key drivers affecting a country's economic evolution. We employ Local Interpretable Model-agnostic Explanations (LIME) to shed light on national happiness indices, economic freedom, and population metrics, spanning variable time frames. Acknowledging the presence of missing values, we employ three imputation approaches to generate robust multivariate time-series datasets apt for LIME's input requirements. Our methodology's efficacy is substantiated through a series of empirical evaluations involving multiple datasets. These evaluations include comparative analyses against random feature selection, correlation with real-world events as elucidated by LIME, and validation through Individual Conditional Expectation (ICE) plots, a state-of-the-art technique proficient in feature importance detection.

Title: The Dog Walking Theory: Rethinking Convergence in Federated Learning

Authors: Kun Zhai, Yifeng Gao, Xingjun Ma, Difan Zou, Guangnan Ye, Yu-Gang Jiang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2404.11888
Pdf URL: https://arxiv.org/pdf/2404.11888
Copy Paste: [[2404.11888]] The Dog Walking Theory: Rethinking Convergence in Federated Learning(https://arxiv.org/abs/2404.11888)
Keywords: federate
Abstract: Federated learning (FL) is a collaborative learning paradigm that allows different clients to train one powerful global model without sharing their private data. Although FL has demonstrated promising results in various applications, it is known to suffer from convergence issues caused by the data distribution shift across different clients, especially on non-independent and identically distributed (non-IID) data. In this paper, we study the convergence of FL on non-IID data and propose a novel \emph{Dog Walking Theory} to formulate and identify the missing element in existing research. The Dog Walking Theory describes the process of a dog walker leash walking multiple dogs from one side of the park to the other. The goal of the dog walker is to arrive at the right destination while giving the dogs enough exercise (i.e., space exploration). In FL, the server is analogous to the dog walker while the clients are analogous to the dogs. This analogy allows us to identify one crucial yet missing element in existing FL algorithms: the leash that guides the exploration of the clients. To address this gap, we propose a novel FL algorithm \emph{FedWalk} that leverages an external easy-to-converge task at the server side as a \emph{leash task} to guide the local training of the clients. We theoretically analyze the convergence of FedWalk with respect to data heterogeneity (between server and clients) and task discrepancy (between the leash and the original tasks). Experiments on multiple benchmark datasets demonstrate the superiority of FedWalk over state-of-the-art FL methods under both IID and non-IID settings.

Title: FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models

Authors: Wei Wu, Qingnan Fan, Shuai Qin, Hong Gu, Ruoyu Zhao, Antoni B. Chan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11895
Pdf URL: https://arxiv.org/pdf/2404.11895
Copy Paste: [[2404.11895]] FreeDiff: Progressive Frequency Truncation for Image Editing with Diffusion Models(https://arxiv.org/abs/2404.11895)
Keywords: diffusion, generative
Abstract: Precise image editing with text-to-image models has attracted increasing interest due to their remarkable generative capabilities and user-friendly nature. However, such attempts face the pivotal challenge of misalignment between the intended precise editing target regions and the broader area impacted by the guidance in practice. Despite excellent methods leveraging attention mechanisms that have been developed to refine the editing guidance, these approaches necessitate modifications through complex network architecture and are limited to specific editing tasks. In this work, we re-examine the diffusion process and misalignment problem from a frequency perspective, revealing that, due to the power law of natural images and the decaying noise schedule, the denoising network primarily recovers low-frequency image components during the earlier timesteps and thus brings excessive low-frequency signals for editing. Leveraging this insight, we introduce a novel fine-tuning free approach that employs progressive $\textbf{Fre}$qu$\textbf{e}$ncy truncation to refine the guidance of $\textbf{Diff}$usion models for universal editing tasks ($\textbf{FreeDiff}$). Our method achieves comparable results with state-of-the-art methods across a variety of editing tasks and on a diverse set of images, highlighting its potential as a versatile tool in image editing applications.

Title: FedMID: A Data-Free Method for Using Intermediate Outputs as a Defense Mechanism Against Poisoning Attacks in Federated Learning

Authors: Sungwon Han, Hyeonho Song, Sungwon Park, Meeyoung Cha
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2404.11905
Pdf URL: https://arxiv.org/pdf/2404.11905
Copy Paste: [[2404.11905]] FedMID: A Data-Free Method for Using Intermediate Outputs as a Defense Mechanism Against Poisoning Attacks in Federated Learning(https://arxiv.org/abs/2404.11905)
Keywords: defense, attack, robust, federate, data-free
Abstract: Federated learning combines local updates from clients to produce a global model, which is susceptible to poisoning attacks. Most previous defense strategies relied on vectors derived from projections of local updates on a Euclidean space; however, these methods fail to accurately represent the functionality and structure of local models, resulting in inconsistent performance. Here, we present a new paradigm to defend against poisoning attacks in federated learning using functional mappings of local models based on intermediate outputs. Experiments show that our mechanism is robust under a broad range of computing conditions and advanced attack scenarios, enabling safer collaboration among data-sensitive participants via federated learning.

Title: TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Authors: Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.11912
Pdf URL: https://arxiv.org/pdf/2404.11912
Copy Paste: [[2404.11912]] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding(https://arxiv.org/abs/2404.11912)
Keywords: robust, large language model
Abstract: With large language models (LLMs) widely deployed in long content generation recently, there has emerged an increasing demand for efficient long-sequence inference support. However, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length. Due to the auto-regressive nature of LLMs, the entire KV cache will be loaded for every generated token, resulting in low utilization of computational cores and high latency. While various compression methods for KV cache have been proposed to alleviate this issue, they suffer from degradation in generation quality. We introduce TriForce, a hierarchical speculative decoding system that is scalable to long sequence generation. This approach leverages the original model weights and dynamic sparse KV cache via retrieval as a draft model, which serves as an intermediate layer in the hierarchy and is further speculated by a smaller model to reduce its drafting latency. TriForce not only facilitates impressive speedups for Llama2-7B-128K, achieving up to 2.31$\times$ on an A100 GPU but also showcases scalability in handling even longer contexts. For the offloading setting on two RTX 4090 GPUs, TriForce achieves 0.108s/token$\unicode{x2014}$only half as slow as the auto-regressive baseline on an A100, which attains 7.78$\times$ on our optimized offloading system. Additionally, TriForce performs 4.86$\times$ than DeepSpeed-Zero-Inference on a single RTX 4090 GPU. TriForce's robustness is highlighted by its consistently outstanding performance across various temperatures. The code is available at https://github.com/Infini-AI-Lab/TriForce.

Title: SKIP: Skill-Localized Prompt Tuning for Inference Speed Boost-Up

Authors: Nakyeong Yang, Junseok Kim, Jiwon Moon, Yunah Jang, Kyomin Jung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.11916
Pdf URL: https://arxiv.org/pdf/2404.11916
Copy Paste: [[2404.11916]] SKIP: Skill-Localized Prompt Tuning for Inference Speed Boost-Up(https://arxiv.org/abs/2404.11916)
Keywords: transformer
Abstract: Prompt-tuning methods have shown comparable performance as parameter-efficient fine-tuning (PEFT) methods in various natural language understanding tasks. However, existing prompt tuning methods still utilize the entire model architecture; thus, they fail to accelerate inference speed in the application. In this paper, we propose a novel approach called SKIll-localized Prompt tuning (SKIP), which is extremely efficient in inference time. Our method significantly enhances inference efficiency by investigating and utilizing a skill-localized subnetwork in a language model. Surprisingly, our method improves the inference speed up to 160% while pruning 52% of the parameters. Furthermore, we demonstrate that our method is applicable across various transformer-based architectures, thereby confirming its practicality and scalability.

Title: EdgeFusion: On-Device Text-to-Image Generation

Authors: Thibault Castells, Hyoung-Kyu Song, Tairen Piao, Shinkook Choi, Bo-Kyeong Kim, Hanyoung Yim, Changgwun Lee, Jae Gon Kim, Tae-Ho Kim
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2404.11925
Pdf URL: https://arxiv.org/pdf/2404.11925
Copy Paste: [[2404.11925]] EdgeFusion: On-Device Text-to-Image Generation(https://arxiv.org/abs/2404.11925)
Keywords: diffusion, generative
Abstract: The intensive computational burden of Stable Diffusion (SD) for text-to-image generation poses a significant hurdle for its practical application. To tackle this challenge, recent research focuses on methods to reduce sampling steps, such as Latent Consistency Model (LCM), and on employing architectural optimizations, including pruning and knowledge distillation. Diverging from existing approaches, we uniquely start with a compact SD variant, BK-SDM. We observe that directly applying LCM to BK-SDM with commonly used crawled datasets yields unsatisfactory results. It leads us to develop two strategies: (1) leveraging high-quality image-text pairs from leading generative models and (2) designing an advanced distillation process tailored for LCM. Through our thorough exploration of quantization, profiling, and on-device deployment, we achieve rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.

Title: CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment

Authors: Geyu Lin, Bin Wang, Zhengyuan Liu, Nancy F. Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.11932
Pdf URL: https://arxiv.org/pdf/2404.11932
Copy Paste: [[2404.11932]] CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment(https://arxiv.org/abs/2404.11932)
Keywords: large language model
Abstract: Multilingual proficiency presents a significant challenge for large language models (LLMs). English-centric models are usually suboptimal in other languages, particularly those that are linguistically distant from English. This performance discrepancy mainly stems from the imbalanced distribution of training data across languages during pre-training and instruction tuning stages. To address this problem, we propose a novel approach called CrossIn, which utilizes a mixed composition of cross-lingual instruction tuning data. Our method leverages the compressed representation shared by various languages to efficiently enhance the model's task-solving capabilities and multilingual proficiency within a single process. In addition, we introduce a multi-task and multi-faceted benchmark to evaluate the effectiveness of CrossIn. Experimental results demonstrate that our method substantially improves performance across tasks and languages, and we provide extensive insights into the impact of cross-lingual data volume and the integration of translation data on enhancing multilingual consistency and accuracy.

Title: LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

Authors: Thibault Castells, Hyoung-Kyu Song, Bo-Kyeong Kim, Shinkook Choi
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2404.11936
Pdf URL: https://arxiv.org/pdf/2404.11936
Copy Paste: [[2404.11936]] LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights(https://arxiv.org/abs/2404.11936)
Keywords: diffusion, generative
Abstract: Latent Diffusion Models (LDMs) have emerged as powerful generative models, known for delivering remarkable results under constrained computational resources. However, deploying LDMs on resource-limited devices remains a complex issue, presenting challenges such as memory consumption and inference speed. To address this issue, we introduce LD-Pruner, a novel performance-preserving structured pruning method for compressing LDMs. Traditional pruning methods for deep neural networks are not tailored to the unique characteristics of LDMs, such as the high computational cost of training and the absence of a fast, straightforward and task-agnostic method for evaluating model performance. Our method tackles these challenges by leveraging the latent space during the pruning process, enabling us to effectively quantify the impact of pruning on model performance, independently of the task at hand. This targeted pruning of components with minimal impact on the output allows for faster convergence during training, as the model has less information to re-learn, thereby addressing the high computational cost of training. Consequently, our approach achieves a compressed model that offers improved inference speed and reduced parameter count, while maintaining minimal performance degradation. We demonstrate the effectiveness of our approach on three different tasks: text-to-image (T2I) generation, Unconditional Image Generation (UIG) and Unconditional Audio Generation (UAG). Notably, we reduce the inference time of Stable Diffusion (SD) by 34.9% while simultaneously improving its FID by 5.2% on MS-COCO T2I benchmark. This work paves the way for more efficient pruning methods for LDMs, enhancing their applicability.

Title: Trusted Multi-view Learning with Label Noise

Authors: Cai Xu, Yilin Zhang, Ziyu Guan, Wei Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2404.11944
Pdf URL: https://arxiv.org/pdf/2404.11944
Copy Paste: [[2404.11944]] Trusted Multi-view Learning with Label Noise(https://arxiv.org/abs/2404.11944)
Keywords: robust, noise learning
Abstract: Multi-view learning methods often focus on improving decision accuracy while neglecting the decision uncertainty, which significantly restricts their applications in safety-critical applications. To address this issue, researchers propose trusted multi-view methods that learn the class distribution for each instance, enabling the estimation of classification probabilities and uncertainty. However, these methods heavily rely on high-quality ground-truth labels. This motivates us to delve into a new generalized trusted multi-view learning problem: how to develop a reliable multi-view learning model under the guidance of noisy labels? We propose a trusted multi-view noise refining method to solve this problem. We first construct view-opinions using evidential deep neural networks, which consist of belief mass vectors and uncertainty estimates. Subsequently, we design view-specific noise correlation matrices that transform the original opinions into noisy opinions aligned with the noisy labels. Considering label noises originating from low-quality data features and easily-confused classes, we ensure that the diagonal elements of these matrices are inversely proportional to the uncertainty, while incorporating class relations into the off-diagonal elements. Finally, we aggregate the noisy opinions and employ a generalized maximum likelihood loss on the aggregated opinion for model training, guided by the noisy labels. We empirically compare TMNR with state-of-the-art trusted multi-view learning and label noise learning baselines on 5 publicly available datasets. Experiment results show that TMNR outperforms baseline methods on accuracy, reliability and robustness. We promise to release the code and all datasets on Github and show the link here.

Title: Sketch-guided Image Inpainting with Partial Discrete Diffusion Process

Authors: Nakul Sharma, Aditay Tripathi, Anirban Chakraborty, Anand Mishra
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.11949
Pdf URL: https://arxiv.org/pdf/2404.11949
Copy Paste: [[2404.11949]] Sketch-guided Image Inpainting with Partial Discrete Diffusion Process(https://arxiv.org/abs/2404.11949)
Keywords: diffusion, transformer
Abstract: In this work, we study the task of sketch-guided image inpainting. Unlike the well-explored natural language-guided image inpainting, which excels in capturing semantic details, the relatively less-studied sketch-guided inpainting offers greater user control in specifying the object's shape and pose to be inpainted. As one of the early solutions to this task, we introduce a novel partial discrete diffusion process (PDDP). The forward pass of the PDDP corrupts the masked regions of the image and the backward pass reconstructs these masked regions conditioned on hand-drawn sketches using our proposed sketch-guided bi-directional transformer. The proposed novel transformer module accepts two inputs -- the image containing the masked region to be inpainted and the query sketch to model the reverse diffusion process. This strategy effectively addresses the domain gap between sketches and natural images, thereby, enhancing the quality of inpainting results. In the absence of a large-scale dataset specific to this task, we synthesize a dataset from the MS-COCO to train and extensively evaluate our proposed framework against various competent approaches in the literature. The qualitative and quantitative results and user studies establish that the proposed method inpaints realistic objects that fit the context in terms of the visual appearance of the provided sketch. To aid further research, we have made our code publicly available at https://github.com/vl2g/Sketch-Inpainting .

Title: The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models

Authors: Cheng Shi, Sibei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11957
Pdf URL: https://arxiv.org/pdf/2404.11957
Copy Paste: [[2404.11957]] The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models(https://arxiv.org/abs/2404.11957)
Keywords: segmentation
Abstract: Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, \textit{i.e.}, these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose $\textbf{Zip}$ which $\textbf{Z}$ips up CL$\textbf{ip}$ and SAM in a novel classification-first-then-discovery pipeline, enabling annotation-free, complex-scene-capable, open-vocabulary object detection and instance segmentation. Our Zip significantly boosts SAM's mask AP on COCO dataset by 12.5% and establishes state-of-the-art performance in various settings, including training-free, self-training, and label-efficient finetuning. Furthermore, annotation-free Zip even achieves comparable performance to the best-performing open-vocabulary object detecters using base annotations. Code is released at https://github.com/ChengShiest/Zip-Your-CLIP

Title: Aligning Language Models to Explicitly Handle Ambiguity

Authors: Hyuhng Joon Kim, Youna Kim, Cheonbok Park, Junyeob Kim, Choonghyun Park, Kang Min Yoo, Sang-goo Lee, Taeuk Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.11972
Pdf URL: https://arxiv.org/pdf/2404.11972
Copy Paste: [[2404.11972]] Aligning Language Models to Explicitly Handle Ambiguity(https://arxiv.org/abs/2404.11972)
Keywords: large language model
Abstract: In spoken languages, utterances are often shaped to be incomplete or vague for efficiency. This can lead to varying interpretations of the same input, based on different assumptions about the context. To ensure reliable user-model interactions in such scenarios, it is crucial for models to adeptly handle the inherent ambiguity in user queries. However, conversational agents built upon even the most recent large language models (LLMs) face challenges in processing ambiguous inputs, primarily due to the following two hurdles: (1) LLMs are not directly trained to handle inputs that are too ambiguous to be properly managed; (2) the degree of ambiguity in an input can vary according to the intrinsic knowledge of the LLMs, which is difficult to investigate. To address these issues, this paper proposes a method to align LLMs to explicitly handle ambiguous inputs. Specifically, we introduce a proxy task that guides LLMs to utilize their intrinsic knowledge to self-disambiguate a given input. We quantify the information gain from the disambiguation procedure as a measure of the extent to which the models perceive their inputs as ambiguous. This measure serves as a cue for selecting samples deemed ambiguous from the models' perspectives, which are then utilized for alignment. Experimental results from several question-answering datasets demonstrate that the LLMs fine-tuned with our approach are capable of handling ambiguous inputs while still performing competitively on clear questions within the task.

Title: EVIT: Event-Oriented Instruction Tuning for Event Reasoning

Authors: Zhengwei Tao, Xiancai Chen, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Yiwei Lou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.11978
Pdf URL: https://arxiv.org/pdf/2404.11978
Copy Paste: [[2404.11978]] EVIT: Event-Oriented Instruction Tuning for Event Reasoning(https://arxiv.org/abs/2404.11978)
Keywords: large language model
Abstract: Events refer to specific occurrences, incidents, or happenings that take place under a particular background. Event reasoning aims to infer events according to certain relations and predict future events. The cutting-edge techniques for event reasoning play a crucial role in various natural language processing applications. Large language models (LLMs) have made significant advancements in event reasoning owing to their wealth of knowledge and reasoning capabilities. However, smaller instruction-tuned models currently in use do not consistently demonstrate exceptional proficiency in managing these tasks. This discrepancy arises from the absence of explicit modeling of events and the interconnections of them within their instruction data. Consequently, these models face challenges in comprehending event structures and semantics while struggling to bridge the gap between their interpretations and human understanding of events. Additionally, their limitations in grasping event relations lead to constrained event reasoning abilities to effectively deduce and incorporate pertinent event knowledge. In this paper, we propose Event-Oriented Instruction Tuning (EvIT) to train our LLM. Specifically, we first propose a novel structure named event quadruple which contains the structure and semantics of events and is complete in the event representation. We then design event-relation learning based on the structures. We encapsulate the learning into the instruction-tuning formulation to better stimulate the event reasoning capacity of our model. We design a heuristic unsupervised method to mine event quadruple from a large-scale corpus. At last, we finetune a Llama model on our Event-Oriented Instruction Tuning. We conduct extensive experiments on event reasoning tasks on several datasets. Automatic and human evaluations demonstrate EvIT achieves competitive performances on event reasoning.

Title: Tendency-driven Mutual Exclusivity for Weakly Supervised Incremental Semantic Segmentation

Authors: Chongjie Si, Xuehui Wang, Xiaokang Yang, Wei Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11981
Pdf URL: https://arxiv.org/pdf/2404.11981
Copy Paste: [[2404.11981]] Tendency-driven Mutual Exclusivity for Weakly Supervised Incremental Semantic Segmentation(https://arxiv.org/abs/2404.11981)
Keywords: segmentation
Abstract: Weakly Incremental Learning for Semantic Segmentation (WILSS) leverages a pre-trained segmentation model to segment new classes using cost-effective and readily available image-level labels. A prevailing way to solve WILSS is the generation of seed areas for each new class, serving as a form of pixel-level supervision. However, a scenario usually arises where a pixel is concurrently predicted as an old class by the pre-trained segmentation model and a new class by the seed areas. Such a scenario becomes particularly problematic in WILSS, as the lack of pixel-level annotations on new classes makes it intractable to ascertain whether the pixel pertains to the new class or not. To surmount this issue, we propose an innovative, tendency-driven relationship of mutual exclusivity, meticulously tailored to govern the behavior of the seed areas and the predictions generated by the pre-trained segmentation model. This relationship stipulates that predictions for the new and old classes must not conflict whilst prioritizing the preservation of predictions for the old classes, which not only addresses the conflicting prediction issue but also effectively mitigates the inherent challenge of incremental learning - catastrophic forgetting. Furthermore, under the auspices of this tendency-driven mutual exclusivity relationship, we generate pseudo masks for the new classes, allowing for concurrent execution with model parameter updating via the resolution of a bi-level optimization problem. Extensive experiments substantiate the effectiveness of our framework, resulting in the establishment of new benchmarks and paving the way for further research in this field.

Title: MultiPhys: Multi-Person Physics-aware 3D Motion Estimation

Authors: Nicolas Ugrinovic, Boxiao Pan, Georgios Pavlakos, Despoina Paschalidou, Bokui Shen, Jordi Sanchez-Riera, Francesc Moreno-Noguer, Leonidas Guibas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11987
Pdf URL: https://arxiv.org/pdf/2404.11987
Copy Paste: [[2404.11987]] MultiPhys: Multi-Person Physics-aware 3D Motion Estimation(https://arxiv.org/abs/2404.11987)
Keywords: robust
Abstract: We introduce MultiPhys, a method designed for recovering multi-person motion from monocular videos. Our focus lies in capturing coherent spatial placement between pairs of individuals across varying degrees of engagement. MultiPhys, being physically aware, exhibits robustness to jittering and occlusions, and effectively eliminates penetration issues between the two individuals. We devise a pipeline in which the motion estimated by a kinematic-based method is fed into a physics simulator in an autoregressive manner. We introduce distinct components that enable our model to harness the simulator's properties without compromising the accuracy of the kinematic estimates. This results in final motion estimates that are both kinematically coherent and physically compliant. Extensive evaluations on three challenging datasets characterized by substantial inter-person interaction show that our method significantly reduces errors associated with penetration and foot skating, while performing competitively with the state-of-the-art on motion accuracy and smoothness. Results and code can be found on our project page (this http URL).

Title: Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation

Authors: Qiyuan Dai, Sibei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.11998
Pdf URL: https://arxiv.org/pdf/2404.11998
Copy Paste: [[2404.11998]] Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation(https://arxiv.org/abs/2404.11998)
Keywords: segmentation
Abstract: Referring image segmentation (RIS) aims to precisely segment referents in images through corresponding natural language expressions, yet relying on cost-intensive mask annotations. Weakly supervised RIS thus learns from image-text pairs to pixel-level semantics, which is challenging for segmenting fine-grained masks. A natural approach to enhancing segmentation precision is to empower weakly supervised RIS with the image segmentation foundation model SAM. Nevertheless, we observe that simply integrating SAM yields limited benefits and can even lead to performance regression due to the inevitable noise issues and challenges in excessive focus on object parts. In this paper, we present an innovative framework, Point PrompTing (PPT), incorporated with the proposed multi-source curriculum learning strategy to address these challenges. Specifically, the core of PPT is a point generator that not only harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability but also generates negative point prompts to address the noisy and excessive focus issues inherently and effectively. In addition, we introduce a curriculum learning strategy with object-centric images to help PPT gradually learn from simpler yet precise semantic alignment to more complex RIS. Experiments demonstrate that our PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU by 11.34%, 14.14%, and 6.97% across RefCOCO, RefCOCO+, and G-Ref, respectively.

Title: Token-level Direct Preference Optimization

Authors: Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, Jun Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.11999
Pdf URL: https://arxiv.org/pdf/2404.11999
Copy Paste: [[2404.11999]] Token-level Direct Preference Optimization(https://arxiv.org/abs/2404.11999)
Keywords: large language model
Abstract: Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.

Title: Variational Multi-Modal Hypergraph Attention Network for Multi-Modal Relation Extraction

Authors: Qian Li, Cheng Ji, Shu Guo, Yong Zhao, Qianren Mao, Shangguang Wang, Yuntao Wei, Jianxin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12006
Pdf URL: https://arxiv.org/pdf/2404.12006
Copy Paste: [[2404.12006]] Variational Multi-Modal Hypergraph Attention Network for Multi-Modal Relation Extraction(https://arxiv.org/abs/2404.12006)
Keywords: extraction
Abstract: Multi-modal relation extraction (MMRE) is a challenging task that aims to identify relations between entities in text leveraging image information. Existing methods are limited by their neglect of the multiple entity pairs in one sentence sharing very similar contextual information (ie, the same text and image), resulting in increased difficulty in the MMRE task. To address this limitation, we propose the Variational Multi-Modal Hypergraph Attention Network (VM-HAN) for multi-modal relation extraction. Specifically, we first construct a multi-modal hypergraph for each sentence with the corresponding image, to establish different high-order intra-/inter-modal correlations for different entity pairs in each sentence. We further design the Variational Hypergraph Attention Networks (V-HAN) to obtain representational diversity among different entity pairs using Gaussian distribution and learn a better hypergraph structure via variational attention. VM-HAN achieves state-of-the-art performance on the multi-modal relation extraction task, outperforming existing methods in terms of accuracy and efficiency.

Title: ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity

Authors: Lasal Jayawardena, Prasan Yapa
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12010
Pdf URL: https://arxiv.org/pdf/2404.12010
Copy Paste: [[2404.12010]] ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity(https://arxiv.org/abs/2404.12010)
Keywords: large language model
Abstract: Paraphrase generation is a pivotal task in natural language processing (NLP). Existing datasets in the domain lack syntactic and lexical diversity, resulting in paraphrases that closely resemble the source sentences. Moreover, these datasets often contain hate speech and noise, and may unintentionally include non-English language sentences. This research introduces ParaFusion, a large-scale, high-quality English paraphrase dataset developed using Large Language Models (LLM) to address these challenges. ParaFusion augments existing datasets with high-quality data, significantly enhancing both lexical and syntactic diversity while maintaining close semantic similarity. It also mitigates the presence of hate speech and reduces noise, ensuring a cleaner and more focused English dataset. Results show that ParaFusion offers at least a 25% improvement in both syntactic and lexical diversity, measured across several metrics for each data source. The paper also aims to set a gold standard for paraphrase evaluation as it contains one of the most comprehensive evaluation strategies to date. The results underscore the potential of ParaFusion as a valuable resource for improving NLP applications.

Title: Pseudo-random generators using linear feedback shift registers with output extraction

Authors: Holger Nobach
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2404.12011
Pdf URL: https://arxiv.org/pdf/2404.12011
Copy Paste: [[2404.12011]] Pseudo-random generators using linear feedback shift registers with output extraction(https://arxiv.org/abs/2404.12011)
Keywords: extraction
Abstract: The use of three extractors, fed by linear feedback shift registers (LFSR) for generating pseudo-random bit streams is investigated. Specifically, a standard LFSR is combined with a von Neumann extractor, a modified LFSR, extended by the all-zero state, is combined with an output logic, which translates every three bits from the LFSR into up to two output bits and a run extraction of the input bit stream into single output bits are investigated. The latter two achieve better efficiency in using bits from the primary bit stream, the last one reaches 50\%. Compared to other generator logics, the three extractors investigated are less performant in terms of their cryptographic strength. However, the focus of this report is on the quality of the pseudo-random bit stream in comparison to really random bits and on the efficiency of using the bits of the primary stream from the LFSR and generating valid output bits, while fulfilling a minimum cryptographic strength only, beyond that of the pure LFSR.

Title: Sequential Compositional Generalization in Multimodal Models

Authors: Semih Yagcioglu, Osman Batur İnce, Aykut Erdem, Erkut Erdem, Desmond Elliott, Deniz Yuret
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12013
Pdf URL: https://arxiv.org/pdf/2404.12013
Copy Paste: [[2404.12013]] Sequential Compositional Generalization in Multimodal Models(https://arxiv.org/abs/2404.12013)
Keywords: generative
Abstract: The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address this by examining sequential compositional generalization using \textsc{CompAct} (\underline{Comp}ositional \underline{Act}ivities)\footnote{Project Page: \url{this http URL}}, a carefully constructed, perceptually grounded dataset set within a rich backdrop of egocentric kitchen activity videos. Each instance in our dataset is represented with a combination of raw video footage, naturally occurring sound, and crowd-sourced step-by-step descriptions. More importantly, our setup ensures that the individual concepts are consistently distributed across training and evaluation sets, while their compositions are novel in the evaluation set. We conduct a comprehensive assessment of several unimodal and multimodal models. Our findings reveal that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts. This highlights the importance of multimodality while charting a trajectory for future research in this domain.

Title: Enhance Robustness of Language Models Against Variation Attack through Graph Integration

Authors: Zi Xiong, Lizhi Qing, Yangyang Kang, Jiawei Liu, Hongsong Li, Changlong Sun, Xiaozhong Liu, Wei Lu
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2404.12014
Pdf URL: https://arxiv.org/pdf/2404.12014
Copy Paste: [[2404.12014]] Enhance Robustness of Language Models Against Variation Attack through Graph Integration(https://arxiv.org/abs/2404.12014)
Keywords: attack, robust
Abstract: The widespread use of pre-trained language models (PLMs) in natural language processing (NLP) has greatly improved performance outcomes. However, these models' vulnerability to adversarial attacks (e.g., camouflaged hints from drug dealers), particularly in the Chinese language with its rich character diversity/variation and complex structures, hatches vital apprehension. In this study, we propose a novel method, CHinese vAriatioN Graph Enhancement (CHANGE), to increase the robustness of PLMs against character variation attacks in Chinese content. CHANGE presents a novel approach for incorporating a Chinese character variation graph into the PLMs. Through designing different supplementary tasks utilizing the graph structure, CHANGE essentially enhances PLMs' interpretation of adversarially manipulated text. Experiments conducted in a multitude of NLP tasks show that CHANGE outperforms current language models in combating against adversarial attacks and serves as a valuable contribution to robust language model research. These findings contribute to the groundwork on robust language models and highlight the substantial potential of graph-guided pre-training strategies for real-world applications.

Title: What does CLIP know about peeling a banana?

Authors: Claudia Cuttano, Gabriele Rosi, Gabriele Trivigno, Giuseppe Averta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12015
Pdf URL: https://arxiv.org/pdf/2404.12015
Copy Paste: [[2404.12015]] What does CLIP know about peeling a banana?(https://arxiv.org/abs/2404.12015)
Keywords: segmentation
Abstract: Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes AffordanceCLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models.

Title: Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering

Authors: Jie Ma, Min Hu, Pinghui Wang, Wangchun Sun, Lingyun Song, Hongbin Pei, Jun Liu, Youtian Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12020
Pdf URL: https://arxiv.org/pdf/2404.12020
Copy Paste: [[2404.12020]] Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering(https://arxiv.org/abs/2404.12020)
Keywords: robust
Abstract: Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, \textit{MUSIC-AVQA-R}, crafted in two steps: rephrasing questions within the test split of a public dataset (\textit{MUSIC-AVQA}) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing strategy to overcome bias learning. Experimental results show that this architecture achieves state-of-the-art performance on both datasets, especially obtaining a significant improvement of 9.68\% on the proposed dataset. Extensive ablation experiments are conducted on these two datasets to validate the effectiveness of the debiasing strategy. Additionally, we highlight the limited robustness of existing multi-modal QA methods through the evaluation on our dataset.

Title: Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Authors: Pengfei Wu, Jiahao Liu, Zhuocheng Gong, Qifan Wang, Jinpeng Li, Jingang Wang, Xunliang Cai, Dongyan Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12022
Pdf URL: https://arxiv.org/pdf/2404.12022
Copy Paste: [[2404.12022]] Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration(https://arxiv.org/abs/2404.12022)
Keywords: transformer, large language model
Abstract: Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly evident when utilizing autoregressive decoding methods, which generate one token in a single forward process, thereby not fully capitalizing on the parallel computing capabilities of GPUs. In this paper, we propose a novel parallel decoding approach, namely \textit{hidden transfer}, which decodes multiple successive tokens simultaneously in a single forward pass. The idea is to transfer the intermediate hidden states of the previous context to the \textit{pseudo} hidden states of the future tokens to be generated, and then the pseudo hidden states will pass the following transformer layers thereby assimilating more semantic information and achieving superior predictive accuracy of the future tokens. Besides, we use the novel tree attention mechanism to simultaneously generate and verify multiple candidates of output sequences, which ensure the lossless generation and further improves the generation efficiency of our method. Experiments demonstrate the effectiveness of our method. We conduct a lot of analytic experiments to prove our motivation. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.

Title: Meta-Auxiliary Learning for Micro-Expression Recognition

Authors: Jingyao Wang, Yunhan Tian, Yuxuan Yang, Xiaoxin Chen, Changwen Zheng, Wenwen Qiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12024
Pdf URL: https://arxiv.org/pdf/2404.12024
Copy Paste: [[2404.12024]] Meta-Auxiliary Learning for Micro-Expression Recognition(https://arxiv.org/abs/2404.12024)
Keywords: robust
Abstract: Micro-expressions (MEs) are involuntary movements revealing people's hidden feelings, which has attracted numerous interests for its objectivity in emotion detection. However, despite its wide applications in various scenarios, micro-expression recognition (MER) remains a challenging problem in real life due to three reasons, including (i) data-level: lack of data and imbalanced classes, (ii) feature-level: subtle, rapid changing, and complex features of MEs, and (iii) decision-making-level: impact of individual differences. To address these issues, we propose a dual-branch meta-auxiliary learning method, called LightmanNet, for fast and robust micro-expression recognition. Specifically, LightmanNet learns general MER knowledge from limited data through a dual-branch bi-level optimization process: (i) In the first level, it obtains task-specific MER knowledge by learning in two branches, where the first branch is for learning MER features via primary MER tasks, while the other branch is for guiding the model obtain discriminative features via auxiliary tasks, i.e., image alignment between micro-expressions and macro-expressions since their resemblance in both spatial and temporal behavioral patterns. The two branches of learning jointly constrain the model of learning meaningful task-specific MER knowledge while avoiding learning noise or superficial connections between MEs and emotions that may damage its generalization ability. (ii) In the second level, LightmanNet further refines the learned task-specific knowledge, improving model generalization and efficiency. Extensive experiments on various benchmark datasets demonstrate the superior robustness and efficiency of LightmanNet.

Title: Data-free Knowledge Distillation for Fine-grained Visual Categorization

Authors: Renrong Shao, Wei Zhang, Jianhua Yin, Jun Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12037
Pdf URL: https://arxiv.org/pdf/2404.12037
Copy Paste: [[2404.12037]] Data-free Knowledge Distillation for Fine-grained Visual Categorization(https://arxiv.org/abs/2404.12037)
Keywords: security, privacy, data-free
Abstract: Data-free knowledge distillation (DFKD) is a promising approach for addressing issues related to model compression, security privacy, and transmission restrictions. Although the existing methods exploiting DFKD have achieved inspiring achievements in coarse-grained classification, in practical applications involving fine-grained classification tasks that require more detailed distinctions between similar categories, sub-optimal results are obtained. To address this issue, we propose an approach called DFKD-FGVC that extends DFKD to fine-grained visual categorization~(FGVC) tasks. Our approach utilizes an adversarial distillation framework with attention generator, mixed high-order attention distillation, and semantic feature contrast learning. Specifically, we introduce a spatial-wise attention mechanism to the generator to synthesize fine-grained images with more details of discriminative parts. We also utilize the mixed high-order attention mechanism to capture complex interactions among parts and the subtle differences among discriminative features of the fine-grained categories, paying attention to both local features and semantic context relationships. Moreover, we leverage the teacher and student models of the distillation framework to contrast high-level semantic feature maps in the hyperspace, comparing variances of different categories. We evaluate our approach on three widely-used FGVC benchmarks (Aircraft, Cars196, and CUB200) and demonstrate its superior performance.

Title: Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector

Authors: Zhihao Xu, Ruixuan Huang, Xiting Wang, Fangzhao Wu, Jing Yao, Xing Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12038
Pdf URL: https://arxiv.org/pdf/2404.12038
Copy Paste: [[2404.12038]] Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector(https://arxiv.org/abs/2404.12038)
Keywords: attack, robust, large language model
Abstract: Current open-source large language models (LLMs) are often undergone careful safety alignment before public release. Some attack methods have also been proposed that help check for safety vulnerabilities in LLMs to ensure alignment robustness. However, many of these methods have moderate attack success rates. Even when successful, the harmfulness of their outputs cannot be guaranteed, leading to suspicions that these methods have not accurately identified the safety vulnerabilities of LLMs. In this paper, we introduce a LLM attack method utilizing concept-based model explanation, where we extract safety concept activation vectors (SCAVs) from LLMs' activation space, enabling efficient attacks on well-aligned LLMs like LLaMA-2, achieving near 100% attack success rate as if LLMs are completely unaligned. This suggests that LLMs, even after thorough safety alignment, could still pose potential risks to society upon public release. To evaluate the harmfulness of outputs resulting with various attack methods, we propose a comprehensive evaluation method that reduces the potential inaccuracies of existing evaluations, and further validate that our method causes more harmful content. Additionally, we discover that the SCAVs show some transferability across different open-source LLMs.

Title: Can We Catch the Elephant? The Evolvement of Hallucination Evaluation on Natural Language Generation: A Survey

Authors: Siya Qi, Yulan He, Zheng Yuan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.12041
Pdf URL: https://arxiv.org/pdf/2404.12041
Copy Paste: [[2404.12041]] Can We Catch the Elephant? The Evolvement of Hallucination Evaluation on Natural Language Generation: A Survey(https://arxiv.org/abs/2404.12041)
Keywords: large language model
Abstract: Hallucination in Natural Language Generation (NLG) is like the elephant in the room, obvious but often overlooked until recent achievements significantly improved the fluency and grammatical accuracy of generated text. For Large Language Models (LLMs), hallucinations can happen in various downstream tasks and casual conversations, which need accurate assessment to enhance reliability and safety. However, current studies on hallucination evaluation vary greatly, and people still find it difficult to sort out and select the most appropriate evaluation methods. Moreover, as NLP research gradually shifts to the domain of LLMs, it brings new challenges to this direction. This paper provides a comprehensive survey on the evolvement of hallucination evaluation methods, aiming to address three key aspects: 1) Diverse definitions and granularity of facts; 2) The categories of automatic evaluators and their applicability; 3) Unresolved issues and future directions.

Title: Using Real-world Bug Bounty Programs in Secure Coding Course: Experience Report

Authors: Kamil Malinka, Anton Firc, Pavel Loutocký, Jakub Vostoupal, Andrej Krištofík, František Kasl
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2404.12043
Pdf URL: https://arxiv.org/pdf/2404.12043
Copy Paste: [[2404.12043]] Using Real-world Bug Bounty Programs in Secure Coding Course: Experience Report(https://arxiv.org/abs/2404.12043)
Keywords: secure, security, attack
Abstract: To keep up with the growing number of cyber-attacks and associated threats, there is an ever-increasing demand for cybersecurity professionals and new methods and technologies. Training new cybersecurity professionals is a challenging task due to the broad scope of the area. One particular field where there is a shortage of experts is Ethical Hacking. Due to its complexity, it often faces educational constraints. Recognizing these challenges, we propose a solution: integrating a real-world bug bounty programme into cybersecurity curriculum. This innovative approach aims to fill the gap in practical cybersecurity education and also brings additional positive benefits. To evaluate our idea, we include the proposed solution to a secure coding course for IT-oriented faculty. We let students choose to participate in a bug bounty programme as an option for the semester assignment in a secure coding course. We then collected responses from the students to evaluate the outcomes (improved skills, reported vulnerabilities, a better relationship with security, etc.). Evaluation of the assignment showed that students enjoyed solving such real-world problems, could find real vulnerabilities, and that it helped raise their skills and cybersecurity awareness. Participation in real bug bounty programmes also positively affects the security level of the tested products. We also discuss the potential risks of this approach and how to mitigate them.

Title: emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information

Authors: Jimenez Eladio, Hao Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12050
Pdf URL: https://arxiv.org/pdf/2404.12050
Copy Paste: [[2404.12050]] emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information(https://arxiv.org/abs/2404.12050)
Keywords: robust, extraction
Abstract: Machine Reading Comprehension (MRC) holds a pivotal role in shaping Medical Question Answering Systems (QAS) and transforming the landscape of accessing and applying medical information. However, the inherent challenges in the medical field, such as complex terminology and question ambiguity, necessitate innovative solutions. One key solution involves integrating specialized medical datasets and creating dedicated datasets. This strategic approach enhances the accuracy of QAS, contributing to advancements in clinical decision-making and medical research. To address the intricacies of medical terminology, a specialized dataset was integrated, exemplified by a novel Span extraction dataset derived from emrQA but restructured into 163,695 questions and 4,136 manually obtained answers, this new dataset was called emrQA-msquad dataset. Additionally, for ambiguous questions, a dedicated medical dataset for the Span extraction task was introduced, reinforcing the system's robustness. The fine-tuning of models such as BERT, RoBERTa, and Tiny RoBERTa for medical contexts significantly improved response accuracy within the F1-score range of 0.75 to 1.00 from 10.1% to 37.4%, 18.7% to 44.7% and 16.0% to 46.8%, respectively. Finally, emrQA-msquad dataset is publicy available at https://huggingface.co/datasets/Eladio/emrqa-msquad.

Title: RAGAR, Your Falsehood RADAR: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models

Authors: M. Abdul Khaliq, P. Chang, M. Ma, B. Pflugfelder, F. Miletić
Subjects: cs.CL, cs.AI, cs.CY, cs.ET, cs.MA
Abstract URL: https://arxiv.org/abs/2404.12065
Pdf URL: https://arxiv.org/pdf/2404.12065
Copy Paste: [[2404.12065]] RAGAR, Your Falsehood RADAR: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models(https://arxiv.org/abs/2404.12065)
Keywords: large language model
Abstract: The escalating challenge of misinformation, particularly in the context of political discourse, necessitates advanced solutions for fact-checking. We introduce innovative approaches to enhance the reliability and efficiency of multimodal fact-checking through the integration of Large Language Models (LLMs) with Retrieval-augmented Generation (RAG)- based advanced reasoning techniques. This work proposes two novel methodologies, Chain of RAG (CoRAG) and Tree of RAG (ToRAG). The approaches are designed to handle multimodal claims by reasoning the next questions that need to be answered based on previous evidence. Our approaches improve the accuracy of veracity predictions and the generation of explanations over the traditional fact-checking approach of sub-question generation with chain of thought veracity prediction. By employing multimodal LLMs adept at analyzing both text and images, this research advances the capability of automated systems in identifying and countering misinformation.

Title: MaskCD: A Remote Sensing Change Detection Network Based on Mask Classification

Authors: Weikang Yu, Xiaokang Zhang, Samiran Das, Xiao Xiang Zhu, Pedram Ghamisi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12081
Pdf URL: https://arxiv.org/pdf/2404.12081
Copy Paste: [[2404.12081]] MaskCD: A Remote Sensing Change Detection Network Based on Mask Classification(https://arxiv.org/abs/2404.12081)
Keywords: transformer
Abstract: Change detection (CD) from remote sensing (RS) images using deep learning has been widely investigated in the literature. It is typically regarded as a pixel-wise labeling task that aims to classify each pixel as changed or unchanged. Although per-pixel classification networks in encoder-decoder structures have shown dominance, they still suffer from imprecise boundaries and incomplete object delineation at various scenes. For high-resolution RS images, partly or totally changed objects are more worthy of attention rather than a single pixel. Therefore, we revisit the CD task from the mask prediction and classification perspective and propose MaskCD to detect changed areas by adaptively generating categorized masks from input image pairs. Specifically, it utilizes a cross-level change representation perceiver (CLCRP) to learn multiscale change-aware representations and capture spatiotemporal relations from encoded features by exploiting deformable multihead self-attention (DeformMHSA). Subsequently, a masked-attention-based detection transformers (MA-DETR) decoder is developed to accurately locate and identify changed objects based on masked attention and self-attention mechanisms. It reconstructs the desired changed objects by decoding the pixel-wise representations into learnable mask proposals and making final predictions from these candidates. Experimental results on five benchmark datasets demonstrate the proposed approach outperforms other state-of-the-art models. Codes and pretrained models are available online (https://github.com/EricYu97/MaskCD).

Title: MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking

Authors: Zhong Wang, Zengyu Wan, Han Han, Bohao Liao, Yuliang Wu, Wei Zhai, Yang Cao, Zheng-jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12083
Pdf URL: https://arxiv.org/pdf/2404.12083
Copy Paste: [[2404.12083]] MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking(https://arxiv.org/abs/2404.12083)
Keywords: secure, robust
Abstract: Event-based eye tracking has shown great promise with the high temporal resolution and low redundancy provided by the event camera. However, the diversity and abruptness of eye movement patterns, including blinking, fixating, saccades, and smooth pursuit, pose significant challenges for eye localization. To achieve a stable event-based eye-tracking system, this paper proposes a bidirectional long-term sequence modeling and time-varying state selection mechanism to fully utilize contextual temporal information in response to the variability of eye movements. Specifically, the MambaPupil network is proposed, which consists of the multi-layer convolutional encoder to extract features from the event representations, a bidirectional Gated Recurrent Unit (GRU), and a Linear Time-Varying State Space Module (LTV-SSM), to selectively capture contextual correlation from the forward and backward temporal relationship. Furthermore, the Bina-rep is utilized as a compact event representation, and the tailor-made data augmentation, called as Event-Cutout, is proposed to enhance the model's robustness by applying spatial random masking to the event image. The evaluation on the ThreeET-plus benchmark shows the superior performance of the MambaPupil, which secured the 1st place in CVPR'2024 AIS Event-based Eye Tracking challenge.

Title: Harnessing Joint Rain-/Detail-aware Representations to Eliminate Intricate Rains

Authors: Wu Ran, Peirong Ma, Zhiquan He, Hao Ren, Hong Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12091
Pdf URL: https://arxiv.org/pdf/2404.12091
Copy Paste: [[2404.12091]] Harnessing Joint Rain-/Detail-aware Representations to Eliminate Intricate Rains(https://arxiv.org/abs/2404.12091)
Keywords: transformer
Abstract: Recent advances in image deraining have focused on training powerful models on mixed multiple datasets comprising diverse rain types and backgrounds. However, this approach tends to overlook the inherent differences among rainy images, leading to suboptimal results. To overcome this limitation, we focus on addressing various rainy images by delving into meaningful representations that encapsulate both the rain and background components. Leveraging these representations as instructive guidance, we put forth a Context-based Instance-level Modulation (CoI-M) mechanism adept at efficiently modulating CNN- or Transformer-based models. Furthermore, we devise a rain-/detail-aware contrastive learning strategy to help extract joint rain-/detail-aware representations. By integrating CoI-M with the rain-/detail-aware Contrastive learning, we develop CoIC, an innovative and potent algorithm tailored for training models on mixed datasets. Moreover, CoIC offers insight into modeling relationships of datasets, quantitatively assessing the impact of rain and details on restoration, and unveiling distinct behaviors of models given diverse inputs. Extensive experiments validate the efficacy of CoIC in boosting the deraining ability of CNN and Transformer models. CoIC also enhances the deraining prowess remarkably when real-world dataset is included.

Title: Evaluating the Security of Merkle Trees in the Internet of Things: An Analysis of Data Falsification Probabilities

Authors: Oleksandr Kuznetsov, Alex Rusnak, Anton Yezhov, Kateryna Kuznetsova, Dzianis Kanonik, Oleksandr Domin
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2404.12093
Pdf URL: https://arxiv.org/pdf/2404.12093
Copy Paste: [[2404.12093]] Evaluating the Security of Merkle Trees in the Internet of Things: An Analysis of Data Falsification Probabilities(https://arxiv.org/abs/2404.12093)
Keywords: security
Abstract: Addressing the critical challenge of ensuring data integrity in decentralized systems, this paper delves into the underexplored area of data falsification probabilities within Merkle Trees, which are pivotal in blockchain and Internet of Things (IoT) technologies. Despite their widespread use, a comprehensive understanding of the probabilistic aspects of data security in these structures remains a gap in current research. Our study aims to bridge this gap by developing a theoretical framework to calculate the probability of data falsification, taking into account various scenarios based on the length of the Merkle path and hash length. The research progresses from the derivation of an exact formula for falsification probability to an approximation suitable for cases with significantly large hash lengths. Empirical experiments validate the theoretical models, exploring simulations with diverse hash lengths and Merkle path lengths. The findings reveal a decrease in falsification probability with increasing hash length and an inverse relationship with longer Merkle paths. A numerical analysis quantifies the discrepancy between exact and approximate probabilities, underscoring the conditions for the effective application of the approximation. This work offers crucial insights into optimizing Merkle Tree structures for bolstering security in blockchain and IoT systems, achieving a balance between computational efficiency and data integrity.

Title: Ethical-Lens: Curbing Malicious Usages of Open-Source Text-to-Image Models

Authors: Yuzhu Cai, Sheng Yin, Yuxi Wei, Chenxin Xu, Weibo Mao, Felix Juefei-Xu, Siheng Chen, Yanfeng Wang
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12104
Pdf URL: https://arxiv.org/pdf/2404.12104
Copy Paste: [[2404.12104]] Ethical-Lens: Curbing Malicious Usages of Open-Source Text-to-Image Models(https://arxiv.org/abs/2404.12104)
Keywords: fair
Abstract: The burgeoning landscape of text-to-image models, exemplified by innovations such as Midjourney and DALLE 3, has revolutionized content creation across diverse sectors. However, these advancements bring forth critical ethical concerns, particularly with the misuse of open-source models to generate content that violates societal norms. Addressing this, we introduce Ethical-Lens, a framework designed to facilitate the value-aligned usage of text-to-image tools without necessitating internal model revision. Ethical-Lens ensures value alignment in text-to-image models across toxicity and bias dimensions by refining user commands and rectifying model outputs. Systematic evaluation metrics, combining GPT4-V, HEIM, and FairFace scores, assess alignment capability. Our experiments reveal that Ethical-Lens enhances alignment capabilities to levels comparable with or superior to commercial models like DALLE 3, ensuring user-generated content adheres to ethical standards while maintaining image quality. This study indicates the potential of Ethical-Lens to ensure the sustainable development of open-source text-to-image tools and their beneficial integration into society. Our code is available at https://github.com/yuzhu-cai/Ethical-Lens.

Title: Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors

Authors: Raz Lapid, Almog Dubin, Moshe Sipper
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2404.12120
Pdf URL: https://arxiv.org/pdf/2404.12120
Copy Paste: [[2404.12120]] Fortify the Guardian, Not the Treasure: Resilient Adversarial Detectors(https://arxiv.org/abs/2404.12120)
Keywords: defense, attack, robust
Abstract: This paper presents RADAR-Robust Adversarial Detection via Adversarial Retraining-an approach designed to enhance the robustness of adversarial detectors against adaptive attacks, while maintaining classifier performance. An adaptive attack is one where the attacker is aware of the defenses and adapts their strategy accordingly. Our proposed method leverages adversarial training to reinforce the ability to detect attacks, without compromising clean accuracy. During the training phase, we integrate into the dataset adversarial examples, which were optimized to fool both the classifier and the adversarial detector, enabling the adversarial detector to learn and adapt to potential attack scenarios. Experimental evaluations on the CIFAR-10 and SVHN datasets demonstrate that our proposed algorithm significantly improves a detector's ability to accurately identify adaptive adversarial attacks -- without sacrificing clean accuracy.

Title: One-Shot Sequential Federated Learning for Non-IID Data by Enhancing Local Model Diversity

Authors: Naibo Wang, Yuchen Deng, Wenjie Feng, Shichen Fan, Jianwei Yin, See-Kiong Ng
Subjects: cs.LG, cs.CV, cs.DC
Abstract URL: https://arxiv.org/abs/2404.12130
Pdf URL: https://arxiv.org/pdf/2404.12130
Copy Paste: [[2404.12130]] One-Shot Sequential Federated Learning for Non-IID Data by Enhancing Local Model Diversity(https://arxiv.org/abs/2404.12130)
Keywords: federate
Abstract: Traditional federated learning mainly focuses on parallel settings (PFL), which can suffer significant communication and computation costs. In contrast, one-shot and sequential federated learning (SFL) have emerged as innovative paradigms to alleviate these costs. However, the issue of non-IID (Independent and Identically Distributed) data persists as a significant challenge in one-shot and SFL settings, exacerbated by the restricted communication between clients. In this paper, we improve the one-shot sequential federated learning for non-IID data by proposing a local model diversity-enhancing strategy. Specifically, to leverage the potential of local model diversity for improving model performance, we introduce a local model pool for each client that comprises diverse models generated during local training, and propose two distance measurements to further enhance the model diversity and mitigate the effect of non-IID data. Consequently, our proposed framework can improve the global model performance while maintaining low communication costs. Extensive experiments demonstrate that our method exhibits superior performance to existing one-shot PFL methods and achieves better accuracy compared with state-of-the-art one-shot SFL methods on both label-skew and domain-shift tasks (e.g., 6%+ accuracy improvement on the CIFAR-10 dataset).

Title: Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

Authors: Shouwei Ruan, Yinpeng Dong, Hanqing Liu, Yao Huang, Hang Su, Xingxing Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12139
Pdf URL: https://arxiv.org/pdf/2404.12139
Copy Paste: [[2404.12139]] Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models(https://arxiv.org/abs/2404.12139)
Keywords: robust
Abstract: Vision-Language Pre-training (VLP) models like CLIP have achieved remarkable success in computer vision and particularly demonstrated superior robustness to distribution shifts of 2D images. However, their robustness under 3D viewpoint variations is still limited, which can hinder the development for real-world applications. This paper successfully addresses this concern while keeping VLPs' original performance by breaking through two primary obstacles: 1) the scarcity of training data and 2) the suboptimal fine-tuning paradigms. To combat data scarcity, we build the Multi-View Caption (MVCap) dataset -- a comprehensive collection of over four million multi-view image-text pairs across more than 100K objects, providing more potential for VLP models to develop generalizable viewpoint-invariant representations. To address the limitations of existing paradigms in performance trade-offs and training efficiency, we design a novel fine-tuning framework named Omniview-Tuning (OVT). Specifically, OVT introduces a Cross-Viewpoint Alignment objective through a minimax-like optimization strategy, which effectively aligns representations of identical objects from diverse viewpoints without causing overfitting. Additionally, OVT fine-tunes VLP models in a parameter-efficient manner, leading to minimal computational cost. Extensive experiments on various VLP models with different architectures validate that OVT significantly improves the models' resilience to viewpoint shifts and keeps the original performance, establishing a pioneering standard for boosting the viewpoint invariance of VLP models.

Title: Mushroom Segmentation and 3D Pose Estimation from Point Clouds using Fully Convolutional Geometric Features and Implicit Pose Encoding

Authors: George Retsinas, Niki Efthymiou, Petros Maragos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12144
Pdf URL: https://arxiv.org/pdf/2404.12144
Copy Paste: [[2404.12144]] Mushroom Segmentation and 3D Pose Estimation from Point Clouds using Fully Convolutional Geometric Features and Implicit Pose Encoding(https://arxiv.org/abs/2404.12144)
Keywords: segmentation
Abstract: Modern agricultural applications rely more and more on deep learning solutions. However, training well-performing deep networks requires a large amount of annotated data that may not be available and in the case of 3D annotation may not even be feasible for human annotators. In this work, we develop a deep learning approach to segment mushrooms and estimate their pose on 3D data, in the form of point clouds acquired by depth sensors. To circumvent the annotation problem, we create a synthetic dataset of mushroom scenes, where we are fully aware of 3D information, such as the pose of each mushroom. The proposed network has a fully convolutional backbone, that parses sparse 3D data, and predicts pose information that implicitly defines both instance segmentation and pose estimation task. We have validated the effectiveness of the proposed implicit-based approach for a synthetic test set, as well as provided qualitative results for a small set of real acquired point clouds with depth sensors. Code is publicly available at https://github.com/georgeretsi/mushroom-pose.

Title: From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

Authors: Xenia Ohmer, Elia Bruni, Dieuwke Hupkes
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.12145
Pdf URL: https://arxiv.org/pdf/2404.12145
Copy Paste: [[2404.12145]] From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency(https://arxiv.org/abs/2404.12145)
Keywords: large language model
Abstract: The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what "understanding" means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes - inspired by Fregean senses - of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model's multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.

Title: FecTek: Enhancing Term Weight in Lexicon-Based Retrieval with Feature Context and Term-level Knowledge

Authors: Zunran Wang, Zhonghua Li, Wei Shen, Qi Ye, Liqiang Nie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12152
Pdf URL: https://arxiv.org/pdf/2404.12152
Copy Paste: [[2404.12152]] FecTek: Enhancing Term Weight in Lexicon-Based Retrieval with Feature Context and Term-level Knowledge(https://arxiv.org/abs/2404.12152)
Keywords: robust
Abstract: Lexicon-based retrieval has gained siginificant popularity in text retrieval due to its efficient and robust performance. To further enhance performance of lexicon-based retrieval, researchers have been diligently incorporating state-of-the-art methodologies like Neural retrieval and text-level contrastive learning approaches. Nonetheless, despite the promising outcomes, current lexicon-based retrieval methods have received limited attention in exploring the potential benefits of feature context representations and term-level knowledge guidance. In this paper, we introduce an innovative method by introducing FEature Context and TErm-level Knowledge modules(FecTek). To effectively enrich the feature context representations of term weight, the Feature Context Module (FCM) is introduced, which leverages the power of BERT's representation to determine dynamic weights for each element in the embedding. Additionally, we develop a term-level knowledge guidance module (TKGM) for effectively utilizing term-level knowledge to intelligently guide the modeling process of term weight. Evaluation of the proposed method on MS Marco benchmark demonstrates its superiority over the previous state-of-the-art approaches.

Title: StyleBooth: Image Style Editing with Multimodal Instruction

Authors: Zhen Han, Chaojie Mao, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12154
Pdf URL: https://arxiv.org/pdf/2404.12154
Copy Paste: [[2404.12154]] StyleBooth: Image Style Editing with Multimodal Instruction(https://arxiv.org/abs/2404.12154)
Keywords: diffusion
Abstract: Given an original image, image editing aims to generate an image that align with the provided instruction. The challenges are to accept multimodal inputs as instructions and a scarcity of high-quality training data, including crucial triplets of source/target image pairs and multimodal (text and image) instructions. In this paper, we focus on image style editing and present StyleBooth, a method that proposes a comprehensive framework for image editing and a feasible strategy for building a high-quality style editing dataset. We integrate encoded textual instruction and image exemplar as a unified condition for diffusion model, enabling the editing of original image following multimodal instructions. Furthermore, by iterative style-destyle tuning and editing and usability filtering, the StyleBooth dataset provides content-consistent stylized/plain image pairs in various categories of styles. To show the flexibility of StyleBooth, we conduct experiments on diverse tasks, such as text-based style editing, exemplar-based style editing and compositional style editing. The results demonstrate that the quality and variety of training data significantly enhance the ability to preserve content and improve the overall quality of generated images in editing tasks. Project page can be found at https://ali-vilab.github.io/stylebooth-page/.

Title: Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization

Authors: Insoo Kim, Jae Seok Choi, Geonseok Seo, Kinam Kwon, Jinwoo Shin, Hyong-Euk Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2404.12168
Pdf URL: https://arxiv.org/pdf/2404.12168
Copy Paste: [[2404.12168]] Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization(https://arxiv.org/abs/2404.12168)
Keywords: segmentation
Abstract: As recent advances in mobile camera technology have enabled the capability to capture high-resolution images, such as 4K images, the demand for an efficient deblurring model handling large motion has increased. In this paper, we discover that the image residual errors, i.e., blur-sharp pixel differences, can be grouped into some categories according to their motion blur type and how complex their neighboring pixels are. Inspired by this, we decompose the deblurring (regression) task into blur pixel discretization (pixel-level blur classification) and discrete-to-continuous conversion (regression with blur class map) tasks. Specifically, we generate the discretized image residual errors by identifying the blur pixels and then transform them to a continuous form, which is computationally more efficient than naively solving the original regression problem with continuous values. Here, we found that the discretization result, i.e., blur segmentation map, remarkably exhibits visual similarity with the image residual errors. As a result, our efficient model shows comparable performance to state-of-the-art methods in realistic benchmarks, while our method is up to 10 times computationally more efficient.

Title: Stance Detection on Social Media with Fine-Tuned Large Language Models

Authors: İlker Gül, Rémi Lebret, Karl Aberer
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2404.12171
Pdf URL: https://arxiv.org/pdf/2404.12171
Copy Paste: [[2404.12171]] Stance Detection on Social Media with Fine-Tuned Large Language Models(https://arxiv.org/abs/2404.12171)
Keywords: large language model
Abstract: Stance detection, a key task in natural language processing, determines an author's viewpoint based on textual analysis. This study evaluates the evolution of stance detection methods, transitioning from early machine learning approaches to the groundbreaking BERT model, and eventually to modern Large Language Models (LLMs) such as ChatGPT, LLaMa-2, and Mistral-7B. While ChatGPT's closed-source nature and associated costs present challenges, the open-source models like LLaMa-2 and Mistral-7B offers an encouraging alternative. Initially, our research focused on fine-tuning ChatGPT, LLaMa-2, and Mistral-7B using several publicly available datasets. Subsequently, to provide a comprehensive comparison, we assess the performance of these models in zero-shot and few-shot learning scenarios. The results underscore the exceptional ability of LLMs in accurately detecting stance, with all tested models surpassing existing benchmarks. Notably, LLaMa-2 and Mistral-7B demonstrate remarkable efficiency and potential for stance detection, despite their smaller sizes compared to ChatGPT. This study emphasizes the potential of LLMs in stance detection and calls for more extensive research in this field.

Title: How to Benchmark Vision Foundation Models for Semantic Segmentation?

Authors: Tommie Kerssies, Daan de Geus, Gijs Dubbelman
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2404.12172
Pdf URL: https://arxiv.org/pdf/2404.12172
Copy Paste: [[2404.12172]] How to Benchmark Vision Foundation Models for Semantic Segmentation?(https://arxiv.org/abs/2404.12172)
Keywords: segmentation
Abstract: Recent vision foundation models (VFMs) have demonstrated proficiency in various tasks but require supervised fine-tuning to perform the task of semantic segmentation effectively. Benchmarking their performance is essential for selecting current models and guiding future model developments for this task. The lack of a standardized benchmark complicates comparisons. Therefore, the primary objective of this paper is to study how VFMs should be benchmarked for semantic segmentation. To do so, various VFMs are fine-tuned under various settings, and the impact of individual settings on the performance ranking and training time is assessed. Based on the results, the recommendation is to fine-tune the ViT-B variants of VFMs with a 16x16 patch size and a linear decoder, as these settings are representative of using a larger model, more advanced decoder and smaller patch size, while reducing training time by more than 13 times. Using multiple datasets for training and evaluation is also recommended, as the performance ranking across datasets and domain shifts varies. Linear probing, a common practice for some VFMs, is not recommended, as it is not representative of end-to-end fine-tuning. The benchmarking setup recommended in this paper enables a performance analysis of VFMs for semantic segmentation. The findings of such an analysis reveal that pretraining with promptable segmentation is not beneficial, whereas masked image modeling (MIM) with abstract representations is crucial, even more important than the type of supervision used. The code for efficiently fine-tuning VFMs for semantic segmentation can be accessed through the project page at: https://tue-mps.github.io/benchmark-vfm-ss/.

Title: Gait Recognition from Highly Compressed Videos

Authors: Andrei Niculae, Andy Catruna, Adrian Cosma, Daniel Rosner, Emilian Radoi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12183
Pdf URL: https://arxiv.org/pdf/2404.12183
Copy Paste: [[2404.12183]] Gait Recognition from Highly Compressed Videos(https://arxiv.org/abs/2404.12183)
Keywords: robust
Abstract: Surveillance footage represents a valuable resource and opportunities for conducting gait analysis. However, the typical low quality and high noise levels in such footage can severely impact the accuracy of pose estimation algorithms, which are foundational for reliable gait analysis. Existing literature suggests a direct correlation between the efficacy of pose estimation and the subsequent gait analysis results. A common mitigation strategy involves fine-tuning pose estimation models on noisy data to improve robustness. However, this approach may degrade the downstream model's performance on the original high-quality data, leading to a trade-off that is undesirable in practice. We propose a processing pipeline that incorporates a task-targeted artifact correction model specifically designed to pre-process and enhance surveillance footage before pose estimation. Our artifact correction model is optimized to work alongside a state-of-the-art pose estimation network, HRNet, without requiring repeated fine-tuning of the pose estimation model. Furthermore, we propose a simple and robust method for obtaining low quality videos that are annotated with poses in an automatic manner with the purpose of training the artifact correction model. We systematically evaluate the performance of our artifact correction model against a range of noisy surveillance data and demonstrate that our approach not only achieves improved pose estimation on low-quality surveillance footage, but also preserves the integrity of the pose estimation on high resolution footage. Our experiments show a clear enhancement in gait analysis performance, supporting the viability of the proposed method as a superior alternative to direct fine-tuning strategies. Our contributions pave the way for more reliable gait analysis using surveillance data in real-world applications, regardless of data quality.

Title: Privacy-Preserving UCB Decision Process Verification via zk-SNARKs

Authors: Xikun Jiang, He Lyu, Chenhao Ying, Yibin Xu, Boris Düdder, Yuan Luo
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2404.12186
Pdf URL: https://arxiv.org/pdf/2404.12186
Copy Paste: [[2404.12186]] Privacy-Preserving UCB Decision Process Verification via zk-SNARKs(https://arxiv.org/abs/2404.12186)
Keywords: security, privacy, protect
Abstract: With the increasingly widespread application of machine learning, how to strike a balance between protecting the privacy of data and algorithm parameters and ensuring the verifiability of machine learning has always been a challenge. This study explores the intersection of reinforcement learning and data privacy, specifically addressing the Multi-Armed Bandit (MAB) problem with the Upper Confidence Bound (UCB) algorithm. We introduce zkUCB, an innovative algorithm that employs the Zero-Knowledge Succinct Non-Interactive Argument of Knowledge (zk-SNARKs) to enhance UCB. zkUCB is carefully designed to safeguard the confidentiality of training data and algorithmic parameters, ensuring transparent UCB decision-making. Experiments highlight zkUCB's superior performance, attributing its enhanced reward to judicious quantization bit usage that reduces information entropy in the decision-making process. zkUCB's proof size and verification time scale linearly with the execution steps of zkUCB. This showcases zkUCB's adept balance between data security and operational efficiency. This approach contributes significantly to the ongoing discourse on reinforcing data privacy in complex decision-making processes, offering a promising solution for privacy-sensitive applications.

Title: Estimating the Hessian Matrix of Ranking Objectives for Stochastic Learning to Rank with Gradient Boosted Trees

Authors: Jingwei Kang, Maarten de Rijke, Harrie Oosterhuis
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2404.12190
Pdf URL: https://arxiv.org/pdf/2404.12190
Copy Paste: [[2404.12190]] Estimating the Hessian Matrix of Ranking Objectives for Stochastic Learning to Rank with Gradient Boosted Trees(https://arxiv.org/abs/2404.12190)
Keywords: fair
Abstract: Stochastic learning to rank (LTR) is a recent branch in the LTR field that concerns the optimization of probabilistic ranking models. Their probabilistic behavior enables certain ranking qualities that are impossible with deterministic models. For example, they can increase the diversity of displayed documents, increase fairness of exposure over documents, and better balance exploitation and exploration through randomization. A core difficulty in LTR is gradient estimation, for this reason, existing stochastic LTR methods have been limited to differentiable ranking models (e.g., neural networks). This is in stark contrast with the general field of LTR where Gradient Boosted Decision Trees (GBDTs) have long been considered the state-of-the-art. In this work, we address this gap by introducing the first stochastic LTR method for GBDTs. Our main contribution is a novel estimator for the second-order derivatives, i.e., the Hessian matrix, which is a requirement for effective GBDTs. To efficiently compute both the first and second-order derivatives simultaneously, we incorporate our estimator into the existing PL-Rank framework, which was originally designed for first-order derivatives only. Our experimental results indicate that stochastic LTR without the Hessian has extremely poor performance, whilst the performance is competitive with the current state-of-the-art with our estimated Hessian. Thus, through the contribution of our novel Hessian estimation method, we have successfully introduced GBDTs to stochastic LTR.

Title: Aligning Actions and Walking to LLM-Generated Textual Descriptions

Authors: Radu Chivereanu, Adrian Cosma, Andy Catruna, Razvan Rughinis, Emilian Radoi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12192
Pdf URL: https://arxiv.org/pdf/2404.12192
Copy Paste: [[2404.12192]] Aligning Actions and Walking to LLM-Generated Textual Descriptions(https://arxiv.org/abs/2404.12192)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains, including data augmentation and synthetic data generation. This work explores the use of LLMs to generate rich textual descriptions for motion sequences, encompassing both actions and walking patterns. We leverage the expressive power of LLMs to align motion representations with high-level linguistic cues, addressing two distinct tasks: action recognition and retrieval of walking sequences based on appearance attributes. For action recognition, we employ LLMs to generate textual descriptions of actions in the BABEL-60 dataset, facilitating the alignment of motion sequences with linguistic representations. In the domain of gait analysis, we investigate the impact of appearance attributes on walking patterns by generating textual descriptions of motion sequences from the DenseGait dataset using LLMs. These descriptions capture subtle variations in walking styles influenced by factors such as clothing choices and footwear. Our approach demonstrates the potential of LLMs in augmenting structured motion attributes and aligning multi-modal representations. The findings contribute to the advancement of comprehensive motion understanding and open up new avenues for leveraging LLMs in multi-modal alignment and data augmentation for motion analysis. We make the code publicly available at https://github.com/Radu1999/WalkAndText

Title: The Explicit values of the UBCT, the LBCT and the DBCT of the inverse function

Authors: Yuying Man, Nian Li, Zhen Liu, Xiangyong Zeng
Subjects: cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2404.12208
Pdf URL: https://arxiv.org/pdf/2404.12208
Copy Paste: [[2404.12208]] The Explicit values of the UBCT, the LBCT and the DBCT of the inverse function(https://arxiv.org/abs/2404.12208)
Keywords: security, attack
Abstract: Substitution boxes (S-boxes) play a significant role in ensuring the resistance of block ciphers against various attacks. The Upper Boomerang Connectivity Table (UBCT), the Lower Boomerang Connectivity Table (LBCT) and the Double Boomerang Connectivity Table (DBCT) of a given S-box are crucial tools to analyze its security concerning specific attacks. However, there are currently no related results for this research. The inverse function is crucial for constructing S-boxes of block ciphers with good cryptographic properties in symmetric cryptography. Therefore, extensive research has been conducted on the inverse function, exploring various properties related to standard attacks. Thanks to the recent advancements in boomerang cryptanalysis, particularly the introduction of concepts such as UBCT, LBCT, and DBCT, this paper aims to further investigate the properties of the inverse function $F(x)=x^{2^n-2}$ over $\gf_{2^n}$ for arbitrary $n$. As a consequence, by carrying out certain finer manipulations of solving specific equations over $\gf_{2^n}$, we give all entries of the UBCT, LBCT of $F(x)$ over $\gf_{2^n}$ for arbitrary $n$. Besides, based on the results of the UBCT and LBCT for the inverse function, we determine that $F(x)$ is hard when $n$ is odd. Furthermore, we completely compute all entries of the DBCT of $F(x)$ over $\gf_{2^n}$ for arbitrary $n$. Additionally, we provide the precise number of elements with a given entry by means of the values of some Kloosterman sums. Further, we determine the double boomerang uniformity of $F(x)$ over $\gf_{2^n}$ for arbitrary $n$. Our in-depth analysis of the DBCT of $F(x)$ contributes to a better evaluation of the S-box's resistance against boomerang attacks.

Title: Observation, Analysis, and Solution: Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training

Authors: Jin Gao, Shubo Lin, Shaoru Wang, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12210
Pdf URL: https://arxiv.org/pdf/2404.12210
Copy Paste: [[2404.12210]] Observation, Analysis, and Solution: Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training(https://arxiv.org/abs/2404.12210)
Keywords: transformer, segmentation
Abstract: Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) in computer vision has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the extremely simple ViTs' fine-tuning performance with a small-scale architecture can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology with sophisticated components introduced. By carefully adapting various typical MIM pre-training methods to this lightweight regime and comparing them with the contrastive learning (CL) pre-training on various downstream image classification and dense prediction tasks, we systematically observe different behaviors between MIM and CL with respect to the downstream fine-tuning data scales. Furthermore, we analyze the frozen features under linear probing evaluation and also the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory fine-tuning performance on data-insufficient downstream tasks. This finding is naturally a guide to choosing appropriate distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments on various vision tasks demonstrate the effectiveness of our observation-analysis-solution flow. In particular, our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design (5.7M/6.5M) can achieve 79.4%/78.9% top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K semantic segmentation task (42.8% mIoU) and LaSOT visual tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.

Title: A Quadrature Approach for General-Purpose Batch Bayesian Optimization via Probabilistic Lifting

Authors: Masaki Adachi, Satoshi Hayakawa, Martin Jørgensen, Saad Hamid, Harald Oberhauser, Michael A. Osborne
Subjects: cs.LG, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2404.12219
Pdf URL: https://arxiv.org/pdf/2404.12219
Copy Paste: [[2404.12219]] A Quadrature Approach for General-Purpose Batch Bayesian Optimization via Probabilistic Lifting(https://arxiv.org/abs/2404.12219)
Keywords: robust
Abstract: Parallelisation in Bayesian optimisation is a common strategy but faces several challenges: the need for flexibility in acquisition functions and kernel choices, flexibility dealing with discrete and continuous variables simultaneously, model misspecification, and lastly fast massive parallelisation. To address these challenges, we introduce a versatile and modular framework for batch Bayesian optimisation via probabilistic lifting with kernel quadrature, called SOBER, which we present as a Python library based on GPyTorch/BoTorch. Our framework offers the following unique benefits: (1) Versatility in downstream tasks under a unified approach. (2) A gradient-free sampler, which does not require the gradient of acquisition functions, offering domain-agnostic sampling (e.g., discrete and mixed variables, non-Euclidean space). (3) Flexibility in domain prior distribution. (4) Adaptive batch size (autonomous determination of the optimal batch size). (5) Robustness against a misspecified reproducing kernel Hilbert space. (6) Natural stopping criterion.

Title: Length Generalization of Causal Transformers without Position Encoding

Authors: Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, Xiaoling Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12224
Pdf URL: https://arxiv.org/pdf/2404.12224
Copy Paste: [[2404.12224]] Length Generalization of Causal Transformers without Position Encoding(https://arxiv.org/abs/2404.12224)
Keywords: transformer
Abstract: Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE's generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which substantially expands NoPE's context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible

Title: Neural Networks with Causal Graph Constraints: A New Approach for Treatment Effects Estimation

Authors: Roger Pros, Jordi Vitrià
Subjects: cs.LG, stat.ME
Abstract URL: https://arxiv.org/abs/2404.12238
Pdf URL: https://arxiv.org/pdf/2404.12238
Copy Paste: [[2404.12238]] Neural Networks with Causal Graph Constraints: A New Approach for Treatment Effects Estimation(https://arxiv.org/abs/2404.12238)
Keywords: robust
Abstract: In recent years, there has been a growing interest in using machine learning techniques for the estimation of treatment effects. Most of the best-performing methods rely on representation learning strategies that encourage shared behavior among potential outcomes to increase the precision of treatment effect estimates. In this paper we discuss and classify these models in terms of their algorithmic inductive biases and present a new model, NN-CGC, that considers additional information from the causal graph. NN-CGC tackles bias resulting from spurious variable interactions by implementing novel constraints on models, and it can be integrated with other representation learning methods. We test the effectiveness of our method using three different base models on common benchmarks. Our results indicate that our model constraints lead to significant improvements, achieving new state-of-the-art results in treatment effects estimation. We also show that our method is robust to imperfect causal graphs and that using partial causal information is preferable to ignoring it.

Title: CMNEE: A Large-Scale Document-Level Event Extraction Dataset based on Open-Source Chinese Military News

Authors: Mengna Zhu, Zijie Xu, Kaisheng Zeng, Kaiming Xiao, Mao Wang, Wenjun Ke, Hongbin Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12242
Pdf URL: https://arxiv.org/pdf/2404.12242
Copy Paste: [[2404.12242]] CMNEE: A Large-Scale Document-Level Event Extraction Dataset based on Open-Source Chinese Military News(https://arxiv.org/abs/2404.12242)
Keywords: extraction
Abstract: Extracting structured event knowledge, including event triggers and corresponding arguments, from military texts is fundamental to many applications, such as intelligence analysis and decision assistance. However, event extraction in the military field faces the data scarcity problem, which impedes the research of event extraction models in this domain. To alleviate this problem, we propose CMNEE, a large-scale, document-level open-source Chinese Military News Event Extraction dataset. It contains 17,000 documents and 29,223 events, which are all manually annotated based on a pre-defined schema for the military domain including 8 event types and 11 argument role types. We designed a two-stage, multi-turns annotation strategy to ensure the quality of CMNEE and reproduced several state-of-the-art event extraction models with a systematic evaluation. The experimental results on CMNEE fall shorter than those on other domain datasets obviously, which demonstrates that event extraction for military domain poses unique challenges and requires further research efforts. Our code and data can be obtained from https://github.com/Mzzzhu/CMNEE.

Title: Deep Gaussian mixture model for unsupervised image segmentation

Authors: Matthias Schwab, Agnes Mayr, Markus Haltmeier
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12252
Pdf URL: https://arxiv.org/pdf/2404.12252
Copy Paste: [[2404.12252]] Deep Gaussian mixture model for unsupervised image segmentation(https://arxiv.org/abs/2404.12252)
Keywords: segmentation
Abstract: The recent emergence of deep learning has led to a great deal of work on designing supervised deep semantic segmentation algorithms. As in many tasks sufficient pixel-level labels are very difficult to obtain, we propose a method which combines a Gaussian mixture model (GMM) with unsupervised deep learning techniques. In the standard GMM the pixel values with each sub-region are modelled by a Gaussian distribution. In order to identify the different regions, the parameter vector that minimizes the negative log-likelihood (NLL) function regarding the GMM has to be approximated. For this task, usually iterative optimization methods such as the expectation-maximization (EM) algorithm are used. In this paper, we propose to estimate these parameters directly from the image using a convolutional neural network (CNN). We thus change the iterative procedure in the EM algorithm replacing the expectation-step by a gradient-step with regard to the networks parameters. This means that the network is trained to minimize the NLL function of the GMM which comes with at least two advantages. As once trained, the network is able to predict label probabilities very quickly compared with time consuming iterative optimization methods. Secondly, due to the deep image prior our method is able to partially overcome one of the main disadvantages of GMM, which is not taking into account correlation between neighboring pixels, as it assumes independence between them. We demonstrate the advantages of our method in various experiments on the example of myocardial infarct segmentation on multi-sequence MRI images.

Title: Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

Authors: Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12253
Pdf URL: https://arxiv.org/pdf/2404.12253
Copy Paste: [[2404.12253]] Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing(https://arxiv.org/abs/2404.12253)
Keywords: large language model
Abstract: Despite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced prompting techniques and the necessity of fine-tuning with high-quality data to augment LLMs' reasoning abilities. However, these approaches are inherently constrained by data availability and quality. In light of this, self-correction and self-learning emerge as viable solutions, employing strategies that allow LLMs to refine their outputs and learn from self-assessed rewards. Yet, the efficacy of LLMs in self-refining its response, particularly in complex reasoning and planning task, remains dubious. In this paper, we introduce AlphaLLM for the self-improvements of LLMs, which integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop, thereby enhancing the capabilities of LLMs without additional annotations. Drawing inspiration from the success of AlphaGo, AlphaLLM addresses the unique challenges of combining MCTS with LLM for self-improvement, including data scarcity, the vastness search spaces of language tasks, and the subjective nature of feedback in language tasks. AlphaLLM is comprised of prompt synthesis component, an efficient MCTS approach tailored for language tasks, and a trio of critic models for precise feedback. Our experimental results in mathematical reasoning tasks demonstrate that AlphaLLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs.

Title: DeepLocalization: Using change point detection for Temporal Action Localization

Authors: Mohammed Shaiqur Rahman, Ibne Farabi Shihab, Lynna Chu, Anuj Sharma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12258
Pdf URL: https://arxiv.org/pdf/2404.12258
Copy Paste: [[2404.12258]] DeepLocalization: Using change point detection for Temporal Action Localization(https://arxiv.org/abs/2404.12258)
Keywords: large language model
Abstract: In this study, we introduce DeepLocalization, an innovative framework devised for the real-time localization of actions tailored explicitly for monitoring driver behavior. Utilizing the power of advanced deep learning methodologies, our objective is to tackle the critical issue of distracted driving-a significant factor contributing to road accidents. Our strategy employs a dual approach: leveraging Graph-Based Change-Point Detection for pinpointing actions in time alongside a Video Large Language Model (Video-LLM) for precisely categorizing activities. Through careful prompt engineering, we customize the Video-LLM to adeptly handle driving activities' nuances, ensuring its classification efficacy even with sparse data. Engineered to be lightweight, our framework is optimized for consumer-grade GPUs, making it vastly applicable in practical scenarios. We subjected our method to rigorous testing on the SynDD2 dataset, a complex benchmark for distracted driving behaviors, where it demonstrated commendable performance-achieving 57.5% accuracy in event classification and 51% in event detection. These outcomes underscore the substantial promise of DeepLocalization in accurately identifying diverse driver behaviors and their temporal occurrences, all within the bounds of limited computational resources.

Title: Alleviating Catastrophic Forgetting in Facial Expression Recognition with Emotion-Centered Models

Authors: Israel A. Laurensi, Alceu de Souza Britto Jr., Jean Paul Barddal, Alessandro Lameiras Koerich
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12260
Pdf URL: https://arxiv.org/pdf/2404.12260
Copy Paste: [[2404.12260]] Alleviating Catastrophic Forgetting in Facial Expression Recognition with Emotion-Centered Models(https://arxiv.org/abs/2404.12260)
Keywords: generative
Abstract: Facial expression recognition is a pivotal component in machine learning, facilitating various applications. However, convolutional neural networks (CNNs) are often plagued by catastrophic forgetting, impeding their adaptability. The proposed method, emotion-centered generative replay (ECgr), tackles this challenge by integrating synthetic images from generative adversarial networks. Moreover, ECgr incorporates a quality assurance algorithm to ensure the fidelity of generated images. This dual approach enables CNNs to retain past knowledge while learning new tasks, enhancing their performance in emotion recognition. The experimental results on four diverse facial expression datasets demonstrate that incorporating images generated by our pseudo-rehearsal method enhances training on the targeted dataset and the source dataset while making the CNN retain previously learned knowledge.

Title: Physics-integrated generative modeling using attentive planar normalizing flow based variational autoencoder

Authors: Sheikh Waqas Akhtar
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2404.12267
Pdf URL: https://arxiv.org/pdf/2404.12267
Copy Paste: [[2404.12267]] Physics-integrated generative modeling using attentive planar normalizing flow based variational autoencoder(https://arxiv.org/abs/2404.12267)
Keywords: robust, interpretability, generative
Abstract: Physics-integrated generative modeling is a class of hybrid or grey-box modeling in which we augment the the data-driven model with the physics knowledge governing the data distribution. The use of physics knowledge allows the generative model to produce output in a controlled way, so that the output, by construction, complies with the physical laws. It imparts improved generalization ability to extrapolate beyond the training distribution as well as improved interpretability because the model is partly grounded in firm domain knowledge. In this work, we aim to improve the fidelity of reconstruction and robustness to noise in the physics integrated generative model. To this end, we use variational-autoencoder as a generative model. To improve the reconstruction results of the decoder, we propose to learn the latent posterior distribution of both the physics as well as the trainable data-driven components using planar normalizng flow. Normalizng flow based posterior distribution harnesses the inherent dynamical structure of the data distribution, hence the learned model gets closer to the true underlying data distribution. To improve the robustness of generative model against noise injected in the model, we propose a modification in the encoder part of the normalizing flow based VAE. We designed the encoder to incorporate scaled dot product attention based contextual information in the noisy latent vector which will mitigate the adverse effect of noise in the latent vector and make the model more robust. We empirically evaluated our models on human locomotion dataset [33] and the results validate the efficacy of our proposed models in terms of improvement in reconstruction quality as well as robustness against noise injected in the model.

Title: Advancing the Robustness of Large Language Models through Self-Denoised Smoothing

Authors: Jiabao Ji, Bairu Hou, Zhen Zhang, Guanhua Zhang, Wenqi Fan, Qing Li, Yang Zhang, Gaowen Liu, Sijia Liu, Shiyu Chang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.12274
Pdf URL: https://arxiv.org/pdf/2404.12274
Copy Paste: [[2404.12274]] Advancing the Robustness of Large Language Models through Self-Denoised Smoothing(https://arxiv.org/abs/2404.12274)
Keywords: defense, attack, robust, large language model
Abstract: Although large language models (LLMs) have achieved significant success, their vulnerability to adversarial perturbations, including recent jailbreak attacks, has raised considerable concerns. However, the increasing size of these models and their limited access make improving their robustness a challenging task. Among various defense strategies, randomized smoothing has shown great potential for LLMs, as it does not require full access to the model's parameters or fine-tuning via adversarial training. However, randomized smoothing involves adding noise to the input before model prediction, and the final model's robustness largely depends on the model's performance on these noise corrupted data. Its effectiveness is often limited by the model's sub-optimal performance on noisy data. To address this issue, we propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions. We call this procedure self-denoised smoothing. Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility. Our experimental results indicate that our method surpasses existing methods in both empirical and certified robustness in defending against adversarial attacks for both downstream tasks and human alignments (i.e., jailbreak attacks). Our code is publicly available at https://github.com/UCSB-NLP-Chang/SelfDenoise

Title: Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting

Authors: Nicholas Harris, Anand Butani, Syed Hashmy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12283
Pdf URL: https://arxiv.org/pdf/2404.12283
Copy Paste: [[2404.12283]] Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting(https://arxiv.org/abs/2404.12283)
Keywords: large language model
Abstract: Embedding models are crucial for various natural language processing tasks but can be limited by factors such as limited vocabulary, lack of context, and grammatical errors. This paper proposes a novel approach to improve embedding performance by leveraging large language models (LLMs) to enrich and rewrite input text before the embedding process. By utilizing ChatGPT 3.5 to provide additional context, correct inaccuracies, and incorporate metadata, the proposed method aims to enhance the utility and accuracy of embedding models. The effectiveness of this approach is evaluated on three datasets: Banking77Classification, TwitterSemEval 2015, and Amazon Counter-factual Classification. Results demonstrate significant improvements over the baseline model on the TwitterSemEval 2015 dataset, with the best-performing prompt achieving a score of 85.34 compared to the previous best of 81.52 on the Massive Text Embedding Benchmark (MTEB) Leaderboard. However, performance on the other two datasets was less impressive, highlighting the importance of considering domain-specific characteristics. The findings suggest that LLM-based text enrichment has shown promising results to improve embedding performance, particularly in certain domains. Hence, numerous limitations in the process of embedding can be avoided.

Title: Performance Evaluation of Segment Anything Model with Variational Prompting for Application to Non-Visible Spectrum Imagery

Authors: Yona Falinie A. Gaus, Neelanjan Bhowmik, Brian K. S. Isaac-Medina, Toby P. Breckon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12285
Pdf URL: https://arxiv.org/pdf/2404.12285
Copy Paste: [[2404.12285]] Performance Evaluation of Segment Anything Model with Variational Prompting for Application to Non-Visible Spectrum Imagery(https://arxiv.org/abs/2404.12285)
Keywords: security, segmentation
Abstract: The Segment Anything Model (SAM) is a deep neural network foundational model designed to perform instance segmentation which has gained significant popularity given its zero-shot segmentation ability. SAM operates by generating masks based on various input prompts such as text, bounding boxes, points, or masks, introducing a novel methodology to overcome the constraints posed by dataset-specific scarcity. While SAM is trained on an extensive dataset, comprising ~11M images, it mostly consists of natural photographic images with only very limited images from other modalities. Whilst the rapid progress in visual infrared surveillance and X-ray security screening imaging technologies, driven forward by advances in deep learning, has significantly enhanced the ability to detect, classify and segment objects with high accuracy, it is not evident if the SAM zero-shot capabilities can be transferred to such modalities. This work assesses SAM capabilities in segmenting objects of interest in the X-ray/infrared modalities. Our approach reuses the pre-trained SAM with three different prompts: bounding box, centroid and random points. We present quantitative/qualitative results to showcase the performance on selected datasets. Our results show that SAM can segment objects in the X-ray modality when given a box prompt, but its performance varies for point prompts. Specifically, SAM performs poorly in segmenting slender objects and organic materials, such as plastic bottles. We find that infrared objects are also challenging to segment with point prompts given the low-contrast nature of this modality. This study shows that while SAM demonstrates outstanding zero-shot capabilities with box prompts, its performance ranges from moderate to poor for point prompts, indicating that special consideration on the cross-modal generalisation of SAM is needed when considering use on X-ray/infrared imagery.

Title: Resilience through Scene Context in Visual Referring Expression Generation

Authors: Simeon Junker, Sina Zarrieß
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12289
Pdf URL: https://arxiv.org/pdf/2404.12289
Copy Paste: [[2404.12289]] Resilience through Scene Context in Visual Referring Expression Generation(https://arxiv.org/abs/2404.12289)
Keywords: transformer
Abstract: Scene context is well known to facilitate humans' perception of visible objects. In this paper, we investigate the role of context in Referring Expression Generation (REG) for objects in images, where existing research has often focused on distractor contexts that exert pressure on the generator. We take a new perspective on scene context in REG and hypothesize that contextual information can be conceived of as a resource that makes REG models more resilient and facilitates the generation of object descriptions, and object types in particular. We train and test Transformer-based REG models with target representations that have been artificially obscured with noise to varying degrees. We evaluate how properties of the models' visual context affect their processing and performance. Our results show that even simple scene contexts make models surprisingly resilient to perturbations, to the extent that they can identify referent types even when visual information about the target is completely missing.

Title: Augmenting emotion features in irony detection with Large language modeling

Authors: Yucheng Lin, Yuhan Xia, Yunfei Long
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2404.12291
Pdf URL: https://arxiv.org/pdf/2404.12291
Copy Paste: [[2404.12291]] Augmenting emotion features in irony detection with Large language modeling(https://arxiv.org/abs/2404.12291)
Keywords: large language model
Abstract: This study introduces a novel method for irony detection, applying Large Language Models (LLMs) with prompt-based learning to facilitate emotion-centric text augmentation. Traditional irony detection techniques typically fall short due to their reliance on static linguistic features and predefined knowledge bases, often overlooking the nuanced emotional dimensions integral to irony. In contrast, our methodology augments the detection process by integrating subtle emotional cues, augmented through LLMs, into three benchmark pre-trained NLP models - BERT, T5, and GPT-2 - which are widely recognized as foundational in irony detection. We assessed our method using the SemEval-2018 Task 3 dataset and observed substantial enhancements in irony detection capabilities.

Title: Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair

Authors: Yusuke Sakai, Mana Makinae, Hidetaka Kamigaito, Taro Watanabe
Subjects: cs.CL, cs.AI, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2404.12299
Pdf URL: https://arxiv.org/pdf/2404.12299
Copy Paste: [[2404.12299]] Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair(https://arxiv.org/abs/2404.12299)
Keywords: large language model
Abstract: In Simultaneous Machine Translation (SiMT) systems, training with a simultaneous interpretation (SI) corpus is an effective method for achieving high-quality yet low-latency systems. However, it is very challenging to curate such a corpus due to limitations in the abilities of annotators, and hence, existing SI corpora are limited. Therefore, we propose a method to convert existing speech translation corpora into interpretation-style data, maintaining the original word order and preserving the entire source content using Large Language Models (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models in text-to-text and speech-to-text settings with the LLM-SI-Corpus reduces latencies while maintaining the same level of quality as the models trained with offline datasets. The LLM-SI-Corpus is available at \url{https://github.com/yusuke1997/LLM-SI-Corpus}.

Title: Proactive Software Supply Chain Risk Management Framework (P-SSCRM) Version 1

Authors: Laurie Williams (North Carolina State University), Sammy Migues (Imbricate Security), Jamie Boote (Synopsys), Ben Hutchison (Synopsys)
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2404.12300
Pdf URL: https://arxiv.org/pdf/2404.12300
Copy Paste: [[2404.12300]] Proactive Software Supply Chain Risk Management Framework (P-SSCRM) Version 1(https://arxiv.org/abs/2404.12300)
Keywords: secure
Abstract: The Proactive Software Supply Chain Risk Management Framework (P SSCRM) described in this document is designed to help you understand and plan a secure software supply chain risk management initiative. P SSCRM was created through a process of understanding and analyzing real world data from nine industry leading software supply chain risk management initiatives as well as through the analysis and unification of ten government and industry documents, frameworks, and standards. Although individual methodologies and standards differ, many initiatives and standards share common ground. P SSCRM describes this common ground and presents a model for understanding, quantifying, and developing a secure software supply chain risk management program and determining where your organization's existing efforts stand when contrasted with other real world software supply chain risk management initiatives.

Title: iRAG: An Incremental Retrieval Augmented Generation System for Videos

Authors: Md Adnan Arefeen, Biplob Debnath, Md Yusuf Sarwar Uddin, Srimat Chakradhar
Subjects: cs.CV, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12309
Pdf URL: https://arxiv.org/pdf/2404.12309
Copy Paste: [[2404.12309]] iRAG: An Incremental Retrieval Augmented Generation System for Videos(https://arxiv.org/abs/2404.12309)
Keywords: extraction
Abstract: Retrieval augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for combined understanding of multimodal data such as text, images and videos is appealing but two critical limitations exist: one-time, upfront capture of all content in large multimodal data as text descriptions entails high processing times, and not all information in the rich multimodal data is typically in the text descriptions. Since the user queries are not known apriori, developing a system for multimodal to text conversion and interactive querying of multimodal data is challenging. To address these limitations, we propose iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of large corpus of multimodal data. Unlike traditional RAG, iRAG quickly indexes large repositories of multimodal data, and in the incremental workflow, it uses the index to opportunistically extract more details from select portions of the multimodal data to retrieve context relevant to an interactive user query. Such an incremental workflow avoids long multimodal to text conversion times, overcomes information loss issues by doing on-demand query-specific extraction of details in multimodal data, and ensures high quality of responses to interactive user queries that are often not known apriori. To the best of our knowledge, iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of large, real-world multimodal data. Experimental results on real-world long videos demonstrate 23x to 25x faster video to text ingestion, while ensuring that quality of responses to interactive user queries is comparable to responses from a traditional RAG where all video data is converted to text upfront before any querying.

Title: Guided Discrete Diffusion for Electronic Health Record Generation

Authors: Zixiang Chen, Jun Han, Yongqian Li, Yiwen Kou, Eran Halperin, Robert E. Tillman, Quanquan Gu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2404.12314
Pdf URL: https://arxiv.org/pdf/2404.12314
Copy Paste: [[2404.12314]] Guided Discrete Diffusion for Electronic Health Record Generation(https://arxiv.org/abs/2404.12314)
Keywords: privacy, diffusion, generative
Abstract: Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.

Title: A Perspective on Deep Vision Performance with Standard Image and Video Codecs

Authors: Christoph Reich, Oliver Hahn, Daniel Cremers, Stefan Roth, Biplob Debnath
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2404.12330
Pdf URL: https://arxiv.org/pdf/2404.12330
Copy Paste: [[2404.12330]] A Perspective on Deep Vision Performance with Standard Image and Video Codecs(https://arxiv.org/abs/2404.12330)
Keywords: segmentation
Abstract: Resource-constrained hardware, such as edge devices or cell phones, often rely on cloud servers to provide the required computational resources for inference in deep vision models. However, transferring image and video data from an edge or mobile device to a cloud server requires coding to deal with network constraints. The use of standardized codecs, such as JPEG or H.264, is prevalent and required to ensure interoperability. This paper aims to examine the implications of employing standardized codecs within deep vision pipelines. We find that using JPEG and H.264 coding significantly deteriorates the accuracy across a broad range of vision tasks and models. For instance, strong compression rates reduce semantic segmentation accuracy by more than 80% in mIoU. In contrast to previous findings, our analysis extends beyond image and action classification to localization and dense prediction tasks, thus providing a more comprehensive perspective.

Title: Customizing Text-to-Image Diffusion with Camera Viewpoint Control

Authors: Nupur Kumari, Grace Su, Richard Zhang, Taesung Park, Eli Shechtman, Jun-Yan Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12333
Pdf URL: https://arxiv.org/pdf/2404.12333
Copy Paste: [[2404.12333]] Customizing Text-to-Image Diffusion with Camera Viewpoint Control(https://arxiv.org/abs/2404.12333)
Keywords: diffusion
Abstract: Model customization introduces new concepts to existing text-to-image models, enabling the generation of the new concept in novel contexts. However, such methods lack accurate camera view control w.r.t the object, and users must resort to prompt engineering (e.g., adding "top-view") to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of camera viewpoint for model customization. This allows us to modify object properties amongst various background scenes via text prompts, all while incorporating the target camera pose as additional control. This new task presents significant challenges in merging a 3D representation from the multi-view images of the new concept with a general, 2D text-to-image model. To bridge this gap, we propose to condition the 2D diffusion process on rendered, view-dependent features of the new object. During training, we jointly adapt the 2D diffusion modules and 3D feature predictions to reconstruct the object's appearance and geometry while reducing overfitting to the input multi-view images. Our method outperforms existing image editing and model personalization baselines in preserving the custom object's identity while following the input text prompt and the object's camera pose.

Title: Measuring Feature Dependency of Neural Networks by Collapsing Feature Dimensions in the Data Manifold

Authors: Yinzhu Jin, Matthew B. Dwyer, P. Thomas Fletcher
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2404.12341
Pdf URL: https://arxiv.org/pdf/2404.12341
Copy Paste: [[2404.12341]] Measuring Feature Dependency of Neural Networks by Collapsing Feature Dimensions in the Data Manifold(https://arxiv.org/abs/2404.12341)
Keywords: generative, segmentation
Abstract: This paper introduces a new technique to measure the feature dependency of neural network models. The motivation is to better understand a model by querying whether it is using information from human-understandable features, e.g., anatomical shape, volume, or image texture. Our method is based on the principle that if a model is dependent on a feature, then removal of that feature should significantly harm its performance. A targeted feature is "removed" by collapsing the dimension in the data distribution that corresponds to that feature. We perform this by moving data points along the feature dimension to a baseline feature value while staying on the data manifold, as estimated by a deep generative model. Then we observe how the model's performance changes on the modified test data set, with the target feature dimension removed. We test our method on deep neural network models trained on synthetic image data with known ground truth, an Alzheimer's disease prediction task using MRI and hippocampus segmentations from the OASIS-3 dataset, and a cell nuclei classification task using the Lizard dataset.

Title: Large Language Models in Targeted Sentiment Analysis

Authors: Nicolay Rusnachenko, Anton Golubev, Natalia Loukachevitch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2404.12342
Pdf URL: https://arxiv.org/pdf/2404.12342
Copy Paste: [[2404.12342]] Large Language Models in Targeted Sentiment Analysis(https://arxiv.org/abs/2404.12342)
Keywords: transformer, generative, large language model
Abstract: In this paper we investigate the use of decoder-based generative transformers for extracting sentiment towards the named entities in Russian news articles. We study sentiment analysis capabilities of instruction-tuned large language models (LLMs). We consider the dataset of RuSentNE-2023 in our study. The first group of experiments was aimed at the evaluation of zero-shot capabilities of LLMs with closed and open transparencies. The second covers the fine-tuning of Flan-T5 using the "chain-of-thought" (CoT) three-hop reasoning framework (THoR). We found that the results of the zero-shot approaches are similar to the results achieved by baseline fine-tuned encoder-based transformers (BERT-base). Reasoning capabilities of the fine-tuned Flan-T5 models with THoR achieve at least 5% increment with the base-size model compared to the results of the zero-shot experiment. The best results of sentiment analysis on RuSentNE-2023 were achieved by fine-tuned Flan-T5-xl, which surpassed the results of previous state-of-the-art transformer-based classifiers. Our CoT application framework is publicly available: https://github.com/nicolay-r/Reasoning-for-Sentiment-Analysis-Framework

Title: AniClipart: Clipart Animation with Text-to-Video Priors

Authors: Ronghuan Wu, Wanchao Su, Kede Ma, Jing Liao
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2404.12347
Pdf URL: https://arxiv.org/pdf/2404.12347
Copy Paste: [[2404.12347]] AniClipart: Clipart Animation with Text-to-Video Priors(https://arxiv.org/abs/2404.12347)
Keywords: diffusion
Abstract: Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define B\'{e}zier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.

Title: Point-In-Context: Understanding Point Cloud via In-Context Learning

Authors: Mengyuan Liu, Zhongbin Fang, Xia Li, Joachim M. Buhmann, Xiangtai Li, Chen Change Loy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12352
Pdf URL: https://arxiv.org/pdf/2404.12352
Copy Paste: [[2404.12352]] Point-In-Context: Understanding Point Cloud via In-Context Learning(https://arxiv.org/abs/2404.12352)
Keywords: segmentation
Abstract: With the emergence of large-scale models trained on diverse datasets, in-context learning has emerged as a promising paradigm for multitasking, notably in natural language processing and image processing. However, its application in 3D point cloud tasks remains largely unexplored. In this work, we introduce Point-In-Context (PIC), a novel framework for 3D point cloud understanding via in-context learning. We address the technical challenge of effectively extending masked point modeling to 3D point clouds by introducing a Joint Sampling module and proposing a vanilla version of PIC called Point-In-Context-Generalist (PIC-G). PIC-G is designed as a generalist model for various 3D point cloud tasks, with inputs and outputs modeled as coordinates. In this paradigm, the challenging segmentation task is achieved by assigning label points with XYZ coordinates for each category; the final prediction is then chosen based on the label point closest to the predictions. To break the limitation by the fixed label-coordinate assignment, which has poor generalization upon novel classes, we propose two novel training strategies, In-Context Labeling and In-Context Enhancing, forming an extended version of PIC named Point-In-Context-Segmenter (PIC-S), targeting improving dynamic context labeling and model training. By utilizing dynamic in-context labels and extra in-context pairs, PIC-S achieves enhanced performance and generalization capability in and across part segmentation datasets. PIC is a general framework so that other tasks or datasets can be seamlessly introduced into our PIC through a unified data format. We conduct extensive experiments to validate the versatility and adaptability of our proposed methods in handling a wide range of tasks and segmenting multi-datasets. Our PIC-S is capable of generalizing unseen datasets and performing novel part segmentation by customizing prompts.

Title: V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Authors: Hang Hua, Yunlong Tang, Chenliang Xu, Jiebo Luo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2404.12353
Pdf URL: https://arxiv.org/pdf/2404.12353
Copy Paste: [[2404.12353]] V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning(https://arxiv.org/abs/2404.12353)
Keywords: large language model
Abstract: Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective fine-tuning of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39\%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.

Title: Towards a Foundation Model for Partial Differential Equation: Multi-Operator Learning and Extrapolation

Authors: Jingmin Sun, Yuxuan Liu, Zecheng Zhang, Hayden Schaeffer
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2404.12355
Pdf URL: https://arxiv.org/pdf/2404.12355
Copy Paste: [[2404.12355]] Towards a Foundation Model for Partial Differential Equation: Multi-Operator Learning and Extrapolation(https://arxiv.org/abs/2404.12355)
Keywords: robust, large language model
Abstract: Foundation models, such as large language models, have demonstrated success in addressing various language and image processing tasks. In this work, we introduce a multi-modal foundation model for scientific problems, named PROSE-PDE. Our model, designed for bi-modality to bi-modality learning, is a multi-operator learning approach which can predict future states of spatiotemporal systems while concurrently learning the underlying governing equations of the physical system. Specifically, we focus on multi-operator learning by training distinct one-dimensional time-dependent nonlinear constant coefficient partial differential equations, with potential applications to many physical applications including physics, geology, and biology. More importantly, we provide three extrapolation studies to demonstrate that PROSE-PDE can generalize physical features through the robust training of multiple operators and that the proposed model can extrapolate to predict PDE solutions whose models or data were unseen during the training. Furthermore, we show through systematic numerical experiments that the utilization of the symbolic modality in our model effectively resolves the well-posedness problems with training multiple operators and thus enhances our model's predictive capabilities.

Title: From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function

Authors: Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2404.12358
Pdf URL: https://arxiv.org/pdf/2404.12358
Copy Paste: [[2404.12358]] From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function(https://arxiv.org/abs/2404.12358)
Keywords: generative
Abstract: Reinforcement Learning From Human Feedback (RLHF) has been a critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm. In this work we rectify this difference, first we theoretically show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. Using our theoretical results, we provide three concrete empirical insights. First, we show that because of its token level interpretation, DPO is able to perform some type of credit assignment. Next, we prove that under the token level formulation, classical search-based algorithms, such as MCTS, which have recently been applied to the language generation space, are equivalent to likelihood-based search on a DPO policy. Empirically we show that a simple beam search yields meaningful improvement over the base DPO policy. Finally, we show how the choice of reference policy causes implicit rewards to decline during training. We conclude by discussing applications of our work, including information elicitation in multi-tun dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.

Title: Inverse Neural Rendering for Explainable Multi-Object Tracking

Authors: Julian Ost, Tanushree Banerjee, Mario Bijelic, Felix Heide
Subjects: cs.CV, cs.GR, cs.RO
Abstract URL: https://arxiv.org/abs/2404.12359
Pdf URL: https://arxiv.org/pdf/2404.12359
Copy Paste: [[2404.12359]] Inverse Neural Rendering for Explainable Multi-Object Tracking(https://arxiv.org/abs/2404.12359)
Keywords: generative
Abstract: Today, most methods for image understanding tasks rely on feed-forward neural networks. While this approach has allowed for empirical accuracy, efficiency, and task adaptation via fine-tuning, it also comes with fundamental disadvantages. Existing networks often struggle to generalize across different datasets, even on the same task. By design, these networks ultimately reason about high-dimensional scene features, which are challenging to analyze. This is true especially when attempting to predict 3D information based on 2D images. We propose to recast 3D multi-object tracking from RGB cameras as an \emph{Inverse Rendering (IR)} problem, by optimizing via a differentiable rendering pipeline over the latent space of pre-trained 3D object representations and retrieve the latents that best represent object instances in a given input image. To this end, we optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We investigate not only an alternate take on tracking but our method also enables examining the generated objects, reasoning about failure situations, and resolving ambiguous cases. We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data and assessing camera-based 3D tracking on the nuScenes and Waymo datasets. Both these datasets are completely unseen to our method and do not require fine-tuning. Videos and code are available at https://light.princeton.edu/inverse-rendering-tracking/.

Title: Transformer tricks: Removing weights for skipless transformers

Authors: Nils Graef
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2404.12362
Pdf URL: https://arxiv.org/pdf/2404.12362
Copy Paste: [[2404.12362]] Transformer tricks: Removing weights for skipless transformers(https://arxiv.org/abs/2404.12362)
Keywords: transformer
Abstract: He and Hofmann (arXiv:2311.01906) detailed a skipless transformer without the V and P (post-attention projection) linear layers, which reduces the total number of weights. However, this scheme is only applicable to MHA (multi-head attention), but not for MQA (multi-query attention) and GQA (grouped-query attention). The latter schemes are used by many popular LLMs such as Llama 2, Mistral, Mixtral, PaLM, and Gemma. Therefore, this micro-paper proposes mathematically equivalent versions that are suitable for MQA and GQA. For example, removing Q and P from a skipless version of Mistral-7B would remove 15% of its weights (and thus reduce its compute and memory complexity). See arXiv:2402.13388 and https://github.com/OpenMachine-ai/transformer-tricks for code and more transformer tricks.

Title: When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes

Authors: Asaf Yehudai, Elron Bendel
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12365
Pdf URL: https://arxiv.org/pdf/2404.12365
Copy Paste: [[2404.12365]] When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes(https://arxiv.org/abs/2404.12365)
Keywords: transformer, large language model
Abstract: We present FastFit, a method, and a Python package design to provide fast and accurate few-shot classification, especially for scenarios with many semantically similar classes. FastFit utilizes a novel approach integrating batch contrastive learning and token-level similarity score. Compared to existing few-shot learning packages, such as SetFit, Transformers, or few-shot prompting of large language models via API calls, FastFit significantly improves multiclass classification performance in speed and accuracy across FewMany, our newly curated English benchmark, and Multilingual datasets. FastFit demonstrates a 3-20x improvement in training speed, completing training in just a few seconds. The FastFit package is now available on GitHub and PyPi, presenting a user-friendly solution for NLP practitioners.

Title: Gradient-Regularized Out-of-Distribution Detection

Authors: Sina Sharifi, Taha Entesari, Bardia Safaei, Vishal M. Patel, Mahyar Fazlyab
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12368
Pdf URL: https://arxiv.org/pdf/2404.12368
Copy Paste: [[2404.12368]] Gradient-Regularized Out-of-Distribution Detection(https://arxiv.org/abs/2404.12368)
Keywords: robust
Abstract: One of the challenges for neural networks in real-life applications is the overconfident errors these models make when the data is not from the original training distribution. Addressing this issue is known as Out-of-Distribution (OOD) detection. Many state-of-the-art OOD methods employ an auxiliary dataset as a surrogate for OOD data during training to achieve improved performance. However, these methods fail to fully exploit the local information embedded in the auxiliary dataset. In this work, we propose the idea of leveraging the information embedded in the gradient of the loss function during training to enable the network to not only learn a desired OOD score for each sample but also to exhibit similar behavior in a local neighborhood around each sample. We also develop a novel energy-based sampling method to allow the network to be exposed to more informative OOD samples during the training phase. This is especially important when the auxiliary dataset is large. We demonstrate the effectiveness of our method through extensive experiments on several OOD benchmarks, improving the existing state-of-the-art FPR95 by 4% on our ImageNet experiment. We further provide a theoretical analysis through the lens of certified robustness and Lipschitz analysis to showcase the theoretical foundation of our work. We will publicly release our code after the review process.

Title: KDk: A Defense Mechanism Against Label Inference Attacks in Vertical Federated Learning

Authors: Marco Arazzi, Serena Nicolazzo, Antonino Nocera
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2404.12369
Pdf URL: https://arxiv.org/pdf/2404.12369
Copy Paste: [[2404.12369]] KDk: A Defense Mechanism Against Label Inference Attacks in Vertical Federated Learning(https://arxiv.org/abs/2404.12369)
Keywords: defense, attack, federate
Abstract: Vertical Federated Learning (VFL) is a category of Federated Learning in which models are trained collaboratively among parties with vertically partitioned data. Typically, in a VFL scenario, the labels of the samples are kept private from all the parties except for the aggregating server, that is the label owner. Nevertheless, recent works discovered that by exploiting gradient information returned by the server to bottom models, with the knowledge of only a small set of auxiliary labels on a very limited subset of training data points, an adversary can infer the private labels. These attacks are known as label inference attacks in VFL. In our work, we propose a novel framework called KDk, that combines Knowledge Distillation and k-anonymity to provide a defense mechanism against potential label inference attacks in a VFL scenario. Through an exhaustive experimental campaign we demonstrate that by applying our approach, the performance of the analyzed label inference attacks decreases consistently, even by more than 60%, maintaining the accuracy of the whole VFL almost unaltered.

Title: MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

Authors: Xiaotang Gai, Chenyi Zhou, Jiaxiang Liu, Yang Feng, Jian Wu, Zuozhu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12372
Pdf URL: https://arxiv.org/pdf/2404.12372
Copy Paste: [[2404.12372]] MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale(https://arxiv.org/abs/2404.12372)
Keywords: interpretability, generative, large language model
Abstract: Medical Visual Question Answering (MedVQA), which offers language responses to image-based medical inquiries, represents a challenging task and significant advancement in healthcare. It assists medical experts to swiftly interpret medical images, thereby enabling faster and more accurate diagnoses. However, the model interpretability and transparency of existing MedVQA solutions are often limited, posing challenges in understanding their decision-making processes. To address this issue, we devise a semi-automated annotation process to streamlining data preparation and build new benchmark MedVQA datasets R-RAD and R-SLAKE. The R-RAD and R-SLAKE datasets provide intermediate medical decision-making rationales generated by multimodal large language models and human annotations for question-answering pairs in existing MedVQA datasets, i.e., VQA-RAD and SLAKE. Moreover, we design a novel framework which finetunes lightweight pretrained generative models by incorporating medical decision-making rationales into the training process. The framework includes three distinct strategies to generate decision outcomes and corresponding rationales, thereby clearly showcasing the medical decision-making process during reasoning. Extensive experiments demonstrate that our method can achieve an accuracy of 83.5% on R-RAD and 86.3% on R-SLAKE, significantly outperforming existing state-of-the-art baselines. Dataset and code will be released.

Title: 6Img-to-3D: Few-Image Large-Scale Outdoor Driving Scene Reconstruction

Authors: Théo Gieruc, Marius Kästingschäfer, Sebastian Bernhard, Mathieu Salzmann
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12378
Pdf URL: https://arxiv.org/pdf/2404.12378
Copy Paste: [[2404.12378]] 6Img-to-3D: Few-Image Large-Scale Outdoor Driving Scene Reconstruction(https://arxiv.org/abs/2404.12378)
Keywords: transformer
Abstract: Current 3D reconstruction techniques struggle to infer unbounded scenes from a few images faithfully. Specifically, existing methods have high computational demands, require detailed pose information, and cannot reconstruct occluded regions reliably. We introduce 6Img-to-3D, an efficient, scalable transformer-based encoder-renderer method for single-shot image to 3D reconstruction. Our method outputs a 3D-consistent parameterized triplane from only six outward-facing input images for large-scale, unbounded outdoor driving scenarios. We take a step towards resolving existing shortcomings by combining contracted custom cross- and self-attention mechanisms for triplane parameterization, differentiable volume rendering, scene contraction, and image feature projection. We showcase that six surround-view vehicle images from a single timestamp without global pose information are enough to reconstruct 360$^{\circ}$ scenes during inference time, taking 395 ms. Our method allows, for example, rendering third-person images and birds-eye views. Our code is available at https://github.com/continental/6Img-to-3D, and more examples can be found at our website here https://6Img-to-3D.GitHub.io/.

Title: Lazy Diffusion Transformer for Interactive Image Editing

Authors: Yotam Nitzan, Zongze Wu, Richard Zhang, Eli Shechtman, Daniel Cohen-Or, Taesung Park, Michaël Gharbi
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2404.12382
Pdf URL: https://arxiv.org/pdf/2404.12382
Copy Paste: [[2404.12382]] Lazy Diffusion Transformer for Interactive Image Editing(https://arxiv.org/abs/2404.12382)
Keywords: diffusion, transformer
Abstract: We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a "lazy" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.

Title: G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis

Authors: Yufei Ye, Abhinav Gupta, Kris Kitani, Shubham Tulsiani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12383
Pdf URL: https://arxiv.org/pdf/2404.12383
Copy Paste: [[2404.12383]] G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis(https://arxiv.org/abs/2404.12383)
Keywords: diffusion, generative
Abstract: We propose G-HOP, a denoising diffusion based generative prior for hand-object interactions that allows modeling both the 3D object and a human hand, conditioned on the object category. To learn a 3D spatial diffusion model that can capture this joint distribution, we represent the human hand via a skeletal distance field to obtain a representation aligned with the (latent) signed distance field for the object. We show that this hand-object prior can then serve as generic guidance to facilitate other tasks like reconstruction from interaction clip and human grasp synthesis. We believe that our model, trained by aggregating seven diverse real-world interaction datasets spanning across 155 categories, represents a first approach that allows jointly generating both hand and object. Our empirical evaluations demonstrate the benefit of this joint prior in video-based reconstruction and human grasp synthesis, outperforming current task-specific baselines. Project website: https://judyye.github.io/ghop-www

Title: MeshLRM: Large Reconstruction Model for High-Quality Mesh

Authors: Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, Zexiang Xu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2404.12385
Pdf URL: https://arxiv.org/pdf/2404.12385
Copy Paste: [[2404.12385]] MeshLRM: Large Reconstruction Model for High-Quality Mesh(https://arxiv.org/abs/2404.12385)
Keywords: extraction
Abstract: We propose MeshLRM, a novel LRM-based approach that can reconstruct a high-quality mesh from merely four input images in less than one second. Different from previous large reconstruction models (LRMs) that focus on NeRF-based reconstruction, MeshLRM incorporates differentiable mesh extraction and rendering within the LRM framework. This allows for end-to-end mesh reconstruction by fine-tuning a pre-trained NeRF LRM with mesh rendering. Moreover, we improve the LRM architecture by simplifying several complex designs in previous LRMs. MeshLRM's NeRF initialization is sequentially trained with low- and high-resolution images; this new LRM training strategy enables significantly faster convergence and thereby leads to better quality with less compute. Our approach achieves state-of-the-art mesh reconstruction from sparse-view inputs and also allows for many downstream applications, including text-to-3D and single-image-to-3D generation. Project page: https://sarahweiii.github.io/meshlrm/

Title: SOHES: Self-supervised Open-world Hierarchical Entity Segmentation

Authors: Shengcao Cao, Jiuxiang Gu, Jason Kuen, Hao Tan, Ruiyi Zhang, Handong Zhao, Ani Nenkova, Liang-Yan Gui, Tong Sun, Yu-Xiong Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2404.12386
Pdf URL: https://arxiv.org/pdf/2404.12386
Copy Paste: [[2404.12386]] SOHES: Self-supervised Open-world Hierarchical Entity Segmentation(https://arxiv.org/abs/2404.12386)
Keywords: segmentation
Abstract: Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Despite its promise, existing entity segmentation methods like Segment Anything Model (SAM) rely heavily on costly expert annotators. This work presents Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel approach that eliminates the need for human annotations. SOHES operates in three phases: self-exploration, self-instruction, and self-correction. Given a pre-trained self-supervised representation, we produce abundant high-quality pseudo-labels through visual feature clustering. Then, we train a segmentation model on the pseudo-labels, and rectify the noises in pseudo-labels via a teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES also captures their constituent parts, providing a hierarchical understanding of visual entities. Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks. Project page: https://SOHES.github.io.

Title: VideoGigaGAN: Towards Detail-rich Video Super-Resolution

Authors: Yiran Xu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Jia-Bin Huang, Difan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12388
Pdf URL: https://arxiv.org/pdf/2404.12388
Copy Paste: [[2404.12388]] VideoGigaGAN: Towards Detail-rich Video Super-Resolution(https://arxiv.org/abs/2404.12388)
Keywords: generative
Abstract: Video super-resolution (VSR) approaches have shown impressive temporal consistency in upsampled videos. However, these approaches tend to generate blurrier results than their image counterparts as they are limited in their generative capability. This raises a fundamental question: can we extend the success of a generative image upsampler to the VSR task while preserving the temporal consistency? We introduce VideoGigaGAN, a new generative VSR model that can produce videos with high-frequency details and temporal consistency. VideoGigaGAN builds upon a large-scale image upsampler -- GigaGAN. Simply inflating GigaGAN to a video model by adding temporal modules produces severe temporal flickering. We identify several key issues and propose techniques that significantly improve the temporal consistency of upsampled videos. Our experiments show that, unlike previous VSR methods, VideoGigaGAN generates temporally consistent videos with more fine-grained appearance details. We validate the effectiveness of VideoGigaGAN by comparing it with state-of-the-art VSR models on public datasets and showcasing video results with $8\times$ super-resolution.

Title: Moving Object Segmentation: All You Need Is SAM (and Flow)

Authors: Junyu Xie, Charig Yang, Weidi Xie, Andrew Zisserman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2404.12389
Pdf URL: https://arxiv.org/pdf/2404.12389
Copy Paste: [[2404.12389]] Moving Object Segmentation: All You Need Is SAM (and Flow)(https://arxiv.org/abs/2404.12389)
Keywords: segmentation
Abstract: The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. This is a much studied area with numerous careful,and sometimes complex, approaches and training schemes including: self-supervised learning, learning from synthetic datasets, object-centric representations, amodal representations, and many more. Our interest in this paper is to determine if the Segment Anything model (SAM) can contribute to this task. We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects. In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt. These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks. We also extend these frame-level segmentations to sequence-level segmentations that maintain object identity. Again, this simple model outperforms previous methods on multiple video object segmentation benchmarks.

Title: BLINK: Multimodal Large Language Models Can See but Not Perceive

Authors: Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, Ranjay Krishna
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2404.12390
Pdf URL: https://arxiv.org/pdf/2404.12390
Copy Paste: [[2404.12390]] BLINK: Multimodal Large Language Models Can See but Not Perceive(https://arxiv.org/abs/2404.12390)
Keywords: large language model
Abstract: We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception.