Copy Paste: [[2410.14690]] Rethinking VLMs and LLMs for Image Classification(https://arxiv.org/abs/2410.14690)
Keywords: large language model
Abstract: Visual Language Models (VLMs) are now increasingly being merged with Large Language Models (LLMs) to enable new capabilities, particularly in terms of improved interactivity and open-ended responsiveness. While these are remarkable capabilities, the contribution of LLMs to enhancing the longstanding key problem of classifying an image among a set of choices remains unclear. Through extensive experiments involving seven models, ten visual understanding datasets, and multiple prompt variations per dataset, we find that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do. Yet at the same time, leveraging LLMs can improve performance on tasks requiring reasoning and outside knowledge. In response to these challenges, we propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes visual tasks to the most suitable model for the task. The LLM router undergoes training using a dataset constructed from more than 2.5 million examples of pairs of visual task and model accuracy. Our results reveal that this lightweight fix surpasses or matches the accuracy of state-of-the-art alternatives, including GPT-4V and HuggingGPT, while improving cost-effectiveness.
Title: Deep Domain Isolation and Sample Clustered Federated Learning for Semantic Segmentation
Copy Paste: [[2410.14693]] Deep Domain Isolation and Sample Clustered Federated Learning for Semantic Segmentation(https://arxiv.org/abs/2410.14693)
Keywords: federate, segmentation
Abstract: Empirical studies show that federated learning exhibits convergence issues in Non Independent and Identically Distributed (IID) setups. However, these studies only focus on label distribution shifts, or concept shifts (e.g. ambiguous tasks). In this paper, we explore for the first time the effect of covariate shifts between participants' data in 2D segmentation tasks, showing an impact way less serious than label shifts but still present on convergence. Moreover, current Personalized (PFL) and Clustered (CFL) Federated Learning methods intrinsically assume the homogeneity of the dataset of each participant and its consistency with future test samples by operating at the client level. We introduce a more general and realistic framework where each participant owns a mixture of multiple underlying feature domain distributions. To diagnose such pathological feature distributions affecting a model being trained in a federated fashion, we develop Deep Domain Isolation (DDI) to isolate image domains directly in the gradient space of the model. A federated Gaussian Mixture Model is fit to the sample gradients of each class, while the results are combined with spectral clustering on the server side to isolate decentralized sample-level domains. We leverage this clustering algorithm through a Sample Clustered Federated Learning (SCFL) framework, performing standard federated learning of several independent models, one for each decentralized image domain. Finally, we train a classifier enabling to associate a test sample to its corresponding domain cluster at inference time, offering a final set of models that are agnostic to any assumptions on the test distribution of each participant. We validate our approach on a toy segmentation dataset as well as different partitionings of a combination of Cityscapes and GTA5 datasets using an EfficientVIT-B0 model, showing a significant performance gain compared to other approaches. Our code is available at this https URL .
Title: Optimizing Parking Space Classification: Distilling Ensembles into Lightweight Classifiers
Authors: Paulo Luza Alves, André Hochuli, Luiz Eduardo de Oliveira, Paulo Lisboa de Almeida
Copy Paste: [[2410.14705]] Optimizing Parking Space Classification: Distilling Ensembles into Lightweight Classifiers(https://arxiv.org/abs/2410.14705)
Keywords: robust
Abstract: When deploying large-scale machine learning models for smart city applications, such as image-based parking lot monitoring, data often must be sent to a central server to perform classification tasks. This is challenging for the city's infrastructure, where image-based applications require transmitting large volumes of data, necessitating complex network and hardware infrastructures to process the data. To address this issue in image-based parking space classification, we propose creating a robust ensemble of classifiers to serve as Teacher models. These Teacher models are distilled into lightweight and specialized Student models that can be deployed directly on edge devices. The knowledge is distilled to the Student models through pseudo-labeled samples generated by the Teacher model, which are utilized to fine-tune the Student models on the target scenario. Our results show that the Student models, with 26 times fewer parameters than the Teacher models, achieved an average accuracy of 96.6% on the target test datasets, surpassing the Teacher models, which attained an average accuracy of 95.3%.
Title: FACMIC: Federated Adaptative CLIP Model for Medical Image Classification
Authors: Yihang Wu, Christian Desrosiers, Ahmad Chaddad
Copy Paste: [[2410.14707]] FACMIC: Federated Adaptative CLIP Model for Medical Image Classification(https://arxiv.org/abs/2410.14707)
Keywords: privacy, federate
Abstract: Federated learning (FL) has emerged as a promising approach to medical image analysis that allows deep model training using decentralized data while ensuring data privacy. However, in the field of FL, communication cost plays a critical role in evaluating the performance of the model. Thus, transferring vision foundation models can be particularly challenging due to the significant resource costs involved. In this paper, we introduce a federated adaptive Contrastive Language Image Pretraining CLIP model designed for classification tasks. We employ a light-weight and efficient feature attention module for CLIP that selects suitable features for each client's data. Additionally, we propose a domain adaptation technique to reduce differences in data distribution between clients. Experimental results on four publicly available datasets demonstrate the superior performance of FACMIC in dealing with real-world and multisource medical imaging data. Our codes are available at this https URL.
Title: G2D2: Gradient-guided Discrete Diffusion for image inverse problem solving
Copy Paste: [[2410.14710]] G2D2: Gradient-guided Discrete Diffusion for image inverse problem solving(https://arxiv.org/abs/2410.14710)
Keywords: diffusion
Abstract: Recent literature has effectively utilized diffusion models trained on continuous variables as priors for solving inverse problems. Notably, discrete diffusion models with discrete latent codes have shown strong performance, particularly in modalities suited for discrete compressed representations, such as image and motion generation. However, their discrete and non-differentiable nature has limited their application to inverse problems formulated in continuous spaces. This paper presents a novel method for addressing linear inverse problems by leveraging image-generation models based on discrete diffusion as priors. We overcome these limitations by approximating the true posterior distribution with a variational distribution constructed from categorical distributions and continuous relaxation techniques. Furthermore, we employ a star-shaped noise process to mitigate the drawbacks of traditional discrete diffusion models with absorbing states, demonstrating that our method performs comparably to continuous diffusion techniques. To the best of our knowledge, this is the first approach to use discrete diffusion model-based priors for solving image inverse problems.
Title: QuAILoRA: Quantization-Aware Initialization for LoRA
Authors: Neal Lawton, Aishwarya Padmakumar, Judith Gaspers, Jack FitzGerald, Anoop Kumar, Greg Ver Steeg, Aram Galstyan
Copy Paste: [[2410.14713]] QuAILoRA: Quantization-Aware Initialization for LoRA(https://arxiv.org/abs/2410.14713)
Keywords: large language model
Abstract: QLoRA reduces the memory-cost of fine-tuning a large language model (LLM) with LoRA by quantizing the base LLM. However, quantization introduces quantization errors that negatively impact model performance after fine-tuning. In this paper we introduce QuAILoRA, a quantization-aware initialization for LoRA that mitigates this negative impact by decreasing quantization errors at initialization. Our method spends a small amount of computational overhead to compute this quantization-aware initialization, without increasing the memory-cost of fine-tuning. We evaluate our method on several causal language modeling and downstream evaluation tasks using several different model sizes and families. We observe that almost all LLMs fined-tuned with QuAILoRA achieve better validation perplexity. When evaluated on downstream tasks, we find that QuAILoRA yields improvements proportional to the negative effect of quantization error. On average, applying QuAILoRA to 4-bit QLoRA models yields 75% of the validation perplexity decrease and 86% of the downstream task accuracy increase as doubling the quantization precision to 8-bit, without increasing GPU memory utilization during fine-tuning.
Title: Animating the Past: Reconstruct Trilobite via Video Generation
Copy Paste: [[2410.14715]] Animating the Past: Reconstruct Trilobite via Video Generation(https://arxiv.org/abs/2410.14715)
Keywords: large language model
Abstract: Paleontology, the study of past life, fundamentally relies on fossils to reconstruct ancient ecosystems and understand evolutionary dynamics. Trilobites, as an important group of extinct marine arthropods, offer valuable insights into Paleozoic environments through their well-preserved fossil records. Reconstructing trilobite behaviour from static fossils will set new standards for dynamic reconstructions in scientific research and education. Despite the potential, current computational methods for this purpose like text-to-video (T2V) face significant challenges, such as maintaining visual realism and consistency, which hinder their application in science contexts. To overcome these obstacles, we introduce an automatic T2V prompt learning method. Within this framework, prompts for a fine-tuned video generation model are generated by a large language model, which is trained using rewards that quantify the visual realism and smoothness of the generated video. The fine-tuning of the video generation model, along with the reward calculations make use of a collected dataset of 9,088 Eoredlichia intermedia fossil images, which provides a common representative of visual details of all class of trilobites. Qualitative and quantitative experiments show that our method can generate trilobite videos with significantly higher visual realism compared to powerful baselines, promising to boost both scientific understanding and public engagement.
Title: A Systematic Survey on Large Language Models for Algorithm Design
Copy Paste: [[2410.14716]] A Systematic Survey on Large Language Models for Algorithm Design(https://arxiv.org/abs/2410.14716)
Keywords: large language model
Abstract: Algorithm Design (AD) is crucial for effective problem-solving across various domains. The advent of Large Language Models (LLMs) has notably enhanced the automation and innovation within this field, offering new perspectives and superior solutions. Over the past three years, the integration of LLMs into AD (LLM4AD) has progressed significantly, finding applications in diverse areas such as optimization, machine learning, mathematical reasoning, and scientific exploration. Given the rapid development and broadening scope of this field, a systematic review is both timely and essential. This paper provides a systematic review of the works on LLM4AD. First, we present an overview and summary of existing studies. Then, we present a systematic categorization, and a review of existing works along four dimensions including the role of LLMs, search techniques, prompt strategies, and application fields. We also discuss the achievements and challenges in each area and the capabilities of LLM4AD in addressing them. Finally, we explore current limitations and propose several open questions and promising directions for future research.
Title: SGLP: A Similarity Guided Fast Layer Partition Pruning for Compressing Large Deep Models
Copy Paste: [[2410.14720]] SGLP: A Similarity Guided Fast Layer Partition Pruning for Compressing Large Deep Models(https://arxiv.org/abs/2410.14720)
Keywords: large language model, segmentation
Abstract: The deployment of Deep Neural Network (DNN)-based networks on resource-constrained devices remains a significant challenge due to their high computational and parameter requirements. To solve this problem, layer pruning has emerged as a potent approach to reduce network size and improve computational efficiency. However, existing layer pruning methods mostly overlook the intrinsic connections and inter-dependencies between different layers within complicated deep neural networks. This oversight can result in pruned models that do not preserve the essential characteristics of the pre-trained network as effectively as desired. To address this limitations, we propose a Similarity Guided fast Layer Partition pruning for compressing large deep models (SGLP), which focuses on pruning layers from network segments partitioned via representation similarity. Specifically, our presented method first leverages Centered Kernel Alignment (CKA) to indicate the internal representations among the layers of the pre-trained network, which provides us with a potent basis for layer pruning. Based on similarity matrix derived from CKA, we employ Fisher Optimal Segmentation to partition the network into multiple segments, which provides a basis for removing the layers in a segment-wise manner. In addition, our method innovatively adopts GradNorm for segment-wise layer importance evaluation, eliminating the need for extensive fine-tuning, and finally prunes the unimportant layers to obtain a compact network. Experimental results in image classification and for large language models (LLMs) demonstrate that our proposed SGLP outperforms the state-of-the-art methods in both accuracy and computational efficiency, presenting a more effective solution for deploying DNNs on resource-limited platforms. Our codes are available at this https URL.
Title: BeniFul: Backdoor Defense via Middle Feature Analysis for Deep Neural Networks
Copy Paste: [[2410.14723]] BeniFul: Backdoor Defense via Middle Feature Analysis for Deep Neural Networks(https://arxiv.org/abs/2410.14723)
Keywords: defense, attack
Abstract: Backdoor defenses have recently become important in resisting backdoor attacks in deep neural networks (DNNs), where attackers implant backdoors into the DNN model by injecting backdoor samples into the training dataset. Although there are many defense methods to achieve backdoor detection for DNN inputs and backdoor elimination for DNN models, they still have not presented a clear explanation of the relationship between these two missions. In this paper, we use the features from the middle layer of the DNN model to analyze the difference between backdoor and benign samples and propose Backdoor Consistency, which indicates that at least one backdoor exists in the DNN model if the backdoor trigger is detected exactly on input. By analyzing the middle features, we design an effective and comprehensive backdoor defense method named BeniFul, which consists of two parts: a gray-box backdoor input detection and a white-box backdoor elimination. Specifically, we use the reconstruction distance from the Variational Auto-Encoder and model inference results to implement backdoor input detection and a feature distance loss to achieve backdoor elimination. Experimental results on CIFAR-10 and Tiny ImageNet against five state-of-the-art attacks demonstrate that our BeniFul exhibits a great defense capability in backdoor input detection and backdoor elimination.
Title: Security Threats in Agentic AI System
Authors: Raihan Khan, Sayak Sarkar, Sainik Kumar Mahata, Edwin Jose
Copy Paste: [[2410.14728]] Security Threats in Agentic AI System(https://arxiv.org/abs/2410.14728)
Keywords: security, privacy
Abstract: This research paper explores the privacy and security threats posed to an Agentic AI system with direct access to database systems. Such access introduces significant risks, including unauthorized retrieval of sensitive information, potential exploitation of system vulnerabilities, and misuse of personal or confidential data. The complexity of AI systems combined with their ability to process and analyze large volumes of data increases the chances of data leaks or breaches, which could occur unintentionally or through adversarial manipulation. Furthermore, as AI agents evolve with greater autonomy, their capacity to bypass or exploit security measures becomes a growing concern, heightening the need to address these critical vulnerabilities in agentic systems.
Title: On the Relation Between Linear Diffusion and Power Iteration
Authors: Dana Weitzner, Mauricio Delbracio, Peyman Milanfar, Raja Giryes
Copy Paste: [[2410.14730]] On the Relation Between Linear Diffusion and Power Iteration(https://arxiv.org/abs/2410.14730)
Keywords: diffusion, generative
Abstract: Recently, diffusion models have gained popularity due to their impressive generative abilities. These models learn the implicit distribution given by the training dataset, and sample new data by transforming random noise through the reverse process, which can be thought of as gradual denoising. In this work, we examine the generation process as a ``correlation machine'', where random noise is repeatedly enhanced in correlation with the implicit given distribution. To this end, we explore the linear case, where the optimal denoiser in the MSE sense is known to be the PCA projection. This enables us to connect the theory of diffusion models to the spiked covariance model, where the dependence of the denoiser on the noise level and the amount of training data can be expressed analytically, in the rank-1 case. In a series of numerical experiments, we extend this result to general low rank data, and show that low frequencies emerge earlier in the generation process, where the denoising basis vectors are more aligned to the true data with a rate depending on their eigenvalues. This model allows us to show that the linear diffusion model converges in mean to the leading eigenvector of the underlying data, similarly to the prevalent power iteration method. Finally, we empirically demonstrate the applicability of our findings beyond the linear case, in the Jacobians of a deep, non-linear denoiser, used in general image generation tasks.
Title: MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
Abstract: KV cache has become a de facto technique for the inference of large language models (LLMs), where tensors of shape (layer number, head number, sequence length, feature dimension) are introduced to cache historical information for self-attention. As the size of the model and data grows, the KV cache can quickly become a bottleneck within the system in both storage and memory transfer. To address this, prior studies usually focus on the first three axes of the cache tensors for compression. This paper supplements them, focusing on the feature dimension axis, by utilizing low-rank projection matrices to transform the cache features into spaces with reduced dimensions. We begin by investigating the canonical orthogonal projection method for data compression through principal component analysis (PCA). We observe the issue with PCA projection where significant performance degradation is observed at low compression rates. To bridge the gap, we propose to directly tune the orthogonal projection matrices with a distillation objective using an elaborate Matryoshka training strategy. After training, we adaptively search for the optimal compression rates for various layers and heads given varying compression budgets. Compared to previous works, our method can easily embrace pre-trained LLMs and hold a smooth tradeoff between performance and compression rate. We empirically witness the high data efficiency of our training procedure and find that our method can sustain over 90% performance with an average KV cache compression rate of 60% (and up to 75% in certain extreme scenarios) for popular LLMs like LLaMA2-7B-base and Mistral-7B-v0.3-base.
Title: Agent Skill Acquisition for Large Language Models via CycleQD
Authors: So Kuroki, Taishi Nakamura, Takuya Akiba, Yujin Tang
Copy Paste: [[2410.14735]] Agent Skill Acquisition for Large Language Models via CycleQD(https://arxiv.org/abs/2410.14735)
Keywords: robust, large language model, segmentation
Abstract: Training large language models to acquire specific skills remains a challenging endeavor. Conventional training approaches often struggle with data distribution imbalances and inadequacies in objective functions that do not align well with task-specific performance. To address these challenges, we introduce CycleQD, a novel approach that leverages the Quality Diversity framework through a cyclic adaptation of the algorithm, along with a model merging based crossover and an SVD-based mutation. In CycleQD, each task's performance metric is alternated as the quality measure while the others serve as the behavioral characteristics. This cyclic focus on individual tasks allows for concentrated effort on one task at a time, eliminating the need for data ratio tuning and simplifying the design of the objective function. Empirical results from AgentBench indicate that applying CycleQD to LLAMA3-8B-INSTRUCT based models not only enables them to surpass traditional fine-tuning methods in coding, operating systems, and database tasks, but also achieves performance on par with GPT-3.5-TURBO, which potentially contains much more parameters, across these domains. Crucially, this enhanced performance is achieved while retaining robust language capabilities, as evidenced by its performance on widely adopted language benchmark tasks. We highlight the key design choices in CycleQD, detailing how these contribute to its effectiveness. Furthermore, our method is general and can be applied to image segmentation models, highlighting its applicability across different domains.
Title: Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching
Copy Paste: [[2410.14740]] Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching(https://arxiv.org/abs/2410.14740)
Keywords: large language model
Abstract: Although Large Language Models (LLMs) have demonstrated remarkable capabilities, their massive parameter counts and associated extensive computing make LLMs' deployment the main part of carbon emission from nowadays AI applications. Compared to modern GPUs like H$100$, it would be significantly carbon-sustainable if we could leverage old-fashioned GPUs such as M$40$ (as shown in Figure~\ref{fig:tisser}, M$40$ only has one third carbon emission of H$100$'s) for LLM servings. However, the limited High Bandwidth Memory (HBM) available on such GPU often cannot support the loading of LLMs due to the gigantic model size and intermediate activation data, making their serving challenging. For instance, a LLaMA2 model with $70$B parameters typically requires $128$GB for inference, which substantially surpasses $24$GB HBM in a $3090$ GPU and remains infeasible even considering the additional $64$GB DRAM. To address this challenge, this paper proposes a mixed-precision with a model modularization algorithm to enable LLM inference on outdated hardware with resource constraints. (The precision denotes the numerical precision like FP16, INT8, INT4) and multi-level caching (M2Cache).) Specifically, our M2Cache first modulizes neurons in LLM and creates their importance ranking. Then, it adopts a dynamic sparse mixed-precision quantization mechanism in weight space to reduce computational demands and communication overhead at each decoding step. It collectively lowers the operational carbon emissions associated with LLM inference. Moreover, M2Cache introduces a three-level cache management system with HBM, DRAM, and SSDs that complements the dynamic sparse mixed-precision inference. To enhance communication efficiency, M2Cache maintains a neuron-level mixed-precision LRU cache in HBM, a larger layer-aware cache in DRAM, and a full model in SSD.
Title: Eliciting Uncertainty in Chain-of-Thought to Mitigate Bias against Forecasting Harmful User Behaviors
Copy Paste: [[2410.14744]] Eliciting Uncertainty in Chain-of-Thought to Mitigate Bias against Forecasting Harmful User Behaviors(https://arxiv.org/abs/2410.14744)
Keywords: large language model
Abstract: Conversation forecasting tasks a model with predicting the outcome of an unfolding conversation. For instance, it can be applied in social media moderation to predict harmful user behaviors before they occur, allowing for preventative interventions. While large language models (LLMs) have recently been proposed as an effective tool for conversation forecasting, it's unclear what biases they may have, especially against forecasting the (potentially harmful) outcomes we request them to predict during moderation. This paper explores to what extent model uncertainty can be used as a tool to mitigate potential biases. Specifically, we ask three primary research questions: 1) how does LLM forecasting accuracy change when we ask models to represent their uncertainty; 2) how does LLM bias change when we ask models to represent their uncertainty; 3) how can we use uncertainty representations to reduce or completely mitigate biases without many training data points. We address these questions for 5 open-source language models tested on 2 datasets designed to evaluate conversation forecasting for social media moderation.
Title: SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation
Copy Paste: [[2410.14745]] SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation(https://arxiv.org/abs/2410.14745)
Keywords: large language model
Abstract: Supervised fine-tuning (SFT) is crucial in adapting large language models (LLMs) to a specific domain or task. However, only a limited amount of labeled data is available in practical applications, which poses a severe challenge for SFT in yielding satisfactory results. Therefore, a data-efficient framework that can fully exploit labeled and unlabeled data for LLM fine-tuning is highly anticipated. Towards this end, we introduce a semi-supervised fine-tuning framework named SemiEvol for LLM adaptation from a propagate-and-select manner. For knowledge propagation, SemiEvol adopts a bi-level approach, propagating knowledge from labeled data to unlabeled data through both in-weight and in-context methods. For knowledge selection, SemiEvol incorporates a collaborative learning mechanism, selecting higher-quality pseudo-response samples. We conducted experiments using GPT-4o-mini and Llama-3.1 on seven general or domain-specific datasets, demonstrating significant improvements in model performance on target data. Furthermore, we compared SemiEvol with SFT and self-evolution methods, highlighting its practicality in hybrid data scenarios.
Title: CFTS-GAN: Continual Few-Shot Teacher Student for Generative Adversarial Networks
Abstract: Few-shot and continual learning face two well-known challenges in GANs: overfitting and catastrophic forgetting. Learning new tasks results in catastrophic forgetting in deep learning models. In the case of a few-shot setting, the model learns from a very limited number of samples (e.g. 10 samples), which can lead to overfitting and mode collapse. So, this paper proposes a Continual Few-shot Teacher-Student technique for the generative adversarial network (CFTS-GAN) that considers both challenges together. Our CFTS-GAN uses an adapter module as a student to learn a new task without affecting the previous knowledge. To make the student model efficient in learning new tasks, the knowledge from a teacher model is distilled to the student. In addition, the Cross-Domain Correspondence (CDC) loss is used by both teacher and student to promote diversity and to avoid mode collapse. Moreover, an effective strategy of freezing the discriminator is also utilized for enhancing performance. Qualitative and quantitative results demonstrate more diverse image synthesis and produce qualitative samples comparatively good to very stronger state-of-the-art models.
Title: Mitigating Embedding Collapse in Diffusion Models for Categorical Data
Copy Paste: [[2410.14758]] Mitigating Embedding Collapse in Diffusion Models for Categorical Data(https://arxiv.org/abs/2410.14758)
Keywords: diffusion
Abstract: Latent diffusion models have enabled continuous-state diffusion models to handle a variety of datasets, including categorical data. However, most methods rely on fixed pretrained embeddings, limiting the benefits of joint training with the diffusion model. While jointly learning the embedding (via reconstruction loss) and the latent diffusion model (via score matching loss) could enhance performance, our analysis shows that end-to-end training risks embedding collapse, degrading generation quality. To address this issue, we introduce CATDM, a continuous diffusion framework within the embedding space that stabilizes training. We propose a novel objective combining the joint embedding-diffusion variational lower bound with a Consistency-Matching (CM) regularizer, alongside a shifted cosine noise schedule and random dropping strategy. The CM regularizer ensures the recovery of the true data distribution. Experiments on benchmarks show that CATDM mitigates embedding collapse, yielding superior results on FFHQ, LSUN Churches, and LSUN Bedrooms. In particular, CATDM achieves an FID of 6.81 on ImageNet $256\times256$ with 50 steps. It outperforms non-autoregressive models in machine translation and is on a par with previous methods in text generation.
Title: Constrained Recurrent Bayesian Forecasting for Crack Propagation
Authors: Sara Yasmine Ouerk, Olivier Vo Van, Mouadh Yagoubi
Copy Paste: [[2410.14761]] Constrained Recurrent Bayesian Forecasting for Crack Propagation(https://arxiv.org/abs/2410.14761)
Keywords: robust
Abstract: Predictive maintenance of railway infrastructure, especially railroads, is essential to ensure safety. However, accurate prediction of crack evolution represents a major challenge due to the complex interactions between intrinsic and external factors, as well as measurement uncertainties. Effective modeling requires a multidimensional approach and a comprehensive understanding of these dynamics and uncertainties. Motivated by an industrial use case based on collected real data containing measured crack lengths, this paper introduces a robust Bayesian multi-horizon approach for predicting the temporal evolution of crack lengths on rails. This model captures the intricate interplay between various factors influencing crack growth. Additionally, the Bayesian approach quantifies both epistemic and aleatoric uncertainties, providing a confidence interval around predictions. To enhance the model's reliability for railroad maintenance, specific constraints are incorporated. These constraints limit non-physical crack propagation behavior and prioritize safety. The findings reveal a trade-off between prediction accuracy and constraint compliance, highlighting the nuanced decision-making process in model training. This study offers insights into advanced predictive modeling for dynamic temporal forecasting, particularly in railway maintenance, with potential applications in other domains.
Title: Enabling Scalable Evaluation of Bias Patterns in Medical LLMs
Copy Paste: [[2410.14763]] Enabling Scalable Evaluation of Bias Patterns in Medical LLMs(https://arxiv.org/abs/2410.14763)
Keywords: fair, generative, large language model
Abstract: Large language models (LLMs) have shown impressive potential in helping with numerous medical challenges. Deploying LLMs in high-stakes applications such as medicine, however, brings in many concerns. One major area of concern relates to biased behaviors of LLMs in medical applications, leading to unfair treatment of individuals. To pave the way for the responsible and impactful deployment of Med LLMs, rigorous evaluation is a key prerequisite. Due to the huge complexity and variability of different medical scenarios, existing work in this domain has primarily relied on using manually crafted datasets for bias evaluation. In this study, we present a new method to scale up such bias evaluations by automatically generating test cases based on rigorous medical evidence. We specifically target the challenges of a) domain-specificity of bias characterization, b) hallucinating while generating the test cases, and c) various dependencies between the health outcomes and sensitive attributes. To that end, we offer new methods to address these challenges integrated with our generative pipeline, using medical knowledge graphs, medical ontologies, and customized general LLM evaluation frameworks in our method. Through a series of extensive experiments, we show that the test cases generated by our proposed method can effectively reveal bias patterns in Med LLMs at larger and more flexible scales than human-crafted datasets. We publish a large bias evaluation dataset using our pipeline, which is dedicated to a few medical case studies. A live demo of our application for vignette generation is available at this https URL. Our code is also available at this https URL.
Title: Multifidelity Kolmogorov-Arnold Networks
Authors: Amanda A. Howard, Bruno Jacob, Panos Stinis
Abstract: We develop a method for multifidelity Kolmogorov-Arnold networks (KANs), which use a low-fidelity model along with a small amount of high-fidelity data to train a model for the high-fidelity data accurately. Multifidelity KANs (MFKANs) reduce the amount of expensive high-fidelity data needed to accurately train a KAN by exploiting the correlations between the low- and high-fidelity data to give accurate and robust predictions in the absence of a large high-fidelity dataset. In addition, we show that multifidelity KANs can be used to increase the accuracy of physics-informed KANs (PIKANs), without the use of training data.
Title: What's New in My Data? Novelty Exploration via Contrastive Generation
Copy Paste: [[2410.14765]] What's New in My Data? Novelty Exploration via Contrastive Generation(https://arxiv.org/abs/2410.14765)
Keywords: privacy, generative
Abstract: Fine-tuning is widely used to adapt language models for specific goals, often leveraging real-world data such as patient records, customer-service interactions, or web content in languages not covered in pre-training. These datasets are typically massive, noisy, and often confidential, making their direct inspection challenging. However, understanding them is essential for guiding model deployment and informing decisions about data cleaning or suppressing any harmful behaviors learned during fine-tuning. In this study, we introduce the task of novelty discovery through generation, which aims to identify novel properties of a fine-tuning dataset by generating examples that illustrate these properties. Our approach, Contrastive Generative Exploration (CGE), assumes no direct access to the data but instead relies on a pre-trained model and the same model after fine-tuning. By contrasting the predictions of these two models, CGE can generate examples that highlight novel characteristics of the fine-tuning data. However, this simple approach may produce examples that are too similar to one another, failing to capture the full range of novel phenomena present in the dataset. We address this by introducing an iterative version of CGE, where the previously generated examples are used to update the pre-trained model, and this updated model is then contrasted with the fully fine-tuned model to generate the next example, promoting diversity in the generated outputs. Our experiments demonstrate the effectiveness of CGE in detecting novel content, such as toxic language, as well as new natural and programming languages. Furthermore, we show that CGE remains effective even when models are fine-tuned using differential privacy techniques.
Title: A Survey on Computational Solutions for Reconstructing Complete Objects by Reassembling Their Fractured Parts
Copy Paste: [[2410.14770]] A Survey on Computational Solutions for Reconstructing Complete Objects by Reassembling Their Fractured Parts(https://arxiv.org/abs/2410.14770)
Keywords: segmentation
Abstract: Reconstructing a complete object from its parts is a fundamental problem in many scientific domains. The purpose of this article is to provide a systematic survey on this topic. The reassembly problem requires understanding the attributes of individual pieces and establishing matches between different pieces. Many approaches also model priors of the underlying complete object. Existing approaches are tightly connected problems of shape segmentation, shape matching, and learning shape priors. We provide existing algorithms in this context and emphasize their similarities and differences to general-purpose approaches. We also survey the trends from early non-deep learning approaches to more recent deep learning approaches. In addition to algorithms, this survey will also describe existing datasets, open-source software packages, and applications. To the best of our knowledge, this is the first comprehensive survey on this topic in computer graphics.
Title: Cross-Document Event-Keyed Summarization
Authors: William Walden, Pavlo Kuchmiichuk, Alexander Martin, Chihsheng Jin, Angela Cao, Claire Sun, Curisia Allen, Aaron Steven White
Abstract: Event-keyed summarization (EKS) requires generating a summary about a specific event described in a document, given the document and an event representation extracted from it. In this work, we extend EKS to the cross-document setting (CDEKS), in which summaries must synthesize information from accounts of the same event given by multiple sources. We introduce SEAMUS (Summaries of Events Across Multiple Sources), a high-quality dataset for CDEKS based on an expert reannotation of the FAMUS dataset for cross-document argument extraction. We present a suite of baselines on SEAMUS, covering both smaller, fine-tuned models, as well as zero- and few-shot prompted LLMs, along with detailed ablations, and a human evaluation study, showing SEAMUS to be a valuable benchmark for this new task.
Title: DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents
Authors: Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao
Copy Paste: [[2410.14803]] DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents(https://arxiv.org/abs/2410.14803)
Keywords: robust, large language model
Abstract: On-device control agents, especially on mobile devices, are responsible for operating mobile devices to fulfill users' requests, enabling seamless and intuitive interactions. Integrating Multimodal Large Language Models (MLLMs) into these agents enhances their ability to understand and execute complex commands, thereby improving user experience. However, fine-tuning MLLMs for on-device control presents significant challenges due to limited data availability and inefficient online training processes. This paper introduces DistRL, a novel framework designed to enhance the efficiency of online RL fine-tuning for mobile device control agents. DistRL employs centralized training and decentralized data acquisition to ensure efficient fine-tuning in the context of dynamic online interactions. Additionally, the framework is backed by our tailor-made RL algorithm, which effectively balances exploration with the prioritized utilization of collected data to ensure stable and robust training. Our experiments show that, on average, DistRL delivers a 3X improvement in training efficiency and enables training data collection 2.4X faster than the leading synchronous multi-machine methods. Notably, after training, DistRL achieves a 20% relative improvement in success rate compared to state-of-the-art methods on general Android tasks from an open benchmark, significantly outperforming existing approaches while maintaining the same training time. These results validate DistRL as a scalable and efficient solution, offering substantial improvements in both training efficiency and agent performance for real-world, in-the-wild device control tasks.
Title: Tackling domain generalization for out-of-distribution endoscopic imaging
Authors: Mansoor Ali Teevno, Gilberto Ochoa-Ruiz, Sharib Ali
Copy Paste: [[2410.14821]] Tackling domain generalization for out-of-distribution endoscopic imaging(https://arxiv.org/abs/2410.14821)
Keywords: robust, segmentation
Abstract: While recent advances in deep learning (DL) for surgical scene segmentation have yielded promising results on single-center and single-imaging modality data, these methods usually do not generalize well to unseen distributions or modalities. Even though human experts can identify visual appearances, DL methods often fail to do so when data samples do not follow a similar distribution. Current literature addressing domain gaps in modality changes has focused primarily on natural scene data. However, these methods cannot be directly applied to endoscopic data, as visual cues in such data are more limited compared to natural scenes. In this work, we exploit both style and content information in images by performing instance normalization and feature covariance mapping techniques to preserve robust and generalizable feature representations. Additionally, to avoid the risk of removing salient feature representations associated with objects of interest, we introduce a restitution module within the feature-learning ResNet backbone that retains useful task-relevant features. Our proposed method shows a 13.7% improvement over the baseline DeepLabv3+ and nearly an 8% improvement over recent state-of-the-art (SOTA) methods for the target (different modality) set of the EndoUDA polyp dataset. Similarly, our method achieved a 19% improvement over the baseline and 6% over the best-performing SOTA method on the EndoUDA Barrett's esophagus (BE) dataset.
Title: SPRIG: Improving Large Language Model Performance by System Prompt Optimization
Copy Paste: [[2410.14826]] SPRIG: Improving Large Language Model Performance by System Prompt Optimization(https://arxiv.org/abs/2410.14826)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown impressive capabilities in many scenarios, but their performance depends, in part, on the choice of prompt. Past research has focused on optimizing prompts specific to a task. However, much less attention has been given to optimizing the general instructions included in a prompt, known as a system prompt. To address this gap, we propose SPRIG, an edit-based genetic algorithm that iteratively constructs prompts from prespecified components to maximize the model's performance in general scenarios. We evaluate the performance of system prompts on a collection of 47 different types of tasks to ensure generalizability. Our study finds that a single optimized system prompt performs on par with task prompts optimized for each individual task. Moreover, combining system and task-level optimizations leads to further improvement, which showcases their complementary nature. Experiments also reveal that the optimized system prompts generalize effectively across model families, parameter sizes, and languages. This study provides insights into the role of system-level instructions in maximizing LLM potential.
Title: Making LLMs Vulnerable to Prompt Injection via Poisoning Alignment
Authors: Zedian Shao, Hongbin Liu, Jaden Mu, Neil Zhenqiang Gong
Copy Paste: [[2410.14827]] Making LLMs Vulnerable to Prompt Injection via Poisoning Alignment(https://arxiv.org/abs/2410.14827)
Keywords: attack
Abstract: In a prompt injection attack, an attacker injects a prompt into the original one, aiming to make the LLM follow the injected prompt and perform a task chosen by the attacker. Existing prompt injection attacks primarily focus on how to blend the injected prompt into the original prompt without altering the LLM itself. Our experiments show that these attacks achieve some success, but there is still significant room for improvement. In this work, we show that an attacker can boost the success of prompt injection attacks by poisoning the LLM's alignment process. Specifically, we propose PoisonedAlign, a method to strategically create poisoned alignment samples. When even a small fraction of the alignment data is poisoned using our method, the aligned LLM becomes more vulnerable to prompt injection while maintaining its foundational capabilities. The code is available at this https URL
Title: Automated Road Extraction from Satellite Imagery Integrating Dense Depthwise Dilated Separable Spatial Pyramid Pooling with DeepLabV3+
Authors: Arpan Mahara, Md Rezaul Karim Khan, Naphtali D. Rishe, Wenjia Wang, Seyed Masoud Sadjadi
Copy Paste: [[2410.14836]] Automated Road Extraction from Satellite Imagery Integrating Dense Depthwise Dilated Separable Spatial Pyramid Pooling with DeepLabV3+(https://arxiv.org/abs/2410.14836)
Keywords: extraction, segmentation
Abstract: Road Extraction is a sub-domain of Remote Sensing applications; it is a subject of extensive and ongoing research. The procedure of automatically extracting roads from satellite imagery encounters significant challenges due to the multi-scale and diverse structures of roads; improvement in this field is needed. The DeepLab series, known for its proficiency in semantic segmentation due to its efficiency in interpreting multi-scale objects' features, addresses some of these challenges caused by the varying nature of roads. The present work proposes the utilization of DeepLabV3+, the latest version of the DeepLab series, by introducing an innovative Dense Depthwise Dilated Separable Spatial Pyramid Pooling (DenseDDSSPP) module and integrating it in place of the conventional Atrous Spatial Pyramid Pooling (ASPP) module. This modification enhances the extraction of complex road structures from satellite images. This study hypothesizes that the integration of DenseDDSSPP, combined with an appropriately selected backbone network and a Squeeze-and-Excitation block, will generate an efficient dense feature map by focusing on relevant features, leading to more precise and accurate road extraction from Remote Sensing images. The results section presents a comparison of our model's performance against state-of-the-art models, demonstrating better results that highlight the effectiveness and success of the proposed approach.
Title: SYNOSIS: Image synthesis pipeline for machine vision in metal surface inspection
Authors: Juraj Fulir, Natascha Jeziorski, Lovro Bosnar, Hans Hagen, Claudia Redenbach, Petra Gospodnetić, Tobias Herrfurth, Marcus Trost, Thomas Gischkat
Copy Paste: [[2410.14844]] SYNOSIS: Image synthesis pipeline for machine vision in metal surface inspection(https://arxiv.org/abs/2410.14844)
Keywords: robust, segmentation
Abstract: The use of machine learning (ML) methods for development of robust and flexible visual inspection system has shown promising. However their performance is highly dependent on the amount and diversity of training data. This is often restricted not only due to costs but also due to a wide variety of defects and product surfaces which occur with varying frequency. As such, one can not guarantee that the acquired dataset contains enough defect and product surface occurrences which are needed to develop a robust model. Using parametric synthetic dataset generation, it is possible to avoid these issues. In this work, we introduce a complete pipeline which describes in detail how to approach image synthesis for surface inspection - from first acquisition, to texture and defect modeling, data generation, comparison to real data and finally use of the synthetic data to train a defect segmentation model. The pipeline is in detail evaluated for milled and sandblasted aluminum surfaces. In addition to providing an in-depth view into each step, discussion of chosen methods, and presentation of ML results, we provide a comprehensive dual dataset containing both real and synthetic images.
Title: FedSpaLLM: Federated Pruning of Large Language Models
Copy Paste: [[2410.14852]] FedSpaLLM: Federated Pruning of Large Language Models(https://arxiv.org/abs/2410.14852)
Keywords: privacy, federate, large language model
Abstract: Large Language Models (LLMs) achieve state-of-the-art performance but are challenging to deploy due to their high computational and storage demands. Pruning can reduce model size, yet existing methods assume public access to calibration data, which is impractical for privacy-sensitive applications. To address the challenge of pruning LLMs in privacy-preserving settings, we propose FedSpaLLM, the first federated learning framework designed specifically for pruning LLMs. FedSpaLLM enables clients to prune their models locally based on private data while accounting for system heterogeneity and maintaining communication efficiency. Our framework introduces several key innovations: (1) a novel $\ell_0$-norm aggregation function that ensures only non-zero weights are averaged across clients, preserving important model parameters; (2) an adaptive mask expansion technique that meets global sparsity targets while accommodating client-specific pruning decisions; and (3) a layer sampling strategy that reduces communication overhead and personalizes the pruning process based on client resources. Extensive experiments show that FedSpaLLM improves pruning performance in diverse federated settings. The source code will be released upon publication.
Title: DFlow: Diverse Dialogue Flow Simulation with Large Language Models
Authors: Wanyu Du, Song Feng, James Gung, Lijia Sun, Yi Zhang, Saab Mansour, Yanjun Qi
Copy Paste: [[2410.14853]] DFlow: Diverse Dialogue Flow Simulation with Large Language Models(https://arxiv.org/abs/2410.14853)
Keywords: large language model
Abstract: Developing language model-based dialogue agents requires effective data to train models that can follow specific task logic. However, most existing data augmentation methods focus on increasing diversity in language, topics, or dialogue acts at the utterance level, largely neglecting a critical aspect of task logic diversity at the dialogue level. This paper proposes a novel data augmentation method designed to enhance the diversity of synthetic dialogues by focusing on task execution logic. Our method uses LLMs to generate decision tree-structured task plans, which enables the derivation of diverse dialogue trajectories for a given task. Each trajectory, referred to as a "dialog flow", guides the generation of a multi-turn dialogue that follows a unique trajectory. We apply this method to generate a task-oriented dialogue dataset comprising 3,886 dialogue flows across 15 different domains. We validate the effectiveness of this dataset using the next action prediction task, where models fine-tuned on our dataset outperform strong baselines, including GPT-4. Upon acceptance of this paper, we plan to release the code and data publicly.
Title: Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention
Copy Paste: [[2410.14874]] Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention(https://arxiv.org/abs/2410.14874)
Keywords: transformer
Abstract: Vision Transformers have made remarkable progress in recent years, achieving state-of-the-art performance in most vision tasks. A key component of this success is due to the introduction of the Multi-Head Self-Attention (MHSA) module, which enables each head to learn different representations by applying the attention mechanism independently. In this paper, we empirically demonstrate that Vision Transformers can be further enhanced by overlapping the heads in MHSA. We introduce Multi-Overlapped-Head Self-Attention (MOHSA), where heads are overlapped with their two adjacent heads for queries, keys, and values, while zero-padding is employed for the first and last heads, which have only one neighboring head. Various paradigms for overlapping ratios are proposed to fully investigate the optimal performance of our approach. The proposed approach is evaluated using five Transformer models on four benchmark datasets and yields a significant performance boost. The source code will be made publicly available upon publication.
Title: On the Influence of Shape, Texture and Color for Learning Semantic Segmentation
Authors: Annika Mütze, Natalie Grabowsky, Edgar Heinert, Matthias Rottmann, Hanno Gottschalk
Copy Paste: [[2410.14878]] On the Influence of Shape, Texture and Color for Learning Semantic Segmentation(https://arxiv.org/abs/2410.14878)
Keywords: transformer, segmentation
Abstract: In recent years, a body of works has emerged, studying shape and texture biases of off-the-shelf pre-trained deep neural networks (DNN) for image classification. These works study how much a trained DNN relies on image cues, predominantly shape and texture. In this work, we switch the perspective, posing the following questions: What can a DNN learn from each of the image cues, i.e., shape, texture and color, respectively? How much does each cue influence the learning success? And what are the synergy effects between different cues? Studying these questions sheds light upon cue influences on learning and thus the learning capabilities of DNNs. We study these questions on semantic segmentation which allows us to address our questions on pixel level. To conduct this study, we develop a generic procedure to decompose a given dataset into multiple ones, each of them only containing either a single cue or a chosen mixture. This framework is then applied to two real-world datasets, Cityscapes and PASCAL Context, and a synthetic data set based on the CARLA simulator. We learn the given semantic segmentation task from these cue datasets, creating cue experts. Early fusion of cues is performed by constructing appropriate datasets. This is complemented by a late fusion of experts which allows us to study cue influence location-dependent on pixel level. Our study on three datasets reveals that neither texture nor shape clearly dominate the learning success, however a combination of shape and color but without texture achieves surprisingly strong results. Our findings hold for convolutional and transformer backbones. In particular, qualitatively there is almost no difference in how both of the architecture types extract information from the different cues.
Title: Zero-shot Generalist Graph Anomaly Detection with Unified Neighborhood Prompts
Abstract: Graph anomaly detection (GAD), which aims to identify nodes in a graph that significantly deviate from normal patterns, plays a crucial role in broad application domains. Existing GAD methods, whether supervised or unsupervised, are one-model-for-one-dataset approaches, i.e., training a separate model for each graph dataset. This limits their applicability in real-world scenarios where training on the target graph data is not possible due to issues like data privacy. To overcome this limitation, we propose a novel zero-shot generalist GAD approach UNPrompt that trains a one-for-all detection model, requiring the training of one GAD model on a single graph dataset and then effectively generalizing to detect anomalies in other graph datasets without any retraining or fine-tuning. The key insight in UNPrompt is that i) the predictability of latent node attributes can serve as a generalized anomaly measure and ii) highly generalized normal and abnormal graph patterns can be learned via latent node attribute prediction in a properly normalized node attribute space. UNPrompt achieves generalist GAD through two main modules: one module aligns the dimensionality and semantics of node attributes across different graphs via coordinate-wise normalization in a projected space, while another module learns generalized neighborhood prompts that support the use of latent node attribute predictability as an anomaly score across different datasets. Extensive experiments on real-world GAD datasets show that UNPrompt significantly outperforms diverse competing methods under the generalist GAD setting, and it also has strong superiority under the one-model-for-one-dataset setting.
Title: Self-Satisfied: An end-to-end framework for SAT generation and prediction
Authors: Christopher R. Serrano, Jonathan Gallagher, Kenji Yamada, Alexei Kopylov, Michael A. Warren
Copy Paste: [[2410.14888]] Self-Satisfied: An end-to-end framework for SAT generation and prediction(https://arxiv.org/abs/2410.14888)
Keywords: transformer
Abstract: The boolean satisfiability (SAT) problem asks whether there exists an assignment of boolean values to the variables of an arbitrary boolean formula making the formula evaluate to True. It is well-known that all NP-problems can be coded as SAT problems and therefore SAT is important both practically and theoretically. From both of these perspectives, better understanding the patterns and structure implicit in SAT data is of significant value. In this paper, we describe several advances that we believe will help open the door to such understanding: we introduce hardware accelerated algorithms for fast SAT problem generation, a geometric SAT encoding that enables the use of transformer architectures typically applied to vision tasks, and a simple yet effective technique we term head slicing for reducing sequence length representation inside transformer architectures. These advances allow us to scale our approach to SAT problems with thousands of variables and tens of thousands of clauses. We validate our architecture, termed Satisfiability Transformer (SaT), on the SAT prediction task with data from the SAT Competition (SATComp) 2022 problem sets. Prior related work either leveraged a pure machine learning approach, but could not handle SATComp-sized problems, or was hybrid in the sense of integrating a machine learning component in a standard SAT solving tool. Our pure machine learning approach achieves prediction accuracies comparable to recent work, but on problems that are an order of magnitude larger than previously demonstrated. A fundamental aspect of our work concerns the very nature of SAT data and its suitability for training machine learning models. We both describe experimental results that probe the landscape of where SAT data can be successfully used for learning and position these results within the broader context of complexity and learning.
Title: Truncated Consistency Models
Authors: Sangyun Lee, Yilun Xu, Tomas Geffner, Giulia Fanti, Karsten Kreis, Arash Vahdat, Weili Nie
Abstract: Consistency models have recently been introduced to accelerate sampling from diffusion models by directly predicting the solution (i.e., data) of the probability flow ODE (PF ODE) from initial noise. However, the training of consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints. This task is much more challenging than the ultimate objective of one-step generation, which only concerns the PF ODE's noise-to-data mapping. We empirically find that this training paradigm limits the one-step generation performance of consistency models. To address this issue, we generalize consistency training to the truncated time range, which allows the model to ignore denoising tasks at earlier time steps and focus its capacity on generation. We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution. Experiments on CIFAR-10 and ImageNet $64\times64$ datasets show that our method achieves better one-step and two-step FIDs than the state-of-the-art consistency models such as iCT-deep, using more than 2$\times$ smaller networks. Project page: this https URL
Title: DRACO: Differentiable Reconstruction for Arbitrary CBCT Orbits
Authors: Chengze Ye, Linda-Sophie Schneider, Yipeng Sun, Mareike Thies, Siyuan Mei, Andreas Maier
Copy Paste: [[2410.14900]] DRACO: Differentiable Reconstruction for Arbitrary CBCT Orbits(https://arxiv.org/abs/2410.14900)
Keywords: robust, interpretability
Abstract: This paper introduces a novel method for reconstructing cone beam computed tomography (CBCT) images for arbitrary orbits using a differentiable shift-variant filtered backprojection (FBP) neural network. Traditional CBCT reconstruction methods for arbitrary orbits, like iterative reconstruction algorithms, are computationally expensive and memory-intensive. The proposed method addresses these challenges by employing a shift-variant FBP algorithm optimized for arbitrary trajectories through a deep learning approach that adapts to a specific orbit geometry. This approach overcomes the limitations of existing techniques by integrating known operators into the learning model, minimizing the number of parameters, and improving the interpretability of the model. The proposed method is a significant advancement in interventional medical imaging, particularly for robotic C-arm CT systems, enabling faster and more accurate CBCT reconstructions with customized orbits. Especially this method can also be used for the analytical reconstruction of non-continuous orbits like circular plus arc. The experimental results demonstrate that the proposed method significantly accelerates the reconstruction process compared to conventional iterative algorithms. It achieves comparable or superior image quality, as evidenced by metrics such as the mean squared error (MSE), the peak signal-to-noise ratio (PSNR), and the structural similarity index measure (SSIM). The validation experiments show that the method can handle data from different trajectories, demonstrating its flexibility and robustness across different scan geometries. Our method demonstrates a significant improvement, particularly for the sinusoidal trajectory, achieving a 38.6% reduction in MSE, a 7.7% increase in PSNR, and a 5.0% improvement in SSIM. Furthermore, the computation time for reconstruction was reduced by more than 97%.
Title: A Hybrid Defense Strategy for Boosting Adversarial Robustness in Vision-Language Models
Copy Paste: [[2410.14911]] A Hybrid Defense Strategy for Boosting Adversarial Robustness in Vision-Language Models(https://arxiv.org/abs/2410.14911)
Keywords: security, defense, attack, robust
Abstract: The robustness of Vision-Language Models (VLMs) such as CLIP is critical for their deployment in safety-critical applications like autonomous driving, healthcare diagnostics, and security systems, where accurate interpretation of visual and textual data is essential. However, these models are highly susceptible to adversarial attacks, which can severely compromise their performance and reliability in real-world scenarios. Previous methods have primarily focused on improving robustness through adversarial training and generating adversarial examples using models like FGSM, AutoAttack, and DeepFool. However, these approaches often rely on strong assumptions, such as fixed perturbation norms or predefined attack patterns, and involve high computational complexity, making them challenging to implement in practical settings. In this paper, we propose a novel adversarial training framework that integrates multiple attack strategies and advanced machine learning techniques to significantly enhance the robustness of VLMs against a broad range of adversarial attacks. Experiments conducted on real-world datasets, including CIFAR-10 and CIFAR-100, demonstrate that the proposed method significantly enhances model robustness. The fine-tuned CLIP model achieved an accuracy of 43.5% on adversarially perturbed images, compared to only 4% for the baseline model. The neural network model achieved a high accuracy of 98% in these challenging classification tasks, while the XGBoost model reached a success rate of 85.26% in prediction tasks.
Title: Adversarial Score identity Distillation: Rapidly Surpassing the Teacher in One Step
Authors: Mingyuan Zhou, Huangjie Zheng, Yi Gu, Zhendong Wang, Hai Huang
Copy Paste: [[2410.14919]] Adversarial Score identity Distillation: Rapidly Surpassing the Teacher in One Step(https://arxiv.org/abs/2410.14919)
Keywords: diffusion, data-free
Abstract: Score identity Distillation (SiD) is a data-free method that has achieved state-of-the-art performance in image generation by leveraging only a pretrained diffusion model, without requiring any training data. However, the ultimate performance of SiD is constrained by the accuracy with which the pretrained model captures the true data scores at different stages of the diffusion process. In this paper, we introduce SiDA (SiD with Adversarial Loss), which not only enhances generation quality but also improves distillation efficiency by incorporating real images and adversarial loss. SiDA utilizes the encoder from the generator's score network as a discriminator, boosting its ability to distinguish between real images and those generated by SiD. The adversarial loss is batch-normalized within each GPU and then combined with the original SiD loss. This integration effectively incorporates the average "fakeness" per GPU batch into the pixel-based SiD loss, enabling SiDA to distill a single-step generator either from scratch or by fine-tuning an existing one. SiDA converges significantly faster than its predecessor when trained from scratch, and swiftly improves upon the original model's performance after an initial warmup period during fine-tuning from a pre-distilled SiD generator. This one-step adversarial distillation method has set new benchmarks for generation performance when distilling EDM diffusion models pretrained on CIFAR-10 (32x32) and ImageNet (64x64), achieving FID scores of $\mathbf{1.499}$ on CIFAR-10 unconditional, $\mathbf{1.396}$ on CIFAR-10 conditional, and $\mathbf{1.110}$ on ImageNet 64x64. Our open-source code will be integrated into the SiD codebase on GitHub.
Title: Imprompter: Tricking LLM Agents into Improper Tool Use
Authors: Xiaohan Fu, Shuheng Li, Zihan Wang, Yihao Liu, Rajesh K. Gupta, Taylor Berg-Kirkpatrick, Earlence Fernandes
Keywords: security, attack, generative, large language model
Abstract: Large Language Model (LLM) Agents are an emerging computing paradigm that blends generative machine learning with tools such as code interpreters, web browsing, email, and more generally, external resources. These agent-based systems represent an emerging shift in personal computing. We contribute to the security foundations of agent-based systems and surface a new class of automatically computed obfuscated adversarial prompt attacks that violate the confidentiality and integrity of user resources connected to an LLM agent. We show how prompt optimization techniques can find such prompts automatically given the weights of a model. We demonstrate that such attacks transfer to production-level agents. For example, we show an information exfiltration attack on Mistral's LeChat agent that analyzes a user's conversation, picks out personally identifiable information, and formats it into a valid markdown command that results in leaking that data to the attacker's server. This attack shows a nearly 80% success rate in an end-to-end evaluation. We conduct a range of experiments to characterize the efficacy of these attacks and find that they reliably work on emerging agent-based systems like Mistral's LeChat, ChatGLM, and Meta's Llama. These attacks are multimodal, and we show variants in the text-only and image domains.
Title: Securing the Web: Analysis of HTTP Security Headers in Popular Global Websites
Copy Paste: [[2410.14924]] Securing the Web: Analysis of HTTP Security Headers in Popular Global Websites(https://arxiv.org/abs/2410.14924)
Keywords: secure, security, attack, robust
Abstract: The surge in website attacks, including Denial of Service (DoS), Cross-Site Scripting (XSS), and Clickjacking, underscores the critical need for robust HTTPS implementation-a practice that, alarmingly, remains inadequately adopted. Regarding this, we analyzed HTTP security headers across N=3,195 globally popular websites. Initially, we employed automated categorization using Google NLP to organize these websites into functional categories and validated this categorization through manual verification using Symantec Sitereview. Subsequently, we assessed HTTPS implementation across these websites by analyzing security factors, including compliance with HTTP Strict Transport Security (HSTS) policies, Certificate Pinning practices, and other security postures using the Mozilla Observatory. Our analysis revealed over half of the websites examined (55.66%) received a dismal security grade of 'F' and most websites scored low for various metrics, which is indicative of weak HTTP header implementation. These low scores expose multiple issues such as weak implementation of Content Security Policies (CSP), neglect of HSTS guidelines, and insufficient application of Subresource Integrity (SRI). Alarmingly, healthcare websites (n=59) are particularly concerning; despite being entrusted with sensitive patient data and obligations to comply with data regulations, these sites recorded the lowest average score (18.14). We conclude by recommending that developers should prioritize secure redirection strategies and use implementation ease as a guide when deciding where to focus their development efforts.
Title: Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding
Authors: Yi Liu, Chengxin Li, Shoukun Xu, Jungong Han
Copy Paste: [[2410.14944]] Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding(https://arxiv.org/abs/2410.14944)
Keywords: segmentation
Abstract: Multi-modal fusion has played a vital role in multi-modal scene understanding. Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion, which is essential for real-world applications like autonomous driving, where visible, depth, event, LiDAR, etc., are used. Besides, few attempts for multi-modal fusion, \emph{e.g.}, simple concatenation, cross-modal attention, and token selection, cannot well dig into the intrinsic shared and specific details of multiple modalities. To tackle the challenge, in this paper, we propose a Part-Whole Relational Fusion (PWRF) framework. For the first time, this framework treats multi-modal fusion as part-whole relational fusion. It routes multiple individual part-level modalities to a fused whole-level modality using the part-whole relational routing ability of Capsule Networks (CapsNets). Through this part-whole routing, our PWRF generates modal-shared and modal-specific semantics from the whole-level modal capsules and the routing coefficients, respectively. On top of that, modal-shared and modal-specific details can be employed to solve the issue of multi-modal scene understanding, including synthetic multi-modal segmentation and visible-depth-thermal salient object detection in this paper. Experiments on several datasets demonstrate the superiority of the proposed PWRF framework for multi-modal scene understanding. The source code has been released on this https URL.
Title: SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation
Authors: Junda Wang, Yujan Ting, Eric Z. Chen, Hieu Tran, Hong Yu, Weijing Huang, Terrence Chen
Copy Paste: [[2410.14948]] SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation(https://arxiv.org/abs/2410.14948)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have made significant strides, yet they face challenges in the medical domain due to limited specialized knowledge. While recent medical MLLMs demonstrate strong performance in lab settings, they often struggle in real-world applications, highlighting a substantial gap between research and practice. In this paper, we seek to address this gap at various stages of the end-to-end learning pipeline, including data collection, model fine-tuning, and evaluation. At the data collection stage, we introduce SemiHVision, a dataset that combines human annotations with automated augmentation techniques to improve both medical knowledge representation and diagnostic reasoning. For model fine-tuning, we trained PMC-Cambrian-8B-AN over 2400 H100 GPU hours, resulting in performance that surpasses public medical models like HuatuoGPT-Vision-34B (79.0% vs. 66.7%) and private general models like Claude3-Opus (55.7%) on traditional benchmarks such as SLAKE and VQA-RAD. In the evaluation phase, we observed that traditional benchmarks cannot accurately reflect realistic clinical task capabilities. To overcome this limitation and provide more targeted guidance for model evaluation, we introduce the JAMA Clinical Challenge, a novel benchmark specifically designed to evaluate diagnostic reasoning. On this benchmark, PMC-Cambrian-AN achieves state-of-the-art performance with a GPT-4 score of 1.29, significantly outperforming HuatuoGPT-Vision-34B (1.13) and Claude3-Opus (1.17), demonstrating its superior diagnostic reasoning abilities.
Title: Straightness of Rectified Flow: A Theoretical Insight into Wasserstein Convergence
Copy Paste: [[2410.14949]] Straightness of Rectified Flow: A Theoretical Insight into Wasserstein Convergence(https://arxiv.org/abs/2410.14949)
Keywords: diffusion, generative
Abstract: Diffusion models have emerged as a powerful tool for image generation and denoising. Typically, generative models learn a trajectory between the starting noise distribution and the target data distribution. Recently Liu et al. (2023b) designed a novel alternative generative model Rectified Flow (RF), which aims to learn straight flow trajectories from noise to data using a sequence of convex optimization problems with close ties to optimal transport. If the trajectory is curved, one must use many Euler discretization steps or novel strategies, such as exponential integrators, to achieve a satisfactory generation quality. In contrast, RF has been shown to theoretically straighten the trajectory through successive rectifications, reducing the number of function evaluations (NFEs) while sampling. It has also been shown empirically that RF may improve the straightness in two rectifications if one can solve the underlying optimization problem within a sufficiently small error. In this paper, we make two key theoretical contributions: 1) we provide the first theoretical analysis of the Wasserstein distance between the sampling distribution of RF and the target distribution. Our error rate is characterized by the number of discretization steps and a new formulation of straightness stronger than that in the original work. 2) In line with the previous empirical findings, we show that, for a rectified flow from a Gaussian to a mixture of two Gaussians, two rectifications are sufficient to achieve a straight flow. Additionally, we also present empirical results on both simulated and real datasets to validate our theoretical findings.
Title: A Fast AI Surrogate for Coastal Ocean Circulation Models
Authors: Zelin Xu, Jie Ren, Yupu Zhang, Jose Maria Gonzalez Ondina, Maitane Olabarrieta, Tingsong Xiao, Wenchong He, Zibo Liu, Shigang Chen, Kaleb Smith, Zhe Jiang
Copy Paste: [[2410.14952]] A Fast AI Surrogate for Coastal Ocean Circulation Models(https://arxiv.org/abs/2410.14952)
Keywords: transformer
Abstract: Nearly 900 million people live in low-lying coastal zones around the world and bear the brunt of impacts from more frequent and severe hurricanes and storm surges. Oceanographers simulate ocean current circulation along the coasts to develop early warning systems that save lives and prevent loss and damage to property from coastal hazards. Traditionally, such simulations are conducted using coastal ocean circulation models such as the Regional Ocean Modeling System (ROMS), which usually runs on an HPC cluster with multiple CPU cores. However, the process is time-consuming and energy expensive. While coarse-grained ROMS simulations offer faster alternatives, they sacrifice detail and accuracy, particularly in complex coastal environments. Recent advances in deep learning and GPU architecture have enabled the development of faster AI (neural network) surrogates. This paper introduces an AI surrogate based on a 4D Swin Transformer to simulate coastal tidal wave propagation in an estuary for both hindcast and forecast (up to 12 days). Our approach not only accelerates simulations but also incorporates a physics-based constraint to detect and correct inaccurate results, ensuring reliability while minimizing manual intervention. We develop a fully GPU-accelerated workflow, optimizing the model training and inference pipeline on NVIDIA DGX-2 A100 GPUs. Our experiments demonstrate that our AI surrogate reduces the time cost of 12-day forecasting of traditional ROMS simulations from 9,908 seconds (on 512 CPU cores) to 22 seconds (on one A100 GPU), achieving over 450$\times$ speedup while maintaining high-quality simulation results. This work contributes to oceanographic modeling by offering a fast, accurate, and physically consistent alternative to traditional simulation models, particularly for real-time forecasting in rapid disaster response.
Title: Dual-Technique Privacy & Security Analysis for E-Commerce Websites Through Automated and Manual Implementation
Copy Paste: [[2410.14960]] Dual-Technique Privacy & Security Analysis for E-Commerce Websites Through Automated and Manual Implementation(https://arxiv.org/abs/2410.14960)
Keywords: secure, security, privacy
Abstract: As e-commerce continues to expand, the urgency for stronger privacy and security measures becomes increasingly critical, particularly on platforms frequented by younger users who are often less aware of potential risks. In our analysis of 90 US-based e-commerce websites, we employed a dual-technique approach, combining automated tools with manual evaluations. Tools like CookieServe and PrivacyCheck revealed that 38.5% of the websites deployed over 50 cookies per session, many of which were categorized as unnecessary or unclear in function, posing significant risks to users' Personally Identifiable Information (PII). Our manual assessment further uncovered critical gaps in standard security practices, including the absence of mandatory multi-factor authentication (MFA) and breach notification protocols. Additionally, we observed inadequate input validation, which compromises the integrity of user data and transactions. Based on these findings, we recommend targeted improvements to privacy policies, enhanced transparency in cookie usage, and the implementation of stronger authentication protocols. These measures are essential for ensuring compliance with CCPA and COPPA, thereby fostering more secure online environments, particularly for younger users.
Title: LangGFM: A Large Language Model Alone Can be a Powerful Graph Foundation Model
Copy Paste: [[2410.14961]] LangGFM: A Large Language Model Alone Can be a Powerful Graph Foundation Model(https://arxiv.org/abs/2410.14961)
Keywords: large language model
Abstract: Graph foundation models (GFMs) have recently gained significant attention. However, the unique data processing and evaluation setups employed by different studies hinder a deeper understanding of their progress. Additionally, current research tends to focus on specific subsets of graph learning tasks, such as structural tasks, node-level tasks, or classification tasks. As a result, they often incorporate specialized modules tailored to particular task types, losing their applicability to other graph learning tasks and contradicting the original intent of foundation models to be universal. Therefore, to enhance consistency, coverage, and diversity across domains, tasks, and research interests within the graph learning community in the evaluation of GFMs, we propose GFMBench-a systematic and comprehensive benchmark comprising 26 datasets. Moreover, we introduce LangGFM, a novel GFM that relies entirely on large language models. By revisiting and exploring the effective graph textualization principles, as well as repurposing successful techniques from graph augmentation and graph self-supervised learning within the language space, LangGFM achieves performance on par with or exceeding the state of the art across GFMBench, which can offer us new perspectives, experiences, and baselines to drive forward the evolution of GFMs.
Title: Deep Learning for Weather Forecasting: A CNN-LSTM Hybrid Model for Predicting Historical Temperature Data
Authors: Yuhao Gong, Yuchen Zhang, Fei Wang, Chi-Han Lee
Copy Paste: [[2410.14963]] Deep Learning for Weather Forecasting: A CNN-LSTM Hybrid Model for Predicting Historical Temperature Data(https://arxiv.org/abs/2410.14963)
Keywords: protect, extraction
Abstract: As global climate change intensifies, accurate weather forecasting has become increasingly important, affecting agriculture, energy management, environmental protection, and daily life. This study introduces a hybrid model combining Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks to predict historical temperature data. CNNs are utilized for spatial feature extraction, while LSTMs handle temporal dependencies, resulting in significantly improved prediction accuracy and stability. By using Mean Absolute Error (MAE) as the loss function, the model demonstrates excellent performance in processing complex meteorological data, addressing challenges such as missing data and high-dimensionality. The results show a strong alignment between the prediction curve and test data, validating the model's potential in climate prediction. This study offers valuable insights for fields such as agriculture, energy management, and urban planning, and lays the groundwork for future applications in weather forecasting under the context of global climate change.
Abstract: Automated fact verification plays an essential role in fostering trust in the digital space. Despite the growing interest, the verification of temporal facts has not received much attention in the community. Temporal fact verification brings new challenges where cues of the temporal information need to be extracted and temporal reasoning involving various temporal aspects of the text must be applied. In this work, we propose an end-to-end solution for temporal fact verification that considers the temporal information in claims to obtain relevant evidence sentences and harness the power of large language model for temporal reasoning. Recognizing that temporal facts often involve events, we model these events in the claim and evidence sentences. We curate two temporal fact datasets to learn time-sensitive representations that encapsulate not only the semantic relationships among the events, but also their chronological proximity. This allows us to retrieve the top-k relevant evidence sentences and provide the context for a large language model to perform temporal reasoning and outputs whether a claim is supported or refuted by the retrieved evidence sentences. Experiment results demonstrate that the proposed approach significantly enhances the accuracy of temporal claim verification, thereby advancing current state-of-the-art in automated fact verification.
Title: Attack as Defense: Run-time Backdoor Implantation for Image Content Protection
Abstract: As generative models achieve great success, tampering and modifying the sensitive image contents (i.e., human faces, artist signatures, commercial logos, etc.) have induced a significant threat with social impact. The backdoor attack is a method that implants vulnerabilities in a target model, which can be activated through a trigger. In this work, we innovatively prevent the abuse of image content modification by implanting the backdoor into image-editing models. Once the protected sensitive content on an image is modified by an editing model, the backdoor will be triggered, making the editing fail. Unlike traditional backdoor attacks that use data poisoning, to enable protection on individual images and eliminate the need for model training, we developed the first framework for run-time backdoor implantation, which is both time- and resource- efficient. We generate imperceptible perturbations on the images to inject the backdoor and define the protected area as the only backdoor trigger. Editing other unprotected insensitive areas will not trigger the backdoor, which minimizes the negative impact on legal image modifications. Evaluations with state-of-the-art image editing models show that our protective method can increase the CLIP-FID of generated images from 12.72 to 39.91, or reduce the SSIM from 0.503 to 0.167 when subjected to malicious editing. At the same time, our method exhibits minimal impact on benign editing, which demonstrates the efficacy of our proposed framework. The proposed run-time backdoor can also achieve effective protection on the latest diffusion models. Code are available.
Title: Visual Navigation of Digital Libraries: Retrieval and Classification of Images in the National Library of Norway's Digitised Book Collection
Authors: Marie Roald, Magnus Breder Birkenes, Lars Gunnarsønn Bagøien Johnsen
Copy Paste: [[2410.14969]] Visual Navigation of Digital Libraries: Retrieval and Classification of Images in the National Library of Norway's Digitised Book Collection(https://arxiv.org/abs/2410.14969)
Keywords: transformer
Abstract: Digital tools for text analysis have long been essential for the searchability and accessibility of digitised library collections. Recent computer vision advances have introduced similar capabilities for visual materials, with deep learning-based embeddings showing promise for analysing visual heritage. Given that many books feature visuals in addition to text, taking advantage of these breakthroughs is critical to making library collections open and accessible. In this work, we present a proof-of-concept image search application for exploring images in the National Library of Norway's pre-1900 books, comparing Vision Transformer (ViT), Contrastive Language-Image Pre-training (CLIP), and Sigmoid loss for Language-Image Pre-training (SigLIP) embeddings for image retrieval and classification. Our results show that the application performs well for exact image retrieval, with SigLIP embeddings slightly outperforming CLIP and ViT in both retrieval and classification tasks. Additionally, SigLIP-based image classification can aid in cleaning image datasets from a digitisation pipeline.
Title: 3D Multi-Object Tracking Employing MS-GLMB Filter for Autonomous Driving
Authors: Linh Van Ma, Muhammad Ishfaq Hussain, Kin-Choong Yow, Moongu Jeon
Copy Paste: [[2410.14977]] 3D Multi-Object Tracking Employing MS-GLMB Filter for Autonomous Driving(https://arxiv.org/abs/2410.14977)
Keywords: robust
Abstract: The MS-GLMB filter offers a robust framework for tracking multiple objects through the use of multi-sensor data. Building on this, the MV-GLMB and MV-GLMB-AB filters enhance the MS-GLMB capabilities by employing cameras for 3D multi-sensor multi-object tracking, effectively addressing occlusions. However, both filters depend on overlapping fields of view from the cameras to combine complementary information. In this paper, we introduce an improved approach that integrates an additional sensor, such as LiDAR, into the MS-GLMB framework for 3D multi-object tracking. Specifically, we present a new LiDAR measurement model, along with a multi-camera and LiDAR multi-object measurement model. Our experimental results demonstrate a significant improvement in tracking performance compared to existing MS-GLMB-based methods. Importantly, our method eliminates the need for overlapping fields of view, broadening the applicability of the MS-GLMB filter. Our source code for nuScenes dataset is available at this https URL.
Title: Subversive Characters and Stereotyping Readers: Characterizing Queer Relationalities with Dialogue-Based Relation Extraction
Copy Paste: [[2410.14978]] Subversive Characters and Stereotyping Readers: Characterizing Queer Relationalities with Dialogue-Based Relation Extraction(https://arxiv.org/abs/2410.14978)
Keywords: extraction
Abstract: Television is often seen as a site for subcultural identification and subversive fantasy, including in queer cultures. How might we measure subversion, or the degree to which the depiction of social relationship between a dyad (e.g. two characters who are colleagues) deviates from its typical representation on TV? To explore this question, we introduce the task of stereotypic relationship extraction. Built on cognitive stylistics, linguistic anthropology, and dialogue relation extraction, in this paper, we attempt to model the cognitive process of stereotyping TV characters in dialogic interactions. Given a dyad, we want to predict: what social relationship do the speakers exhibit through their words? Subversion is then characterized by the discrepancy between the distribution of the model's predictions and the ground truth labels. To demonstrate the usefulness of this task and gesture at a methodological intervention, we enclose four case studies to characterize the representation of queer relationalities in the Big Bang Theory, Frasier, and Gilmore Girls, as we explore the suspicious and reparative modes of reading with our computational methods.
Title: D-SarcNet: A Dual-stream Deep Learning Framework for Automatic Analysis of Sarcomere Structures in Fluorescently Labeled hiPSC-CMs
Authors: Huyen Le, Khiet Dang, Nhung Nguyen, Mai Tran, Hieu Pham
Copy Paste: [[2410.14983]] D-SarcNet: A Dual-stream Deep Learning Framework for Automatic Analysis of Sarcomere Structures in Fluorescently Labeled hiPSC-CMs(https://arxiv.org/abs/2410.14983)
Keywords: extraction
Abstract: Human-induced pluripotent stem cell-derived cardiomyocytes (hiPSC-CMs) are a powerful tool in advancing cardiovascular research and clinical applications. The maturation of sarcomere organization in hiPSC-CMs is crucial, as it supports the contractile function and structural integrity of these cells. Traditional methods for assessing this maturation like manual annotation and feature extraction are labor-intensive, time-consuming, and unsuitable for high-throughput analysis. To address this, we propose D-SarcNet, a dual-stream deep learning framework that takes fluorescent hiPSC-CM single-cell images as input and outputs the stage of the sarcomere structural organization on a scale from 1.0 to 5.0. The framework also integrates Fast Fourier Transform (FFT), deep learning-generated local patterns, and gradient magnitude to capture detailed structural information at both global and local levels. Experiments on a publicly available dataset from the Allen Institute for Cell Science show that the proposed approach not only achieves a Spearman correlation of 0.868 marking a 3.7% improvement over the previous state-of-the-art but also significantly enhances other key performance metrics, including MSE, MAE, and R2 score. Beyond establishing a new state-of-the-art in sarcomere structure assessment from hiPSC-CM images, our ablation studies highlight the significance of integrating global and local information to enhance deep learning networks ability to discern and learn vital visual features of sarcomere structure.
Title: SeaS: Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning
Copy Paste: [[2410.14987]] SeaS: Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning(https://arxiv.org/abs/2410.14987)
Keywords: segmentation
Abstract: Current segmentation methods require many training images and precise masks, while insufficient anomaly images hinder their application in industrial scenarios. To address such an issue, we explore producing diverse anomalies and accurate pixel-wise annotations. By observing the real production lines, we find that anomalies vary randomly in shape and appearance, whereas products hold globally consistent patterns with slight local variations. Such a characteristic inspires us to develop a Separation and Sharing Fine-tuning (SeaS) approach using only a few abnormal and some normal images. Firstly, we propose the Unbalanced Abnormal (UA) Text Prompt tailored to industrial anomaly generation, consisting of one product token and several anomaly tokens. Then, for anomaly images, we propose a Decoupled Anomaly Alignment (DA) loss to bind the attributes of the anomalies to different anomaly tokens. Re-blending such attributes may produce never-seen anomalies, achieving a high diversity of anomalies. For normal images, we propose a Normal-image Alignment (NA) loss to learn the products' key features that are used to synthesize products with both global consistency and local variations. The two training processes are separated but conducted on a shared U-Net. Finally, SeaS produces high-fidelity annotations for the generated anomalies by fusing discriminative features of U-Net and high-resolution VAE features. Extensive evaluations on the challenging MVTec AD and MVTec 3D AD dataset demonstrate the effectiveness of our approach. For anomaly image generation, we achieve 1.88 on IS and 0.34 on IC-LPIPS on MVTec AD dataset, 1.95 on IS and 0.30 on IC-LPIPS on MVTec 3D AD dataset. For downstream task, using our generated anomaly image-mask pairs, three common segmentation methods achieve an average 11.17% improvement on IoU on MVTec AD dataset, and a 15.49% enhancement in IoU on MVTec 3D AD dataset.
Title: ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla
Copy Paste: [[2410.14991]] ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla(https://arxiv.org/abs/2410.14991)
Keywords: large language model
Abstract: Visual Question Answer (VQA) poses the problem of answering a natural language question about a visual context. Bangla, despite being a widely spoken language, is considered low-resource in the realm of VQA due to the lack of a proper benchmark dataset. The absence of such datasets challenges models that are known to be performant in other languages. Furthermore, existing Bangla VQA datasets offer little cultural relevance and are largely adapted from their foreign counterparts. To address these challenges, we introduce a large-scale Bangla VQA dataset titled ChitroJera, totaling over 15k samples where diverse and locally relevant data sources are used. We assess the performance of text encoders, image encoders, multimodal models, and our novel dual-encoder models. The experiments reveal that the pre-trained dual-encoders outperform other models of its scale. We also evaluate the performance of large language models (LLMs) using prompt-based techniques, with LLMs achieving the best performance. Given the underdeveloped state of existing datasets, we envision ChitroJera expanding the scope of Vision-Language tasks in Bangla.
Title: A comparative study of NeuralODE and Universal ODE approaches to solving Chandrasekhar White Dwarf equation
Copy Paste: [[2410.14998]] A comparative study of NeuralODE and Universal ODE approaches to solving Chandrasekhar White Dwarf equation(https://arxiv.org/abs/2410.14998)
Keywords: robust
Abstract: In this study, we apply two pillars of Scientific Machine Learning: Neural Ordinary Differential Equations (Neural ODEs) and Universal Differential Equations (UDEs) to the Chandrasekhar White Dwarf Equation (CWDE). The CWDE is fundamental for understanding the life cycle of a star, and describes the relationship between the density of the white dwarf and its distance from the center. Despite the rise in Scientific Machine Learning frameworks, very less attention has been paid to the systematic applications of the above SciML pillars on astronomy based ODEs. Through robust modeling in the Julia programming language, we show that both Neural ODEs and UDEs can be used effectively for both prediction as well as forecasting of the CWDE. More importantly, we introduce the forecasting breakdown point - the time at which forecasting fails for both Neural ODEs and UDEs. Through a robust hyperparameter optimization testing, we provide insights on the neural network architecture, activation functions and optimizers which provide the best results. This study provides opens a door to investigate the applicability of Scientific Machine Learning frameworks in forecasting tasks for a wide range of scientific domains.
Title: How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold
Copy Paste: [[2410.15002]] How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Threshold(https://arxiv.org/abs/2410.15002)
Keywords: privacy
Abstract: Text-to-image models are trained using large datasets collected by scraping image-text pairs from the internet. These datasets often include private, copyrighted, and licensed material. Training models on such datasets enables them to generate images with such content, which might violate copyright laws and individual privacy. This phenomenon is termed imitation -- generation of images with content that has recognizable similarity to its training images. In this work we study the relationship between a concept's frequency in the training dataset and the ability of a model to imitate it. We seek to determine the point at which a model was trained on enough instances to imitate a concept -- the imitation threshold. We posit this question as a new problem: Finding the Imitation Threshold (FIT) and propose an efficient approach that estimates the imitation threshold without incurring the colossal cost of training multiple models from scratch. We experiment with two domains -- human faces and art styles -- for which we create four datasets, and evaluate three text-to-image models which were trained on two pretraining datasets. Our results reveal that the imitation threshold of these models is in the range of 200-600 images, depending on the domain and the model. The imitation threshold can provide an empirical basis for copyright violation claims and acts as a guiding principle for text-to-image model developers that aim to comply with copyright and privacy laws. We release the code and data at \url{this https URL} and the project's website is hosted at \url{this https URL}.
Title: CAP: Data Contamination Detection via Consistency Amplification
Copy Paste: [[2410.15005]] CAP: Data Contamination Detection via Consistency Amplification(https://arxiv.org/abs/2410.15005)
Keywords: large language model
Abstract: Large language models (LLMs) are widely used, but concerns about data contamination challenge the reliability of LLM evaluations. Existing contamination detection methods are often task-specific or require extra prerequisites, limiting practicality. We propose a novel framework, Consistency Amplification-based Data Contamination Detection (CAP), which introduces the Performance Consistency Ratio (PCR) to measure dataset leakage by leveraging LM consistency. To the best of our knowledge, this is the first method to explicitly differentiate between fine-tuning and contamination, which is crucial for detecting contamination in domain-specific models. Additionally, CAP is applicable to various benchmarks and works for both white-box and black-box models. We validate CAP's effectiveness through experiments on seven LLMs and four domain-specific benchmarks. Our findings also show that composite benchmarks from various dataset sources are particularly prone to unintentional contamination. Codes will be publicly available soon.
Title: DiffuseST: Unleashing the Capability of the Diffusion Model for Style Transfer
Copy Paste: [[2410.15007]] DiffuseST: Unleashing the Capability of the Diffusion Model for Style Transfer(https://arxiv.org/abs/2410.15007)
Keywords: robust, diffusion
Abstract: Style transfer aims to fuse the artistic representation of a style image with the structural information of a content image. Existing methods train specific networks or utilize pre-trained models to learn content and style features. However, they rely solely on textual or spatial representations that are inadequate to achieve the balance between content and style. In this work, we propose a novel and training-free approach for style transfer, combining textual embedding with spatial features and separating the injection of content or style. Specifically, we adopt the BLIP-2 encoder to extract the textual representation of the style image. We utilize the DDIM inversion technique to extract intermediate embeddings in content and style branches as spatial features. Finally, we harness the step-by-step property of diffusion models by separating the injection of content and style in the target branch, which improves the balance between content preservation and style fusion. Various experiments have demonstrated the effectiveness and robustness of our proposed DiffeseST for achieving balanced and controllable style transfer results, as well as the potential to extend to other tasks.
Title: FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational Learning
Authors: Sizhe Liu, Jun Xia, Lecheng Zhang, Yuchen Liu, Yue Liu, Wenjie Du, Zhangyang Gao, Bozhen Hu, Cheng Tan, Hongxin Xiang, Stan Z. Li
Copy Paste: [[2410.15010]] FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational Learning(https://arxiv.org/abs/2410.15010)
Keywords: robust, fair
Abstract: Molecular relational learning (MRL) is crucial for understanding the interaction behaviors between molecular pairs, a critical aspect of drug discovery and development. However, the large feasible model space of MRL poses significant challenges to benchmarking, and existing MRL frameworks face limitations in flexibility and scope. To address these challenges, avoid repetitive coding efforts, and ensure fair comparison of models, we introduce FlexMol, a comprehensive toolkit designed to facilitate the construction and evaluation of diverse model architectures across various datasets and performance metrics. FlexMol offers a robust suite of preset model components, including 16 drug encoders, 13 protein sequence encoders, 9 protein structure encoders, and 7 interaction layers. With its easy-to-use API and flexibility, FlexMol supports the dynamic construction of over 70, 000 distinct combinations of model architectures. Additionally, we provide detailed benchmark results and code examples to demonstrate FlexMol's effectiveness in simplifying and standardizing MRL model development and comparison.
Title: DST-TransitNet: A Dynamic Spatio-Temporal Deep Learning Model for Scalable and Efficient Network-Wide Prediction of Station-Level Transit Ridership
Copy Paste: [[2410.15013]] DST-TransitNet: A Dynamic Spatio-Temporal Deep Learning Model for Scalable and Efficient Network-Wide Prediction of Station-Level Transit Ridership(https://arxiv.org/abs/2410.15013)
Keywords: robust, extraction, interpretability
Abstract: Accurate prediction of public transit ridership is vital for efficient planning and management of transit in rapidly growing urban areas in Canada. Unexpected increases in passengers can cause overcrowded vehicles, longer boarding times, and service disruptions. Traditional time series models like ARIMA and SARIMA face limitations, particularly in short-term predictions and integration of spatial and temporal features. These models struggle with the dynamic nature of ridership patterns and often ignore spatial correlations between nearby stops. Deep Learning (DL) models present a promising alternative, demonstrating superior performance in short-term prediction tasks by effectively capturing both spatial and temporal features. However, challenges such as dynamic spatial feature extraction, balancing accuracy with computational efficiency, and ensuring scalability remain. This paper introduces DST-TransitNet, a hybrid DL model for system-wide station-level ridership prediction. This proposed model uses graph neural networks (GNN) and recurrent neural networks (RNN) to dynamically integrate the changing temporal and spatial correlations within the stations. The model also employs a precise time series decomposition framework to enhance accuracy and interpretability. Tested on Bogota's BRT system data, with three distinct social scenarios, DST-TransitNet outperformed state-of-the-art models in precision, efficiency and robustness. Meanwhile, it maintains stability over long prediction intervals, demonstrating practical applicability.
Abstract: The purpose of RGB-D Salient Object Detection (SOD) is to pinpoint the most visually conspicuous areas within images accurately. While conventional deep models heavily rely on CNN extractors and overlook the long-range contextual dependencies, subsequent transformer-based models have addressed the issue to some extent but introduce high computational complexity. Moreover, incorporating spatial information from depth maps has been proven effective for this task. A primary challenge of this issue is how to fuse the complementary information from RGB and depth effectively. In this paper, we propose a dual Mamba-driven cross-modal fusion network for RGB-D SOD, named MambaSOD. Specifically, we first employ a dual Mamba-driven feature extractor for both RGB and depth to model the long-range dependencies in multiple modality inputs with linear complexity. Then, we design a cross-modal fusion Mamba for the captured multi-modal features to fully utilize the complementary information between the RGB and depth features. To the best of our knowledge, this work is the first attempt to explore the potential of the Mamba in the RGB-D SOD task, offering a novel perspective. Numerous experiments conducted on six prevailing datasets demonstrate our method's superiority over sixteen state-of-the-art RGB-D SOD models. The source code will be released at this https URL.
Title: Transit Pulse: Utilizing Social Media as a Source for Customer Feedback and Information Extraction with Large Language Model
Copy Paste: [[2410.15016]] Transit Pulse: Utilizing Social Media as a Source for Customer Feedback and Information Extraction with Large Language Model(https://arxiv.org/abs/2410.15016)
Keywords: extraction, large language model
Abstract: Users of the transit system flood social networks daily with messages that contain valuable insights crucial for improving service quality. These posts help transit agencies quickly identify emerging issues. Parsing topics and sentiments is key to gaining comprehensive insights to foster service excellence. However, the volume of messages makes manual analysis impractical, and standard NLP techniques like Term Frequency-Inverse Document Frequency (TF-IDF) fall short in nuanced interpretation. Traditional sentiment analysis separates topics and sentiments before integrating them, often missing the interaction between them. This incremental approach complicates classification and reduces analytical productivity. To address these challenges, we propose a novel approach to extracting and analyzing transit-related information, including sentiment and sarcasm detection, identification of unusual system problems, and location data from social media. Our method employs Large Language Models (LLM), specifically Llama 3, for a streamlined analysis free from pre-established topic labels. To enhance the model's domain-specific knowledge, we utilize Retrieval-Augmented Generation (RAG), integrating external knowledge sources into the information extraction pipeline. We validated our method through extensive experiments comparing its performance with traditional NLP approaches on user tweet data from the real world transit system. Our results demonstrate the potential of LLMs to transform social media data analysis in the public transit domain, providing actionable insights and enhancing transit agencies' responsiveness by extracting a broader range of information.
Title: A Survey of Ontology Expansion for Conversational Understanding
Copy Paste: [[2410.15019]] A Survey of Ontology Expansion for Conversational Understanding(https://arxiv.org/abs/2410.15019)
Keywords: robust
Abstract: In the rapidly evolving field of conversational AI, Ontology Expansion (OnExp) is crucial for enhancing the adaptability and robustness of conversational agents. Traditional models rely on static, predefined ontologies, limiting their ability to handle new and unforeseen user needs. This survey paper provides a comprehensive review of the state-of-the-art techniques in OnExp for conversational understanding. It categorizes the existing literature into three main areas: (1) New Intent Discovery, (2) New Slot-Value Discovery, and (3) Joint OnExp. By examining the methodologies, benchmarks, and challenges associated with these areas, we highlight several emerging frontiers in OnExp to improve agent performance in real-world scenarios and discuss their corresponding challenges. This survey aspires to be a foundational reference for researchers and practitioners, promoting further exploration and innovation in this crucial domain.
Title: Group Diffusion Transformers are Unsupervised Multitask Learners
Copy Paste: [[2410.15027]] Group Diffusion Transformers are Unsupervised Multitask Learners(https://arxiv.org/abs/2410.15027)
Keywords: diffusion, transformer, large language model
Abstract: While large language models (LLMs) have revolutionized natural language processing with their task-agnostic capabilities, visual generation tasks such as image translation, style transfer, and character customization still rely heavily on supervised, task-specific datasets. In this work, we introduce Group Diffusion Transformers (GDTs), a novel framework that unifies diverse visual generation tasks by redefining them as a group generation problem. In this approach, a set of related images is generated simultaneously, optionally conditioned on a subset of the group. GDTs build upon diffusion transformers with minimal architectural modifications by concatenating self-attention tokens across images. This allows the model to implicitly capture cross-image relationships (e.g., identities, styles, layouts, surroundings, and color schemes) through caption-based correlations. Our design enables scalable, unsupervised, and task-agnostic pretraining using extensive collections of image groups sourced from multimodal internet articles, image galleries, and video frames. We evaluate GDTs on a comprehensive benchmark featuring over 200 instructions across 30 distinct visual generation tasks, including picture book creation, font design, style transfer, sketching, colorization, drawing sequence generation, and character customization. Our models achieve competitive zero-shot performance without any additional fine-tuning or gradient updates. Furthermore, ablation studies confirm the effectiveness of key components such as data scaling, group size, and model design. These results demonstrate the potential of GDTs as scalable, general-purpose visual generation systems.
Title: A Novel Reinforcement Learning Model for Post-Incident Malware Investigations
Copy Paste: [[2410.15028]] A Novel Reinforcement Learning Model for Post-Incident Malware Investigations(https://arxiv.org/abs/2410.15028)
Keywords: extraction
Abstract: This Research proposes a Novel Reinforcement Learning (RL) model to optimise malware forensics investigation during cyber incident response. It aims to improve forensic investigation efficiency by reducing false negatives and adapting current practices to evolving malware signatures. The proposed RL framework leverages techniques such as Q-learning and the Markov Decision Process (MDP) to train the system to identify malware patterns in live memory dumps, thereby automating forensic tasks. The RL model is based on a detailed malware workflow diagram that guides the analysis of malware artefacts using static and behavioural techniques as well as machine learning algorithms. Furthermore, it seeks to address challenges in the UK justice system by ensuring the accuracy of forensic evidence. We conduct testing and evaluation in controlled environments, using datasets created with Windows operating systems to simulate malware infections. The experimental results demonstrate that RL improves malware detection rates compared to conventional methods, with the RL model's performance varying depending on the complexity and learning rate of the environment. The study concludes that while RL offers promising potential for automating malware forensics, its efficacy across diverse malware types requires ongoing refinement of reward systems and feature extraction methods.
Title: Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention
Authors: Yuzhe Weng, Haotian Wang, Tian Gao, Kewei Li, Shutong Niu, Jun Du
Copy Paste: [[2410.15029]] Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention(https://arxiv.org/abs/2410.15029)
Keywords: robust
Abstract: In multimodal sentiment analysis, collecting text data is often more challenging than video or audio due to higher annotation costs and inconsistent automatic speech recognition (ASR) quality. To address this challenge, our study has developed a robust model that effectively integrates multimodal sentiment information, even in the absence of text modality. Specifically, we have developed a Double-Flow Self-Distillation Framework, including Unified Modality Cross-Attention (UMCA) and Modality Imagination Autoencoder (MIA), which excels at processing both scenarios with complete modalities and those with missing text modality. In detail, when the text modality is missing, our framework uses the LLM-based model to simulate the text representation from the audio modality, while the MIA module supplements information from the other two modalities to make the simulated text representation similar to the real text representation. To further align the simulated and real representations, and to enable the model to capture the continuous nature of sample orders in sentiment valence regression tasks, we have also introduced the Rank-N Contrast (RNC) loss function. When testing on the CMU-MOSEI, our model achieved outstanding performance on MAE and significantly outperformed other models when text modality is missing. The code is available at: this https URL
Title: Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging
Copy Paste: [[2410.15035]] Improving General Text Embedding Model: Tackling Task Conflict and Data Imbalance through Model Merging(https://arxiv.org/abs/2410.15035)
Keywords: robust
Abstract: Text embeddings are vital for tasks such as text retrieval and semantic textual similarity (STS). Recently, the advent of pretrained language models, along with unified benchmarks like the Massive Text Embedding Benchmark (MTEB), has facilitated the development of versatile general-purpose text embedding models. Advanced embedding models are typically developed using large-scale multi-task data and joint training across multiple tasks. However, our experimental analysis reveals two significant drawbacks of joint training: 1) Task Conflict: Gradients from different tasks interfere with each other, leading to negative transfer. 2) Data Imbalance: Disproportionate data distribution introduces biases that negatively impact performance across tasks. To overcome these challenges, we explore model merging-a technique that combines independently trained models to mitigate gradient conflicts and balance data distribution. We introduce a novel method, Self Positioning, which efficiently searches for optimal model combinations within the interpolation space of task vectors using stochastic gradient descent. Our experiments demonstrate that Self Positioning significantly enhances multi-task performance on the MTEB dataset, achieving an absolute improvement of 0.7 points. It outperforms traditional resampling methods while reducing computational costs. This work offers a robust approach to building generalized text embedding models with superior performance across diverse embedding-related tasks.
Title: mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation
Copy Paste: [[2410.15037]] mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation(https://arxiv.org/abs/2410.15037)
Keywords: large language model
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However, this and other Code LLM benchmarks face critical limitations, particularly in task diversity, test coverage, and linguistic scope. Current evaluations primarily focus on English-to-Python conversion tasks with limited test cases, potentially overestimating model performance. While recent works have addressed test coverage and programming language (PL) diversity, code generation from low-resource language prompts remains largely unexplored. To address this gap, we introduce mHumanEval, an extended benchmark supporting prompts in over 200 natural languages. We employ established machine translation methods to compile the benchmark, coupled with a quality assurance process. Furthermore, we provide expert human translations for 15 diverse natural languages (NLs). We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs, offering insights into the current landscape of cross-lingual code generation.
Title: A General-Purpose Multimodal Foundation Model for Dermatology
Authors: Siyuan Yan, Zhen Yu, Clare Primiero, Cristina Vico-Alonso, Zhonghua Wang, Litao Yang, Philipp Tschandl, Ming Hu, Gin Tan, Vincent Tang, Aik Beng Ng, David Powell, Paul Bonnington, Simon See, Monika Janda, Victoria Mar, Harald Kittler, H. Peter Soyer, Zongyuan Ge
Copy Paste: [[2410.15038]] A General-Purpose Multimodal Foundation Model for Dermatology(https://arxiv.org/abs/2410.15038)
Keywords: robust, segmentation
Abstract: Diagnosing and treating skin diseases require advanced visual skills across multiple domains and the ability to synthesize information from various imaging modalities. Current deep learning models, while effective at specific tasks such as diagnosing skin cancer from dermoscopic images, fall short in addressing the complex, multimodal demands of clinical practice. Here, we introduce PanDerm, a multimodal dermatology foundation model pretrained through self-supervised learning on a dataset of over 2 million real-world images of skin diseases, sourced from 11 clinical institutions across 4 imaging modalities. We evaluated PanDerm on 28 diverse datasets covering a range of clinical tasks, including skin cancer screening, phenotype assessment and risk stratification, diagnosis of neoplastic and inflammatory skin diseases, skin lesion segmentation, change monitoring, and metastasis prediction and prognosis. PanDerm achieved state-of-the-art performance across all evaluated tasks, often outperforming existing models even when using only 5-10% of labeled data. PanDerm's clinical utility was demonstrated through reader studies in real-world clinical settings across multiple imaging modalities. It outperformed clinicians by 10.2% in early-stage melanoma detection accuracy and enhanced clinicians' multiclass skin cancer diagnostic accuracy by 11% in a collaborative human-AI setting. Additionally, PanDerm demonstrated robust performance across diverse demographic factors, including different body locations, age groups, genders, and skin tones. The strong results in benchmark evaluations and real-world clinical scenarios suggest that PanDerm could enhance the management of skin diseases and serve as a model for developing multimodal foundation models in other medical specialties, potentially accelerating the integration of AI support in healthcare.
Title: Adversarial Training: A Survey
Authors: Mengnan Zhao, Lihe Zhang, Jingwen Ye, Huchuan Lu, Baocai Yin, Xinchao Wang
Copy Paste: [[2410.15042]] Adversarial Training: A Survey(https://arxiv.org/abs/2410.15042)
Keywords: attack, robust
Abstract: Adversarial training (AT) refers to integrating adversarial examples -- inputs altered with imperceptible perturbations that can significantly impact model predictions -- into the training process. Recent studies have demonstrated the effectiveness of AT in improving the robustness of deep neural networks against diverse adversarial attacks. However, a comprehensive overview of these developments is still missing. This survey addresses this gap by reviewing a broad range of recent and representative studies. Specifically, we first describe the implementation procedures and practical applications of AT, followed by a comprehensive review of AT techniques from three perspectives: data enhancement, network design, and training configurations. Lastly, we discuss common challenges in AT and propose several promising directions for future research.
Title: Are LLMs Good Zero-Shot Fallacy Classifiers?
Authors: Fengjun Pan, Xiaobao Wu, Zongrui Li, Anh Tuan Luu
Copy Paste: [[2410.15050]] Are LLMs Good Zero-Shot Fallacy Classifiers?(https://arxiv.org/abs/2410.15050)
Keywords: extraction, large language model
Abstract: Fallacies are defective arguments with faulty reasoning. Detecting and classifying them is a crucial NLP task to prevent misinformation, manipulative claims, and biased decisions. However, existing fallacy classifiers are limited by the requirement for sufficient labeled data for training, which hinders their out-of-distribution (OOD) generalization abilities. In this paper, we focus on leveraging Large Language Models (LLMs) for zero-shot fallacy classification. To elicit fallacy-related knowledge and reasoning abilities of LLMs, we propose diverse single-round and multi-round prompting schemes, applying different task-specific instructions such as extraction, summarization, and Chain-of-Thought reasoning. With comprehensive experiments on benchmark datasets, we suggest that LLMs could be potential zero-shot fallacy classifiers. In general, LLMs under single-round prompting schemes have achieved acceptable zero-shot performances compared to the best full-shot baselines and can outperform them in all OOD inference scenarios and some open-domain tasks. Our novel multi-round prompting schemes can effectively bring about more improvements, especially for small LLMs. Our analysis further underlines the future research on zero-shot fallacy classification. Codes and data are available at: this https URL.
Title: Weakly-supervised diagnosis identification from Italian discharge letters
Authors: Vittorio Torri, Elisa Barbieri, Anna Cantarutti, Carlo Giaquinto, Francesca Ieva
Copy Paste: [[2410.15051]] Weakly-supervised diagnosis identification from Italian discharge letters(https://arxiv.org/abs/2410.15051)
Keywords: robust
Abstract: Objective: Recognizing diseases from discharge letters is crucial for cohort selection and epidemiological analyses, as this is the only type of data consistently produced across hospitals. This is a classic document classification problem, typically requiring supervised learning. However, manual annotation of large datasets of discharge letters is uncommon since it is extremely time-consuming. We propose a novel weakly-supervised pipeline to recognize diseases from Italian discharge letters. Methods: Our Natural Language Processing pipeline is based on a fine-tuned version of the Italian Umberto model. The pipeline extracts diagnosis-related sentences from a subset of letters and applies a two-level clustering using the embeddings generated by the fine-tuned Umberto model. These clusters are summarized and those mapped to the diseases of interest are selected as weak labels. Finally, the same BERT-based model is trained using these weak labels to detect the targeted diseases. Results: A case study related to the identification of bronchiolitis with 33'176 Italian discharge letters from 44 hospitals in the Veneto Region shows the potential of our method, with an AUC of 77.7 % and an F1-Score of 75.1 % on manually annotated labels, improving compared to other non-supervised methods and with a limited loss compared to fully supervised methods. Results are robust to the cluster selection and the identified clusters highlight the potential to recognize a variety of diseases. Conclusions: This study demonstrates the feasibility of diagnosis identification from Italian discharge letters in the absence of labelled data. Our pipeline showed strong performance and robustness, and its flexibility allows for easy adaptation to various diseases. This approach offers a scalable solution for clinical text classification, reducing the need for manual annotation while maintaining good accuracy.
Title: BYOCL: Build Your Own Consistent Latent with Hierarchical Representative Latent Clustering
Copy Paste: [[2410.15060]] BYOCL: Build Your Own Consistent Latent with Hierarchical Representative Latent Clustering(https://arxiv.org/abs/2410.15060)
Keywords: extraction, segmentation
Abstract: To address the semantic inconsistency issue with SAM or other single-image segmentation models handling image sequences, we introduce BYOCL. This novel model outperforms SAM in extensive experiments, showcasing its Hierarchical prototype capabilities across CLIP and other representations. BYOCL significantly reduces time and space consumption by dividing inputs into smaller batches, achieving exponential time reduction compared to previous methods. Our approach leverages the SAM image encoder for feature extraction, followed by Intra-Batch and Inter-Batch clustering algorithms. Extensive experiments demonstrate that BYOCL far exceeds the previous state-of-the-art single image segmentation model. Our work is the first to apply consistent segmentation using foundation models without requiring training, utilizing plug-and-play modules for any latent space, making our method highly efficientModels are available at \href{this https URL
Title: A Survey on All-in-One Image Restoration: Taxonomy, Evaluation and Future Trends
Authors: Junjun Jiang, Zengyuan Zuo, Gang Wu, Kui Jiang, Xianming Liu
Copy Paste: [[2410.15067]] A Survey on All-in-One Image Restoration: Taxonomy, Evaluation and Future Trends(https://arxiv.org/abs/2410.15067)
Keywords: robust
Abstract: Image restoration (IR) refers to the process of improving visual quality of images while removing degradation, such as noise, blur, weather effects, and so on. Traditional IR methods typically target specific types of degradation, which limits their effectiveness in real-world scenarios with complex distortions. In response to this challenge, the all-in-one image restoration (AiOIR) paradigm has emerged, offering a unified framework that adeptly addresses multiple degradation types. These innovative models enhance both convenience and versatility by adaptively learning degradation-specific features while simultaneously leveraging shared knowledge across diverse corruptions. In this review, we delve into the AiOIR methodologies, emphasizing their architecture innovations and learning paradigm and offering a systematic review of prevalent approaches. We systematically categorize prevalent approaches and critically assess the challenges these models encounter, proposing future research directions to advance this dynamic field. Our paper begins with an introduction to the foundational concepts of AiOIR models, followed by a categorization of cutting-edge designs based on factors such as prior knowledge and generalization capability. Next, we highlight key advancements in AiOIR, aiming to inspire further inquiry and innovation within the community. To facilitate a robust evaluation of existing methods, we collate and summarize commonly used datasets, implementation details, and evaluation metrics. Additionally, we present an objective comparison of open-sourced methods, providing valuable insights for researchers and practitioners alike. This paper stands as the first comprehensive and insightful review of AiOIR. A related repository is available at this https URL.
Title: Personalized Federated Learning with Adaptive Feature Aggregation and Knowledge Transfer
Copy Paste: [[2410.15073]] Personalized Federated Learning with Adaptive Feature Aggregation and Knowledge Transfer(https://arxiv.org/abs/2410.15073)
Keywords: privacy, federate
Abstract: Federated Learning(FL) is popular as a privacy-preserving machine learning paradigm for generating a single model on decentralized data. However, statistical heterogeneity poses a significant challenge for FL. As a subfield of FL, personalized FL (pFL) has attracted attention for its ability to achieve personalized models that perform well on non-independent and identically distributed (Non-IID) data. However, existing pFL methods are limited in terms of leveraging the global model's knowledge to enhance generalization while achieving personalization on local data. To address this, we proposed a new method personalized Federated learning with Adaptive Feature Aggregation and Knowledge Transfer (FedAFK), to train better feature extractors while balancing generalization and personalization for participating clients, which improves the performance of personalized models on Non-IID data. We conduct extensive experiments on three datasets in two widely-used heterogeneous settings and show the superior performance of our proposed method over thirteen state-of-the-art baselines.
Title: LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound
Authors: Xuechen Guo, Wenhao Chai, Shi-Yan Li, Gaoang Wang
Copy Paste: [[2410.15074]] LLaVA-Ultra: Large Chinese Language and Vision Assistant for Ultrasound(https://arxiv.org/abs/2410.15074)
Keywords: robust, generative, large language model
Abstract: Multimodal Large Language Model (MLLM) has recently garnered attention as a prominent research focus. By harnessing powerful LLM, it facilitates a transition of conversational generative AI from unimodal text to performing multimodal tasks. This boom begins to significantly impact medical field. However, general visual language model (VLM) lacks sophisticated comprehension for medical visual question answering (Med-VQA). Even models specifically tailored for medical domain tend to produce vague answers with weak visual relevance. In this paper, we propose a fine-grained adaptive VLM architecture for Chinese medical visual conversations through parameter-efficient tuning. Specifically, we devise a fusion module with fine-grained vision encoders to achieve enhancement for subtle medical visual semantics. Then we note data redundancy common to medical scenes is ignored in most prior works. In cases of a single text paired with multiple figures, we utilize weighted scoring with knowledge distillation to adaptively screen valid images mirroring text descriptions. For execution, we leverage a large-scale multimodal Chinese ultrasound dataset obtained from the hospital. We create instruction-following data based on text from professional doctors, which ensures effective tuning. With enhanced model and quality data, our Large Chinese Language and Vision Assistant for Ultrasound (LLaVA-Ultra) shows strong capability and robustness to medical scenarios. On three Med-VQA datasets, LLaVA-Ultra surpasses previous state-of-the-art models on various metrics.
Title: SLIC: Secure Learned Image Codec through Compressed Domain Watermarking to Defend Image Manipulation
Abstract: The digital image manipulation and advancements in Generative AI, such as Deepfake, has raised significant concerns regarding the authenticity of images shared on social media. Traditional image forensic techniques, while helpful, are often passive and insufficient against sophisticated tampering methods. This paper introduces the Secure Learned Image Codec (SLIC), a novel active approach to ensuring image authenticity through watermark embedding in the compressed domain. SLIC leverages neural network-based compression to embed watermarks as adversarial perturbations in the latent space, creating images that degrade in quality upon re-compression if tampered with. This degradation acts as a defense mechanism against unauthorized modifications. Our method involves fine-tuning a neural encoder/decoder to balance watermark invisibility with robustness, ensuring minimal quality loss for non-watermarked images. Experimental results demonstrate SLIC's effectiveness in generating visible artifacts in tampered images, thereby preventing their redistribution. This work represents a significant step toward developing secure image codecs that can be widely adopted to safeguard digital image integrity.
Title: Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion
Copy Paste: [[2410.15091]] Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion(https://arxiv.org/abs/2410.15091)
Keywords: segmentation
Abstract: Selective state space models (SSMs), such as Mamba, highly excel at capturing long-range dependencies in 1D sequential data, while their applications to 2D vision tasks still face challenges. Current visual SSMs often convert images into 1D sequences and employ various scanning patterns to incorporate local spatial dependencies. However, these methods are limited in effectively capturing the complex image spatial structures and the increased computational cost caused by the lengthened scanning paths. To address these limitations, we propose Spatial-Mamba, a novel approach that establishes neighborhood connectivity directly in the state space. Instead of relying solely on sequential state transitions, we introduce a structure-aware state fusion equation, which leverages dilated convolutions to capture image spatial structural dependencies, significantly enhancing the flow of visual contextual information. Spatial-Mamba proceeds in three stages: initial state computation in a unidirectional scan, spatial context acquisition through structure-aware state fusion, and final state computation using the observation equation. Our theoretical analysis shows that Spatial-Mamba unifies the original Mamba and linear attention under the same matrix multiplication framework, providing a deeper understanding of our method. Experimental results demonstrate that Spatial-Mamba, even with a single scan, attains or surpasses the state-of-the-art SSM-based models in image classification, detection and segmentation. Source codes and trained models can be found at $\href{this https URL}{\text{this https URL}}$.
Title: DPVS-Shapley:Faster and Universal Contribution Evaluation Component in Federated Learning
Copy Paste: [[2410.15093]] DPVS-Shapley:Faster and Universal Contribution Evaluation Component in Federated Learning(https://arxiv.org/abs/2410.15093)
Keywords: privacy, robust, federate, fair
Abstract: In the current era of artificial intelligence, federated learning has emerged as a novel approach to addressing data privacy concerns inherent in centralized learning paradigms. This decentralized learning model not only mitigates the risk of data breaches but also enhances the system's scalability and robustness. However, this approach introduces a new challenge: how to fairly and accurately assess the contribution of each participant. Developing an effective contribution evaluation mechanism is crucial for federated learning. Such a mechanism incentivizes participants to actively contribute their data and computational resources, thereby improving the overall performance of the federated learning system. By allocating resources and rewards based on the size of the contributions, it ensures that each participant receives fair treatment, fostering sustained this http URL, Shapley value-based methods are widely used to evaluate participants' contributions, with many researchers proposing modifications to adapt these methods to real-world scenarios. In this paper, we introduce a component called Dynamic Pruning Validation Set Shapley (DPVS-Shapley). This method accelerates the contribution assessment process by dynamically pruning the original dataset without compromising the evaluation's accuracy. Furthermore, this component can assign different weights to various samples, thereby allowing clients capable of distinguishing difficult examples to receive higher contribution scores.
Title: CosFairNet:A Parameter-Space based Approach for Bias Free Learning
Authors: Rajeev Ranjan Dwivedi, Priyadarshini Kumari, Vinod K Kurmi
Copy Paste: [[2410.15094]] CosFairNet:A Parameter-Space based Approach for Bias Free Learning(https://arxiv.org/abs/2410.15094)
Keywords: robust, fair
Abstract: Deep neural networks trained on biased data often inadvertently learn unintended inference rules, particularly when labels are strongly correlated with biased features. Existing bias mitigation methods typically involve either a) predefining bias types and enforcing them as prior knowledge or b) reweighting training samples to emphasize bias-conflicting samples over bias-aligned samples. However, both strategies address bias indirectly in the feature or sample space, with no control over learned weights, making it difficult to control the bias propagation across different layers. Based on this observation, we introduce a novel approach to address bias directly in the model's parameter space, preventing its propagation across layers. Our method involves training two models: a bias model for biased features and a debias model for unbiased details, guided by the bias model. We enforce dissimilarity in the debias model's later layers and similarity in its initial layers with the bias model, ensuring it learns unbiased low-level features without adopting biased high-level abstractions. By incorporating this explicit constraint during training, our approach shows enhanced classification accuracy and debiasing effectiveness across various synthetic and real-world datasets of different sizes. Moreover, the proposed method demonstrates robustness across different bias types and percentages of biased samples in the training data. The code is available at: this https URL
Title: Standardizing Generative Face Video Compression using Supplemental Enhancement Information
Authors: Bolin Chen, Yan Ye, Jie Chen, Ru-Ling Liao, Shanzhi Yin, Shiqi Wang, Kaifa Yang, Yue Li, Yiling Xu, Ye-Kui Wang, Shiv Gehlot, Guan-Ming Su, Peng Yin, Sean McCarthy, Gary J. Sullivan
Copy Paste: [[2410.15105]] Standardizing Generative Face Video Compression using Supplemental Enhancement Information(https://arxiv.org/abs/2410.15105)
Keywords: generative
Abstract: This paper proposes a Generative Face Video Compression (GFVC) approach using Supplemental Enhancement Information (SEI), where a series of compact spatial and temporal representations of a face video signal (i.e., 2D/3D key-points, facial semantics and compact features) can be coded using SEI message and inserted into the coded video bitstream. At the time of writing, the proposed GFVC approach is an official "technology under consideration" (TuC) for standardization by the Joint Video Experts Team (JVET) of ISO/IEC JVT 1/SC 29 and ITU-T SG16. To the best of the authors' knowledge, the JVET work on the proposed SEI-based GFVC approach is the first standardization activity for generative video compression. The proposed SEI approach has not only advanced the reconstruction quality of early-day Model-Based Coding (MBC) via the state-of-the-art generative technique, but also established a new SEI definition for future GFVC applications and deployment. Experimental results illustrate that the proposed SEI-based GFVC approach can achieve remarkable rate-distortion performance compared with the latest Versatile Video Coding (VVC) standard, whilst also potentially enabling a wide variety of functionalities including user-specified animation/filtering and metaverse-related applications.
Title: Toward Robust RALMs: Revealing the Impact of Imperfect Retrieval on Retrieval-Augmented Language Models
Copy Paste: [[2410.15107]] Toward Robust RALMs: Revealing the Impact of Imperfect Retrieval on Retrieval-Augmented Language Models(https://arxiv.org/abs/2410.15107)
Keywords: attack, robust, generative
Abstract: Retrieval Augmented Language Models (RALMs) have gained significant attention for their ability to generate accurate answer and improve efficiency. However, RALMs are inherently vulnerable to imperfect information due to their reliance on the imperfect retriever or knowledge source. We identify three common scenarios-unanswerable, adversarial, conflicting-where retrieved document sets can confuse RALM with plausible real-world examples. We present the first comprehensive investigation to assess how well RALMs detect and handle such problematic scenarios. Among these scenarios, to systematically examine adversarial robustness we propose a new adversarial attack method, Generative model-based ADVersarial attack (GenADV) and a novel metric Robustness under Additional Document (RAD). Our findings reveal that RALMs often fail to identify the unanswerability or contradiction of a document set, which frequently leads to hallucinations. Moreover, we show the addition of an adversary significantly degrades RALM's performance, with the model becoming even more vulnerable when the two scenarios overlap (adversarial+unanswerable). Our research identifies critical areas for assessing and enhancing the robustness of RALMs, laying the foundation for the development of more robust models.
Title: Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models
Authors: Qitan Lv, Jie Wang, Hanzhu Chen, Bin Li, Yongdong Zhang, Feng Wu
Copy Paste: [[2410.15116]] Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models(https://arxiv.org/abs/2410.15116)
Keywords: large language model
Abstract: Generation of plausible but incorrect factual information, often termed hallucination, has attracted significant research interest. Retrieval-augmented language model (RALM) -- which enhances models with up-to-date knowledge -- emerges as a promising method to reduce hallucination. However, existing RALMs may instead exacerbate hallucination when retrieving lengthy contexts. To address this challenge, we propose COFT, a novel \textbf{CO}arse-to-\textbf{F}ine highligh\textbf{T}ing method to focus on different granularity-level key texts, thereby avoiding getting lost in lengthy contexts. Specifically, COFT consists of three components: \textit{recaller}, \textit{scorer}, and \textit{selector}. First, \textit{recaller} applies a knowledge graph to extract potential key entities in a given context. Second, \textit{scorer} measures the importance of each entity by calculating its contextual weight. Finally, \textit{selector} selects high contextual weight entities with a dynamic threshold algorithm and highlights the corresponding paragraphs, sentences, or words in a coarse-to-fine manner. Extensive experiments on the knowledge hallucination benchmark demonstrate the effectiveness of COFT, leading to a superior performance over $30\%$ in the F1 score metric. Moreover, COFT also exhibits remarkable versatility across various long-form tasks, such as reading comprehension and question answering.
Title: Reinfier and Reintrainer: Verification and Interpretation-Driven Safe Deep Reinforcement Learning Frameworks
Copy Paste: [[2410.15127]] Reinfier and Reintrainer: Verification and Interpretation-Driven Safe Deep Reinforcement Learning Frameworks(https://arxiv.org/abs/2410.15127)
Keywords: interpretability
Abstract: Ensuring verifiable and interpretable safety of deep reinforcement learning (DRL) is crucial for its deployment in real-world applications. Existing approaches like verification-in-the-loop training, however, face challenges such as difficulty in deployment, inefficient training, lack of interpretability, and suboptimal performance in property satisfaction and reward performance. In this work, we propose a novel verification-driven interpretation-in-the-loop framework Reintrainer to develop trustworthy DRL models, which are guaranteed to meet the expected constraint properties. Specifically, in each iteration, this framework measures the gap between the on-training model and predefined properties using formal verification, interprets the contribution of each input feature to the model's output, and then generates the training strategy derived from the on-the-fly measure results, until all predefined properties are proven. Additionally, the low reusability of existing verifiers and interpreters motivates us to develop Reinfier, a general and fundamental tool within Reintrainer for DRL verification and interpretation. Reinfier features breakpoints searching and verification-driven interpretation, associated with a concise constraint-encoding language DRLP. Evaluations demonstrate that Reintrainer outperforms the state-of-the-art on six public benchmarks in both performance and property guarantees. Our framework can be accessed at this https URL.
Title: Augmenting the Veracity and Explanations of Complex Fact Checking via Iterative Self-Revision with LLMs
Authors: Xiaocheng Zhang, Xi Wang, Yifei Lu, Zhuangzhuang Ye, Jianing Wang, Mengjiao Bao, Peng Yan, Xiaohong Su
Copy Paste: [[2410.15135]] Augmenting the Veracity and Explanations of Complex Fact Checking via Iterative Self-Revision with LLMs(https://arxiv.org/abs/2410.15135)
Keywords: large language model
Abstract: Explanation generation plays a more pivotal role than fact verification in producing interpretable results and facilitating comprehensive fact-checking, which has recently garnered considerable attention. However, previous studies on explanation generation has shown several limitations, such as being confined to English scenarios, involving overly complex inference processes, and not fully unleashing the potential of the mutual feedback between veracity labels and explanation texts. To address these issues, we construct two complex fact-checking datasets in the Chinese scenarios: CHEF-EG and TrendFact. These datasets involve complex facts in areas such as health, politics, and society, presenting significant challenges for fact verification methods. In response to these challenges, we propose a unified framework called FactISR (Augmenting Fact-Checking via Iterative Self-Revision) to perform mutual feedback between veracity and explanations by leveraging the capabilities of large language models(LLMs). FactISR uses a single model to address tasks such as fact verification and explanation generation. Its self-revision mechanism can further revision the consistency between veracity labels, explanation texts, and evidence, as well as eliminate irrelevant noise. We conducted extensive experiments with baselines and FactISR on the proposed datasets. The experimental results demonstrate the effectiveness of our method.
Title: Collaborative State Fusion in Partially Known Multi-agent Environments
Copy Paste: [[2410.15137]] Collaborative State Fusion in Partially Known Multi-agent Environments(https://arxiv.org/abs/2410.15137)
Keywords: robust
Abstract: In this paper, we study the collaborative state fusion problem in a multi-agent environment, where mobile agents collaborate to track movable targets. Due to the limited sensing range and potential errors of on-board sensors, it is necessary to aggregate individual observations to provide target state fusion for better target state estimation. Existing schemes do not perform well due to (1) impractical assumption of the fully known prior target state-space model and (2) observation outliers from individual sensors. To address the issues, we propose a two-stage collaborative fusion framework, namely \underline{L}earnable Weighted R\underline{o}bust \underline{F}usion (\textsf{LoF}). \textsf{LoF} combines a local state estimator (e.g., Kalman Filter) with a learnable weight generator to address the mismatch between the prior state-space model and underlying patterns of moving targets. Moreover, given observation outliers, we develop a time-series soft medoid(TSM) scheme to perform robust fusion. We evaluate \textsf{LoF} in a collaborative detection simulation environment with promising results. In an example setting with 4 agents and 2 targets, \textsf{LoF} leads to a 9.1\% higher fusion gain compared to the state-of-the-art.
Title: Budgeted Online Continual Learning by Adaptive Layer Freezing and Frequency-based Sampling
Copy Paste: [[2410.15143]] Budgeted Online Continual Learning by Adaptive Layer Freezing and Frequency-based Sampling(https://arxiv.org/abs/2410.15143)
Keywords: fair
Abstract: The majority of online continual learning (CL) advocates single-epoch training and imposes restrictions on the size of replay memory. However, single-epoch training would incur a different amount of computations per CL algorithm, and the additional storage cost to store logit or model in addition to replay memory is largely ignored in calculating the storage budget. Arguing different computational and storage budgets hinder fair comparison among CL algorithms in practice, we propose to use floating point operations (FLOPs) and total memory size in Byte as a metric for computational and memory budgets, respectively, to compare and develop CL algorithms in the same 'total resource budget.' To improve a CL method in a limited total budget, we propose adaptive layer freezing that does not update the layers for less informative batches to reduce computational costs with a negligible loss of accuracy. In addition, we propose a memory retrieval method that allows the model to learn the same amount of knowledge as using random retrieval in fewer iterations. Empirical validations on the CIFAR-10/100, CLEAR-10/100, and ImageNet-1K datasets demonstrate that the proposed approach outperforms the state-of-the-art methods within the same total budget
Title: Evaluating Deep Unlearning in Large Language Models
Copy Paste: [[2410.15153]] Evaluating Deep Unlearning in Large Language Models(https://arxiv.org/abs/2410.15153)
Keywords: protect, large language model
Abstract: Machine unlearning is a key requirement of many data protection regulations such as GDPR. Prior work on unlearning has mostly considered superficial unlearning tasks where a single or a few related pieces of information are required to be removed. However, the task of unlearning a fact is much more challenging in recent large language models (LLMs), because the facts in LLMs can be deduced from each other. In this work, we investigate whether current unlearning methods for LLMs succeed beyond superficial unlearning of facts. Specifically, we formally propose a framework and a definition for deep unlearning facts that are interrelated. We design the metric, recall, to quantify the extent of deep unlearning. To systematically evaluate deep unlearning, we construct a synthetic dataset EDU-RELAT, which consists of a synthetic knowledge base of family relationships and biographies, together with a realistic logical rule set that connects them. We use this dataset to test four unlearning methods in four LLMs at different sizes. Our findings reveal that in the task of deep unlearning only a single fact, they either fail to properly unlearn with high recall, or end up unlearning many other irrelevant facts. Our dataset and code are publicly available at: this https URL.
Title: Explaining Graph Neural Networks with Large Language Models: A Counterfactual Perspective for Molecular Property Prediction
Authors: Yinhan He, Zaiyi Zheng, Patrick Soga, Yaozhen Zhu, yushun Dong, Jundong Li
Copy Paste: [[2410.15165]] Explaining Graph Neural Networks with Large Language Models: A Counterfactual Perspective for Molecular Property Prediction(https://arxiv.org/abs/2410.15165)
Keywords: large language model
Abstract: In recent years, Graph Neural Networks (GNNs) have become successful in molecular property prediction tasks such as toxicity analysis. However, due to the black-box nature of GNNs, their outputs can be concerning in high-stakes decision-making scenarios, e.g., drug discovery. Facing such an issue, Graph Counterfactual Explanation (GCE) has emerged as a promising approach to improve GNN transparency. However, current GCE methods usually fail to take domain-specific knowledge into consideration, which can result in outputs that are not easily comprehensible by humans. To address this challenge, we propose a novel GCE method, LLM-GCE, to unleash the power of large language models (LLMs) in explaining GNNs for molecular property prediction. Specifically, we utilize an autoencoder to generate the counterfactual graph topology from a set of counterfactual text pairs (CTPs) based on an input graph. Meanwhile, we also incorporate a CTP dynamic feedback module to mitigate LLM hallucination, which provides intermediate feedback derived from the generated counterfactuals as an attempt to give more faithful guidance. Extensive experiments demonstrate the superior performance of LLM-GCE. Our code is released on this https URL\_LLM4GNNExplanation.
Title: An Electoral Approach to Diversify LLM-based Multi-Agent Collective Decision-Making
Copy Paste: [[2410.15168]] An Electoral Approach to Diversify LLM-based Multi-Agent Collective Decision-Making(https://arxiv.org/abs/2410.15168)
Keywords: robust, large language model
Abstract: Modern large language models (LLMs) have exhibited cooperative synergy on complex task-solving, and collective decision-making (CDM) is a pivotal component in LLM-based multi-agent collaboration frameworks. Our survey on 52 recent such systems uncovers a severe lack of diversity, with a heavy reliance on dictatorial and plurality voting for CDM. Through the lens of social choice theory, we scrutinize widely-adopted CDM methods and identify their limitations. To enrich current landscape of LLM-based CDM, we present GEDI, an electoral CDM module that incorporates various ordinal preferential voting mechanisms. Our empirical case study across three benchmarks shows that the integration of certain CDM methods can markedly improve the reasoning capabilities and robustness of some leading LLMs, all without requiring intricate system designs. Additionally, we find that some CDM mechanisms generate positive synergies even with as few as three agents. The voting-based methods also demonstrate robustness against single points of failure, as well as diversity in terms of hit-rate@k and subject-wise impacts.
Title: Beyond Pruning Criteria: The Dominant Role of Fine-Tuning and Adaptive Ratios in Neural Network Robustness
Copy Paste: [[2410.15176]] Beyond Pruning Criteria: The Dominant Role of Fine-Tuning and Adaptive Ratios in Neural Network Robustness(https://arxiv.org/abs/2410.15176)
Keywords: attack, robust
Abstract: Deep neural networks (DNNs) excel in tasks like image recognition and natural language processing, but their increasing complexity complicates deployment in resource-constrained environments and increases susceptibility to adversarial attacks. While traditional pruning methods reduce model size, they often compromise the network's ability to withstand subtle perturbations. This paper challenges the conventional emphasis on weight importance scoring as the primary determinant of a pruned network's performance. Through extensive analysis, including experiments conducted on CIFAR, Tiny-ImageNet, and various network architectures, we demonstrate that effective fine-tuning plays a dominant role in enhancing both performance and adversarial robustness, often surpassing the impact of the chosen pruning criteria. To address this issue, we introduce Module Robust Sensitivity, a novel metric that adaptively adjusts the pruning ratio for each network layer based on its sensitivity to adversarial perturbations. By integrating this metric into the pruning process, we develop a stable algorithm that maintains accuracy and robustness simultaneously. Experimental results show that our approach enables the practical deployment of more robust and efficient neural networks.
Title: Action abstractions for amortized sampling
Authors: Oussama Boussif, Léna Néhale Ezzine, Joseph D Viviano, Michał Koziarski, Moksh Jain, Nikolay Malkin, Emmanuel Bengio, Rim Assouel, Yoshua Bengio
Copy Paste: [[2410.15184]] Action abstractions for amortized sampling(https://arxiv.org/abs/2410.15184)
Keywords: generative
Abstract: As trajectories sampled by policies used by reinforcement learning (RL) and generative flow networks (GFlowNets) grow longer, credit assignment and exploration become more challenging, and the long planning horizon hinders mode discovery and generalization. The challenge is particularly pronounced in entropy-seeking RL methods, such as generative flow networks, where the agent must learn to sample from a structured distribution and discover multiple high-reward states, each of which take many steps to reach. To tackle this challenge, we propose an approach to incorporate the discovery of action abstractions, or high-level actions, into the policy optimization process. Our approach involves iteratively extracting action subsequences commonly used across many high-reward trajectories and `chunking' them into a single action that is added to the action space. In empirical evaluation on synthetic and real-world environments, our approach demonstrates improved sample efficiency performance in discovering diverse high-reward objects, especially on harder exploration problems. We also observe that the abstracted high-order actions are interpretable, capturing the latent structure of the reward landscape of the action space. This work provides a cognitively motivated approach to action abstraction in RL and is the first demonstration of hierarchical planning in amortized sequential sampling.
Title: Fine-tuning foundational models to code diagnoses from veterinary health records
Authors: Mayla R. Boguslav, Adam Kiehl, David Kott, G. Joseph Strecker, Tracy Webb, Nadia Saklou, Terri Ward, Michael Kirby
Copy Paste: [[2410.15186]] Fine-tuning foundational models to code diagnoses from veterinary health records(https://arxiv.org/abs/2410.15186)
Keywords: transformer, large language model
Abstract: Veterinary medical records represent a large data resource for application to veterinary and One Health clinical research efforts. Use of the data is limited by interoperability challenges including inconsistent data formats and data siloing. Clinical coding using standardized medical terminologies enhances the quality of medical records and facilitates their interoperability with veterinary and human health records from other sites. Previous studies, such as DeepTag and VetTag, evaluated the application of Natural Language Processing (NLP) to automate veterinary diagnosis coding, employing long short-term memory (LSTM) and transformer models to infer a subset of Systemized Nomenclature of Medicine - Clinical Terms (SNOMED-CT) diagnosis codes from free-text clinical notes. This study expands on these efforts by incorporating all 7,739 distinct SNOMED-CT diagnosis codes recognized by the Colorado State University (CSU) Veterinary Teaching Hospital (VTH) and by leveraging the increasing availability of pre-trained large language models (LLMs). Ten freely-available pre-trained LLMs were fine-tuned on the free-text notes from 246,473 manually-coded veterinary patient visits included in the CSU VTH's electronic health records (EHRs), which resulted in superior performance relative to previous efforts. The most accurate results were obtained when expansive labeled data were used to fine-tune relatively large clinical LLMs, but the study also showed that comparable results can be obtained using more limited resources and non-clinical LLMs. The results of this study contribute to the improvement of the quality of veterinary EHRs by investigating accessible methods for automated coding and support both animal and human health research by paving the way for more integrated and comprehensive health databases that span species and institutions.
Title: FSCsec: Collaboration in Financial Sector Cybersecurity -- Exploring the Impact of Resource Sharing on IT Security
Authors: Sayed Abu Sayeed, Mir Mehedi Rahman, Samiul Alam, Naresh Kshetri
Copy Paste: [[2410.15194]] FSCsec: Collaboration in Financial Sector Cybersecurity -- Exploring the Impact of Resource Sharing on IT Security(https://arxiv.org/abs/2410.15194)
Keywords: security, protect
Abstract: The financial sector's dependence on digital infrastructure increases its vulnerability to cybersecurity threats, requiring strong IT security protocols with other entities. This collaboration, however, is often identified as the most vulnerable link in the chain of cybersecurity. Adopting both symbolic and substantive measures lessens the impact of IT security spending on decreasing the frequency of data security breaches in the long run. The Protection Motivation Theory clarifies actions triggered by data sharing with other organizations, and the Institutional theory aids in comprehending the intricate relationship between transparency and organizational conduct. We investigate how things like regulatory pressure, teamwork among institutions, and people's motivations to protect themselves influence cybersecurity. By using simple theories to understand these factors, this research aims to provide insights that can help financial institutions make better decisions to protect. We have also included the discussion, conclusion, and future directions in regard to collaboration in financial sector cybersecurity for exploring impact of resource sharing.
Title: Low-cost Robust Night-time Aerial Material Segmentation through Hyperspectral Data and Sparse Spatio-Temporal Learning
Copy Paste: [[2410.15208]] Low-cost Robust Night-time Aerial Material Segmentation through Hyperspectral Data and Sparse Spatio-Temporal Learning(https://arxiv.org/abs/2410.15208)
Keywords: robust, segmentation
Abstract: Material segmentation is a complex task, particularly when dealing with aerial data in poor lighting and atmospheric conditions. To address this, hyperspectral data from specialized cameras can be very useful in addition to RGB images. However, due to hardware constraints, high spectral data often come with lower spatial resolution. Additionally, incorporating such data into a learning-based segmentation framework is challenging due to the numerous data channels involved. To overcome these difficulties, we propose an innovative Siamese framework that uses time series-based compression to effectively and scalably integrate the additional spectral data into the segmentation task. We demonstrate our model's effectiveness through competitive benchmarks on aerial datasets in various environmental conditions.
Title: DataSeal: Ensuring the Verifiability of Private Computation on Encrypted Data
Authors: Muhammad Husni Santriaji, Jiaqi Xue, Qian Lou, Yan Solihin
Copy Paste: [[2410.15215]] DataSeal: Ensuring the Verifiability of Private Computation on Encrypted Data(https://arxiv.org/abs/2410.15215)
Keywords: secure, privacy
Abstract: Fully Homomorphic Encryption (FHE) allows computations to be performed directly on encrypted data without needing to decrypt it first. This "encryption-in-use" feature is crucial for securely outsourcing computations in privacy-sensitive areas such as healthcare and finance. Nevertheless, in the context of FHE-based cloud computing, clients often worry about the integrity and accuracy of the outcomes. This concern arises from the potential for a malicious server or server-side vulnerabilities that could result in tampering with the data, computations, and results. Ensuring integrity and verifiability with low overhead remains an open problem, as prior attempts have not yet achieved this goal. To tackle this challenge and ensure the verification of FHE's private computations on encrypted data, we introduce DataSeal, which combines the low overhead of the algorithm-based fault tolerance (ABFT) technique with the confidentiality of FHE, offering high efficiency and verification capability. Through thorough testing in diverse contexts, we demonstrate that DataSeal achieves much lower overheads for providing computation verifiability for FHE than other techniques that include MAC, ZKP, and TEE. DataSeal's space and computation overheads decrease to nearly negligible as the problem size increases.
Title: On the Diversity of Synthetic Data and its Impact on Training Large Language Models
Authors: Hao Chen, Abdul Waheed, Xiang Li, Yidong Wang, Jindong Wang, Bhiksha Raj, Marah I. Abdin
Copy Paste: [[2410.15226]] On the Diversity of Synthetic Data and its Impact on Training Large Language Models(https://arxiv.org/abs/2410.15226)
Keywords: large language model
Abstract: The rise of Large Language Models (LLMs) has accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. While previous literature has focused predominantly on the quality and quantity of real data, our work enables the measurement of diversity in synthetic data and explores its impact on LLM performance. We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages by introducing a new diversity metric, \textit{LLM cluster-agent}, designed to evaluate the diversity of synthetic datasets. Through a series of controlled experiments with models of 350M and 1.4B parameters, we demonstrate that the proposed cluster-based LLM scoring of diversity correlates positively with both pre-training and supervised fine-tuning performance. Our findings also reveal that synthetic data diversity in pre-training affects supervised fine-tuning more significantly than pre-training itself, even for smaller models. We hope this study advances our understanding of the optimal use of synthetic data in LLM training and opens new avenues for efficient data generation processes.
Title: Deep Learning-based Detection of Bacterial Swarm Motion Using a Single Image
Copy Paste: [[2410.15229]] Deep Learning-based Detection of Bacterial Swarm Motion Using a Single Image(https://arxiv.org/abs/2410.15229)
Keywords: robust
Abstract: Distinguishing between swarming and swimming, the two principal forms of bacterial movement, holds significant conceptual and clinical relevance. This is because bacteria that exhibit swarming capabilities often possess unique properties crucial to the pathogenesis of infectious diseases and may also have therapeutic potential. Here, we report a deep learning-based swarming classifier that rapidly and autonomously predicts swarming probability using a single blurry image. Compared with traditional video-based, manually-processed approaches, our method is particularly suited for high-throughput environments and provides objective, quantitative assessments of swarming probability. The swarming classifier demonstrated in our work was trained on Enterobacter sp. SM3 and showed good performance when blindly tested on new swarming (positive) and swimming (negative) test images of SM3, achieving a sensitivity of 97.44% and a specificity of 100%. Furthermore, this classifier demonstrated robust external generalization capabilities when applied to unseen bacterial species, such as Serratia marcescens DB10 and Citrobacter koseri H6. It blindly achieved a sensitivity of 97.92% and a specificity of 96.77% for DB10, and a sensitivity of 100% and a specificity of 97.22% for H6. This competitive performance indicates the potential to adapt our approach for diagnostic applications through portable devices or even smartphones. This adaptation would facilitate rapid, objective, on-site screening for bacterial swarming motility, potentially enhancing the early detection and treatment assessment of various diseases, including inflammatory bowel diseases (IBD) and urinary tract infections (UTI).
Title: A Semidefinite Relaxation Approach for Fair Graph Clustering
Copy Paste: [[2410.15233]] A Semidefinite Relaxation Approach for Fair Graph Clustering(https://arxiv.org/abs/2410.15233)
Keywords: fair
Abstract: Fair graph clustering is crucial for ensuring equitable representation and treatment of diverse communities in network analysis. Traditional methods often ignore disparities among social, economic, and demographic groups, perpetuating biased outcomes and reinforcing inequalities. This study introduces fair graph clustering within the framework of the disparate impact doctrine, treating it as a joint optimization problem integrating clustering quality and fairness constraints. Given the NP-hard nature of this problem, we employ a semidefinite relaxation approach to approximate the underlying optimization problem. For up to medium-sized graphs, we utilize a singular value decomposition-based algorithm, while for larger graphs, we propose a novel algorithm based on the alternative direction method of multipliers. Unlike existing methods, our formulation allows for tuning the trade-off between clustering quality and fairness. Experimental results on graphs generated from the standard stochastic block model demonstrate the superiority of our approach in achieving an optimal accuracy-fairness trade-off compared to state-of-the-art methods.
Title: Modeling Visual Memorability Assessment with Autoencoders Reveals Characteristics of Memorable Images
Copy Paste: [[2410.15235]] Modeling Visual Memorability Assessment with Autoencoders Reveals Characteristics of Memorable Images(https://arxiv.org/abs/2410.15235)
Keywords: robust, interpretability
Abstract: Background: Image memorability refers to the phenomenon where certain images are more likely to be remembered than others. It is a quantifiable and intrinsic image attribute, defined as the likelihood of being remembered upon a single exposure. Despite advances in understanding human visual perception and memory, it is unclear what features contribute to an image's memorability. To address this question, we propose a deep learning-based computational modeling approach. Methods: We modeled the subjective experience of visual memorability using an autoencoder based on VGG16 Convolutional Neural Networks (CNNs). The model was trained on images for one epoch, to simulate the single-exposure condition used in human memory tests. We investigated the relationship between memorability and reconstruction error, assessed latent space representations distinctiveness, and developed a Gated Recurrent Unit (GRU) model to predict memorability likelihood. Interpretability analysis was conducted to identify key image characteristics contributing to memorability. Results: Our results demonstrate a significant correlation between the images memorability score and autoencoder's reconstruction error, and the robust predictive performance of its latent representations. Distinctiveness in these representations correlated significantly with memorability. Additionally, certain visual characteristics, such as strong contrasts, distinctive objects, and prominent foreground elements were among the features contributing to image memorability in our model. Conclusions: Images with unique features that challenge the autoencoder's capacity are inherently more memorable. Moreover, these memorable images are distinct from others the model has encountered, and the latent space of the encoder contains features predictive of memorability.
Title: Jailbreaking and Mitigation of Vulnerabilities in Large Language Models
Copy Paste: [[2410.15236]] Jailbreaking and Mitigation of Vulnerabilities in Large Language Models(https://arxiv.org/abs/2410.15236)
Keywords: security, defense, attack, robust, large language model
Abstract: Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems. Despite these advancements in the past few years, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies. We roughly categorize attack approaches into prompt-based, model-based, multimodal, and multilingual, covering techniques such as adversarial prompting, backdoor injections, and cross-modality exploits. We also review various defense mechanisms, including prompt filtering, transformation, alignment techniques, multi-agent defenses, and self-regulation, evaluating their strengths and shortcomings. We also discuss key metrics and benchmarks used to assess LLM safety and robustness, noting challenges like the quantification of attack success in interactive contexts and biases in existing datasets. Identifying current research gaps, we suggest future directions for resilient alignment strategies, advanced defenses against evolving attacks, automation of jailbreak detection, and consideration of ethical and societal impacts. This review emphasizes the need for continued research and cooperation within the AI community to enhance LLM security and ensure their safe deployment.
Title: Conditional Prediction ROC Bands for Graph Classification
Copy Paste: [[2410.15239]] Conditional Prediction ROC Bands for Graph Classification(https://arxiv.org/abs/2410.15239)
Keywords: robust
Abstract: Graph classification in medical imaging and drug discovery requires accuracy and robust uncertainty quantification. To address this need, we introduce Conditional Prediction ROC (CP-ROC) bands, offering uncertainty quantification for ROC curves and robustness to distributional shifts in test data. Although developed for Tensorized Graph Neural Networks (TGNNs), CP-ROC is adaptable to general Graph Neural Networks (GNNs) and other machine learning models. We establish statistically guaranteed coverage for CP-ROC under a local exchangeability condition. This addresses uncertainty challenges for ROC curves under non-iid setting, ensuring reliability when test graph distributions differ from training data. Empirically, to establish local exchangeability for TGNNs, we introduce a data-driven approach to construct local calibration sets for graphs. Comprehensive evaluations show that CP-ROC significantly improves prediction reliability across diverse tasks. This method enhances uncertainty quantification efficiency and reliability for ROC curves, proving valuable for real-world applications with non-iid objects.
Title: Fastrack: Fast IO for Secure ML using GPU TEEs
Copy Paste: [[2410.15240]] Fastrack: Fast IO for Secure ML using GPU TEEs(https://arxiv.org/abs/2410.15240)
Keywords: secure, security
Abstract: As cloud-based ML expands, ensuring data security during training and inference is critical. GPU-based Trusted Execution Environments (TEEs) offer secure, high-performance solutions, with CPU TEEs managing data movement and GPU TEEs handling authentication and computation. However, CPU-to-GPU communication overheads significantly hinder performance, as data must be encrypted, authenticated, decrypted, and verified, increasing costs by 12.69 to 33.53 times. This results in GPU TEE inference becoming 54.12% to 903.9% slower and training 10% to 455% slower than non-TEE systems, undermining GPU TEE advantages in latency-sensitive applications. This paper analyzes Nvidia H100 TEE protocols and identifies three key overheads: 1) redundant CPU re-encryption, 2) limited authentication parallelism, and 3) unnecessary operation serialization. We propose Fastrack, optimizing with 1) direct GPU TEE communication, 2) parallelized authentication, and 3) overlapping decryption with PCI-e transmission. These optimizations cut communication costs and reduce inference/training runtime by up to 84.6%, with minimal overhead compared to non-TEE systems.
Title: Conditional Uncertainty Quantification for Tensorized Topological Neural Networks
Authors: Yujia Wu, Bo Yang, Yang Zhao, Elynn Chen, Yuzhou Chen, Zheshi Zheng
Abstract: Graph Neural Networks (GNNs) have become the de facto standard for analyzing graph-structured data, leveraging message-passing techniques to capture both structural and node feature information. However, recent studies have raised concerns about the statistical reliability of uncertainty estimates produced by GNNs. This paper addresses this crucial challenge by introducing a novel technique for quantifying uncertainty in non-exchangeable graph-structured data, while simultaneously reducing the size of label prediction sets in graph classification tasks. We propose Conformalized Tensor-based Topological Neural Networks (CF-T2NN), a new approach for rigorous prediction inference over graphs. CF-T2NN employs tensor decomposition and topological knowledge learning to navigate and interpret the inherent uncertainty in decision-making processes. This method enables a more nuanced understanding and handling of prediction uncertainties, enhancing the reliability and interpretability of neural network outcomes. Our empirical validation, conducted across 10 real-world datasets, demonstrates the superiority of CF-T2NN over a wide array of state-of-the-art methods on various graph benchmarks. This work not only enhances the GNN framework with robust uncertainty quantification capabilities but also sets a new standard for reliability and precision in graph-structured data analysis.
Abstract: Graph contrastive learning (GCL) has emerged as a promising approach to enhance graph neural networks' (GNNs) ability to learn rich representations from unlabeled graph-structured data. However, current GCL models face challenges with computational demands and limited feature utilization, often relying only on basic graph properties like node degrees and edge attributes. This constrains their capacity to fully capture the complex topological characteristics of real-world phenomena represented by graphs. To address these limitations, we propose Tensor-Fused Multi-View Graph Contrastive Learning (TensorMV-GCL), a novel framework that integrates extended persistent homology (EPH) with GCL representations and facilitates multi-scale feature extraction. Our approach uniquely employs tensor aggregation and compression to fuse information from graph and topological features obtained from multiple augmented views of the same graph. By incorporating tensor concatenation and contraction modules, we reduce computational overhead by separating feature tensor aggregation and transformation. Furthermore, we enhance the quality of learned topological features and model robustness through noise-injected EPH. Experiments on molecular, bioinformatic, and social network datasets demonstrate TensorMV-GCL's superiority, outperforming 15 state-of-the-art methods in graph classification tasks across 9 out of 11 benchmarks while achieving comparable results on the remaining two. The code for this paper is publicly available at this https URL.
Title: FastSTI: A Fast Conditional Pseudo Numerical Diffusion Model for Spatio-temporal Traffic Data Imputation
Copy Paste: [[2410.15248]] FastSTI: A Fast Conditional Pseudo Numerical Diffusion Model for Spatio-temporal Traffic Data Imputation(https://arxiv.org/abs/2410.15248)
Keywords: diffusion, generative
Abstract: High-quality spatiotemporal traffic data is crucial for intelligent transportation systems (ITS) and their data-driven applications. Inevitably, the issue of missing data caused by various disturbances threatens the reliability of data acquisition. Recent studies of diffusion probability models have demonstrated the superiority of deep generative models in imputation tasks by precisely capturing the spatio-temporal correlation of traffic data. One drawback of diffusion models is their slow sampling/denoising process. In this work, we aim to accelerate the imputation process while retaining the performance. We propose a fast conditional diffusion model for spatiotemporal traffic data imputation (FastSTI). To speed up the process yet, obtain better performance, we propose the application of a high-order pseudo-numerical solver. Our method further revs the imputation by introducing a predefined alignment strategy of variance schedule during the sampling process. Evaluating FastSTI on two types of real-world traffic datasets (traffic speed and flow) with different missing data scenarios proves its ability to impute higher-quality samples in only six sampling steps, especially under high missing rates (60\% $\sim$ 90\%). The experimental results illustrate a speed-up of $\textbf{8.3} \times$ faster than the current state-of-the-art model while achieving better performance.
Title: Lossless KV Cache Compression to 2%
Authors: Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang
Copy Paste: [[2410.15252]] Lossless KV Cache Compression to 2%(https://arxiv.org/abs/2410.15252)
Keywords: large language model
Abstract: Large language models have revolutionized data processing in numerous domains, with their ability to handle extended context reasoning receiving notable recognition. To speed up inference, maintaining a key-value (KV) cache memory is essential. Nonetheless, the growing demands for KV cache memory create significant hurdles for efficient implementation. This work introduces a novel architecture, Cross-Layer Latent Attention (CLLA), aimed at compressing the KV cache to less than 2% of its original size while maintaining comparable performance levels. CLLA integrates multiple aspects of KV cache compression, including attention head/dimension reduction, layer sharing, and quantization techniques, into a cohesive framework. Our extensive experiments demonstrate that CLLA achieves lossless performance on most tasks while utilizing minimal KV cache, marking a significant advancement in practical KV cache compression.
Title: Back to School: Translation Using Grammar Books
Copy Paste: [[2410.15263]] Back to School: Translation Using Grammar Books(https://arxiv.org/abs/2410.15263)
Keywords: large language model
Abstract: Machine translation systems for high resource languages perform exceptionally well and produce high quality translations. Unfortunately, the vast majority of languages are not considered high resource and lack the quantity of parallel sentences needed to train such systems. These under-represented languages are not without resources, however, and bilingual dictionaries and grammar books are available as linguistic reference material. With current large language models (LLMs) supporting near book-length contexts, we can begin to use the available material to ensure advancements are shared among all of the world's languages. In this paper, we demonstrate incorporating grammar books in the prompt of GPT-4 to improve machine translation and evaluate the performance on 16 topologically diverse low-resource languages, using a combination of reference material to show that the machine translation performance of LLMs can be improved using this method.
Title: When Machine Unlearning Meets Retrieval-Augmented Generation (RAG): Keep Secret or Forget Knowledge?
Authors: Shang Wang, Tianqing Zhu, Dayong Ye, Wanlei Zhou
Copy Paste: [[2410.15267]] When Machine Unlearning Meets Retrieval-Augmented Generation (RAG): Keep Secret or Forget Knowledge?(https://arxiv.org/abs/2410.15267)
Keywords: robust, large language model
Abstract: The deployment of large language models (LLMs) like ChatGPT and Gemini has shown their powerful natural language generation capabilities. However, these models can inadvertently learn and retain sensitive information and harmful content during training, raising significant ethical and legal concerns. To address these issues, machine unlearning has been introduced as a potential solution. While existing unlearning methods take into account the specific characteristics of LLMs, they often suffer from high computational demands, limited applicability, or the risk of catastrophic forgetting. To address these limitations, we propose a lightweight unlearning framework based on Retrieval-Augmented Generation (RAG) technology. By modifying the external knowledge base of RAG, we simulate the effects of forgetting without directly interacting with the unlearned LLM. We approach the construction of unlearned knowledge as a constrained optimization problem, deriving two key components that underpin the effectiveness of RAG-based unlearning. This RAG-based approach is particularly effective for closed-source LLMs, where existing unlearning methods often fail. We evaluate our framework through extensive experiments on both open-source and closed-source models, including ChatGPT, Gemini, Llama-2-7b-chat-hf, and PaLM 2. The results demonstrate that our approach meets five key unlearning criteria: effectiveness, universality, harmlessness, simplicity, and robustness. Meanwhile, this approach can extend to multimodal large language models and LLM-based agents.
Title: TAGExplainer: Narrating Graph Explanations for Text-Attributed Graph Learning Models
Abstract: Representation learning of Text-Attributed Graphs (TAGs) has garnered significant attention due to its applications in various domains, including recommendation systems and social networks. Despite advancements in TAG learning methodologies, challenges remain in explainability due to the black-box nature of existing TAG representation learning models. This paper presents TAGExplainer, the first method designed to generate natural language explanations for TAG learning. TAGExplainer employs a generative language model that maps input-output pairs to explanations reflecting the model's decision-making process. To address the lack of annotated ground truth explanations in real-world scenarios, we propose first generating pseudo-labels that capture the model's decisions from saliency-based explanations, then the pseudo-label generator is iteratively trained based on three training objectives focusing on faithfulness and brevity via Expert Iteration, to improve the quality of generated pseudo-labels. The high-quality pseudo-labels are finally utilized to train an end-to-end explanation generator model. Extensive experiments are conducted to demonstrate the effectiveness of TAGExplainer in producing faithful and concise natural language explanations.
Title: Can LVLMs Describe Videos like Humans? A Five-in-One Video Annotations Benchmark for Better Human-Machine Comparison
Copy Paste: [[2410.15270]] Can LVLMs Describe Videos like Humans? A Five-in-One Video Annotations Benchmark for Better Human-Machine Comparison(https://arxiv.org/abs/2410.15270)
Keywords: robust
Abstract: Large vision-language models (LVLMs) have made significant strides in addressing complex video tasks, sparking researchers' interest in their human-like multimodal understanding capabilities. Video description serves as a fundamental task for evaluating video comprehension, necessitating a deep understanding of spatial and temporal dynamics, which presents challenges for both humans and machines. Thus, investigating whether LVLMs can describe videos as comprehensively as humans (through reasonable human-machine comparisons using video captioning as a proxy task) will enhance our understanding and application of these models. However, current benchmarks for video comprehension have notable limitations, including short video durations, brief annotations, and reliance on a single annotator's perspective. These factors hinder a comprehensive assessment of LVLMs' ability to understand complex, lengthy videos and prevent the establishment of a robust human baseline that accurately reflects human video comprehension capabilities. To address these issues, we propose a novel benchmark, FIOVA (Five In One Video Annotations), designed to evaluate the differences between LVLMs and human understanding more comprehensively. FIOVA includes 3,002 long video sequences (averaging 33.6 seconds) that cover diverse scenarios with complex spatiotemporal relationships. Each video is annotated by five distinct annotators, capturing a wide range of perspectives and resulting in captions that are 4-15 times longer than existing benchmarks, thereby establishing a robust baseline that represents human understanding comprehensively for the first time in video description tasks. Using the FIOVA benchmark, we conducted an in-depth evaluation of six state-of-the-art LVLMs, comparing their performance with humans. More detailed information can be found at this https URL.
Title: BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression
Copy Paste: [[2410.15277]] BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression(https://arxiv.org/abs/2410.15277)
Keywords: large language model
Abstract: Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge. However, as the number of retrieved documents increases, the input length to LLMs grows linearly, causing a dramatic increase in latency and a degradation in long-context understanding. This is particularly serious for multi-hop questions that require a chain of reasoning across documents. To accelerate inference, reduce costs, and minimize distractions, this paper presents BRIEF (Bridging Retrieval and Inference through Evidence Fusion), a lightweight approach that performs query-aware multi-hop reasoning by compressing retrieved documents into highly dense textual summaries to integrate into in-context learning. To enable learning compression for multi-hop reasoning, we curate synthetic data by extracting atomic proposition expressions that encapsulate distinct factoids from the source documents to compose synthetic summaries. Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries and enables a range of LLMs to achieve exceptional open-domain question answering (QA) performance. For example, on HotpotQA, BRIEF improves the compression rate by 2 times compared to the state-of-the-art baseline, while outperforming it by 3.00% EM and 4.16% F1 with Flan-UL2 as the reader LM. It also generates more concise summaries than proprietary GPT-3.5, while demonstrating nearly identical QA performance.
Title: Neural Normalized Compression Distance and the Disconnect Between Compression and Classification
Authors: John Hurwitz, Charles Nicholas, Edward Raff
Copy Paste: [[2410.15280]] Neural Normalized Compression Distance and the Disconnect Between Compression and Classification(https://arxiv.org/abs/2410.15280)
Keywords: large language model
Abstract: It is generally well understood that predictive classification and compression are intrinsically related concepts in information theory. Indeed, many deep learning methods are explained as learning a kind of compression, and that better compression leads to better performance. We interrogate this hypothesis via the Normalized Compression Distance (NCD), which explicitly relies on compression as the means of measuring similarity between sequences and thus enables nearest-neighbor classification. By turning popular large language models (LLMs) into lossless compressors, we develop a Neural NCD and compare LLMs to classic general-purpose algorithms like gzip. In doing so, we find that classification accuracy is not predictable by compression rate alone, among other empirical aberrations not predicted by current understanding. Our results imply that our intuition on what it means for a neural network to ``compress'' and what is needed for effective classification are not yet well understood.
Title: TRIZ Method for Urban Building Energy Optimization: GWO-SARIMA-LSTM Forecasting model
Copy Paste: [[2410.15283]] TRIZ Method for Urban Building Energy Optimization: GWO-SARIMA-LSTM Forecasting model(https://arxiv.org/abs/2410.15283)
Keywords: robust
Abstract: With the advancement of global climate change and sustainable development goals, urban building energy consumption optimization and carbon emission reduction have become the focus of research. Traditional energy consumption prediction methods often lack accuracy and adaptability due to their inability to fully consider complex energy consumption patterns, especially in dealing with seasonal fluctuations and dynamic changes. This study proposes a hybrid deep learning model that combines TRIZ innovation theory with GWO, SARIMA and LSTM to improve the accuracy of building energy consumption prediction. TRIZ plays a key role in model design, providing innovative solutions to achieve an effective balance between energy efficiency, cost and comfort by systematically analyzing the contradictions in energy consumption optimization. GWO is used to optimize the parameters of the model to ensure that the model maintains high accuracy under different conditions. The SARIMA model focuses on capturing seasonal trends in the data, while the LSTM model handles short-term and long-term dependencies in the data, further improving the accuracy of the prediction. The main contribution of this research is the development of a robust model that leverages the strengths of TRIZ and advanced deep learning techniques, improving the accuracy of energy consumption predictions. Our experiments demonstrate a significant 15% reduction in prediction error compared to existing models. This innovative approach not only enhances urban energy management but also provides a new framework for optimizing energy use and reducing carbon emissions, contributing to sustainable development.
Title: LTPNet Integration of Deep Learning and Environmental Decision Support Systems for Renewable Energy Demand Forecasting
Copy Paste: [[2410.15286]] LTPNet Integration of Deep Learning and Environmental Decision Support Systems for Renewable Energy Demand Forecasting(https://arxiv.org/abs/2410.15286)
Keywords: transformer
Abstract: Against the backdrop of increasingly severe global environmental changes, accurately predicting and meeting renewable energy demands has become a key challenge for sustainable business development. Traditional energy demand forecasting methods often struggle with complex data processing and low prediction accuracy. To address these issues, this paper introduces a novel approach that combines deep learning techniques with environmental decision support systems. The model integrates advanced deep learning techniques, including LSTM and Transformer, and PSO algorithm for parameter optimization, significantly enhancing predictive performance and practical applicability. Results show that our model achieves substantial improvements across various metrics, including a 30% reduction in MAE, a 20% decrease in MAPE, a 25% drop in RMSE, and a 35% decline in MSE. These results validate the model's effectiveness and reliability in renewable energy demand forecasting. This research provides valuable insights for applying deep learning in environmental decision support systems.
Title: Attention Is All You Need for LLM-based Code Vulnerability Localization
Copy Paste: [[2410.15288]] Attention Is All You Need for LLM-based Code Vulnerability Localization(https://arxiv.org/abs/2410.15288)
Keywords: robust, large language model
Abstract: The rapid expansion of software systems and the growing number of reported vulnerabilities have emphasized the importance of accurately identifying vulnerable code segments. Traditional methods for vulnerability localization, such as manual code audits or rule-based tools, are often time-consuming and limited in scope, typically focusing on specific programming languages or types of vulnerabilities. In recent years, the introduction of large language models (LLMs) such as GPT and LLaMA has opened new possibilities for automating vulnerability detection. However, while LLMs show promise in this area, they face challenges, particularly in maintaining accuracy over longer code contexts. This paper introduces LOVA, a novel framework leveraging the self-attention mechanisms inherent in LLMs to enhance vulnerability localization. Our key insight is that self-attention mechanisms assign varying importance to different parts of the input, making it possible to track how much attention the model focuses on specific lines of code. In the context of vulnerability localization, the hypothesis is that vulnerable lines of code will naturally attract higher attention weights because they have a greater influence on the model's output. By systematically tracking changes in attention weights and focusing on specific lines of code, LOVA improves the precision of identifying vulnerable lines across various programming languages. Through rigorous experimentation and evaluation, we demonstrate that LOVA significantly outperforms existing LLM-based approaches, achieving up to a 5.3x improvement in F1-scores. LOVA also demonstrated strong scalability, with up to a 14.6x improvement in smart contract vulnerability localization across languages like C, Python, Java, and Solidity. Its robustness was proven through consistent performance across different LLM architectures.
Abstract: Computer-aided analysis of security protocols heavily relies on equational theories to model cryptographic primitives. Most automated verifiers for security protocols focus on equational theories that satisfy the Finite Variant Property (FVP), for which solving unification is decidable. However, they either require to prove FVP by hand or at least to provide a representation as an E-convergent rewrite system, usually E being at most the equational theory for an associative and commutative function symbol (AC). The verifier ProVerif is probably the only exception amongst these tools as it automatically proves FVP without requiring a representation, but on a small class of equational theories. In this work, we propose a novel semi-decision procedure for proving FVP, without the need for a specific representation, and for a class of theories that goes beyond the ones expressed by an E-convergent rewrite system. We implemented a prototype and successfully applied it on several theories from the literature.
Title: Multiple Kernel Clustering via Local Regression Integration
Authors: Liang Du, Xin Ren, Haiying Zhang, Peng Zhou
Copy Paste: [[2410.15304]] Multiple Kernel Clustering via Local Regression Integration(https://arxiv.org/abs/2410.15304)
Keywords: robust
Abstract: Multiple kernel methods less consider the intrinsic manifold structure of multiple kernel data and estimate the consensus kernel matrix with quadratic number of variables, which makes it vulnerable to the noise and outliers within multiple candidate kernels. This paper first presents the clustering method via kernelized local regression (CKLR). It captures the local structure of kernel data and employs kernel regression on the local region to predict the clustering results. Moreover, this paper further extends it to perform clustering via the multiple kernel local regression (CMKLR). We construct the kernel level local regression sparse coefficient matrix for each candidate kernel, which well characterizes the kernel level manifold structure. We then aggregate all the kernel level local regression coefficients via linear weights and generate the consensus sparse local regression coefficient, which largely reduces the number of candidate variables and becomes more robust against noises and outliers within multiple kernel data. Thus, the proposed method CMKLR avoids the above two limitations. It only contains one additional hyperparameter for tuning. Extensive experimental results show that the clustering performance of the proposed method on benchmark datasets is better than that of 10 state-of-the-art multiple kernel clustering methods.
Title: LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content
Authors: Mohamed Bayan Kmainasi, Ali Ezzat Shahroor, Maram Hasanain, Sahinur Rahman Laskar, Naeemul Hassan, Firoj Alam
Copy Paste: [[2410.15308]] LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content(https://arxiv.org/abs/2410.15308)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable success as general-purpose task solvers across various fields, including NLP, healthcare, finance, and law. However, their capabilities remain limited when addressing domain-specific problems, particularly in downstream NLP tasks. Research has shown that models fine-tuned on instruction-based downstream NLP datasets outperform those that are not fine-tuned. While most efforts in this area have primarily focused on resource-rich languages like English and broad domains, little attention has been given to multilingual settings and specific domains. To address this gap, this study focuses on developing a specialized LLM, LlamaLens, for analyzing news and social media content in a multilingual context. To the best of our knowledge, this is the first attempt to tackle both domain specificity and multilinguality, with a particular focus on news and social media. Our experimental setup includes 19 tasks, represented by 52 datasets covering Arabic, English, and Hindi. We demonstrate that LlamaLens outperforms the current state-of-the-art (SOTA) on 16 testing sets, and achieves comparable performance on 10 sets. We make the models and resources publicly available for the research community.(this https URL)
Title: Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image
Authors: Yu Zhao, Hao Fei, Xiangtai Li, Libo Qin, Jiayi Ji, Hongyuan Zhu, Meishan Zhang, Min Zhang, Jianguo Wei
Copy Paste: [[2410.15312]] Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image(https://arxiv.org/abs/2410.15312)
Keywords: diffusion
Abstract: In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling. In this work, we consider modeling the SI2T and ST2I together under a dual learning framework. During the dual framework, we then propose to represent the 3D spatial scene features with a novel 3D scene graph (3DSG) representation that can be shared and beneficial to both tasks. Further, inspired by the intuition that the easier 3D$\to$image and 3D$\to$text processes also exist symmetrically in the ST2I and SI2T, respectively, we propose the Spatial Dual Discrete Diffusion (SD$^3$) framework, which utilizes the intermediate features of the 3D$\to$X processes to guide the hard X$\to$3D processes, such that the overall ST2I and SI2T will benefit each other. On the visual spatial understanding dataset VSD, our system outperforms the mainstream T2I and I2T methods significantly. Further in-depth analysis reveals how our dual learning strategy advances.
Abstract: Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.
Copy Paste: [[2410.15319]] Causality for Large Language Models(https://arxiv.org/abs/2410.15319)
Keywords: large language model
Abstract: Recent breakthroughs in artificial intelligence have driven a paradigm shift, where large language models (LLMs) with billions or trillions of parameters are trained on vast datasets, achieving unprecedented success across a series of language tasks. However, despite these successes, LLMs still rely on probabilistic modeling, which often captures spurious correlations rooted in linguistic patterns and social stereotypes, rather than the true causal relationships between entities and events. This limitation renders LLMs vulnerable to issues such as demographic biases, social stereotypes, and LLM hallucinations. These challenges highlight the urgent need to integrate causality into LLMs, moving beyond correlation-driven paradigms to build more reliable and ethically aligned AI systems. While many existing surveys and studies focus on utilizing prompt engineering to activate LLMs for causal knowledge or developing benchmarks to assess their causal reasoning abilities, most of these efforts rely on human intervention to activate pre-trained models. How to embed causality into the training process of LLMs and build more general and intelligent models remains unexplored. Recent research highlights that LLMs function as causal parrots, capable of reciting causal knowledge without truly understanding or applying it. These prompt-based methods are still limited to human interventional improvements. This survey aims to address this gap by exploring how causality can enhance LLMs at every stage of their lifecycle-from token embedding learning and foundation model training to fine-tuning, alignment, inference, and evaluation-paving the way for more interpretable, reliable, and causally-informed models. Additionally, we further outline six promising future directions to advance LLM development, enhance their causal reasoning capabilities, and address the current limitations these models face.
Title: FoMo: A Foundation Model for Mobile Traffic Forecasting with Diffusion Model
Authors: Haoye Chai, Shiyuan Zhang, Xiaoqian Qi, Yong Li
Copy Paste: [[2410.15322]] FoMo: A Foundation Model for Mobile Traffic Forecasting with Diffusion Model(https://arxiv.org/abs/2410.15322)
Keywords: diffusion, transformer
Abstract: Mobile traffic forecasting allows operators to anticipate network dynamics and performance in advance, offering substantial potential for enhancing service quality and improving user experience. However, existing models are often task-oriented and are trained with tailored data, which limits their effectiveness in diverse mobile network tasks of Base Station (BS) deployment, resource allocation, energy optimization, etc. and hinders generalization across different urban environments. Foundation models have made remarkable strides across various domains of NLP and CV due to their multi-tasking adaption and zero/few-shot learning capabilities. In this paper, we propose an innovative Foundation model for Mo}bile traffic forecasting (FoMo), aiming to handle diverse forecasting tasks of short/long-term predictions and distribution generation across multiple cities to support network planning and optimization. FoMo combines diffusion models and transformers, where various spatio-temporal masks are proposed to enable FoMo to learn intrinsic features of different tasks, and a contrastive learning strategy is developed to capture the correlations between mobile traffic and urban contexts, thereby improving its transfer learning capability. Extensive experiments on 9 real-world datasets demonstrate that FoMo outperforms current models concerning diverse forecasting tasks and zero/few-shot learning, showcasing a strong universality. We further deploy the FoMo on the JiuTian optimization platform of China Mobile, where we use the predicted mobile data to formulate network planning and optimization applications, including BS deployment, resource block scheduling, and BS sleep control.
Title: A Survey of Uncertainty Estimation in LLMs: Theory Meets Practice
Copy Paste: [[2410.15326]] A Survey of Uncertainty Estimation in LLMs: Theory Meets Practice(https://arxiv.org/abs/2410.15326)
Keywords: large language model
Abstract: As large language models (LLMs) continue to evolve, understanding and quantifying the uncertainty in their predictions is critical for enhancing application credibility. However, the existing literature relevant to LLM uncertainty estimation often relies on heuristic approaches, lacking systematic classification of the methods. In this survey, we clarify the definitions of uncertainty and confidence, highlighting their distinctions and implications for model predictions. On this basis, we integrate theoretical perspectives, including Bayesian inference, information theory, and ensemble strategies, to categorize various classes of uncertainty estimation methods derived from heuristic approaches. Additionally, we address challenges that arise when applying these methods to LLMs. We also explore techniques for incorporating uncertainty into diverse applications, including out-of-distribution detection, data annotation, and question clarification. Our review provides insights into uncertainty estimation from both definitional and theoretical angles, contributing to a comprehensive understanding of this critical aspect in LLMs. We aim to inspire the development of more reliable and effective uncertainty estimation approaches for LLMs in real-world scenarios.
Title: EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models
Copy Paste: [[2410.15332]] EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models(https://arxiv.org/abs/2410.15332)
Keywords: large language model
Abstract: Large Language Models (LLMs) are critical for a wide range of applications, but serving them efficiently becomes increasingly challenging as inputs become more complex. Context caching improves serving performance by exploiting inter-request dependency and reusing key-value (KV) cache across requests, thus improving time-to-first-token (TTFT). However, existing prefix-based context caching requires exact token prefix matches, limiting cache reuse in few-shot learning, multi-document QA, or retrieval-augmented generation, where prefixes may vary. In this paper, we present EPIC, an LLM serving system that introduces position-independent context caching (PIC), enabling modular KV cache reuse regardless of token chunk position (or prefix). EPIC features two key designs: AttnLink, which leverages static attention sparsity to minimize recomputation for accuracy recovery, and KVSplit, a customizable chunking method that preserves semantic coherence. Our experiments demonstrate that Epic delivers up to 8x improvements in TTFT and 7x throughput over existing systems, with negligible or no accuracy loss. By addressing the limitations of traditional caching approaches, Epic enables more scalable and efficient LLM inference.
Title: Modality-Fair Preference Optimization for Trustworthy MLLM Alignment
Authors: Songtao Jiang, Yan Zhang, Ruizhe Chen, Yeying Jin, Zuozhu Liu
Copy Paste: [[2410.15334]] Modality-Fair Preference Optimization for Trustworthy MLLM Alignment(https://arxiv.org/abs/2410.15334)
Keywords: fair, large language model
Abstract: Direct Preference Optimization (DPO) is effective for aligning large language models (LLMs), but when applied to multimodal models (MLLMs), it often favors text over image information, leading to unreliable outputs and visual hallucinations. To address this, we propose Modality-Fair Preference Optimization (MFPO) to balance text and image preferences. First, we found that the lack of image-related rewards in preference data biases optimization toward text, so we created automated, fine-grained image preference data to correct this. Then, we designed a learning objective to ensure the model captures both text and image preferences while maintaining high-quality outputs. Finally, we use a multi-stage alignment approach to stabilize training and improve learning across both modalities. Extensive experiments demonstrate that MFPO significantly enhances MLLM trustworthiness. On models like LLaVA-v1.5 (7B, 13B), our approach reduces hallucinations substantially. On the 7B model, MFPO outperforms GPT-4V and achieves a nearly 40\% improvement over previous methods on Object HalBench, as well as achieving state-of-the-art performance on both Object HalBench and AMBER when combined with the latest LLaVA-v1.6. Code will be released.
Title: YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary
Authors: Hao-Tang Tsui, Chien-Yao Wang, Hong-Yuan Mark Liao
Copy Paste: [[2410.15346]] YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary(https://arxiv.org/abs/2410.15346)
Keywords: large language model, segmentation
Abstract: Identifying and localizing objects within images is a fundamental challenge, and numerous efforts have been made to enhance model accuracy by experimenting with diverse architectures and refining training strategies. Nevertheless, a prevalent limitation in existing models is overemphasizing the current input while ignoring the information from the entire dataset. We introduce an innovative {\em \textbf{R}etriever}-{\em\textbf{D}ictionary} (RD) module to address this issue. This architecture enables YOLO-based models to efficiently retrieve features from a Dictionary that contains the insight of the dataset, which is built by the knowledge from Visual Models (VM), Large Language Models (LLM), or Visual Language Models (VLM). The flexible RD enables the model to incorporate such explicit knowledge that enhances the ability to benefit multiple tasks, specifically, segmentation, detection, and classification, from pixel to image level. The experiments show that using the RD significantly improves model performance, achieving more than a 3\% increase in mean Average Precision for object detection with less than a 1\% increase in model parameters. Beyond 1-stage object detection models, the RD module improves the effectiveness of 2-stage models and DETR-based architectures, such as Faster R-CNN and Deformable DETR
Title: Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
Copy Paste: [[2410.15362]] Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models(https://arxiv.org/abs/2410.15362)
Keywords: attack, large language model
Abstract: Aligned Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, LLMs remain susceptible to jailbreak adversarial attacks, where adversaries manipulate prompts to elicit malicious responses that aligned LLMs should have avoided. Identifying these vulnerabilities is crucial for understanding the inherent weaknesses of LLMs and preventing their potential misuse. One pioneering work in jailbreaking is the GCG attack, a discrete token optimization algorithm that seeks to find a suffix capable of jailbreaking aligned LLMs. Despite the success of GCG, we find it suboptimal, requiring significantly large computational costs, and the achieved jailbreaking performance is limited. In this work, we propose Faster-GCG, an efficient adversarial jailbreak method by delving deep into the design of GCG. Experiments demonstrate that Faster-GCG can surpass the original GCG with only 1/10 of the computational cost, achieving significantly higher attack success rates on various open-source aligned LLMs. In addition, We demonstrate that Faster-GCG exhibits improved attack transferability when testing on closed-sourced LLMs such as ChatGPT.
Title: Scene Graph Generation with Role-Playing Large Language Models
Copy Paste: [[2410.15364]] Scene Graph Generation with Role-Playing Large Language Models(https://arxiv.org/abs/2410.15364)
Keywords: large language model
Abstract: Current approaches for open-vocabulary scene graph generation (OVSGG) use vision-language models such as CLIP and follow a standard zero-shot pipeline -- computing similarity between the query image and the text embeddings for each category (i.e., text classifiers). In this work, we argue that the text classifiers adopted by existing OVSGG methods, i.e., category-/part-level prompts, are scene-agnostic as they remain unchanged across contexts. Using such fixed text classifiers not only struggles to model visual relations with high variance, but also falls short in adapting to distinct contexts. To plug these intrinsic shortcomings, we devise SDSGG, a scene-specific description based OVSGG framework where the weights of text classifiers are adaptively adjusted according to the visual content. In particular, to generate comprehensive and diverse descriptions oriented to the scene, an LLM is asked to play different roles (e.g., biologist and engineer) to analyze and discuss the descriptive features of a given scene from different views. Unlike previous efforts simply treating the generated descriptions as mutually equivalent text classifiers, SDSGG is equipped with an advanced renormalization mechanism to adjust the influence of each text classifier based on its relevance to the presented scene (this is what the term "specific" means). Furthermore, to capture the complicated interplay between subjects and objects, we propose a new lightweight module called mutual visual adapter. It refines CLIP's ability to recognize relations by learning an interaction-aware semantic space. Extensive experiments on prevalent benchmarks show that SDSGG outperforms top-leading methods by a clear margin.
Title: FrameBridge: Improving Image-to-Video Generation with Bridge Models
Copy Paste: [[2410.15371]] FrameBridge: Improving Image-to-Video Generation with Bridge Models(https://arxiv.org/abs/2410.15371)
Keywords: diffusion, generative
Abstract: Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis. Recently, diffusion-based I2V models have achieved remarkable progress given their novel design on network architecture, cascaded framework, and motion representation. However, restricted by their noise-to-data generation process, diffusion-based methods inevitably suffer the difficulty to generate video samples with both appearance consistency and temporal coherence from an uninformative Gaussian noise, which may limit their synthesis quality. In this work, we present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them. By formulating I2V synthesis as a frames-to-frames generation task and modelling it with a data-to-data process, we fully exploit the information in input image and facilitate the generative model to learn the image animation process. In two popular settings of training I2V models, namely fine-tuning a pre-trained text-to-video (T2V) model or training from scratch, we further propose two techniques, SNR-Aligned Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively. Experiments conducted on WebVid-2M and UCF-101 demonstrate that: (1) our FrameBridge achieves superior I2V quality in comparison with the diffusion counterpart (zero-shot FVD 83 vs. 176 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101); (2) our proposed SAF and neural prior effectively enhance the ability of bridge-based I2V models in the scenarios of fine-tuning and training from scratch. Demo samples can be visited at: this https URL.
Title: Explainability of Point Cloud Neural Networks Using SMILE: Statistical Model-Agnostic Interpretability with Local Explanations
Authors: Seyed Mohammad Ahmadi, Koorosh Aslansefat, Ruben Valcarce-Dineiro, Joshua Barnfather
Copy Paste: [[2410.15374]] Explainability of Point Cloud Neural Networks Using SMILE: Statistical Model-Agnostic Interpretability with Local Explanations(https://arxiv.org/abs/2410.15374)
Abstract: In today's world, the significance of explainable AI (XAI) is growing in robotics and point cloud applications, as the lack of transparency in decision-making can pose considerable safety risks, particularly in autonomous systems. As these technologies are integrated into real-world environments, ensuring that model decisions are interpretable and trustworthy is vital for operational reliability and safety assurance. This study explores the implementation of SMILE, a novel explainability method originally designed for deep neural networks, on point cloud-based models. SMILE builds on LIME by incorporating Empirical Cumulative Distribution Function (ECDF) statistical distances, offering enhanced robustness and interpretability, particularly when the Anderson-Darling distance is used. The approach demonstrates superior performance in terms of fidelity loss, R2 scores, and robustness across various kernel widths, perturbation numbers, and clustering configurations. Moreover, this study introduces a stability analysis for point cloud data using the Jaccard index, establishing a new benchmark and baseline for model stability in this field. The study further identifies dataset biases in the classification of the 'person' category, emphasizing the necessity for more comprehensive datasets in safety-critical applications like autonomous driving and robotics. The results underscore the potential of advanced explainability models and highlight areas for future research, including the application of alternative surrogate models and explainability techniques in point cloud data.
Title: ActiveNeuS: Neural Signed Distance Fields for Active Stereo
Copy Paste: [[2410.15376]] ActiveNeuS: Neural Signed Distance Fields for Active Stereo(https://arxiv.org/abs/2410.15376)
Keywords: robust
Abstract: 3D-shape reconstruction in extreme environments, such as low illumination or scattering condition, has been an open problem and intensively researched. Active stereo is one of potential solution for such environments for its robustness and high accuracy. However, active stereo systems usually consist of specialized system configurations with complicated algorithms, which narrow their application. In this paper, we propose Neural Signed Distance Field for active stereo systems to enable implicit correspondence search and triangulation in generalized Structured Light. With our technique, textureless or equivalent surfaces by low light condition are successfully reconstructed even with a small number of captured images. Experiments were conducted to confirm that the proposed method could achieve state-of-the-art reconstruction quality under such severe condition. We also demonstrated that the proposed method worked in an underwater scenario.
Title: Neural Active Structure-from-Motion in Dark and Textureless Environment
Authors: Kazuto Ichimaru, Diego Thomas, Takafumi Iwaguchi, Hiroshi Kawasaki
Copy Paste: [[2410.15378]] Neural Active Structure-from-Motion in Dark and Textureless Environment(https://arxiv.org/abs/2410.15378)
Keywords: robust
Abstract: Active 3D measurement, especially structured light (SL) has been widely used in various fields for its robustness against textureless or equivalent surfaces by low light illumination. In addition, reconstruction of large scenes by moving the SL system has become popular, however, there have been few practical techniques to obtain the system's precise pose information only from images, since most conventional techniques are based on image features, which cannot be retrieved under textureless environments. In this paper, we propose a simultaneous shape reconstruction and pose estimation technique for SL systems from an image set where sparsely projected patterns onto the scene are observed (i.e. no scene texture information), which we call Active SfM. To achieve this, we propose a full optimization framework of the volumetric shape that employs neural signed distance fields (Neural-SDF) for SL with the goal of not only reconstructing the scene shape but also estimating the poses for each motion of the system. Experimental results show that the proposed method is able to achieve accurate shape reconstruction as well as pose estimation from images where only projected patterns are observed.
Title: Synthetic Data Generation for Residential Load Patterns via Recurrent GAN and Ensemble Method
Copy Paste: [[2410.15379]] Synthetic Data Generation for Residential Load Patterns via Recurrent GAN and Ensemble Method(https://arxiv.org/abs/2410.15379)
Keywords: privacy, generative
Abstract: Generating synthetic residential load data that can accurately represent actual electricity consumption patterns is crucial for effective power system planning and operation. The necessity for synthetic data is underscored by the inherent challenges associated with using real-world load data, such as privacy considerations and logistical complexities in large-scale data collection. In this work, we tackle the above-mentioned challenges by developing the Ensemble Recurrent Generative Adversarial Network (ERGAN) framework to generate high-fidelity synthetic residential load data. ERGAN leverages an ensemble of recurrent Generative Adversarial Networks, augmented by a loss function that concurrently takes into account adversarial loss and differences between statistical properties. Our developed ERGAN can capture diverse load patterns across various households, thereby enhancing the realism and diversity of the synthetic data generated. Comprehensive evaluations demonstrate that our method consistently outperforms established benchmarks in the synthetic generation of residential load data across various performance metrics including diversity, similarity, and statistical measures. The findings confirm the potential of ERGAN as an effective tool for energy applications requiring synthetic yet realistic load data. We also make the generated synthetic residential load patterns publicly available.
Title: LoRA-IR: Taming Low-Rank Experts for Efficient All-in-One Image Restoration
Abstract: Prompt-based all-in-one image restoration (IR) frameworks have achieved remarkable performance by incorporating degradation-specific information into prompt modules. Nevertheless, handling the complex and diverse degradations encountered in real-world scenarios remains a significant challenge. To address this challenge, we propose LoRA-IR, a flexible framework that dynamically leverages compact low-rank experts to facilitate efficient all-in-one image restoration. Specifically, LoRA-IR consists of two training stages: degradation-guided pre-training and parameter-efficient fine-tuning. In the pre-training stage, we enhance the pre-trained CLIP model by introducing a simple mechanism that scales it to higher resolutions, allowing us to extract robust degradation representations that adaptively guide the IR network. In the fine-tuning stage, we refine the pre-trained IR network using low-rank adaptation (LoRA). Built upon a Mixture-of-Experts (MoE) architecture, LoRA-IR dynamically integrates multiple low-rank restoration experts through a degradation-guided router. This dynamic integration mechanism significantly enhances our model's adaptability to diverse and unknown degradations in complex real-world scenarios. Extensive experiments demonstrate that LoRA-IR achieves state-of-the-art performance across 14 image restoration tasks and 29 benchmarks. Code and pre-trained models will be available at: this https URL.
Title: CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges
Copy Paste: [[2410.15393]] CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges(https://arxiv.org/abs/2410.15393)
Keywords: robust, fair, large language model
Abstract: The use of large language models (LLMs) as automated evaluation tools to assess the quality of generated natural language, known as LLMs-as-Judges, has demonstrated promising capabilities and is rapidly gaining widespread attention. However, when applied to pairwise comparisons of candidate responses, LLM-based evaluators often exhibit selection bias. Specifically, their judgments may become inconsistent when the option positions or ID tokens are swapped, compromising the effectiveness and fairness of the evaluation result. To address this challenge, we introduce CalibraEval, a novel label-free method for mitigating selection bias during inference. Specifically, CalibraEval reformulates debiasing as an optimization task aimed at adjusting observed prediction distributions to align with unbiased prediction distributions. To solve this optimization problem, we propose a non-parametric order-preserving algorithm (NOA). This algorithm leverages the partial order relationships between model prediction distributions, thereby eliminating the need for explicit labels and precise mathematical function this http URL evaluations of LLMs in multiple representative benchmarks demonstrate that CalibraEval effectively mitigates selection bias and improves performance compared to existing debiasing methods. This work marks a step toward building more robust and unbiased automated evaluation frameworks, paving the way for improved reliability in AI-driven assessments
Title: The Best Defense is a Good Offense: Countering LLM-Powered Cyberattacks
Authors: Daniel Ayzenshteyn, Roy Weiss, Yisroel Mirsky
Copy Paste: [[2410.15396]] The Best Defense is a Good Offense: Countering LLM-Powered Cyberattacks(https://arxiv.org/abs/2410.15396)
Keywords: defense, attack, large language model
Abstract: As large language models (LLMs) continue to evolve, their potential use in automating cyberattacks becomes increasingly likely. With capabilities such as reconnaissance, exploitation, and command execution, LLMs could soon become integral to autonomous cyber agents, capable of launching highly sophisticated attacks. In this paper, we introduce novel defense strategies that exploit the inherent vulnerabilities of attacking LLMs. By targeting weaknesses such as biases, trust in input, memory limitations, and their tunnel-vision approach to problem-solving, we develop techniques to mislead, delay, or neutralize these autonomous agents. We evaluate our defenses under black-box conditions, starting with single prompt-response scenarios and progressing to real-world tests using custom-built CTF machines. Our results show defense success rates of up to 90\%, demonstrating the effectiveness of turning LLM vulnerabilities into defensive strategies against LLM-driven cyber threats.
Title: IPO: Interpretable Prompt Optimization for Vision-Language Models
Authors: Yingjun Du, Wenfang Sun, Cees G. M. Snoek
Copy Paste: [[2410.15397]] IPO: Interpretable Prompt Optimization for Vision-Language Models(https://arxiv.org/abs/2410.15397)
Keywords: interpretability, large language model
Abstract: Pre-trained vision-language models like CLIP have remarkably adapted to various downstream tasks. Nonetheless, their performance heavily depends on the specificity of the input text prompts, which requires skillful prompt template engineering. Instead, current approaches to prompt optimization learn the prompts through gradient descent, where the prompts are treated as adjustable parameters. However, these methods tend to lead to overfitting of the base classes seen during training and produce prompts that are no longer understandable by humans. This paper introduces a simple but interpretable prompt optimizer (IPO), that utilizes large language models (LLMs) to generate textual prompts dynamically. We introduce a Prompt Optimization Prompt that not only guides LLMs in creating effective prompts but also stores past prompts with their performance metrics, providing rich in-context information. Additionally, we incorporate a large multimodal model (LMM) to condition on visual content by generating image descriptions, which enhance the interaction between textual and visual modalities. This allows for thae creation of dataset-specific prompts that improve generalization performance, while maintaining human comprehension. Extensive testing across 11 datasets reveals that IPO not only improves the accuracy of existing gradient-descent-based prompt learning methods but also considerably enhances the interpretability of the generated prompts. By leveraging the strengths of LLMs, our approach ensures that the prompts remain human-understandable, thereby facilitating better transparency and oversight for vision-language models.
Title: MMCS: A Multimodal Medical Diagnosis System Integrating Image Analysis and Knowledge-based Departmental Consultation
Authors: Yi Ren, HanZhi Zhang, Weibin Li, Diandong Liu, Tianyi Zhang, Jie He
Copy Paste: [[2410.15403]] MMCS: A Multimodal Medical Diagnosis System Integrating Image Analysis and Knowledge-based Departmental Consultation(https://arxiv.org/abs/2410.15403)
Keywords: large language model
Abstract: We present MMCS, a system capable of recognizing medical images and patient facial details, and providing professional medical diagnoses. The system consists of two core components: The first component is the analysis of medical images and videos. We trained a specialized multimodal medical model capable of interpreting medical images and accurately analyzing patients' facial emotions and facial paralysis conditions. The model achieved an accuracy of 72.59% on the FER2013 facial emotion recognition dataset, with a 91.1% accuracy in recognizing the happy emotion. In facial paralysis recognition, the model reached an accuracy of 92%, which is 30% higher than that of GPT-4o. Based on this model, we developed a parser for analyzing facial movement videos of patients with facial paralysis, achieving precise grading of the paralysis severity. In tests on 30 videos of facial paralysis patients, the system demonstrated a grading accuracy of 83.3%.The second component is the generation of professional medical responses. We employed a large language model, integrated with a medical knowledge base, to generate professional diagnoses based on the analysis of medical images or videos. The core innovation lies in our development of a department-specific knowledge base routing management mechanism, in which the large language model categorizes data by medical departments and, during the retrieval process, determines the appropriate knowledge base to query. This significantly improves retrieval accuracy in the RAG (retrieval-augmented generation) process. This mechanism led to an average increase of 4 percentage points in accuracy for various large language models on the MedQA this http URL code is open-sourced and available at: this https URL.
Title: PEAS: A Strategy for Crafting Transferable Adversarial Examples
Copy Paste: [[2410.15409]] PEAS: A Strategy for Crafting Transferable Adversarial Examples(https://arxiv.org/abs/2410.15409)
Keywords: attack
Abstract: Black box attacks, where adversaries have limited knowledge of the target model, pose a significant threat to machine learning systems. Adversarial examples generated with a substitute model often suffer from limited transferability to the target model. While recent work explores ranking perturbations for improved success rates, these methods see only modest gains. We propose a novel strategy called PEAS that can boost the transferability of existing black box attacks. PEAS leverages the insight that samples which are perceptually equivalent exhibit significant variability in their adversarial transferability. Our approach first generates a set of images from an initial sample via subtle augmentations. We then evaluate the transferability of adversarial perturbations on these images using a set of substitute models. Finally, the most transferable adversarial example is selected and used for the attack. Our experiments show that PEAS can double the performance of existing attacks, achieving a 2.5x improvement in attack success rates on average over current ranking methods. We thoroughly evaluate PEAS on ImageNet and CIFAR-10, analyze hyperparameter impacts, and provide an ablation study to isolate each component's importance.
Title: A Comprehensive Evaluation of Cognitive Biases in LLMs
Authors: Simon Malberg, Roman Poletukhin, Carolin M. Schuster, Georg Groh
Copy Paste: [[2410.15413]] A Comprehensive Evaluation of Cognitive Biases in LLMs(https://arxiv.org/abs/2410.15413)
Keywords: large language model
Abstract: We present a large-scale evaluation of 30 cognitive biases in 20 state-of-the-art large language models (LLMs) under various decision-making scenarios. Our contributions include a novel general-purpose test framework for reliable and large-scale generation of tests for LLMs, a benchmark dataset with 30,000 tests for detecting cognitive biases in LLMs, and a comprehensive assessment of the biases found in the 20 evaluated LLMs. Our work confirms and broadens previous findings suggesting the presence of cognitive biases in LLMs by reporting evidence of all 30 tested biases in at least some of the 20 LLMs. We publish our framework code to encourage future research on biases in LLMs: this https URL
Title: Where to Build Food Banks and Pantries: A Two-Level Machine Learning Approach
Copy Paste: [[2410.15420]] Where to Build Food Banks and Pantries: A Two-Level Machine Learning Approach(https://arxiv.org/abs/2410.15420)
Keywords: secure, security
Abstract: Over 44 million Americans currently suffer from food insecurity, of whom 13 million are children. Across the United States, thousands of food banks and pantries serve as vital sources of food and other forms of aid for food insecure families. By optimizing food bank and pantry locations, food would become more accessible to families who desperately require it. In this work, we introduce a novel two-level optimization framework, which utilizes the K-Medoids clustering algorithm in conjunction with the Open-Source Routing Machine engine, to optimize food bank and pantry locations based on real road distances to houses and house blocks. Our proposed framework also has the adaptability to factor in considerations such as median household income using a pseudo-weighted K-Medoids algorithm. Testing conducted with California and Indiana household data, as well as comparisons with real food bank and pantry locations showed that interestingly, our proposed framework yields food pantry locations superior to those of real existing ones and saves significant distance for households, while there is a marginal penalty on the first level food bank to food pantry distance. Overall, we believe that the second-level benefits of this framework far outweigh any drawbacks and yield a net benefit result.
Title: Accelerated Sub-Image Search For Variable-Size Patches Identification Based On Virtual Time Series Transformation And Segmentation
Copy Paste: [[2410.15425]] Accelerated Sub-Image Search For Variable-Size Patches Identification Based On Virtual Time Series Transformation And Segmentation(https://arxiv.org/abs/2410.15425)
Keywords: segmentation
Abstract: This paper addresses two tasks: (i) fixed-size objects such as hay bales are to be identified in an aerial image for a given reference image of the object, and (ii) variable-size patches such as areas on fields requiring spot spraying or other handling are to be identified in an image for a given small-scale reference image. Both tasks are related. The second differs in that identified sub-images similar to the reference image are further clustered before patches contours are determined by solving a traveling salesman problem. Both tasks are complex in that the exact number of similar sub-images is not known a priori. The main discussion of this paper is presentation of an acceleration mechanism for sub-image search that is based on a transformation of an image to multivariate time series along the RGB-channels and subsequent segmentation to reduce the 2D search space in the image. Two variations of the acceleration mechanism are compared to exhaustive search on diverse synthetic and real-world images. Quantitatively, proposed method results in solve time reductions of up to 2 orders of magnitude, while qualitatively delivering comparative visual results. Proposed method is neural network-free and does not use any image pre-processing.
Title: Efficient Model Extraction via Boundary Sampling
Copy Paste: [[2410.15429]] Efficient Model Extraction via Boundary Sampling(https://arxiv.org/abs/2410.15429)
Keywords: attack, extraction, data-free
Abstract: This paper introduces a novel data-free model extraction attack that significantly advances the current state-of-the-art in terms of efficiency, accuracy, and effectiveness. Traditional black-box methods rely on using the victim's model as an oracle to label a vast number of samples within high-confidence areas. This approach not only requires an extensive number of queries but also results in a less accurate and less transferable model. In contrast, our method innovates by focusing on sampling low-confidence areas (along the decision boundaries) and employing an evolutionary algorithm to optimize the sampling process. These novel contributions allow for a dramatic reduction in the number of queries needed by the attacker by a factor of 10x to 600x while simultaneously improving the accuracy of the stolen model. Moreover, our approach improves boundary alignment, resulting in better transferability of adversarial examples from the stolen model to the victim's model (increasing the attack success rate from 60\% to 82\% on average). Finally, we accomplish all of this with a strict black-box assumption on the victim, with no knowledge of the target's architecture or dataset. We demonstrate our attack on three datasets with increasingly larger resolutions and compare our performance to four state-of-the-art model extraction attacks.
Title: MedDiff-FM: A Diffusion-based Foundation Model for Versatile Medical Image Applications
Copy Paste: [[2410.15432]] MedDiff-FM: A Diffusion-based Foundation Model for Versatile Medical Image Applications(https://arxiv.org/abs/2410.15432)
Keywords: diffusion
Abstract: Diffusion models have achieved significant success in both the natural image and medical image domains, encompassing a wide range of applications. Previous investigations in medical images have often been constrained to specific anatomical regions, particular applications, and limited datasets, resulting in isolated diffusion models. This paper introduces a diffusion-based foundation model to address a diverse range of medical image tasks, namely MedDiff-FM. MedDiff-FM leverages 3D CT images from multiple publicly available datasets, covering anatomical regions from head to abdomen, to pre-train a diffusion foundation model, and explores the capabilities of the diffusion foundation model across a variety of application scenarios. The diffusion foundation model handles multi-level image processing both at the image-level and patch-level, and utilizes position embedding to establish multi-level spatial relationships as well as anatomical structures and region classes to control certain anatomical regions. MedDiff-FM manages several downstream tasks seamlessly, including image denoising, anomaly detection, and image synthesis. MedDiff-FM is also capable of performing lesion generation and lesion inpainting by rapidly fine-tuning the diffusion foundation model using ControlNet with task-specific conditions. Experimental results demonstrate the effectiveness of MedDiff-FM in addressing diverse downstream medical image tasks.
Title: Evaluating Consistencies in LLM responses through a Semantic Clustering of Question Answering
Copy Paste: [[2410.15440]] Evaluating Consistencies in LLM responses through a Semantic Clustering of Question Answering(https://arxiv.org/abs/2410.15440)
Keywords: large language model
Abstract: In the realm of Large Language Model (LLM) functionalities, providing reliable information is paramount, yet reports suggest that LLM outputs lack consistency. This inconsistency, often at-tributed to randomness in token sampling, under-mines user trust as it leads to varying responses even for identical queries. In this paper, we present a new approach for evaluating semantic consistencies of LLM including comparison of alternative tech-niques. Our approach evaluates whether LLM re-sponses are semantically congruent for a given question, recognizing that as syntactically different sentences may convey the same meaning. Here-tofore, To enhance LLM consistency, two main approaches have been explored: Leverage external knowledge as context like the RAG pattern or use Zero-shot-CoT to improve performance of LLM itself. We apply our evaluation approach to these techniques, and demonstrate to compare the im-pact of these methods on LLM response con-sistency across different domains of question an-swering tasks. Using the TruthfulQA dataset to assess LLM responses, the study induces N re-sponses per question from the LLM and clusters semantically equivalent sentences to measure semantic consistency across 37 categories. Through this, it quantitatively analyzes the effectiveness of the aforementioned methods in improving LLM performance before and after their adoption.
Title: MDFI-Net: Multiscale Differential Feature Interaction Network for Accurate Retinal Vessel Segmentation
Copy Paste: [[2410.15444]] MDFI-Net: Multiscale Differential Feature Interaction Network for Accurate Retinal Vessel Segmentation(https://arxiv.org/abs/2410.15444)
Keywords: extraction, segmentation
Abstract: The accurate segmentation of retinal vessels in fundus images is a great challenge in medical image segmentation tasks due to their highly complex structure from other this http URL, deep-learning based methods for retinal cessel segmentation achieved suboptimal outcoms,since vessels with indistinct features are prone to being overlooked in deeper layers of the network. Additionally, the abundance of redundant information in the background poses significant interference to feature extraction, thus increasing the segmentation difficulty. To address this issue, this paper proposes a feature-enhanced interaction network based on DPCN, named this http URL, we design a feature enhancement structure, the Deformable-convolutional Pulse Coupling Network (DPCN), to provide an enhanced feature iteration sequence to the segmentation network in a simple and efficient manner. Subsequently, these features will interact within the segmentation this http URL experiments were conducted on publicly available retinal vessel segmentation datasets to validate the effectiveness of our network structure. Experimental results of our algorithm show that the detection accuracy of the retinal blood vessel achieves 97.91%, 97.97% and 98.16% across all datasets. Finally, plentiful experimental results also prove that the proposed MDFI-Net achieves segmentation performance superior to state-of-the-art methods on public datasets.
Title: Concept Complement Bottleneck Model for Interpretable Medical Image Diagnosis
Copy Paste: [[2410.15446]] Concept Complement Bottleneck Model for Interpretable Medical Image Diagnosis(https://arxiv.org/abs/2410.15446)
Keywords: fair, interpretability, large language model
Abstract: Models based on human-understandable concepts have received extensive attention to improve model interpretability for trustworthy artificial intelligence in the field of medical image analysis. These methods can provide convincing explanations for model decisions but heavily rely on the detailed annotation of pre-defined concepts. Consequently, they may not be effective in cases where concepts or annotations are incomplete or low-quality. Although some methods automatically discover effective and new visual concepts rather than using pre-defined concepts or could find some human-understandable concepts via large Language models, they are prone to veering away from medical diagnostic evidence and are challenging to understand. In this paper, we propose a concept complement bottleneck model for interpretable medical image diagnosis with the aim of complementing the existing concept set and finding new concepts bridging the gap between explainable models. Specifically, we propose to use concept adapters for specific concepts to mine the concept differences and score concepts in their own attention channels to support almost fairly concept learning. Then, we devise a concept complement strategy to learn new concepts while jointly using known concepts to improve model performance. Comprehensive experiments on medical datasets demonstrate that our model outperforms the state-of-the-art competitors in concept detection and disease diagnosis tasks while providing diverse explanations to ensure model interpretability effectively.
Title: MedLogic-AQA: Enhancing Medical Question Answering with Abstractive Models Focusing on Logical Structures
Copy Paste: [[2410.15463]] MedLogic-AQA: Enhancing Medical Question Answering with Abstractive Models Focusing on Logical Structures(https://arxiv.org/abs/2410.15463)
Keywords: robust
Abstract: In Medical question-answering (QA) tasks, the need for effective systems is pivotal in delivering accurate responses to intricate medical queries. However, existing approaches often struggle to grasp the intricate logical structures and relationships inherent in medical contexts, thus limiting their capacity to furnish precise and nuanced answers. In this work, we address this gap by proposing a novel Abstractive QA system MedLogic-AQA that harnesses First Order Logic (FOL) based rules extracted from both context and questions to generate well-grounded answers. Through initial experimentation, we identified six pertinent first-order logical rules, which were then used to train a Logic-Understanding (LU) model capable of generating logical triples for a given context, question, and answer. These logic triples are then integrated into the training of MedLogic-AQA, enabling effective and coherent reasoning during answer generation. This distinctive fusion of logical reasoning with abstractive QA equips our system to produce answers that are logically sound, relevant, and engaging. Evaluation with respect to both automated and human-based demonstrates the robustness of MedLogic-AQA against strong baselines. Through empirical assessments and case studies, we validate the efficacy of MedLogic-AQA in elevating the quality and comprehensiveness of answers in terms of reasoning as well as informativeness
Title: A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia
Copy Paste: [[2410.15464]] A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia(https://arxiv.org/abs/2410.15464)
Keywords: interpretability
Abstract: Work on bias in pretrained language models (PLMs) focuses on bias evaluation and mitigation and fails to tackle the question of bias attribution and this http URL propose a novel metric, the $\textit{bias attribution score}$, which draws from information theory to measure token-level contributions to biased behavior in PLMs. We then demonstrate the utility of this metric by applying it on multilingual PLMs, including models from Southeast Asia which have not yet been thoroughly examined in bias evaluation literature. Our results confirm the presence of sexist and homophobic bias in Southeast Asian PLMs. Interpretability and semantic analyses also reveal that PLM bias is strongly induced by words relating to crime, intimate relationships, and helping among other discursive categories, suggesting that these are topics where PLMs strongly reproduce bias from pretraining data and where PLMs should be used with more caution.
Title: Keep Guessing? When Considering Inference Scaling, Mind the Baselines
Authors: Gal Yona, Or Honovich, Omer Levy, Roee Aharoni
Copy Paste: [[2410.15466]] Keep Guessing? When Considering Inference Scaling, Mind the Baselines(https://arxiv.org/abs/2410.15466)
Keywords: large language model
Abstract: Scaling inference compute in large language models (LLMs) through repeated sampling consistently increases the coverage (fraction of problems solved) as the number of samples increases. We conjecture that this observed improvement is partially due to the answer distribution of standard evaluation benchmarks, which is skewed towards a relatively small set of common answers. To test this conjecture, we define a baseline that enumerates answers according to their prevalence in the training set. Experiments spanning two domains -- mathematical reasoning and factual knowledge -- reveal that this baseline outperforms repeated model sampling for some LLMs, while the coverage for others is on par with that of a mixture strategy that obtains $k$ answers by using only $10$ model samples and similarly guessing the remaining $k-10$ attempts via enumeration. Our baseline enables a more accurate measurement of how much repeated sampling improves coverage in such settings beyond prompt-agnostic guessing.
Title: Hey GPT, Can You be More Racist? Analysis from Crowdsourced Attempts to Elicit Biased Content from Generative AI
Copy Paste: [[2410.15467]] Hey GPT, Can You be More Racist? Analysis from Crowdsourced Attempts to Elicit Biased Content from Generative AI(https://arxiv.org/abs/2410.15467)
Keywords: generative, large language model
Abstract: The widespread adoption of large language models (LLMs) and generative AI (GenAI) tools across diverse applications has amplified the importance of addressing societal biases inherent within these technologies. While the NLP community has extensively studied LLM bias, research investigating how non-expert users perceive and interact with biases from these systems remains limited. As these technologies become increasingly prevalent, understanding this question is crucial to inform model developers in their efforts to mitigate bias. To address this gap, this work presents the findings from a university-level competition, which challenged participants to design prompts for eliciting biased outputs from GenAI tools. We quantitatively and qualitatively analyze the competition submissions and identify a diverse set of biases in GenAI and strategies employed by participants to induce bias in GenAI. Our finding provides unique insights into how non-expert users perceive and interact with biases from GenAI tools.
Title: Data Augmentation via Diffusion Model to Enhance AI Fairness
Copy Paste: [[2410.15470]] Data Augmentation via Diffusion Model to Enhance AI Fairness(https://arxiv.org/abs/2410.15470)
Keywords: fair, explainability, diffusion
Abstract: AI fairness seeks to improve the transparency and explainability of AI systems by ensuring that their outcomes genuinely reflect the best interests of users. Data augmentation, which involves generating synthetic data from existing datasets, has gained significant attention as a solution to data scarcity. In particular, diffusion models have become a powerful technique for generating synthetic data, especially in fields like computer vision. This paper explores the potential of diffusion models to generate synthetic tabular data to improve AI fairness. The Tabular Denoising Diffusion Probabilistic Model (Tab-DDPM), a diffusion model adaptable to any tabular dataset and capable of handling various feature types, was utilized with different amounts of generated data for data augmentation. Additionally, reweighting samples from AIF360 was employed to further enhance AI fairness. Five traditional machine learning models-Decision Tree (DT), Gaussian Naive Bayes (GNB), K-Nearest Neighbors (KNN), Logistic Regression (LR), and Random Forest (RF)-were used to validate the proposed approach. Experimental results demonstrate that the synthetic data generated by Tab-DDPM improves fairness in binary classification.
Title: Bayesian data fusion for distributed learning
Copy Paste: [[2410.15473]] Bayesian data fusion for distributed learning(https://arxiv.org/abs/2410.15473)
Keywords: federate
Abstract: One of the main challenges of federated learning (FL) is handling non-independent and identically distributed (non-IID) client data, which may occur in practice due to unbalanced datasets and use of different data sources across clients. Knowledge sharing and model personalization are key strategies for addressing this issue. Clustered federated learning is a class of FL methods that groups clients that observe similarly distributed data into clusters, such that every client is typically associated with one data distribution and participates in training a model for that distribution along their cluster peers. In this paper, we present a unified Bayesian framework for clustered FL which associates clients to clusters. Then we propose several practical algorithms to handle the, otherwise growing, data associations in a way that trades off performance and computational complexity. This work provides insights on client-cluster associations and enables client knowledge sharing in new ways. The proposed framework circumvents the need for unique client-cluster associations, which is seen to increase the performance of the resulting models in a variety of experiments.
Title: Optimizing Backward Policies in GFlowNets via Trajectory Likelihood Maximization
Copy Paste: [[2410.15474]] Optimizing Backward Policies in GFlowNets via Trajectory Likelihood Maximization(https://arxiv.org/abs/2410.15474)
Keywords: generative
Abstract: Generative Flow Networks (GFlowNets) are a family of generative models that learn to sample objects with probabilities proportional to a given reward function. The key concept behind GFlowNets is the use of two stochastic policies: a forward policy, which incrementally constructs compositional objects, and a backward policy, which sequentially deconstructs them. Recent results show a close relationship between GFlowNet training and entropy-regularized reinforcement learning (RL) problems with a particular reward design. However, this connection applies only in the setting of a fixed backward policy, which might be a significant limitation. As a remedy to this problem, we introduce a simple backward policy optimization algorithm that involves direct maximization of the value function in an entropy-regularized Markov Decision Process (MDP) over intermediate rewards. We provide an extensive experimental evaluation of the proposed approach across various benchmarks in combination with both RL and GFlowNet algorithms and demonstrate its faster convergence and mode discovery in complex environments.
Title: Generalized Multimodal Fusion via Poisson-Nernst-Planck Equation
Copy Paste: [[2410.15475]] Generalized Multimodal Fusion via Poisson-Nernst-Planck Equation(https://arxiv.org/abs/2410.15475)
Keywords: extraction
Abstract: Previous studies have highlighted significant advancements in multimodal fusion. Nevertheless, such methods often encounter challenges regarding the efficacy of feature extraction, data integrity, consistency of feature dimensions, and adaptability across various downstream tasks. This paper proposes a generalized multimodal fusion method (GMF) via the Poisson-Nernst-Planck (PNP) equation, which adeptly addresses the aforementioned issues. Theoretically, the optimization objective for traditional multimodal tasks is formulated and redefined by integrating information entropy and the flow of gradient backward step. Leveraging these theoretical insights, the PNP equation is applied to feature fusion, rethinking multimodal features through the framework of charged particles in physics and controlling their movement through dissociation, concentration, and reconstruction. Building on these theoretical foundations, GMF disassociated features which extracted by the unimodal feature extractor into modality-specific and modality-invariant subspaces, thereby reducing mutual information and subsequently lowering the entropy of downstream tasks. The identifiability of the feature's origin enables our approach to function independently as a frontend, seamlessly integrated with a simple concatenation backend, or serve as a prerequisite for other modules. Experimental results on multiple downstream tasks show that the proposed GMF achieves performance close to the state-of-the-art (SOTA) accuracy while utilizing fewer parameters and computational resources. Furthermore, by integrating GMF with advanced fusion methods, we surpass the SOTA results.
Title: "What is the value of {templates}?" Rethinking Document Information Extraction Datasets for LLMs
Copy Paste: [[2410.15484]] "What is the value of {templates}?" Rethinking Document Information Extraction Datasets for LLMs(https://arxiv.org/abs/2410.15484)
Keywords: robust, extraction, generative, large language model
Abstract: The rise of large language models (LLMs) for visually rich document understanding (VRDU) has kindled a need for prompt-response, document-based datasets. As annotating new datasets from scratch is labor-intensive, the existing literature has generated prompt-response datasets from available resources using simple templates. For the case of key information extraction (KIE), one of the most common VRDU tasks, past work has typically employed the template "What is the value for the {key}?". However, given the variety of questions encountered in the wild, simple and uniform templates are insufficient for creating robust models in research and industrial contexts. In this work, we present K2Q, a diverse collection of five datasets converted from KIE to a prompt-response format using a plethora of bespoke templates. The questions in K2Q can span multiple entities and be extractive or boolean. We empirically compare the performance of seven baseline generative models on K2Q with zero-shot prompting. We further compare three of these models when training on K2Q versus training on simpler templates to motivate the need of our work. We find that creating diverse and intricate KIE questions enhances the performance and robustness of VRDU models. We hope this work encourages future studies on data quality for generative model training.
Abstract: The rising need for explainable deep neural network architectures has utilized semantic concepts as explainable units. Several approaches utilizing disentangled representation learning estimate the generative factors and utilize them as concepts for explaining DNNs. However, even though the generative factors for a dataset remain fixed, concepts are not fixed entities and vary based on downstream tasks. In this paper, we propose a disentanglement mechanism utilizing a variational autoencoder (VAE) for learning mutually independent generative factors for a given dataset and subsequently learning task-specific concepts using a structural causal model (SCM). Our method assumes generative factors and concepts to form a bipartite graph, with directed causal edges from generative factors to concepts. Experiments are conducted on datasets with known generative factors: D-sprites and Shapes3D. On specific downstream tasks, our proposed method successfully learns task-specific concepts which are explained well by the causal edges from the generative factors. Lastly, separate from current causal concept discovery methods, our methodology is generalizable to an arbitrary number of concepts and flexible to any downstream tasks.
Title: SEA: State-Exchange Attention for High-Fidelity Physics-Based Transformers
Authors: Parsa Esmati, Amirhossein Dadashzadeh, Vahid Goodarzi, Nicolas Larrosa, Nicolo Grilli
Copy Paste: [[2410.15495]] SEA: State-Exchange Attention for High-Fidelity Physics-Based Transformers(https://arxiv.org/abs/2410.15495)
Keywords: transformer
Abstract: Current approaches using sequential networks have shown promise in estimating field variables for dynamical systems, but they are often limited by high rollout errors. The unresolved issue of rollout error accumulation results in unreliable estimations as the network predicts further into the future, with each step's error compounding and leading to an increase in inaccuracy. Here, we introduce the State-Exchange Attention (SEA) module, a novel transformer-based module enabling information exchange between encoded fields through multi-head cross-attention. The cross-field multidirectional information exchange design enables all state variables in the system to exchange information with one another, capturing physical relationships and symmetries between fields. In addition, we incorporate a ViT-like architecture to generate spatially coherent mesh embeddings, further improving the model's ability to capture spatial dependencies in the data. This enhances the model's ability to represent complex interactions between the field variables, resulting in improved rollout error accumulation. Our results show that the Transformer model integrated with the State-Exchange Attention (SEA) module outperforms competitive baseline models, including the PbGMR-GMUS Transformer-RealNVP and GMR-GMUS Transformer, with a reduction in error of 88\% and 91\%, respectively, achieving state-of-the-art performance. Furthermore, we demonstrate that the SEA module alone can reduce errors by 97\% for state variables that are highly dependent on other states of the system.
Title: Taming Mambas for Voxel Level 3D Medical Image Segmentation
Copy Paste: [[2410.15496]] Taming Mambas for Voxel Level 3D Medical Image Segmentation(https://arxiv.org/abs/2410.15496)
Keywords: transformer, segmentation
Abstract: Recently, the field of 3D medical segmentation has been dominated by deep learning models employing Convolutional Neural Networks (CNNs) and Transformer-based architectures, each with their distinctive strengths and limitations. CNNs are constrained by a local receptive field, whereas transformers are hindered by their substantial memory requirements as well as they data hungriness, making them not ideal for processing 3D medical volumes at a fine-grained level. For these reasons, fully convolutional neural networks, as nnUNet, still dominate the scene when segmenting medical structures in 3D large medical volumes. Despite numerous advancements towards developing transformer variants with subquadratic time and memory complexity, these models still fall short in content-based reasoning. A recent breakthrough is Mamba, a Recurrent Neural Network (RNN) based on State Space Models (SSMs) outperforming Transformers in many long-context tasks (million-length sequences) on famous natural language processing and genomic benchmarks while keeping a linear complexity.
Title: Generating Tabular Data Using Heterogeneous Sequential Feature Forest Flow Matching
Authors: Ange-Clément Akazan, Alexia Jolicoeur-Martineau, Ioannis Mitliagkas
Copy Paste: [[2410.15516]] Generating Tabular Data Using Heterogeneous Sequential Feature Forest Flow Matching(https://arxiv.org/abs/2410.15516)
Keywords: privacy, robust, generative
Abstract: Privacy and regulatory constraints make data generation vital to advancing machine learning without relying on real-world datasets. A leading approach for tabular data generation is the Forest Flow (FF) method, which combines Flow Matching with XGBoost. Despite its good performance, FF is slow and makes errors when treating categorical variables as one-hot continuous features. It is also highly sensitive to small changes in the initial conditions of the ordinary differential equation (ODE). To overcome these limitations, we develop Heterogeneous Sequential Feature Forest Flow (HS3F). Our method generates data sequentially (feature-by-feature), reducing the dependency on noisy initial conditions through the additional information from previously generated features. Furthermore, it generates categorical variables using multinomial sampling (from an XGBoost classifier) instead of flow matching, improving generation speed. We also use a Runge-Kutta 4th order (Rg4) ODE solver for improved performance over the Euler solver used in FF. Our experiments with 25 datasets reveal that HS3F produces higher quality and more diverse synthetic data than FF, especially for categorical variables. It also generates data 21-27 times faster for datasets with $\geq20%$ categorical variables. HS3F further demonstrates enhanced robustness to affine transformation in flow ODE initial conditions compared to FF. This study not only validates the HS3F but also unveils promising new strategies to advance generative models.
Title: SceneGraMMi: Scene Graph-boosted Hybrid-fusion for Multi-Modal Misinformation Veracity Prediction
Copy Paste: [[2410.15517]] SceneGraMMi: Scene Graph-boosted Hybrid-fusion for Multi-Modal Misinformation Veracity Prediction(https://arxiv.org/abs/2410.15517)
Keywords: explainability
Abstract: Misinformation undermines individual knowledge and affects broader societal narratives. Despite growing interest in the research community in multi-modal misinformation detection, existing methods exhibit limitations in capturing semantic cues, key regions, and cross-modal similarities within multi-modal datasets. We propose SceneGraMMi, a Scene Graph-boosted Hybrid-fusion approach for Multi-modal Misinformation veracity prediction, which integrates scene graphs across different modalities to improve detection performance. Experimental results across four benchmark datasets show that SceneGraMMi consistently outperforms state-of-the-art methods. In a comprehensive ablation study, we highlight the contribution of each component, while Shapley values are employed to examine the explainability of the model's decision-making process.
Title: TrackMe:A Simple and Effective Multiple Object Tracking Annotation Tool
Authors: Thinh Phan, Isaac Phillips, Andrew Lockett, Michael T.Kidd, Ngan Le
Abstract: Object tracking, especially animal tracking, is one of the key topics that attract a lot of attention due to its benefits of animal behavior understanding and monitoring. Recent state-of-the-art tracking methods are founded on deep learning architectures for object detection, appearance feature extraction and track association. Despite the good tracking performance, these methods are trained and evaluated on common objects such as human and cars. To perform on the animal, there is a need to create large datasets of different types in multiple conditions. The dataset construction comprises of data collection and data annotation. In this work, we put more focus on the latter task. Particularly, we renovate the well-known tool, LabelMe, so as to assist common user with or without in-depth knowledge about computer science to annotate the data with less effort. The new tool named as TrackMe inherits the simplicity, high compatibility with varied systems, minimal hardware requirement and convenient feature utilization from the predecessor. TrackMe is an upgraded version with essential features for multiple object tracking annotation.
Title: MIRA: A Method of Federated MultI-Task Learning for LaRge LAnguage Models
Authors: Ahmed Elbakary, Chaouki Ben Issaid, Tamer ElBatt, Karim Seddik, Mehdi Bennis
Copy Paste: [[2410.15524]] MIRA: A Method of Federated MultI-Task Learning for LaRge LAnguage Models(https://arxiv.org/abs/2410.15524)
Keywords: federate, large language model
Abstract: In this paper, we introduce a method for fine-tuning Large Language Models (LLMs), inspired by Multi-Task learning in a federated manner. Our approach leverages the structure of each client's model and enables a learning scheme that considers other clients' tasks and data distribution. To mitigate the extensive computational and communication overhead often associated with LLMs, we utilize a parameter-efficient fine-tuning method, specifically Low-Rank Adaptation (LoRA), reducing the number of trainable parameters. Experimental results, with different datasets and models, demonstrate the proposed method's effectiveness compared to existing frameworks for federated fine-tuning of LLMs in terms of average and local performances. The proposed scheme outperforms existing baselines by achieving lower local loss for each client while maintaining comparable global performance.
Title: Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage
Copy Paste: [[2410.15531]] Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage(https://arxiv.org/abs/2410.15531)
Keywords: generative
Abstract: Evaluating retrieval-augmented generation (RAG) systems remains challenging, particularly for open-ended questions that lack definitive answers and require coverage of multiple sub-topics. In this paper, we introduce a novel evaluation framework based on sub-question coverage, which measures how well a RAG system addresses different facets of a question. We propose decomposing questions into sub-questions and classifying them into three types -- core, background, and follow-up -- to reflect their roles and importance. Using this categorization, we introduce a fine-grained evaluation protocol that provides insights into the retrieval and generation characteristics of RAG systems, including three commercial generative answer engines: this http URL, Perplexity AI, and Bing Chat. Interestingly, we find that while all answer engines cover core sub-questions more often than background or follow-up ones, they still miss around 50% of core sub-questions, revealing clear opportunities for improvement. Further, sub-question coverage metrics prove effective for ranking responses, achieving 82% accuracy compared to human preference annotations. Lastly, we also demonstrate that leveraging core sub-questions enhances both retrieval and answer generation in a RAG system, resulting in a 74% win rate over the baseline that lacks sub-questions.
Title: Grammatical Error Correction for Low-Resource Languages: The Case of Zarma
Authors: Mamadou K. Keita, Christopher Homan, Sofiane Abdoulaye Hamani, Adwoa Bremang, Marcos Zampieri, Habibatou Abdoulaye Alfari, Elysabhete Amadou Ibrahim, Dennis Owusu
Copy Paste: [[2410.15539]] Grammatical Error Correction for Low-Resource Languages: The Case of Zarma(https://arxiv.org/abs/2410.15539)
Keywords: large language model
Abstract: Grammatical error correction (GEC) is important for improving written materials for low-resource languages like Zarma -- spoken by over 5 million people in West Africa. Yet it remains a challenging problem. This study compares rule-based methods, machine translation (MT) models, and large language models (LLMs) for GEC in Zarma. We evaluate each approach's effectiveness on our manually-built dataset of over 250,000 examples using synthetic and human-annotated data. Our experiments show that the MT-based approach using the M2M100 model outperforms others, achieving a detection rate of 95.82% and a suggestion accuracy of 78.90% in automatic evaluations, and scoring 3.0 out of 5.0 in logical/grammar error correction during MEs by native speakers. The rule-based method achieved perfect detection (100%) and high suggestion accuracy (96.27%) for spelling corrections but struggled with context-level errors. LLMs like MT5-small showed moderate performance with a detection rate of 90.62% and a suggestion accuracy of 57.15%. Our work highlights the potential of MT models to enhance GEC in low-resource languages, paving the way for more inclusive NLP tools.
Title: Hiding in Plain Sight: Reframing Hardware Trojan Benchmarking as a Hide&Seek Modification
Authors: Amin Sarihi, Ahmad Patooghy, Peter Jamieson, Abdel-Hameed A. Badawy
Copy Paste: [[2410.15550]] Hiding in Plain Sight: Reframing Hardware Trojan Benchmarking as a Hide&Seek Modification(https://arxiv.org/abs/2410.15550)
Keywords: security
Abstract: This work focuses on advancing security research in the hardware design space by formally defining the realistic problem of Hardware Trojan (HT) detection. The goal is to model HT detection more closely to the real world, i.e., describing the problem as The Seeker's Dilemma where a detecting agent is unaware of whether circuits are infected by HTs or not. Using this theoretical problem formulation, we create a benchmark that consists of a mixture of HT-free and HT-infected restructured circuits while preserving their original functionalities. The restructured circuits are randomly infected by HTs, causing a situation where the defender is uncertain if a circuit is infected or not. We believe that our innovative benchmark and methodology of creating benchmarks will help the community judge the detection quality of different methods by comparing their success rates in circuit classification. We use our developed benchmark to evaluate three state-of-the-art HT detection tools to show baseline results for this approach. We use Principal Component Analysis to assess the strength of our benchmark, where we observe that some restructured HT-infected circuits are mapped closely to HT-free circuits, leading to significant label misclassification by detectors.
Title: Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following
Authors: Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, Shruti Bhosale, Chenguang Zhu, Karthik Abinav Sankararaman, Eryk Helenowski, Melanie Kambadur, Aditya Tayade, Hao Ma, Han Fang, Sinong Wang
Copy Paste: [[2410.15553]] Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following(https://arxiv.org/abs/2410.15553)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including instruction following, which is crucial for aligning model outputs with user expectations. However, evaluating LLMs' ability to follow instructions remains challenging due to the complexity and subjectivity of human language. Current benchmarks primarily focus on single-turn, monolingual instructions, which do not adequately reflect the complexities of real-world applications that require handling multi-turn and multilingual interactions. To address this gap, we introduce Multi-IF, a new benchmark designed to assess LLMs' proficiency in following multi-turn and multilingual instructions. Multi-IF, which utilizes a hybrid framework combining LLM and human annotators, expands upon the IFEval by incorporating multi-turn sequences and translating the English prompts into another 7 languages, resulting in a dataset of 4,501 multilingual conversations, where each has three turns. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. All the models tested showed a higher rate of failure in executing instructions correctly with each additional turn. For example, o1-preview drops from 0.877 at the first turn to 0.707 at the third turn in terms of average accuracy over all languages. Moreover, languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities. We release Multi-IF prompts and the evaluation code base to encourage further research in this critical area.
Title: Bayesian Concept Bottleneck Models with LLM Priors
Authors: Jean Feng, Avni Kothari, Luke Zier, Chandan Singh, Yan Shuo Tan
Copy Paste: [[2410.15555]] Bayesian Concept Bottleneck Models with LLM Priors(https://arxiv.org/abs/2410.15555)
Keywords: robust, extraction, interpretability, large language model
Abstract: Concept Bottleneck Models (CBMs) have been proposed as a compromise between white-box and black-box models, aiming to achieve interpretability without sacrificing accuracy. The standard training procedure for CBMs is to predefine a candidate set of human-interpretable concepts, extract their values from the training data, and identify a sparse subset as inputs to a transparent prediction model. However, such approaches are often hampered by the tradeoff between enumerating a sufficiently large set of concepts to include those that are truly relevant versus controlling the cost of obtaining concept extractions. This work investigates a novel approach that sidesteps these challenges: BC-LLM iteratively searches over a potentially infinite set of concepts within a Bayesian framework, in which Large Language Models (LLMs) serve as both a concept extraction mechanism and prior. BC-LLM is broadly applicable and multi-modal. Despite imperfections in LLMs, we prove that BC-LLM can provide rigorous statistical inference and uncertainty quantification. In experiments, it outperforms comparator methods including black-box models, converges more rapidly towards relevant concepts and away from spuriously correlated ones, and is more robust to out-of-distribution samples.
Title: Pruning Foundation Models for High Accuracy without Retraining
Authors: Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin
Copy Paste: [[2410.15567]] Pruning Foundation Models for High Accuracy without Retraining(https://arxiv.org/abs/2410.15567)
Keywords: transformer, large language model
Abstract: Despite the superior performance, it is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. While pruning is a promising technique to reduce model size and accelerate the inference, the traditional pruning techniques can hardly be applied for LLMs as they need to finetune the model on the full dataset with multiple epochs consuming massive data and hardware resources. To deal with this problem, post-training pruning methods are proposed to prune LLMs in one-shot without retraining. However, their accuracy after pruning may suffer from certain performance degradation due to the lack of retraining with massive data. To address this issue, in this paper, we first formulate the post-training problem for layer-wise LLM compression to simultaneously prune multiple weights in LLMs. Next, we provide an optimal solution for this problem and design our post-training pruning algorithm for both unstructured and semi-structured sparsity. Our extensive experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines across various LLM families including transformer-based LLMs and Mamba-based LLMs. Code link: this https URL
Title: ZK-DPPS: A Zero-Knowledge Decentralised Data Sharing and Processing Middleware
Authors: Amir Jabbari, Gowri Ramachandran, Sidra Malik, Raja Jurdak
Copy Paste: [[2410.15568]] ZK-DPPS: A Zero-Knowledge Decentralised Data Sharing and Processing Middleware(https://arxiv.org/abs/2410.15568)
Keywords: secure, privacy, protect, extraction
Abstract: In the current digital landscape, supply chains have transformed into complex networks driven by the Internet of Things (IoT), necessitating enhanced data sharing and processing capabilities to ensure traceability and transparency. Leveraging Blockchain technology in IoT applications advances reliability and transparency in near-real-time insight extraction processes. However, it raises significant concerns regarding data privacy. Existing privacy-preserving approaches often rely on Smart Contracts for automation and Zero Knowledge Proofs (ZKP) for privacy. However, apart from being inflexible in adopting system changes while effectively protecting data confidentiality, these approaches introduce significant computational expenses and overheads that make them impractical for dynamic supply chain environments. To address these challenges, we propose ZK-DPPS, a framework that ensures zero-knowledge communications without the need for traditional ZKPs. In ZK-DPPS, privacy is preserved through a combination of Fully Homomorphic Encryption (FHE) for computations and Secure Multi-Party Computations (SMPC) for key reconstruction. To ensure that the raw data remains private throughout the entire process, we use FHE to execute computations directly on encrypted data. The "zero-knowledge" aspect of ZK-DPPS refers to the system's ability to process and share data insights without exposing sensitive information, thus offering a practical and efficient alternative to ZKP-based methods. We demonstrate the efficacy of ZK-DPPS through a simulated supply chain scenario, showcasing its ability to tackle the dual challenges of privacy preservation and computational trust in decentralised environments.
Title: Stacking Small Language Models for Generalizability
Copy Paste: [[2410.15570]] Stacking Small Language Models for Generalizability(https://arxiv.org/abs/2410.15570)
Keywords: interpretability, large language model
Abstract: Recent advances show that large language models (LLMs) generalize strong performance across different natural language benchmarks. However, the large size of LLMs makes training and inference expensive and impractical to run in resource-limited settings. This paper introduces a new approach called fine-tuning stacks of language models (FSLM), which involves stacking small language models (SLM) as an alternative to LLMs. By fine-tuning each SLM to perform a specific task, this approach breaks down high level reasoning into multiple lower-level steps that specific SLMs are responsible for. As a result, FSLM allows for lower training and inference costs, and also improves model interpretability as each SLM communicates with the subsequent one through natural language. By evaluating FSLM on common natural language benchmarks, this paper highlights promising early results toward generalizable performance using FSLM as a cost-effective alternative to LLMs.
Title: Leveraging Retrieval-Augmented Generation for Culturally Inclusive Hakka Chatbots: Design Insights and User Perceptions
Authors: Chen-Chi Chang, Han-Pi Chang, Hung-Shin Lee
Copy Paste: [[2410.15572]] Leveraging Retrieval-Augmented Generation for Culturally Inclusive Hakka Chatbots: Design Insights and User Perceptions(https://arxiv.org/abs/2410.15572)
Keywords: generative, large language model
Abstract: In an era where cultural preservation is increasingly intertwined with technological innovation, this study introduces a groundbreaking approach to promoting and safeguarding the rich heritage of Taiwanese Hakka culture through the development of a Retrieval-Augmented Generation (RAG)-enhanced chatbot. Traditional large language models (LLMs), while powerful, often fall short in delivering accurate and contextually rich responses, particularly in culturally specific domains. By integrating external databases with generative AI models, RAG technology bridges this gap, empowering chatbots to not only provide precise answers but also resonate deeply with the cultural nuances that are crucial for authentic interactions. This study delves into the intricate process of augmenting the chatbot's knowledge base with targeted cultural data, specifically curated to reflect the unique aspects of Hakka traditions, language, and practices. Through dynamic information retrieval, the RAG-enhanced chatbot becomes a versatile tool capable of handling complex inquiries that demand an in-depth understanding of Hakka cultural context. This is particularly significant in an age where digital platforms often dilute cultural identities, making the role of culturally aware AI systems more critical than ever. System usability studies conducted as part of our research reveal a marked improvement in both user satisfaction and engagement, highlighting the chatbot's effectiveness in fostering a deeper connection with Hakka culture. The feedback underscores the potential of RAG technology to not only enhance user experience but also to serve as a vital instrument in the broader mission of ethnic mainstreaming and cultural celebration.
Copy Paste: [[2410.15576]] A Survey of Conversational Search(https://arxiv.org/abs/2410.15576)
Keywords: robust, large language model
Abstract: As a cornerstone of modern information access, search engines have become indispensable in everyday life. With the rapid advancements in AI and natural language processing (NLP) technologies, particularly large language models (LLMs), search engines have evolved to support more intuitive and intelligent interactions between users and systems. Conversational search, an emerging paradigm for next-generation search engines, leverages natural language dialogue to facilitate complex and precise information retrieval, thus attracting significant attention. Unlike traditional keyword-based search engines, conversational search systems enhance user experience by supporting intricate queries, maintaining context over multi-turn interactions, and providing robust information integration and processing capabilities. Key components such as query reformulation, search clarification, conversational retrieval, and response generation work in unison to enable these sophisticated interactions. In this survey, we explore the recent advancements and potential future directions in conversational search, examining the critical modules that constitute a conversational search system. We highlight the integration of LLMs in enhancing these systems and discuss the challenges and opportunities that lie ahead in this dynamic field. Additionally, we provide insights into real-world applications and robust evaluations of current conversational search systems, aiming to guide future research and development in conversational search.
Title: Generalized Probabilistic Attention Mechanism in Transformers
Copy Paste: [[2410.15578]] Generalized Probabilistic Attention Mechanism in Transformers(https://arxiv.org/abs/2410.15578)
Keywords: transformer
Abstract: The Transformer architecture has become widely adopted due to its demonstrated success, attributed to the attention mechanism at its core. Despite these successes, the attention mechanism of Transformers is associated with two well-known issues: rank-collapse and gradient vanishing. In this paper, we present a theoretical analysis that it is inherently difficult to address both issues simultaneously in the conventional attention mechanism. To handle these issues, we introduce a novel class of attention mechanism, referred to as generalized probabilistic attention mechanism (GPAM), and its dual-attention implementation within the Transformer architecture. Unlike conventional attention mechanisms, GPAM allows for negative attention scores while preserving a fixed total sum. We provide theoretical evidence that the proposed dual-attention GPAM (daGPAM) effectively mitigates both the rank-collapse and gradient vanishing issues which are difficult to resolve simultaneously with the conventional attention mechanisms. Furthermore, we empirically validate this theoretical evidence, demonstrating the superiority of daGPAM compared to other alternative attention mechanisms that were proposed to address the same issues. Additionally, we demonstrate the practical benefits of GPAM in natural language processing tasks, such as language modeling and neural machine translation.
Title: Language Models are Symbolic Learners in Arithmetic
Copy Paste: [[2410.15580]] Language Models are Symbolic Learners in Arithmetic(https://arxiv.org/abs/2410.15580)
Keywords: large language model
Abstract: Large Language Models (LLMs) are thought to struggle with arithmetic learning due to the inherent differences between language modeling and numerical computation, but concrete evidence has been lacking. This work responds to this claim through a two-side experiment. We first investigate whether LLMs leverage partial products during arithmetic learning. We find that although LLMs can identify some partial products after learning, they fail to leverage them for arithmetic tasks, conversely. We then explore how LLMs approach arithmetic symbolically by breaking tasks into subgroups, hypothesizing that difficulties arise from subgroup complexity and selection. Our results show that when subgroup complexity is fixed, LLMs treat a collection of different arithmetic operations similarly. By analyzing position-level accuracy across different training sizes, we further observe that it follows a U-shaped pattern: LLMs quickly learn the easiest patterns at the first and last positions, while progressively learning the more difficult patterns in the middle positions. This suggests that LLMs select subgroup following an easy-to-hard paradigm during learning. Our work confirms that LLMs are pure symbolic learners in arithmetic tasks and underscores the importance of understanding them deeply through subgroup-level quantification.
Title: Deep Learning and Machine Learning -- Object Detection and Semantic Segmentation: From Theory to Applications
Authors: Jintao Ren, Ziqian Bi, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jinlang Wang, Keyu Chen, Caitlyn Heqi Yin, Pohsun Feng, Yizhu Wen, Tianyang Wang, Silin Chen, Ming Li, Jiawei Xu, Ming Liu
Copy Paste: [[2410.15584]] Deep Learning and Machine Learning -- Object Detection and Semantic Segmentation: From Theory to Applications(https://arxiv.org/abs/2410.15584)
Keywords: transformer, large language model, segmentation
Abstract: This book offers an in-depth exploration of object detection and semantic segmentation, combining theoretical foundations with practical applications. It covers state-of-the-art advancements in machine learning and deep learning, with a focus on convolutional neural networks (CNNs), YOLO architectures, and transformer-based approaches like DETR. The book also delves into the integration of artificial intelligence (AI) techniques and large language models for enhanced object detection in complex environments. A thorough discussion of big data analysis is presented, highlighting the importance of data processing, model optimization, and performance evaluation metrics. By bridging the gap between traditional methods and modern deep learning frameworks, this book serves as a comprehensive guide for researchers, data scientists, and engineers aiming to leverage AI-driven methodologies in large-scale object detection tasks.
Abstract: Detecting fake news in large datasets is challenging due to its diversity and complexity, with traditional approaches often focusing on textual features while underutilizing semantic and emotional elements. Current methods also rely heavily on large annotated datasets, limiting their effectiveness in more nuanced analysis. To address these challenges, this paper introduces Emotion-\textbf{A}ware \textbf{M}ultimodal Fusion \textbf{P}rompt \textbf{L}\textbf{E}arning (\textbf{AMPLE}) framework to address the above issue by combining text sentiment analysis with multimodal data and hybrid prompt templates. This framework extracts emotional elements from texts by leveraging sentiment analysis tools. It then employs Multi-Head Cross-Attention (MCA) mechanisms and similarity-aware fusion methods to integrate multimodal data. The proposed AMPLE framework demonstrates strong performance on two public datasets in both few-shot and data-rich settings, with results indicating the potential of emotional aspects in fake news detection. Furthermore, the study explores the impact of integrating large language models with this method for text sentiment extraction, revealing substantial room for further improvement. The code can be found at :\url{this https URL
Title: All You Need is an Improving Column: Enhancing Column Generation for Parallel Machine Scheduling via Transformers
Copy Paste: [[2410.15601]] All You Need is an Improving Column: Enhancing Column Generation for Parallel Machine Scheduling via Transformers(https://arxiv.org/abs/2410.15601)
Keywords: transformer
Abstract: We present a neural network-enhanced column generation (CG) approach for a parallel machine scheduling problem. The proposed approach utilizes an encoder-decoder attention model, namely the transformer and pointer architectures, to develop job sequences with negative reduced cost and thus generate columns to add to the master problem. By training the neural network offline and using it in inference mode to predict negative reduced costs columns, we achieve significant computational time savings compared to dynamic programming (DP). Since the exact DP procedure is used to verify that no further columns with negative reduced cost can be identified at termination, the optimality guarantee of the original CG procedure is preserved. For small to medium-sized instances, our approach achieves an average 45% reduction in computation time compared to solving the subproblems with DP. Furthermore, the model generalizes not only to unseen, larger problem instances from the same probability distribution but also to instances from different probability distributions than those presented at training time. For large-sized instances, the proposed approach achieves an 80% improvement in the objective value in under 500 seconds, demonstrating both its scalability and efficiency.
Title: Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding
Copy Paste: [[2410.15609]] Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding(https://arxiv.org/abs/2410.15609)
Keywords: robust
Abstract: Recently, pre-trained language models (PLMs) have been increasingly adopted in spoken language understanding (SLU). However, automatic speech recognition (ASR) systems frequently produce inaccurate transcriptions, leading to noisy inputs for SLU models, which can significantly degrade their performance. To address this, our objective is to train SLU models to withstand ASR errors by exposing them to noises commonly observed in ASR systems, referred to as ASR-plausible noises. Speech noise injection (SNI) methods have pursued this objective by introducing ASR-plausible noises, but we argue that these methods are inherently biased towards specific ASR systems, or ASR-specific noises. In this work, we propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system, by cutting off the non-causal effect of noises. Experimental results and analyses demonstrate the effectiveness of our proposed methods in enhancing the robustness and generalizability of SLU models against unseen ASR systems by introducing more diverse and plausible ASR noises in advance.
Title: On The Global Convergence Of Online RLHF With Neural Parametrization
Copy Paste: [[2410.15610]] On The Global Convergence Of Online RLHF With Neural Parametrization(https://arxiv.org/abs/2410.15610)
Keywords: large language model
Abstract: The importance of Reinforcement Learning from Human Feedback (RLHF) in aligning large language models (LLMs) with human values cannot be overstated. RLHF is a three-stage process that includes supervised fine-tuning (SFT), reward learning, and policy learning. Although there are several offline and online approaches to aligning LLMs, they often suffer from distribution shift issues. These issues arise from the inability to accurately capture the distributional interdependence between the reward learning and policy learning stages. Consequently, this has led to various approximated approaches, but the theoretical insights and motivations remain largely limited to tabular settings, which do not hold in practice. This gap between theoretical insights and practical implementations is critical. It is challenging to address this gap as it requires analyzing the performance of AI alignment algorithms in neural network-parameterized settings. Although bi-level formulations have shown promise in addressing distribution shift issues, they suffer from the hyper-gradient problem, and current approaches lack efficient algorithms to solve this. In this work, we tackle these challenges employing the bi-level formulation laid out in Kwon et al. (2024) along with the assumption \emph{Weak Gradient Domination} to demonstrate convergence in an RLHF setup, obtaining a sample complexity of $\epsilon^{-\frac{7}{2}}$ . Our key contributions are twofold: (i) We propose a bi-level formulation for AI alignment in parameterized settings and introduce a first-order approach to solve this problem. (ii) We analyze the theoretical convergence rates of the proposed algorithm and derive state-of-the-art bounds. To the best of our knowledge, this is the first work to establish convergence rate bounds and global optimality for the RLHF framework in neural network-parameterized settings.
Title: Exploring Stronger Transformer Representation Learning for Occluded Person Re-Identificatio
Copy Paste: [[2410.15613]] Exploring Stronger Transformer Representation Learning for Occluded Person Re-Identificatio(https://arxiv.org/abs/2410.15613)
Keywords: transformer
Abstract: Due to some complex factors (e.g., occlusion, pose variation and diverse camera perspectives), extracting stronger feature representation in person re-identification remains a challenging task. In this paper, we proposed a novel self-supervision and supervision combining transformer-based person re-identification framework, namely SSSC-TransReID. Different from the general transformer-based person re-identification models, we designed a self-supervised contrastive learning branch, which can enhance the feature representation for person re-identification without negative samples or additional pre-training. In order to train the contrastive learning branch, we also proposed a novel random rectangle mask strategy to simulate the occlusion in real scenes, so as to enhance the feature representation for occlusion. Finally, we utilized the joint-training loss function to integrate the advantages of supervised learning with ID tags and self-supervised contrastive learning without negative samples, which can reinforce the ability of our model to excavate stronger discriminative features, especially for occlusion. Extensive experimental results on several benchmark datasets show our proposed model obtains superior Re-ID performance consistently and outperforms the state-of-the-art ReID methods by large margins on the mean average accuracy (mAP) and Rank-1 accuracy.
Title: Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation
Authors: Anh Bui, Long Vuong, Khanh Doan, Trung Le, Paul Montague, Tamas Abraham, Dinh Phung
Copy Paste: [[2410.15618]] Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation(https://arxiv.org/abs/2410.15618)
Keywords: diffusion
Abstract: Diffusion models excel at generating visually striking content from text but can inadvertently produce undesirable or harmful content when trained on unfiltered internet data. A practical solution is to selectively removing target concepts from the model, but this may impact the remaining concepts. Prior approaches have tried to balance this by introducing a loss term to preserve neutral content or a regularization term to minimize changes in the model parameters, yet resolving this trade-off remains challenging. In this work, we propose to identify and preserving concepts most affected by parameter changes, termed as \textit{adversarial concepts}. This approach ensures stable erasure with minimal impact on the other concepts. We demonstrate the effectiveness of our method using the Stable Diffusion model, showing that it outperforms state-of-the-art erasure methods in eliminating unwanted content while maintaining the integrity of other unrelated elements. Our code is available at \url{this https URL}.
Title: Guardians of Discourse: Evaluating LLMs on Multilingual Offensive Language Detection
Copy Paste: [[2410.15623]] Guardians of Discourse: Evaluating LLMs on Multilingual Offensive Language Detection(https://arxiv.org/abs/2410.15623)
Keywords: large language model
Abstract: Identifying offensive language is essential for maintaining safety and sustainability in the social media era. Though large language models (LLMs) have demonstrated encouraging potential in social media analytics, they lack thorough evaluation when in offensive language detection, particularly in multilingual environments. We for the first time evaluate multilingual offensive language detection of LLMs in three languages: English, Spanish, and German with three LLMs, GPT-3.5, Flan-T5, and Mistral, in both monolingual and multilingual settings. We further examine the impact of different prompt languages and augmented translation data for the task in non-English contexts. Furthermore, we discuss the impact of the inherent bias in LLMs and the datasets in the mispredictions related to sensitive topics.
Title: Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement
Authors: Shuzheng Si, Haozhe Zhao, Gang Chen, Yunshui Li, Kangyang Luo, Chuancheng Lv, Kaikai An, Fanchao Qi, Baobao Chang, Maosong Sun
Copy Paste: [[2410.15633]] Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement(https://arxiv.org/abs/2410.15633)
Keywords: large language model
Abstract: The expansion of large language models to effectively handle instructions with extremely long contexts has yet to be fully investigated. The primary obstacle lies in constructing a high-quality long instruction-following dataset devised for long context alignment. Existing studies have attempted to scale up the available data volume by synthesizing long instruction-following samples. However, indiscriminately increasing the quantity of data without a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the final performance. To bridge this gap, we aim to address the unique challenge of long-context alignment, i.e., modeling the long-range dependencies for handling instructions and lengthy input contexts. We propose GATEAU, a novel framework designed to identify the influential and high-quality samples enriched with long-range dependency relations by utilizing crafted Homologous Models' Guidance (HMG) and Contextual Awareness Measurement (CAM). Specifically, HMG attempts to measure the difficulty of generating corresponding responses due to the long-range dependencies, using the perplexity scores of the response from two homologous models with different context windows. Also, the role of CAM is to measure the difficulty of understanding the long input contexts due to long-range dependencies by evaluating whether the model's attention is focused on important segments. Built upon both proposed methods, we select the most challenging samples as the influential data to effectively frame the long-range dependencies, thereby achieving better performance of LLMs. Comprehensive experiments indicate that GATEAU effectively identifies samples enriched with long-range dependency relations and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.
Title: Can Large Language Models Invent Algorithms to Improve Themselves?
Copy Paste: [[2410.15639]] Can Large Language Models Invent Algorithms to Improve Themselves?(https://arxiv.org/abs/2410.15639)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown remarkable performance improvements and are rapidly gaining adoption in industry. However, the methods for improving LLMs are still designed by humans, which restricts the invention of new model-improving algorithms to human expertise and imagination. To address this, we propose the Self-Developing framework, which enables LLMs to autonomously generate and learn model-improvement algorithms. In this framework, the seed model generates, applies, and evaluates model-improving algorithms, continuously improving both the seed model and the algorithms themselves. In mathematical reasoning tasks, Self-Developing not only creates models that surpass the seed model but also consistently outperforms models created using human-designed algorithms. Additionally, these LLM-discovered algorithms demonstrate strong effectiveness, including transferability to out-of-domain models.
Title: SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis
Copy Paste: [[2410.15641]] SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis(https://arxiv.org/abs/2410.15641)
Keywords: security, attack, large language model
Abstract: The increasing integration of large language models (LLMs) across various fields has heightened concerns about their potential to propagate dangerous information. This paper specifically explores the security vulnerabilities of LLMs within the field of chemistry, particularly their capacity to provide instructions for synthesizing hazardous substances. We evaluate the effectiveness of several prompt injection attack methods, including red-teaming, explicit prompting, and implicit prompting. Additionally, we introduce a novel attack technique named SMILES-prompting, which uses the Simplified Molecular-Input Line-Entry System (SMILES) to reference chemical substances. Our findings reveal that SMILES-prompting can effectively bypass current safety mechanisms. These findings highlight the urgent need for enhanced domain-specific safeguards in LLMs to prevent misuse and improve their potential for positive social impact.
Title: Resource-Efficient Medical Report Generation using Large Language Models
Copy Paste: [[2410.15642]] Resource-Efficient Medical Report Generation using Large Language Models(https://arxiv.org/abs/2410.15642)
Keywords: large language model
Abstract: Medical report generation is the task of automatically writing radiology reports for chest X-ray images. Manually composing these reports is a time-consuming process that is also prone to human errors. Generating medical reports can therefore help reduce the burden on radiologists. In other words, we can promote greater clinical automation in the medical domain. In this work, we propose a new framework leveraging vision-enabled Large Language Models (LLM) for the task of medical report generation. We introduce a lightweight solution that achieves better or comparative performance as compared to previous solutions on the task of medical report generation. We conduct extensive experiments exploring different model sizes and enhancement approaches, such as prefix tuning to improve the text generation abilities of the LLMs. We evaluate our approach on a prominent large-scale radiology report dataset - MIMIC-CXR. Our results demonstrate the capability of our resource-efficient framework to generate patient-specific reports with strong medical contextual understanding and high precision.
Title: Understanding and Alleviating Memory Consumption in RLHF for LLMs
Authors: Jin Zhou, Hanmei Yang, Steven (Jiaxun)Tang, Mingcan Xiang, Hui Guan, Tongping Liu
Copy Paste: [[2410.15651]] Understanding and Alleviating Memory Consumption in RLHF for LLMs(https://arxiv.org/abs/2410.15651)
Keywords: large language model
Abstract: Fine-tuning with Reinforcement Learning with Human Feedback (RLHF) is essential for aligning large language models (LLMs). However, RLHF often encounters significant memory challenges. This study is the first to examine memory usage in the RLHF context, exploring various memory management strategies and unveiling the reasons behind excessive memory consumption. Additionally, we introduce a simple yet effective approach that substantially reduces the memory required for RLHF fine-tuning.
Title: CL-HOI: Cross-Level Human-Object Interaction Distillation from Vision Large Language Models
Copy Paste: [[2410.15657]] CL-HOI: Cross-Level Human-Object Interaction Distillation from Vision Large Language Models(https://arxiv.org/abs/2410.15657)
Keywords: large language model
Abstract: Human-object interaction (HOI) detection has seen advancements with Vision Language Models (VLMs), but these methods often depend on extensive manual annotations. Vision Large Language Models (VLLMs) can inherently recognize and reason about interactions at the image level but are computationally heavy and not designed for instance-level HOI detection. To overcome these limitations, we propose a Cross-Level HOI distillation (CL-HOI) framework, which distills instance-level HOIs from VLLMs image-level understanding without the need for manual annotations. Our approach involves two stages: context distillation, where a Visual Linguistic Translator (VLT) converts visual information into linguistic form, and interaction distillation, where an Interaction Cognition Network (ICN) reasons about spatial, visual, and context relations. We design contrastive distillation losses to transfer image-level context and interaction knowledge from the teacher to the student model, enabling instance-level HOI detection. Evaluations on HICO-DET and V-COCO datasets demonstrate that our CL-HOI surpasses existing weakly supervised methods and VLLM supervised methods, showing its efficacy in detecting HOIs without manual labels.
Title: Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
Authors: Clara Na, Ian Magnusson, Ananya Harsh Jha, Tom Sherborne, Emma Strubell, Jesse Dodge, Pradeep Dasigi
Copy Paste: [[2410.15661]] Scalable Data Ablation Approximations for Language Models through Modular Training and Merging(https://arxiv.org/abs/2410.15661)
Keywords: large language model
Abstract: Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive since the full effect is seen only after training the models; this can lead practitioners to settle for sub-optimal data mixtures. We propose an efficient method for approximating data ablations which trains individual models on subsets of a training corpus and reuses them across evaluations of combinations of subsets. In continued pre-training experiments, we find that, given an arbitrary evaluation set, the perplexity score of a single model trained on a candidate set of data is strongly correlated with perplexity scores of parameter averages of models trained on distinct partitions of that data. From this finding, we posit that researchers and practitioners can conduct inexpensive simulations of data ablations by maintaining a pool of models that were each trained on partitions of a large training corpus, and assessing candidate data mixtures by evaluating parameter averages of combinations of these models. This approach allows for substantial improvements in amortized training efficiency -- scaling only linearly with respect to new data -- by enabling reuse of previous training computation, opening new avenues for improving model performance through rigorous, incremental data assessment and mixing.
Title: RAC: Efficient LLM Factuality Correction with Retrieval Augmentation
Abstract: Large Language Models (LLMs) exhibit impressive results across a wide range of natural language processing (NLP) tasks, yet they can often produce factually incorrect outputs. This paper introduces a simple but effective low-latency post-correction method, \textbf{Retrieval Augmented Correction (RAC)}, aimed at enhancing the factual performance of LLMs without requiring additional fine-tuning. Our method is general and can be used with any instruction-tuned LLM, and has greatly reduced latency compared to prior approaches. RAC decomposes the LLM's output into atomic facts and applies a fine-grained verification and correction process with retrieved content to verify and correct the LLM-generated output. Our extensive experiments show that RAC yields up to 30\% improvements over state-of-the-art baselines across two popular factuality evaluation datasets, validating its efficacy and robustness in both with and without the integration of Retrieval-Augmented Generation (RAG) across different LLMs.\footnote{Our code is at \url{this https URL}}
Title: Learning to Generate and Evaluate Fact-checking Explanations with Transformers
Authors: Darius Feher, Abdullah Khered, Hao Zhang, Riza Batista-Navarro, Viktor Schlegel
Copy Paste: [[2410.15669]] Learning to Generate and Evaluate Fact-checking Explanations with Transformers(https://arxiv.org/abs/2410.15669)
Keywords: transformer, generative
Abstract: In an era increasingly dominated by digital platforms, the spread of misinformation poses a significant challenge, highlighting the need for solutions capable of assessing information veracity. Our research contributes to the field of Explainable Artificial Antelligence (XAI) by developing transformer-based fact-checking models that contextualise and justify their decisions by generating human-accessible explanations. Importantly, we also develop models for automatic evaluation of explanations for fact-checking verdicts across different dimensions such as \texttt{(self)-contradiction}, \texttt{hallucination}, \texttt{convincingness} and \texttt{overall quality}. By introducing human-centred evaluation methods and developing specialised datasets, we emphasise the need for aligning Artificial Intelligence (AI)-generated explanations with human judgements. This approach not only advances theoretical knowledge in XAI but also holds practical implications by enhancing the transparency, reliability and users' trust in AI-driven fact-checking systems. Furthermore, the development of our metric learning models is a first step towards potentially increasing efficiency and reducing reliance on extensive manual assessment. Based on experimental results, our best performing generative model \textsc{ROUGE-1} score of 47.77, demonstrating superior performance in generating fact-checking explanations, particularly when provided with high-quality evidence. Additionally, the best performing metric learning model showed a moderately strong correlation with human judgements on objective dimensions such as \texttt{(self)-contradiction and \texttt{hallucination}, achieving a Matthews Correlation Coefficient (MCC) of around 0.7.}
Title: TALoS: Enhancing Semantic Scene Completion via Test-time Adaptation on the Line of Sight
Authors: Hyun-Kurl Jang, Jihun Kim, Hyeokjun Kweon, Kuk-Jin Yoon
Copy Paste: [[2410.15674]] TALoS: Enhancing Semantic Scene Completion via Test-time Adaptation on the Line of Sight(https://arxiv.org/abs/2410.15674)
Keywords: segmentation
Abstract: Semantic Scene Completion (SSC) aims to perform geometric completion and semantic segmentation simultaneously. Despite the promising results achieved by existing studies, the inherently ill-posed nature of the task presents significant challenges in diverse driving scenarios. This paper introduces TALoS, a novel test-time adaptation approach for SSC that excavates the information available in driving environments. Specifically, we focus on that observations made at a certain moment can serve as Ground Truth (GT) for scene completion at another moment. Given the characteristics of the LiDAR sensor, an observation of an object at a certain location confirms both 1) the occupation of that location and 2) the absence of obstacles along the line of sight from the LiDAR to that point. TALoS utilizes these observations to obtain self-supervision about occupancy and emptiness, guiding the model to adapt to the scene in test time. In a similar manner, we aggregate reliable SSC predictions among multiple moments and leverage them as semantic pseudo-GT for adaptation. Further, to leverage future observations that are not accessible at the current time, we present a dual optimization scheme using the model in which the update is delayed until the future observation is available. Evaluations on the SemanticKITTI validation and test sets demonstrate that TALoS significantly improves the performance of the pre-trained SSC model. Our code is available at this https URL.
Title: Revealing and Mitigating the Local Pattern Shortcuts of Mamba
Copy Paste: [[2410.15678]] Revealing and Mitigating the Local Pattern Shortcuts of Mamba(https://arxiv.org/abs/2410.15678)
Keywords: large language model
Abstract: Large language models (LLMs) have advanced significantly due to the attention mechanism, but their quadratic complexity and linear memory demands limit their performance on long-context tasks. Recently, researchers introduced Mamba, an advanced model built upon State Space Models(SSMs) that offers linear complexity and constant memory. Although Mamba is reported to match or surpass the performance of attention-based models, our analysis reveals a performance gap: Mamba excels in tasks that involve localized key information but faces challenges with tasks that require handling distributed key information. Our controlled experiments suggest that this inconsistency arises from Mamba's reliance on local pattern shortcuts, which enable the model to remember local key information within its limited memory but hinder its ability to retain more dispersed information. Therefore, we introduce a global selection module into the Mamba model to address this issue. Experiments on both existing and proposed synthetic tasks, as well as real-world tasks, demonstrate the effectiveness of our method. Notably, with the introduction of only 4M extra parameters, our approach enables the Mamba model(130M) to achieve a significant improvement on tasks with distributed information, increasing its performance from 0 to 80.54 points.
Title: Federated Learning with MMD-based Early Stopping for Adaptive GNSS Interference Classification
Authors: Nishant S. Gaikwad, Lucas Heublein, Nisha L. Raichur, Tobias Feigl, Christopher Mutschler, Felix Ott
Copy Paste: [[2410.15681]] Federated Learning with MMD-based Early Stopping for Adaptive GNSS Interference Classification(https://arxiv.org/abs/2410.15681)
Keywords: federate
Abstract: Federated learning (FL) enables multiple devices to collaboratively train a global model while maintaining data on local servers. Each device trains the model on its local server and shares only the model updates (i.e., gradient weights) during the aggregation step. A significant challenge in FL is managing the feature distribution of novel, unbalanced data across devices. In this paper, we propose an FL approach using few-shot learning and aggregation of the model weights on a global server. We introduce a dynamic early stopping method to balance out-of-distribution classes based on representation learning, specifically utilizing the maximum mean discrepancy of feature embeddings between local and global models. An exemplary application of FL is orchestrating machine learning models along highways for interference classification based on snapshots from global navigation satellite system (GNSS) receivers. Extensive experiments on four GNSS datasets from two real-world highways and controlled environments demonstrate that our FL method surpasses state-of-the-art techniques in adapting to both novel interference classes and multipath scenarios.
Title: DomainSum: A Hierarchical Benchmark for Fine-Grained Domain Shift in Abstractive Text Summarization
Copy Paste: [[2410.15687]] DomainSum: A Hierarchical Benchmark for Fine-Grained Domain Shift in Abstractive Text Summarization(https://arxiv.org/abs/2410.15687)
Keywords: large language model
Abstract: Most research on abstractive summarization focuses on single-domain applications, often neglecting how domain shifts between documents affect performance and the generalization ability of summarization models. To address this issue, we introduce DomainSum, a hierarchical benchmark designed to capture fine-grained domain shifts in abstractive summarization. We categorize these shifts into three levels: genre, style, and topic, and demonstrate through comprehensive benchmark analysis that they follow a hierarchical structure. Furthermore, we evaluate the domain generalization capabilities of commonly used pre-trained language models (PLMs) and large language models (LLMs) in in-domain and cross-domain settings.
Title: MIK: Modified Isolation Kernel for Biological Sequence Visualization, Classification, and Clustering
Copy Paste: [[2410.15688]] MIK: Modified Isolation Kernel for Biological Sequence Visualization, Classification, and Clustering(https://arxiv.org/abs/2410.15688)
Keywords: robust
Abstract: The t-Distributed Stochastic Neighbor Embedding (t-SNE) has emerged as a popular dimensionality reduction technique for visualizing high-dimensional data. It computes pairwise similarities between data points by default using an RBF kernel and random initialization (in low-dimensional space), which successfully captures the overall structure but may struggle to preserve the local structure efficiently. This research proposes a novel approach called the Modified Isolation Kernel (MIK) as an alternative to the Gaussian kernel, which is built upon the concept of the Isolation Kernel. MIK uses adaptive density estimation to capture local structures more accurately and integrates robustness measures. It also assigns higher similarity values to nearby points and lower values to distant points. Comparative research using the normal Gaussian kernel, the isolation kernel, and several initialization techniques, including random, PCA, and random walk initializations, are used to assess the proposed approach (MIK). Additionally, we compare the computational efficiency of all $3$ kernels with $3$ different initialization methods. Our experimental results demonstrate several advantages of the proposed kernel (MIK) and initialization method selection. It exhibits improved preservation of the local and global structure and enables better visualization of clusters and subclusters in the embedded space. These findings contribute to advancing dimensionality reduction techniques and provide researchers and practitioners with an effective tool for data exploration, visualization, and analysis in various domains.
Title: Enhancing SNN-based Spatio-Temporal Learning: A Benchmark Dataset and Cross-Modality Attention Model
Authors: Shibo Zhou, Bo Yang, Mengwen Yuan, Runhao Jiang, Rui Yan, Gang Pan, Huajin Tang
Copy Paste: [[2410.15689]] Enhancing SNN-based Spatio-Temporal Learning: A Benchmark Dataset and Cross-Modality Attention Model(https://arxiv.org/abs/2410.15689)
Keywords: robust
Abstract: Spiking Neural Networks (SNNs), renowned for their low power consumption, brain-inspired architecture, and spatio-temporal representation capabilities, have garnered considerable attention in recent years. Similar to Artificial Neural Networks (ANNs), high-quality benchmark datasets are of great importance to the advances of SNNs. However, our analysis indicates that many prevalent neuromorphic datasets lack strong temporal correlation, preventing SNNs from fully exploiting their spatio-temporal representation capabilities. Meanwhile, the integration of event and frame modalities offers more comprehensive visual spatio-temporal information. Yet, the SNN-based cross-modality fusion remains underexplored. In this work, we present a neuromorphic dataset called DVS-SLR that can better exploit the inherent spatio-temporal properties of SNNs. Compared to existing datasets, it offers advantages in terms of higher temporal correlation, larger scale, and more varied scenarios. In addition, our neuromorphic dataset contains corresponding frame data, which can be used for developing SNN-based fusion methods. By virtue of the dual-modal feature of the dataset, we propose a Cross-Modality Attention (CMA) based fusion method. The CMA model efficiently utilizes the unique advantages of each modality, allowing for SNNs to learn both temporal and spatial attention scores from the spatio-temporal features of event and frame modalities, subsequently allocating these scores across modalities to enhance their synergy. Experimental results demonstrate that our method not only improves recognition accuracy but also ensures robustness across diverse scenarios.
Title: Efficient Terminology Integration for LLM-based Translation in Specialized Domains
Authors: Sejoon Kim, Mingi Sung, Jeonghwan Lee, Hyunkuk Lim, Jorge Froilan Gimenez Perez
Copy Paste: [[2410.15690]] Efficient Terminology Integration for LLM-based Translation in Specialized Domains(https://arxiv.org/abs/2410.15690)
Keywords: extraction
Abstract: Traditional machine translation methods typically involve training models directly on large parallel corpora, with limited emphasis on specialized terminology. However, In specialized fields such as patent, finance, or biomedical domains, terminology is crucial for translation, with many terms that needs to be translated following agreed-upon conventions. In this paper we introduce a methodology that efficiently trains models with a smaller amount of data while preserving the accuracy of terminology translation. We achieve this through a systematic process of term extraction and glossary creation using the Trie Tree algorithm, followed by data reconstruction to teach the LLM how to integrate these specialized terms. This methodology enhances the model's ability to handle specialized terminology and ensures high-quality translations, particularly in fields where term consistency is crucial. Our approach has demonstrated exceptional performance, achieving the highest translation score among participants in the WMT patent task to date, showcasing its effectiveness and broad applicability in specialized translation domains where general methods often fall short.
Title: Solving Continual Offline RL through Selective Weights Activation on Aligned Spaces
Authors: Jifeng Hu, Sili Huang, Li Shen, Zhejian Yang, Shengchao Hu, Shisong Tang, Hechang Chen, Yi Chang, Dacheng Tao, Lichao Sun
Copy Paste: [[2410.15698]] Solving Continual Offline RL through Selective Weights Activation on Aligned Spaces(https://arxiv.org/abs/2410.15698)
Keywords: diffusion
Abstract: Continual offline reinforcement learning (CORL) has shown impressive ability in diffusion-based lifelong learning systems by modeling the joint distributions of trajectories. However, most research only focuses on limited continual task settings where the tasks have the same observation and action space, which deviates from the realistic demands of training agents in various environments. In view of this, we propose Vector-Quantized Continual Diffuser, named VQ-CD, to break the barrier of different spaces between various tasks. Specifically, our method contains two complementary sections, where the quantization spaces alignment provides a unified basis for the selective weights activation. In the quantized spaces alignment, we leverage vector quantization to align the different state and action spaces of various tasks, facilitating continual training in the same space. Then, we propose to leverage a unified diffusion model attached by the inverse dynamic model to master all tasks by selectively activating different weights according to the task-related sparse masks. Finally, we conduct extensive experiments on 15 continual learning (CL) tasks, including conventional CL task settings (identical state and action spaces) and general CL task settings (various state and action spaces). Compared with 16 baselines, our method reaches the SOTA performance.
Title: Students Rather Than Experts: A New AI For Education Pipeline To Model More Human-Like And Personalised Early Adolescences
Copy Paste: [[2410.15701]] Students Rather Than Experts: A New AI For Education Pipeline To Model More Human-Like And Personalised Early Adolescences(https://arxiv.org/abs/2410.15701)
Keywords: large language model
Abstract: The capabilities of large language models (LLMs) have been applied in expert systems across various domains, providing new opportunities for AI in Education. Educational interactions involve a cyclical exchange between teachers and students. Current research predominantly focuses on using LLMs to simulate teachers, leveraging their expertise to enhance student learning outcomes. However, the simulation of students, which could improve teachers' instructional skills, has received insufficient attention due to the challenges of modeling and evaluating virtual students. This research asks: Can LLMs be utilized to develop virtual student agents that mimic human-like behavior and individual variability? Unlike expert systems focusing on knowledge delivery, virtual students must replicate learning difficulties, emotional responses, and linguistic uncertainties. These traits present significant challenges in both modeling and evaluation. To address these issues, this study focuses on language learning as a context for modeling virtual student agents. We propose a novel AI4Education framework, called SOE (Scene-Object-Evaluation), to systematically construct LVSA (LLM-based Virtual Student Agents). By curating a dataset of personalized teacher-student interactions with various personality traits, question types, and learning stages, and fine-tuning LLMs using LoRA, we conduct multi-dimensional evaluation experiments. Specifically, we: (1) develop a theoretical framework for generating LVSA; (2) integrate human subjective evaluation metrics into GPT-4 assessments, demonstrating a strong correlation between human evaluators and GPT-4 in judging LVSA authenticity; and (3) validate that LLMs can generate human-like, personalized virtual student agents in educational contexts, laying a foundation for future applications in pre-service teacher training and multi-agent simulation environments.
Title: Mitigating Hallucinations of Large Language Models in Medical Information Extraction via Contrastive Decoding
Copy Paste: [[2410.15702]] Mitigating Hallucinations of Large Language Models in Medical Information Extraction via Contrastive Decoding(https://arxiv.org/abs/2410.15702)
Keywords: extraction, large language model
Abstract: The impressive capabilities of large language models (LLMs) have attracted extensive interests of applying LLMs to medical field. However, the complex nature of clinical environments presents significant hallucination challenges for LLMs, hindering their widespread adoption. In this paper, we address these hallucination issues in the context of Medical Information Extraction (MIE) tasks by introducing ALternate Contrastive Decoding (ALCD). We begin by redefining MIE tasks as an identify-and-classify process. We then separate the identification and classification functions of LLMs by selectively masking the optimization of tokens during fine-tuning. During the inference stage, we alternately contrast output distributions derived from sub-task models. This approach aims to selectively enhance the identification and classification capabilities while minimizing the influence of other inherent abilities in LLMs. Additionally, we propose an alternate adaptive constraint strategy to more effectively adjust the scale and scope of contrastive tokens. Through comprehensive experiments on two different backbones and six diverse medical information extraction tasks, ALCD demonstrates significant improvements in resolving hallucination issues compared to conventional decoding methods.
Title: Residual vector quantization for KV cache compression in large language model
Copy Paste: [[2410.15704]] Residual vector quantization for KV cache compression in large language model(https://arxiv.org/abs/2410.15704)
Keywords: large language model
Abstract: KV cache compression methods have mainly relied on scalar quantization techniques to reduce the memory requirements during decoding. In this work, we apply residual vector quantization, which has been widely used for high fidelity audio compression, to compress KV cache in large language models (LLM). We adapt the standard recipe with minimal changes to compress the output of any key or value projection matrix in a pretrained LLM: we scale the vector by its standard deviation, divide channels into groups and then quantize each group with the same residual vector quantizer. We learn the codebook using exponential moving average and there are no other learnable parameters including the input and output projections normally used in a vector quantization set up. We find that a residual depth of 8 recovers most of the performance of the unquantized model. We also find that grouping non-contiguous channels together works better than grouping contiguous channels for compressing key matrix and the method further benefits from a light weight finetuning of LLM together with the quantization. Overall, the proposed technique is competitive with existing quantization methods while being much simpler and results in 5.5x compression compared to half precision.
Title: Estimating Individual Dose-Response Curves under Unobserved Confounders from Observational Data
Copy Paste: [[2410.15706]] Estimating Individual Dose-Response Curves under Unobserved Confounders from Observational Data(https://arxiv.org/abs/2410.15706)
Keywords: robust
Abstract: Estimating an individual's potential response to continuously varied treatments is crucial for addressing causal questions across diverse domains, from healthcare to social sciences. However, existing methods are limited either to estimating causal effects of binary treatments, or scenarios where all confounding variables are measurable. In this work, we present ContiVAE, a novel framework for estimating causal effects of continuous treatments, measured by individual dose-response curves, considering the presence of unobserved confounders using observational data. Leveraging a variational auto-encoder with a Tilted Gaussian prior distribution, ContiVAE models the hidden confounders as latent variables, and is able to predict the potential outcome of any treatment level for each individual while effectively capture the heterogeneity among individuals. Experiments on semi-synthetic datasets show that ContiVAE outperforms existing methods by up to 62%, demonstrating its robustness and flexibility. Application on a real-world dataset illustrates its practical utility.
Title: Traffic Matrix Estimation based on Denoising Diffusion Probabilistic Model
Copy Paste: [[2410.15716]] Traffic Matrix Estimation based on Denoising Diffusion Probabilistic Model(https://arxiv.org/abs/2410.15716)
Keywords: diffusion, generative
Abstract: The traffic matrix estimation (TME) problem has been widely researched for decades of years. Recent progresses in deep generative models offer new opportunities to tackle TME problems in a more advanced way. In this paper, we leverage the powerful ability of denoising diffusion probabilistic models (DDPMs) on distribution learning, and for the first time adopt DDPM to address the TME problem. To ensure a good performance of DDPM on learning the distributions of TMs, we design a preprocessing module to reduce the dimensions of TMs while keeping the data variety of each OD flow. To improve the estimation accuracy, we parameterize the noise factors in DDPM and transform the TME problem into a gradient-descent optimization problem. Finally, we compared our method with the state-of-the-art TME methods using two real-world TM datasets, the experimental results strongly demonstrate the superiority of our method on both TM synthesis and TM estimation.
Title: Efficient and Universally Accessible Cross-Chain Options without Upfront Holder Collateral
Copy Paste: [[2410.15724]] Efficient and Universally Accessible Cross-Chain Options without Upfront Holder Collateral(https://arxiv.org/abs/2410.15724)
Keywords: secure, security
Abstract: Options are fundamental to blockchain-based financial markets, offering essential tools for risk management and price speculation, which enhance liquidity, flexibility, and market efficiency in decentralized finance (DeFi). Despite the growing interest in options for blockchain-resident assets, such as cryptocurrencies, current option mechanisms face significant challenges, including limited asset support, high trading delays, and the requirement for option holders to provide upfront collateral. In this paper, we present a protocol that addresses the aforementioned issues by facilitating efficient and universally accessible option trading without requiring holders to post collateral when establishing options. Our protocol's universality allows for cross-chain options involving nearly $\textit{any}$ assets on $\textit{any}$ two different blockchains, provided the chains' programming languages can enforce and execute the necessary contract logic. A key innovation in our approach is the use of Double-Authentication-Preventing Signatures (DAPS), which significantly reduces trading latency. Additionally, by introducing a guarantee from the option writer, our protocol removes the need of upfront collateral from holders. Our evaluation demonstrates that the proposed scheme reduces option transfer latency to less than half of that in existing methods. Rigorous security analysis proves that our protocol achieves secure option trading, even when facing adversarial behaviors.
Title: Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases
Abstract: Unsupervised object-centric learning from videos is a promising approach towards learning compositional representations that can be applied to various downstream tasks, such as prediction and reasoning. Recently, it was shown that pretrained Vision Transformers (ViTs) can be useful to learn object-centric representations on real-world video datasets. However, while these approaches succeed at extracting objects from the scenes, the slot-based representations fail to maintain temporal consistency across consecutive frames in a video, i.e. the mapping of objects to slots changes across the video. To address this, we introduce Conditional Autoregressive Slot Attention (CA-SA), a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks. Leveraging an autoregressive prior network to condition representations on previous timesteps and a novel consistency loss function, CA-SA predicts future slot representations and imposes consistency across frames. We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks, such as video prediction and visual question-answering tasks.
Title: ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts
Copy Paste: [[2410.15732]] ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts(https://arxiv.org/abs/2410.15732)
Keywords: transformer
Abstract: Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision through a comprehensive study on image classification. However, we observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design. The underlying cause is that inappropriate MoE layers lead to unreliable routing and hinder experts from effectively acquiring helpful knowledge. To address this, we introduce a shared expert to learn and capture common information, serving as an effective way to construct stable ViMoE. Furthermore, we demonstrate how to analyze expert routing behavior, revealing which MoE layers are capable of specializing in handling specific information and which are not. This provides guidance for retaining the critical layers while removing redundancies, thereby advancing ViMoE to be more efficient without sacrificing accuracy. We aspire for this work to offer new insights into the design of vision MoE models and provide valuable empirical guidance for future research.
Title: Who's Who: Large Language Models Meet Knowledge Conflicts in Practice
Authors: Quang Hieu Pham, Hoang Ngo, Anh Tuan Luu, Dat Quoc Nguyen
Copy Paste: [[2410.15737]] Who's Who: Large Language Models Meet Knowledge Conflicts in Practice(https://arxiv.org/abs/2410.15737)
Keywords: large language model
Abstract: Retrieval-augmented generation (RAG) methods are viable solutions for addressing the static memory limits of pre-trained language models. Nevertheless, encountering conflicting sources of information within the retrieval context is an inevitable practical challenge. In such situations, the language models are recommended to transparently inform users about the conflicts rather than autonomously deciding what to present based on their inherent biases. To analyze how current large language models (LLMs) align with our recommendation, we introduce WhoQA, a public benchmark dataset to examine model's behavior in knowledge conflict situations. We induce conflicts by asking about a common property among entities having the same name, resulting in questions with up to 8 distinctive answers. WhoQA evaluation set includes 5K questions across 13 Wikidata property types and 150K Wikipedia entities. Our experiments show that despite the simplicity of WhoQA questions, knowledge conflicts significantly degrades LLMs' performance in RAG settings.
Title: Toeing the Party Line: Election Manifestos as a Key to Understand Political Discourse on Twitter
Authors: Maximilian Maurer, Tanise Ceron, Sebastian Padó, Gabriella Lapesa
Copy Paste: [[2410.15743]] Toeing the Party Line: Election Manifestos as a Key to Understand Political Discourse on Twitter(https://arxiv.org/abs/2410.15743)
Keywords: robust
Abstract: Political discourse on Twitter is a moving target: politicians continuously make statements about their positions. It is therefore crucial to track their discourse on social media to understand their ideological positions and goals. However, Twitter data is also challenging to work with since it is ambiguous and often dependent on social context, and consequently, recent work on political positioning has tended to focus strongly on manifestos (parties' electoral programs) rather than social media. In this paper, we extend recently proposed methods to predict pairwise positional similarities between parties from the manifesto case to the Twitter case, using hashtags as a signal to fine-tune text representations, without the need for manual annotation. We verify the efficacy of fine-tuning and conduct a series of experiments that assess the robustness of our method for low-resource scenarios. We find that our method yields stable positioning reflective of manifesto positioning, both in scenarios with all tweets of candidates across years available and when only smaller subsets from shorter time periods are available. This indicates that it is possible to reliably analyze the relative positioning of actors forgoing manual annotation, even in the noisier context of social media.
Title: Unleashing the Potential of Vision-Language Pre-Training for 3D Zero-Shot Lesion Segmentation via Mask-Attribute Alignment
Copy Paste: [[2410.15744]] Unleashing the Potential of Vision-Language Pre-Training for 3D Zero-Shot Lesion Segmentation via Mask-Attribute Alignment(https://arxiv.org/abs/2410.15744)
Keywords: segmentation
Abstract: Recent advancements in medical vision-language pre-training models have driven significant progress in zero-shot disease recognition. However, transferring image-level knowledge to pixel-level tasks, such as lesion segmentation in 3D CT scans, remains a critical challenge. Due to the complexity and variability of pathological visual characteristics, existing methods struggle to align fine-grained lesion features not encountered during training with disease-related textual representations. In this paper, we present Malenia, a novel multi-scale lesion-level mask-attribute alignment framework, specifically designed for 3D zero-shot lesion segmentation. Malenia improves the compatibility between mask representations and their associated elemental attributes, explicitly linking the visual features of unseen lesions with the extensible knowledge learned from previously seen ones. Furthermore, we design a Cross-Modal Knowledge Injection module to enhance both visual and textual features with mutually beneficial information, effectively guiding the generation of segmentation results. Comprehensive experiments across three datasets and 12 lesion categories validate the superior performance of Malenia. Codes will be publicly available.
Title: Solving Sparse \& High-Dimensional-Output Regression via Compression
Copy Paste: [[2410.15762]] Solving Sparse \& High-Dimensional-Output Regression via Compression(https://arxiv.org/abs/2410.15762)
Keywords: interpretability
Abstract: Multi-Output Regression (MOR) has been widely used in scientific data analysis for decision-making. Unlike traditional regression models, MOR aims to simultaneously predict multiple real-valued outputs given an input. However, the increasing dimensionality of the outputs poses significant challenges regarding interpretability and computational scalability for modern MOR applications. As a first step to address these challenges, this paper proposes a Sparse \& High-dimensional-Output REgression (SHORE) model by incorporating additional sparsity requirements to resolve the output interpretability, and then designs a computationally efficient two-stage optimization framework capable of solving SHORE with provable accuracy via compression on outputs. Theoretically, we show that the proposed framework is computationally scalable while maintaining the same order of training loss and prediction loss before-and-after compression under arbitrary or relatively weak sample set conditions. Empirically, numerical results further validate the theoretical findings, showcasing the efficiency and accuracy of the proposed framework.
Title: How Important are Data Augmentations to Close the Domain Gap for Object Detection in Orbit?
Authors: Maximilian Ulmer, Leonard Klüpfel, Maximilian Durner, Rudolph Triebel
Copy Paste: [[2410.15766]] How Important are Data Augmentations to Close the Domain Gap for Object Detection in Orbit?(https://arxiv.org/abs/2410.15766)
Keywords: robust
Abstract: We investigate the efficacy of data augmentations to close the domain gap in spaceborne computer vision, crucial for autonomous operations like on-orbit servicing. As the use of computer vision in space increases, challenges such as hostile illumination and low signal-to-noise ratios significantly hinder performance. While learning-based algorithms show promising results, their adoption is limited by the need for extensive annotated training data and the domain gap that arises from differences between synthesized and real-world imagery. This study explores domain generalization in terms of data augmentations -- classical color and geometric transformations, corruptions, and noise -- to enhance model performance across the domain gap. To this end, we conduct an large scale experiment using a hyperparameter optimization pipeline that samples hundreds of different configurations and searches for the best set to bridge the domain gap. As a reference task, we use 2D object detection and evaluate on the SPEED+ dataset that contains real hardware-in-the-loop satellite images in its test set. Moreover, we evaluate four popular object detectors, including Mask R-CNN, Faster R-CNN, YOLO-v7, and the open set detector GroundingDINO, and highlight their trade-offs between performance, inference speed, and training time. Our results underscore the vital role of data augmentations in bridging the domain gap, improving model performance, robustness, and reliability for critical space applications. As a result, we propose two novel data augmentations specifically developed to emulate the visual effects observed in orbital imagery. We conclude by recommending the most effective augmentations for advancing computer vision in challenging orbital environments. Code for training detectors and hyperparameter search will be made publicly available.
Title: High-Fidelity Transfer of Functional Priors for Wide Bayesian Neural Networks by Learning Activations
Authors: Marcin Sendera, Amin Sorkhei, Tomasz Kuśmierczyk
Copy Paste: [[2410.15777]] High-Fidelity Transfer of Functional Priors for Wide Bayesian Neural Networks by Learning Activations(https://arxiv.org/abs/2410.15777)
Keywords: robust
Abstract: Function-space priors in Bayesian Neural Networks provide a more intuitive approach to embedding beliefs directly into the model's output, thereby enhancing regularization, uncertainty quantification, and risk-aware decision-making. However, imposing function-space priors on BNNs is challenging. We address this task through optimization techniques that explore how trainable activations can accommodate complex priors and match intricate target function distributions. We discuss critical learning challenges, including identifiability, loss construction, and symmetries that arise in this context. Furthermore, we enable evidence maximization to facilitate model selection by conditioning the functional priors on additional hyperparameters. Our empirical findings demonstrate that even BNNs with a single wide hidden layer, when equipped with these adaptive trainable activations and conditioning strategies, can effectively achieve high-fidelity function-space priors, providing a robust and flexible framework for enhancing Bayesian neural network performance.
Title: Reducing Hallucinations in Vision-Language Models via Latent Space Steering
Copy Paste: [[2410.15778]] Reducing Hallucinations in Vision-Language Models via Latent Space Steering(https://arxiv.org/abs/2410.15778)
Keywords: large language model
Abstract: Hallucination poses a challenge to the deployment of large vision-language models (LVLMs) in applications. Unlike in large language models (LLMs), hallucination in LVLMs often arises from misalignments between visual inputs and textual outputs. This paper investigates the underlying mechanisms of hallucination, focusing on the unique structure of LVLMs that distinguishes them from large language models (LLMs). We identify that hallucinations often arise from the sensitivity of text decoders to vision inputs, a natural phenomenon when image encoders and text decoders are pre-trained separately. Inspired by this, we introduce Visual and Textual Intervention (VTI), a novel technique designed to reduce hallucinations by steering latent space representations during inference to enhance the stability of vision features. As a task-agnostic test-time intervention, VTI can be easily applied to any problem without additional cost. Extensive experiments demonstrate that it can effectively reduce hallucinations and outperform baseline methods across multiple metrics, highlighting the critical role of vision feature stability in LVLMs.
Title: Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count
Copy Paste: [[2410.15787]] Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count(https://arxiv.org/abs/2410.15787)
Keywords: transformer
Abstract: Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training. While arithmetic tasks are commonly used to study length generalization, certain tasks are considered notoriously difficult, e.g., multi-operand addition (requiring generalization over both the number of operands and their lengths) and multiplication (requiring generalization over both operand lengths). In this work, we achieve approximately 2-3x length generalization on both tasks, which is the first such achievement in arithmetic Transformers. We design task-specific scratchpads enabling the model to focus on a fixed number of tokens per each next-token prediction step, and apply multi-level versions of Position Coupling (Cho et al., 2024; McLeish et al., 2024) to let Transformers know the right position to attend to. On the theory side, we prove that a 1-layer Transformer using our method can solve multi-operand addition, up to operand length and operand count that are exponential in embedding dimension.
Title: Habaek: High-performance water segmentation through dataset expansion and inductive bias optimization
Copy Paste: [[2410.15794]] Habaek: High-performance water segmentation through dataset expansion and inductive bias optimization(https://arxiv.org/abs/2410.15794)
Keywords: transformer, segmentation
Abstract: Water segmentation is critical to disaster response and water resource management. Authorities may employ high-resolution photography to monitor rivers, lakes, and reservoirs, allowing for more proactive management in agriculture, industry, and conservation. Deep learning has improved flood monitoring by allowing models like CNNs, U-Nets, and transformers to handle large volumes of satellite and aerial data. However, these models usually have significant processing requirements, limiting their usage in real-time applications. This research proposes upgrading the SegFormer model for water segmentation by data augmentation with datasets such as ADE20K and RIWA to boost generalization. We examine how inductive bias affects attention-based models and discover that SegFormer performs better on bigger datasets. To further demonstrate the function of data augmentation, Low-Rank Adaptation (LoRA) is used to lower processing complexity while preserving accuracy. We show that the suggested Habaek model outperforms current models in segmentation, with an Intersection over Union (IoU) ranging from 0.91986 to 0.94397. In terms of F1-score, recall, accuracy, and precision, Habaek performs better than rival models, indicating its potential for real-world applications. This study highlights the need to enhance structures and include datasets for effective water segmentation.
Title: Data-Efficient CLIP-Powered Dual-Branch Networks for Source-Free Unsupervised Domain Adaptation
Abstract: Source-Free Unsupervised Domain Adaptation (SF-UDA) aims to transfer a model's performance from a labeled source domain to an unlabeled target domain without direct access to source samples, addressing data privacy issues. However, most existing SF-UDA approaches assume the availability of abundant source domain samples, which is often impractical due to the high cost of data annotation. In this paper, we explore a more challenging scenario where direct access to source domain samples is restricted, and the source domain contains only a few samples. To tackle the dual challenges of limited source data and privacy concerns, we introduce a data-efficient, CLIP-powered dual-branch network (CDBN in short). We design a cross-modal dual-branch network that integrates source domain class semantics into the unsupervised fine-tuning of the target domain. It preserves the class information from the source domain while enhancing the model's generalization to the target domain. Additionally, we propose an unsupervised optimization strategy driven by accurate classification and diversity, which aims to retain the classification capability learned from the source domain while producing more confident and diverse predictions in the target domain. Extensive experiments across 31 transfer tasks on 7 public datasets demonstrate that our approach achieves state-of-the-art performance compared to existing methods.
Title: Kaninfradet3D:A Road-side Camera-LiDAR Fusion 3D Perception Model based on Nonlinear Feature Extraction and Intrinsic Correlation
Authors: Pei Liu (1), Nanfang Zheng (2), Yiqun Li (2), Junlan Chen (2), Ziyuan Pu (2) ((1) Intelligent Transportation Thrust, Systems Hub, The Hong Kong University of Science and Technology (Guangzhou), (2) Transportation, Southeast University)
Copy Paste: [[2410.15814]] Kaninfradet3D:A Road-side Camera-LiDAR Fusion 3D Perception Model based on Nonlinear Feature Extraction and Intrinsic Correlation(https://arxiv.org/abs/2410.15814)
Keywords: extraction
Abstract: With the development of AI-assisted driving, numerous methods have emerged for ego-vehicle 3D perception tasks, but there has been limited research on roadside perception. With its ability to provide a global view and a broader sensing range, the roadside perspective is worth developing. LiDAR provides precise three-dimensional spatial information, while cameras offer semantic information. These two modalities are complementary in 3D detection. However, adding camera data does not increase accuracy in some studies since the information extraction and fusion procedure is not sufficiently reliable. Recently, Kolmogorov-Arnold Networks (KANs) have been proposed as replacements for MLPs, which are better suited for high-dimensional, complex data. Both the camera and the LiDAR provide high-dimensional information, and employing KANs should enhance the extraction of valuable features to produce better fusion outcomes. This paper proposes Kaninfradet3D, which optimizes the feature extraction and fusion modules. To extract features from complex high-dimensional data, the model's encoder and fuser modules were improved using KAN Layers. Cross-attention was applied to enhance feature fusion, and visual comparisons verified that camera features were more evenly integrated. This addressed the issue of camera features being abnormally concentrated, negatively impacting fusion. Compared to the benchmark, our approach shows improvements of +9.87 mAP and +10.64 mAP in the two viewpoints of the TUMTraf Intersection Dataset and an improvement of +1.40 mAP in the roadside end of the TUMTraf V2X Cooperative Perception Dataset. The results indicate that Kaninfradet3D can effectively fuse features, demonstrating the potential of applying KANs in roadside perception tasks.
Title: Explainability of Highly Associated Fuzzy Churn Patterns in Binary Classification
Authors: D.Y.C. Wang, Lars Arne Jordanger, Jerry Chun-Wei Lin
Copy Paste: [[2410.15827]] Explainability of Highly Associated Fuzzy Churn Patterns in Binary Classification(https://arxiv.org/abs/2410.15827)
Keywords: explainability
Abstract: Customer churn, particularly in the telecommunications sector, influences both costs and profits. As the explainability of models becomes increasingly important, this study emphasizes not only the explainability of customer churn through machine learning models, but also the importance of identifying multivariate patterns and setting soft bounds for intuitive interpretation. The main objective is to use a machine learning model and fuzzy-set theory with top-\textit{k} HUIM to identify highly associated patterns of customer churn with intuitive identification, referred to as Highly Associated Fuzzy Churn Patterns (HAFCP). Moreover, this method aids in uncovering association rules among multiple features across low, medium, and high distributions. Such discoveries are instrumental in enhancing the explainability of findings. Experiments show that when the top-5 HAFCPs are included in five datasets, a mixture of performance results is observed, with some showing notable improvements. It becomes clear that high importance features enhance explanatory power through their distribution and patterns associated with other features. As a result, the study introduces an innovative approach that improves the explainability and effectiveness of customer churn prediction models.
Title: LiOn-XA: Unsupervised Domain Adaptation via LiDAR-Only Cross-Modal Adversarial Training
Authors: Thomas Kreutz, Jens Lemke, Max Mühlhäuser, Alejandro Sanchez Guinea
Abstract: In this paper, we propose LiOn-XA, an unsupervised domain adaptation (UDA) approach that combines LiDAR-Only Cross-Modal (X) learning with Adversarial training for 3D LiDAR point cloud semantic segmentation to bridge the domain gap arising from environmental and sensor setup changes. Unlike existing works that exploit multiple data modalities like point clouds and RGB image data, we address UDA in scenarios where RGB images might not be available and show that two distinct LiDAR data representations can learn from each other for UDA. More specifically, we leverage 3D voxelized point clouds to preserve important geometric structure in combination with 2D projection-based range images that provide information such as object orientations or surfaces. To further align the feature space between both domains, we apply adversarial training using both features and predictions of both 2D and 3D neural networks. Our experiments on 3 real-to-real adaptation scenarios demonstrate the effectiveness of our approach, achieving new state-of-the-art performance when compared to previous uni- and multi-model UDA methods. Our source code is publicly available at this https URL.
Title: Private, Efficient and Scalable Kernel Learning for Medical Image Analysis
Authors: Anika Hannemann, Arjhun Swaminathan, Ali Burak Ünal, Mete Akgün
Copy Paste: [[2410.15840]] Private, Efficient and Scalable Kernel Learning for Medical Image Analysis(https://arxiv.org/abs/2410.15840)
Keywords: privacy
Abstract: Medical imaging is key in modern medicine. From magnetic resonance imaging (MRI) to microscopic imaging for blood cell detection, diagnostic medical imaging reveals vital insights into patient health. To predict diseases or provide individualized therapies, machine learning techniques like kernel methods have been widely used. Nevertheless, there are multiple challenges for implementing kernel methods. Medical image data often originates from various hospitals and cannot be combined due to privacy concerns, and the high dimensionality of image data presents another significant obstacle. While randomised encoding offers a promising direction, existing methods often struggle with a trade-off between accuracy and efficiency. Addressing the need for efficient privacy-preserving methods on distributed image data, we introduce OKRA (Orthonormal K-fRAmes), a novel randomized encoding-based approach for kernel-based machine learning. This technique, tailored for widely used kernel functions, significantly enhances scalability and speed compared to current state-of-the-art solutions. Through experiments conducted on various clinical image datasets, we evaluated model quality, computational performance, and resource overhead. Additionally, our method outperforms comparable approaches
Title: Random Token Fusion for Multi-View Medical Diagnosis
Authors: Jingyu Guo, Christos Matsoukas, Fredrik Strand, Kevin Smith
Copy Paste: [[2410.15847]] Random Token Fusion for Multi-View Medical Diagnosis(https://arxiv.org/abs/2410.15847)
Keywords: robust, transformer
Abstract: In multi-view medical diagnosis, deep learning-based models often fuse information from different imaging perspectives to improve diagnostic performance. However, existing approaches are prone to overfitting and rely heavily on view-specific features, which can lead to trivial solutions. In this work, we introduce Random Token Fusion (RTF), a novel technique designed to enhance multi-view medical image analysis using vision transformers. By integrating randomness into the feature fusion process during training, RTF addresses the issue of overfitting and enhances the robustness and accuracy of diagnostic models without incurring any additional cost at inference. We validate our approach on standard mammography and chest X-ray benchmark datasets. Through extensive experiments, we demonstrate that RTF consistently improves the performance of existing fusion methods, paving the way for a new generation of multi-view medical foundation models.
Title: Mesa-Extrapolation: A Weave Position Encoding Method for Enhanced Extrapolation in LLMs
Authors: Xin Ma, Yang Liu, Jingjing Liu, Xiaoxu Ma
Copy Paste: [[2410.15859]] Mesa-Extrapolation: A Weave Position Encoding Method for Enhanced Extrapolation in LLMs(https://arxiv.org/abs/2410.15859)
Keywords: large language model
Abstract: Large language models (LLMs), although having revolutionized many fields, still suffer from the challenging extrapolation problem, where the inference ability of LLMs sharply declines beyond their max training lengths. In this work, we conduct a theoretical analysis to better understand why No Position Encoding (NoPE) fails outside its effective range, as well as examining the power of Position Encoding (PE) in this context. Our findings reveal that with meticulous weave position, PE can indeed be extended beyond effective range. Our theorems establish that LLMs equipped with weave PE can achieve improved extrapolation performance without additional cost. Furthermore, we introduce a novel weave PE method, Mesa-Extrapolation, which utilizes a chunk-based triangular attention matrix and applies Stair PE to manage the final chunk. This method not only retains competitive performance but also offers substantial benefits such as significantly reduced memory demand and faster inference speed. Extensive experiments validate the effectiveness of Mesa-Extrapolation, demonstrating its potential as a scalable solution to enhancing LLMs applicative reach.
Copy Paste: [[2410.15882]] Distributed Learning for UAV Swarms(https://arxiv.org/abs/2410.15882)
Keywords: security, privacy, federate
Abstract: Unmanned Aerial Vehicle (UAV) swarms are increasingly deployed in dynamic, data-rich environments for applications such as environmental monitoring and surveillance. These scenarios demand efficient data processing while maintaining privacy and security, making Federated Learning (FL) a promising solution. FL allows UAVs to collaboratively train global models without sharing raw data, but challenges arise due to the non-Independent and Identically Distributed (non-IID) nature of the data collected by UAVs. In this study, we show an integration of the state-of-the-art FL methods to UAV Swarm application and invetigate the performance of multiple aggregation methods (namely FedAvg, FedProx, FedOpt, and MOON) with a particular focus on tackling non-IID on a variety of datasets, specifically MNIST for baseline performance, CIFAR10 for natural object classification, EuroSAT for environment monitoring, and CelebA for surveillance. These algorithms were selected to cover improved techniques on both client-side updates and global aggregation. Results show that while all algorithms perform comparably on IID data, their performance deteriorates significantly under non-IID conditions. FedProx demonstrated the most stable overall performance, emphasising the importance of regularising local updates in non-IID environments to mitigate drastic deviations in local models.
Title: Model Mimic Attack: Knowledge Distillation for Provably Transferable Adversarial Examples
Authors: Kirill Lukyanov, Andrew Perminov, Denis Turdakov, Mikhail Pautov
Copy Paste: [[2410.15889]] Model Mimic Attack: Knowledge Distillation for Provably Transferable Adversarial Examples(https://arxiv.org/abs/2410.15889)
Keywords: attack
Abstract: The vulnerability of artificial neural networks to adversarial perturbations in the black-box setting is widely studied in the literature. The majority of attack methods to construct these perturbations suffer from an impractically large number of queries required to find an adversarial example. In this work, we focus on knowledge distillation as an approach to conduct transfer-based black-box adversarial attacks and propose an iterative training of the surrogate model on an expanding dataset. This work is the first, to our knowledge, to provide provable guarantees on the success of knowledge distillation-based attack on classification neural networks: we prove that if the student model has enough learning capabilities, the attack on the teacher model is guaranteed to be found within the finite number of distillation iterations.
Title: Leveraging CORAL-Correlation Consistency Network for Semi-Supervised Left Atrium MRI Segmentation
Copy Paste: [[2410.15916]] Leveraging CORAL-Correlation Consistency Network for Semi-Supervised Left Atrium MRI Segmentation(https://arxiv.org/abs/2410.15916)
Keywords: segmentation
Abstract: Semi-supervised learning (SSL) has been widely used to learn from both a few labeled images and many unlabeled images to overcome the scarcity of labeled samples in medical image segmentation. Most current SSL-based segmentation methods use pixel values directly to identify similar features in labeled and unlabeled data. They usually fail to accurately capture the intricate attachment structures in the left atrium, such as the areas of inconsistent density or exhibit outward curvatures, adding to the complexity of the task. In this paper, we delve into this issue and introduce an effective solution, CORAL(Correlation-Aligned)-Correlation Consistency Network (CORN), to capture the global structure shape and local details of Left Atrium. Diverging from previous methods focused on each local pixel value, the CORAL-Correlation Consistency Module (CCM) in the CORN leverages second-order statistical information to capture global structural features by minimizing the distribution discrepancy between labeled and unlabeled samples in feature space. Yet, direct construction of features from unlabeled data frequently results in ``Sample Selection Bias'', leading to flawed supervision. We thus further propose the Dynamic Feature Pool (DFP) for the CCM, which utilizes a confidence-based filtering strategy to remove incorrectly selected features and regularize both teacher and student models by constraining the similarity matrix to be consistent. Extensive experiments on the Left Atrium dataset have shown that the proposed CORN outperforms previous state-of-the-art semi-supervised learning methods.
Title: GReFEL: Geometry-Aware Reliable Facial Expression Learning under Bias and Imbalanced Data Distribution
Authors: Azmine Toushik Wasi, Taki Hasan Rafi, Raima Islam, Karlo Serbetar, Dong Kyu Chae
Copy Paste: [[2410.15927]] GReFEL: Geometry-Aware Reliable Facial Expression Learning under Bias and Imbalanced Data Distribution(https://arxiv.org/abs/2410.15927)
Keywords: transformer
Abstract: Reliable facial expression learning (FEL) involves the effective learning of distinctive facial expression characteristics for more reliable, unbiased and accurate predictions in real-life settings. However, current systems struggle with FEL tasks because of the variance in people's facial expressions due to their unique facial structures, movements, tones, and demographics. Biased and imbalanced datasets compound this challenge, leading to wrong and biased prediction labels. To tackle these, we introduce GReFEL, leveraging Vision Transformers and a facial geometry-aware anchor-based reliability balancing module to combat imbalanced data distributions, bias, and uncertainty in facial expression learning. Integrating local and global data with anchors that learn different facial data points and structural features, our approach adjusts biased and mislabeled emotions caused by intra-class disparity, inter-class similarity, and scale sensitivity, resulting in comprehensive, accurate, and reliable facial expression predictions. Our model outperforms current state-of-the-art methodologies, as demonstrated by extensive experiments on various datasets.
Title: Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection
Authors: Koji Inoue, Divesh Lala, Gabriel Skantze, Tatsuya Kawahara
Copy Paste: [[2410.15929]] Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection(https://arxiv.org/abs/2410.15929)
Keywords: robust
Abstract: In human conversations, short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue. These backchannels signal attentiveness and understanding without interrupting the speaker, making their accurate prediction essential for creating more natural conversational agents. This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection (VAP) model. While existing approaches have relied on turn-based or artificially balanced datasets, our approach predicts both the timing and type of backchannels in a continuous and frame-wise manner on unbalanced, real-world datasets. We first pre-train the VAP model on a general dialogue corpus to capture conversational dynamics and then fine-tune it on a specialized dataset focused on backchannel behavior. Experimental results demonstrate that our model outperforms baseline methods in both timing and type prediction tasks, achieving robust performance in real-time environments. This research offers a promising step toward more responsive and human-like dialogue systems, with implications for interactive spoken dialogue applications such as virtual assistants and robots.
Title: Focus on BEV: Self-calibrated Cycle View Transformation for Monocular Birds-Eye-View Segmentation
Authors: Jiawei Zhao, Qixing Jiang, Xuede Li, Junfeng Luo
Copy Paste: [[2410.15932]] Focus on BEV: Self-calibrated Cycle View Transformation for Monocular Birds-Eye-View Segmentation(https://arxiv.org/abs/2410.15932)
Keywords: segmentation
Abstract: Birds-Eye-View (BEV) segmentation aims to establish a spatial mapping from the perspective view to the top view and estimate the semantic maps from monocular images. Recent studies have encountered difficulties in view transformation due to the disruption of BEV-agnostic features in image space. To tackle this issue, we propose a novel FocusBEV framework consisting of $(i)$ a self-calibrated cross view transformation module to suppress the BEV-agnostic image areas and focus on the BEV-relevant areas in the view transformation stage, $(ii)$ a plug-and-play ego-motion-based temporal fusion module to exploit the spatiotemporal structure consistency in BEV space with a memory bank, and $(iii)$ an occupancy-agnostic IoU loss to mitigate both semantic and positional uncertainties. Experimental evidence demonstrates that our approach achieves new state-of-the-art on two popular benchmarks,\ie, 29.2\% mIoU on nuScenes and 35.2\% mIoU on Argoverse.
Title: CausalGraph2LLM: Evaluating LLMs for Causal Queries
Copy Paste: [[2410.15939]] CausalGraph2LLM: Evaluating LLMs for Causal Queries(https://arxiv.org/abs/2410.15939)
Keywords: large language model
Abstract: Causality is essential in scientific research, enabling researchers to interpret true relationships between variables. These causal relationships are often represented by causal graphs, which are directed acyclic graphs. With the recent advancements in Large Language Models (LLMs), there is an increasing interest in exploring their capabilities in causal reasoning and their potential use to hypothesize causal graphs. These tasks necessitate the LLMs to encode the causal graph effectively for subsequent downstream tasks. In this paper, we propose a comprehensive benchmark, \emph{CausalGraph2LLM}, encompassing a variety of causal graph settings to assess the causal graph understanding capability of LLMs. We categorize the causal queries into two types: graph-level and node-level queries. We benchmark both open-sourced and closed models for our study. Our findings reveal that while LLMs show promise in this domain, they are highly sensitive to the encoding used. Even capable models like GPT-4 and Gemini-1.5 exhibit sensitivity to encoding, with deviations of about $60\%$. We further demonstrate this sensitivity for downstream causal intervention tasks. Moreover, we observe that LLMs can often display biases when presented with contextual information about a causal graph, potentially stemming from their parametric memory.
Title: A Low-Cost Privacy-Preserving Digital Wallet for Humanitarian Aid Distribution
Authors: Eva Luvison, Sylvain Chatel, Justinas Sukaitis, Vincent Graf Narbel, Carmela Troncoso, Wouter Lueks
Copy Paste: [[2410.15942]] A Low-Cost Privacy-Preserving Digital Wallet for Humanitarian Aid Distribution(https://arxiv.org/abs/2410.15942)
Keywords: security, privacy, fair
Abstract: Humanitarian organizations distribute aid to people affected by armed conflicts or natural disasters. Digitalization has the potential to increase the efficiency and fairness of aid-distribution systems, and recent work by Wang et al. has shown that these benefits are possible without creating privacy harms for aid recipients. However, their work only provides a solution for one particular aid-distribution scenario in which aid recipients receive a pre-defined set of goods. Yet, in many situations it is desirable to enable recipients to decide which items they need at each moment to satisfy their specific needs. We formalize these needs into functional, deployment, security, and privacy requirements, and design a privacy-preserving digital wallet for aid distribution. Our smart-card-based solution enables aid recipients to spend a pre-defined budget at different vendors to obtain the items that they need. We prove our solution's security and privacy properties, and show it is practical at scale.
Title: TS-ACL: A Time Series Analytic Continual Learning Framework for Privacy-Preserving and Class-Incremental Pattern Recognition
Copy Paste: [[2410.15954]] TS-ACL: A Time Series Analytic Continual Learning Framework for Privacy-Preserving and Class-Incremental Pattern Recognition(https://arxiv.org/abs/2410.15954)
Keywords: privacy, protect, extraction
Abstract: Class-incremental Learning (CIL) in Time Series Classification (TSC) aims to incrementally train models using the streaming time series data that arrives continuously. The main problem in this scenario is catastrophic forgetting, i.e., training models with new samples inevitably leads to the forgetting of previously learned knowledge. Among existing methods, the replay-based methods achieve satisfactory performance but compromise privacy, while exemplar-free methods protect privacy but suffer from low accuracy. However, more critically, owing to their reliance on gradient-based update techniques, these existing methods fundamentally cannot solve the catastrophic forgetting problem. In TSC scenarios with continuously arriving data and temporally shifting distributions, these methods become even less practical. In this paper, we propose a Time Series Analytic Continual Learning framework, called TS-ACL. Inspired by analytical learning, TS-ACL transforms neural network updates into gradient-free linear regression problems, thereby fundamentally mitigating catastrophic forgetting. Specifically, employing a pre-trained and frozen feature extraction encoder, TS-ACL only needs to update its analytic classifier recursively in a lightweight manner that is highly suitable for real-time applications and large-scale data processing. Additionally, we theoretically demonstrate that the model obtained recursively through the TS-ACL is exactly equivalent to a model trained on the complete dataset in a centralized manner, thereby establishing the property of absolute knowledge memory. Extensive experiments validate the superior performance of our TS-ACL.
Title: Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs
Authors: Yanzhu Guo, Simone Conia, Zelin Zhou, Min Li, Saloni Potdar, Henry Xiao
Copy Paste: [[2410.15956]] Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs(https://arxiv.org/abs/2410.15956)
Keywords: large language model
Abstract: Current Large Language Models (LLMs) are predominantly designed with English as the primary language, and even the few that are multilingual tend to exhibit strong English-centric biases. Much like speakers who might produce awkward expressions when learning a second language, LLMs often generate unnatural outputs in non-English languages, reflecting English-centric patterns in both vocabulary and grammar. Despite the importance of this issue, the naturalness of multilingual LLM outputs has received limited attention. In this paper, we address this gap by introducing novel automatic corpus-level metrics to assess the lexical and syntactic naturalness of LLM outputs in a multilingual context. Using our new metrics, we evaluate state-of-the-art LLMs on a curated benchmark in French and Chinese, revealing a tendency towards English-influenced patterns. To mitigate this issue, we also propose a simple and effective alignment method to improve the naturalness of an LLM in a target language and domain, achieving consistent improvements in naturalness without compromising the performance on general-purpose benchmarks. Our work highlights the importance of developing multilingual metrics, resources and methods for the new wave of multilingual LLMs.
Title: CamI2V: Camera-Controlled Image-to-Video Diffusion Model
Authors: Guangcong Zheng, Teng Li, Rui Jiang, Yehao Lu, Tao Wu, Xi Li
Abstract: Recently, camera pose, as a user-friendly and physics-related condition, has been introduced into text-to-video diffusion model for camera control. However, existing methods simply inject camera conditions through a side input. These approaches neglect the inherent physical knowledge of camera pose, resulting in imprecise camera control, inconsistencies, and also poor interpretability. In this paper, we emphasize the necessity of integrating explicit physical constraints into model design. Epipolar attention is proposed for modeling all cross-frame relationships from a novel perspective of noised condition. This ensures that features are aggregated from corresponding epipolar lines in all noised frames, overcoming the limitations of current attention mechanisms in tracking displaced features across frames, especially when features move significantly with the camera and become obscured by noise. Additionally, we introduce register tokens to handle cases without intersections between frames, commonly caused by rapid camera movements, dynamic objects, or occlusions. To support image-to-video, we propose the multiple guidance scale to allow for precise control for image, text, and camera, respectively. Furthermore, we establish a more robust and reproducible evaluation pipeline to solve the inaccuracy and instability of existing camera control measurement. We achieve a 25.5\% improvement in camera controllability on RealEstate10K while maintaining strong generalization to out-of-domain images. Only 24GB and 12GB are required for training and inference, respectively. We plan to release checkpoints, along with training and evaluation codes. Dynamic videos are best viewed at \url{this https URL}.
Title: Systematic Exploration of Dialogue Summarization Approaches for Reproducibility, Comparative Assessment, and Methodological Innovations for Advancing Natural Language Processing in Abstractive Summarization
Copy Paste: [[2410.15962]] Systematic Exploration of Dialogue Summarization Approaches for Reproducibility, Comparative Assessment, and Methodological Innovations for Advancing Natural Language Processing in Abstractive Summarization(https://arxiv.org/abs/2410.15962)
Keywords: robust
Abstract: Reproducibility in scientific research, particularly within the realm of natural language processing (NLP), is essential for validating and verifying the robustness of experimental findings. This paper delves into the reproduction and evaluation of dialogue summarization models, focusing specifically on the discrepancies observed between original studies and our reproduction efforts. Dialogue summarization is a critical aspect of NLP, aiming to condense conversational content into concise and informative summaries, thus aiding in efficient information retrieval and decision-making processes. Our research involved a thorough examination of several dialogue summarization models using the AMI (Augmented Multi-party Interaction) dataset. The models assessed include Hierarchical Memory Networks (HMNet) and various versions of Pointer-Generator Networks (PGN), namely PGN(DKE), PGN(DRD), PGN(DTS), and PGN(DALL). The primary objective was to evaluate the informativeness and quality of the summaries generated by these models through human assessment, a method that introduces subjectivity and variability in the evaluation process. The analysis began with Dataset 1, where the sample standard deviation of 0.656 indicated a moderate dispersion of data points around the mean.
Title: Self-Explained Keywords Empower Large Language Models for Code Generation
Copy Paste: [[2410.15966]] Self-Explained Keywords Empower Large Language Models for Code Generation(https://arxiv.org/abs/2410.15966)
Keywords: large language model
Abstract: Large language models (LLMs) have achieved impressive performance in code generation. However, due to the long-tail distribution of LLMs' training data, low-frequency terms are typically underrepresented in the training process. Consequently, LLMs often misunderstand or overlook problem-specific, low-frequency keywords during code generation, compromising the accuracy of the generated code. To address this, we propose a novel technique named SEK(\textbf{S}elf-\textbf{E}xplained \textbf{K}eywords), which empowers an LLM for better code generation by extracting and explaining the key terms in the problem description with the LLM itself and ranking them based on frequency. Comprehensive experiments across three benchmarks, i.e., HumanEval(+), MBPP(+), and APPS, with five representative LLMs, show that SEK can significantly improve LLMs in code generation, yielding substantial and consistent gains. For instance, SEK improves the Pass@1 of DeepSeek-Coder-V2-Instruct from 85.4\% to 93.3\% on the Humaneval benchmark. Further analysis confirms that SEK enables the LLMs to shift their attention from low-frequency keywords to their corresponding high-frequency counterparts.
Title: Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly
Copy Paste: [[2410.15971]] Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly(https://arxiv.org/abs/2410.15971)
Keywords: robust
Abstract: Large language and vision models have been leading a revolution in visual computing. By greatly scaling up sizes of data and model parameters, the large models learn deep priors which lead to remarkable performance in various tasks. In this work, we present deep prior assembly, a novel framework that assembles diverse deep priors from large models for scene reconstruction from single images in a zero-shot manner. We show that this challenging task can be done without extra knowledge but just simply generalizing one deep prior in one sub-task. To this end, we introduce novel methods related to poses, scales, and occlusion parsing which are keys to enable deep priors to work together in a robust way. Deep prior assembly does not require any 3D or 2D data-driven training in the task and demonstrates superior performance in generalizing priors to open-world scenes. We conduct evaluations on various datasets, and report analysis, numerical and visual comparisons with the latest methods to show our superiority. Project page: this https URL.
Title: Large Language Models for Cross-lingual Emotion Detection
Copy Paste: [[2410.15974]] Large Language Models for Cross-lingual Emotion Detection(https://arxiv.org/abs/2410.15974)
Keywords: large language model
Abstract: This paper presents a detailed system description of our entry for the WASSA 2024 Task 2, focused on cross-lingual emotion detection. We utilized a combination of large language models (LLMs) and their ensembles to effectively understand and categorize emotions across different languages. Our approach not only outperformed other submissions with a large margin, but also demonstrated the strength of integrating multiple models to enhance performance. Additionally, We conducted a thorough comparison of the benefits and limitations of each model used. An error analysis is included along with suggested areas for future improvement. This paper aims to offer a clear and comprehensive understanding of advanced techniques in emotion detection, making it accessible even to those new to the field.
Copy Paste: [[2410.15980]] Granularity Matters in Long-Tail Learning(https://arxiv.org/abs/2410.15980)
Keywords: robust, large language model
Abstract: Balancing training on long-tail data distributions remains a long-standing challenge in deep learning. While methods such as re-weighting and re-sampling help alleviate the imbalance issue, limited sample diversity continues to hinder models from learning robust and generalizable feature representations, particularly for tail classes. In contrast to existing methods, we offer a novel perspective on long-tail learning, inspired by an observation: datasets with finer granularity tend to be less affected by data imbalance. In this paper, we investigate this phenomenon through both quantitative and qualitative studies, showing that increased granularity enhances the generalization of learned features in tail categories. Motivated by these findings, we propose a method to increase dataset granularity through category extrapolation. Specifically, we introduce open-set auxiliary classes that are visually similar to existing ones, aiming to enhance representation learning for both head and tail classes. This forms the core contribution and insight of our approach. To automate the curation of auxiliary data, we leverage large language models (LLMs) as knowledge bases to search for auxiliary categories and retrieve relevant images through web crawling. To prevent the overwhelming presence of auxiliary classes from disrupting training, we introduce a neighbor-silencing loss that encourages the model to focus on class discrimination within the target dataset. During inference, the classifier weights for auxiliary categories are masked out, leaving only the target class weights for use. Extensive experiments and ablation studies on three standard long-tail benchmarks demonstrate the effectiveness of our approach, notably outperforming strong baseline methods that use the same amount of data. The code will be made publicly available.
Title: 1024m at SMM4H 2024: Tasks 3, 5 & 6 -- Ensembles of Transformers and Large Language Models for Medical Text Classification
Copy Paste: [[2410.15998]] 1024m at SMM4H 2024: Tasks 3, 5 & 6 -- Ensembles of Transformers and Large Language Models for Medical Text Classification(https://arxiv.org/abs/2410.15998)
Keywords: transformer, large language model
Abstract: Social media is a great source of data for users reporting information and regarding their health and how various things have had an effect on them. This paper presents various approaches using Transformers and Large Language Models and their ensembles, their performance along with advantages and drawbacks for various tasks of SMM4H'24 - Classifying texts on impact of nature and outdoor spaces on the author's mental health (Task 3), Binary classification of tweets reporting their children's health disorders like Asthma, Autism, ADHD and Speech disorder (task 5), Binary classification of users self-reporting their age (task 6).
Title: Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Copy Paste: [[2410.15999]] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering(https://arxiv.org/abs/2410.15999)
Keywords: large language model
Abstract: Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context -- this phenomenon, known as \emph{context-memory knowledge conflicts}, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use \emph{inference-time} intervention strategies to resolve it. In this work, we propose \textsc{SpARE}, a \emph{training-free} representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. \textsc{SpARE} identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. Our experimental results show that \textsc{SpARE} can effectively control the usage of either knowledge source to resolve knowledge conflict in open-domain question-answering tasks, surpassing existing representation engineering methods ($+10\%$) as well as contrastive decoding methods ($+15\%$).
Title: Exploring Continual Fine-Tuning for Enhancing Language Ability in Large Language Model
Copy Paste: [[2410.16006]] Exploring Continual Fine-Tuning for Enhancing Language Ability in Large Language Model(https://arxiv.org/abs/2410.16006)
Keywords: generative, large language model
Abstract: A common challenge towards the adaptability of Large Language Models (LLMs) is their ability to learn new languages over time without hampering the model's performance on languages in which the model is already proficient (usually English). Continual fine-tuning (CFT) is the process of sequentially fine-tuning an LLM to enable the model to adapt to downstream tasks with varying data distributions and time shifts. This paper focuses on the language adaptability of LLMs through CFT. We study a two-phase CFT process in which an English-only end-to-end fine-tuned LLM from Phase 1 (predominantly Task Ability) is sequentially fine-tuned on a multilingual dataset -- comprising task data in new languages -- in Phase 2 (predominantly Language Ability). We observe that the ``similarity'' of Phase 2 tasks with Phase 1 determines the LLM's adaptability. For similar phase-wise datasets, the LLM after Phase 2 does not show deterioration in task ability. In contrast, when the phase-wise datasets are not similar, the LLM's task ability deteriorates. We test our hypothesis on the open-source \mis\ and \llm\ models with multiple phase-wise dataset pairs. To address the deterioration, we analyze tailored variants of two CFT methods: layer freezing and generative replay. Our findings demonstrate their effectiveness in enhancing the language ability of LLMs while preserving task performance, in comparison to relevant baselines.
Title: Information-Theoretic Minimax Regret Bounds for Reinforcement Learning based on Duality
Authors: Raghav Bongole, Amaury Gouverneur, Borja Rodríguez-Gálvez, Tobias J. Oechtering, Mikael Skoglund
Copy Paste: [[2410.16013]] Information-Theoretic Minimax Regret Bounds for Reinforcement Learning based on Duality(https://arxiv.org/abs/2410.16013)
Keywords: robust
Abstract: We study agents acting in an unknown environment where the agent's goal is to find a robust policy. We consider robust policies as policies that achieve high cumulative rewards for all possible environments. To this end, we consider agents minimizing the maximum regret over different environment parameters, leading to the study of minimax regret. This research focuses on deriving information-theoretic bounds for minimax regret in Markov Decision Processes (MDPs) with a finite time horizon. Building on concepts from supervised learning, such as minimum excess risk (MER) and minimax excess risk, we use recent bounds on the Bayesian regret to derive minimax regret bounds. Specifically, we establish minimax theorems and use bounds on the Bayesian regret to perform minimax regret analysis using these minimax theorems. Our contributions include defining a suitable minimax regret in the context of MDPs, finding information-theoretic bounds for it, and applying these bounds in various scenarios.
Abstract: Cybersecurity has become a crucial concern in the field of connected autonomous vehicles. Cyber threat intelligence (CTI), as the collection of cyber threat information, offers an ideal way for responding to emerging cyber threats and realizing proactive security defense. However, instant analysis and modeling of vehicle cybersecurity data is a fundamental challenge since its complex and professional context. In this paper, we suggest an automotive CTI modeling framework, Actim, to extract and analyse the interrelated relationships among cyber threat elements. Specifically, we first design a vehicle security-safety conceptual ontology model to depict various threat entity classes and their relations. Then, we manually annotate the first automobile CTI corpus by using real cybersecurity data, which comprises 908 threat intelligence texts, including 8195 entities and 4852 relationships. To effectively extract cyber threat entities and their relations, we propose an automotive CTI mining model based on cross-sentence context. Experiment results show that the proposed BERT-DocHiatt-BiLSTM-LSTM model exceeds the performance of existing methods. Finally, we define entity-relation matching rules and create a CTI knowledge graph that structurally fuses various elements of cyber threats. The Actim framework enables mining the intrinsic connections among threat entities, providing valuable insight on the evolving cyber threat landscape.
Title: START: A Generalized State Space Model with Saliency-Driven Token-Aware Transformation
Authors: Jintao Guo, Lei Qi, Yinghuan Shi, Yang Gao
Copy Paste: [[2410.16020]] START: A Generalized State Space Model with Saliency-Driven Token-Aware Transformation(https://arxiv.org/abs/2410.16020)
Keywords: transformer
Abstract: Domain Generalization (DG) aims to enable models to generalize to unseen target domains by learning from multiple source domains. Existing DG methods primarily rely on convolutional neural networks (CNNs), which inherently learn texture biases due to their limited receptive fields, making them prone to overfitting source domains. While some works have introduced transformer-based methods (ViTs) for DG to leverage the global receptive field, these methods incur high computational costs due to the quadratic complexity of self-attention. Recently, advanced state space models (SSMs), represented by Mamba, have shown promising results in supervised learning tasks by achieving linear complexity in sequence length during training and fast RNN-like computation during inference. Inspired by this, we investigate the generalization ability of the Mamba model under domain shifts and find that input-dependent matrices within SSMs could accumulate and amplify domain-specific features, thus hindering model generalization. To address this issue, we propose a novel SSM-based architecture with saliency-based token-aware transformation (namely START), which achieves state-of-the-art (SOTA) performances and offers a competitive alternative to CNNs and ViTs. Our START can selectively perturb and suppress domain-specific features in salient tokens within the input-dependent matrices of SSMs, thus effectively reducing the discrepancy between different domains. Extensive experiments on five benchmarks demonstrate that START outperforms existing SOTA DG methods with efficient linear complexity. Our code is available at this https URL.
Title: ComPO: Community Preferences for Language Model Personalization
Authors: Sachin Kumar, Chan Young Park, Yulia Tsvetkov, Noah A. Smith, Hannaneh Hajishirzi
Copy Paste: [[2410.16027]] ComPO: Community Preferences for Language Model Personalization(https://arxiv.org/abs/2410.16027)
Keywords: privacy
Abstract: Conventional algorithms for training language models (LMs) with human feedback rely on preferences that are assumed to account for an "average" user, disregarding subjectivity and finer-grained variations. Recent studies have raised concerns that aggregating such diverse and often contradictory human feedback to finetune models results in generic models that generate outputs not preferred by many user groups, as they tend to average out styles and norms. To address this issue, we draw inspiration from recommendation systems and propose ComPO, a method to personalize preference optimization in LMs by contextualizing the probability distribution of model outputs with the preference provider. Focusing on group-level preferences rather than individuals, we collect and release ComPRed, a question answering dataset with community-level preferences from Reddit. This dataset facilitates studying diversity in preferences without incurring privacy concerns associated with individual feedback. Our experiments reveal that conditioning language models on a community identifier (i.e., subreddit name) during preference tuning substantially enhances model performance. Conversely, replacing this context with random subreddit identifiers significantly diminishes performance, highlighting the effectiveness of our approach in tailoring responses to communities' preferences.
Title: TimeMixer++: A General Time Series Pattern Machine for Universal Predictive Analysis
Authors: Shiyu Wang, Jiawei Li, Xiaoming Shi, Zhou Ye, Baichuan Mo, Wenze Lin, Shengtong Ju, Zhixuan Chu, Ming Jin
Copy Paste: [[2410.16032]] TimeMixer++: A General Time Series Pattern Machine for Universal Predictive Analysis(https://arxiv.org/abs/2410.16032)
Keywords: extraction
Abstract: Time series analysis plays a critical role in numerous applications, supporting tasks such as forecasting, classification, anomaly detection, and imputation. In this work, we present the time series pattern machine (TSPM), a model designed to excel in a broad range of time series tasks through powerful representation and pattern extraction capabilities. Traditional time series models often struggle to capture universal patterns, limiting their effectiveness across diverse tasks. To address this, we define multiple scales in the time domain and various resolutions in the frequency domain, employing various mixing strategies to extract intricate, task-adaptive time series patterns. Specifically, we introduce a general-purpose TSPM that processes multi-scale time series using (1) multi-resolution time imaging (MRTI), (2) time image decomposition (TID), (3) multi-scale mixing (MCM), and (4) multi-resolution mixing (MRM) to extract comprehensive temporal patterns. MRTI transforms multi-scale time series into multi-resolution time images, capturing patterns across both temporal and frequency domains. TID leverages dual-axis attention to extract seasonal and trend patterns, while MCM hierarchically aggregates these patterns across scales. MRM adaptively integrates all representations across resolutions. This method achieves state-of-the-art performance across 8 time series analytical tasks, consistently surpassing both general-purpose and task-specific models. Our work marks a promising step toward the next generation of TSPMs, paving the way for further advancements in time series analysis.
Title: TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling
Copy Paste: [[2410.16033]] TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling(https://arxiv.org/abs/2410.16033)
Keywords: large language model
Abstract: Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning but presents challenges due to balancing computational efficiency with high-quality output. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one, achieving improved performance but with a high computational cost. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling. TreeBoN maintains a set of parent nodes, iteratively branching and pruning low-quality responses, thereby reducing computational overhead while maintaining high output quality. Our approach also leverages token-level rewards from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. We evaluate TreeBoN using AlpacaFarm, UltraFeedback, GSM8K, HH-RLHF, and TutorEval datasets, demonstrating consistent improvements. Specifically, TreeBoN achieves a 65% win rate at maximum lengths of 192 and 384 tokens, outperforming standard BoN with the same computational cost. Furthermore, TreeBoN achieves around a 60% win rate across longer responses, showcasing its scalability and alignment efficacy.
Title: Improving the Multi-label Atomic Activity Recognition by Robust Visual Feature and Advanced Attention @ ROAD++ Atomic Activity Recognition 2024
Copy Paste: [[2410.16037]] Improving the Multi-label Atomic Activity Recognition by Robust Visual Feature and Advanced Attention @ ROAD++ Atomic Activity Recognition 2024(https://arxiv.org/abs/2410.16037)
Keywords: robust, extraction
Abstract: Road++ Track3 proposes a multi-label atomic activity recognition task in traffic scenarios, which can be standardized as a 64-class multi-label video action recognition task. In the multi-label atomic activity recognition task, the robustness of visual feature extraction remains a key challenge, which directly affects the model performance and generalization ability. To cope with these issues, our team optimized three aspects: data processing, model and post-processing. Firstly, the appropriate resolution and video sampling strategy are selected, and a fixed sampling strategy is set on the validation and test sets. Secondly, in terms of model training, the team selects a variety of visual backbone networks for feature extraction, and then introduces the action-slot model, which is trained on the training and validation sets, and reasoned on the test set. Finally, for post-processing, the team combined the strengths and weaknesses of different models for weighted fusion, and the final mAP on the test set was 58%, which is 4% higher than the challenge baseline.
Title: Large Language Models Know What To Say But Not When To Speak
Authors: Muhammad Umair, Vasanth Sarathy, JP de Ruiter
Copy Paste: [[2410.16044]] Large Language Models Know What To Say But Not When To Speak(https://arxiv.org/abs/2410.16044)
Keywords: large language model
Abstract: Turn-taking is a fundamental mechanism in human communication that ensures smooth and coherent verbal interactions. Recent advances in Large Language Models (LLMs) have motivated their use in improving the turn-taking capabilities of Spoken Dialogue Systems (SDS), such as their ability to respond at appropriate times. However, existing models often struggle to predict opportunities for speaking -- called Transition Relevance Places (TRPs) -- in natural, unscripted conversations, focusing only on turn-final TRPs and not within-turn TRPs. To address these limitations, we introduce a novel dataset of participant-labeled within-turn TRPs and use it to evaluate the performance of state-of-the-art LLMs in predicting opportunities for speaking. Our experiments reveal the current limitations of LLMs in modeling unscripted spoken interactions, highlighting areas for improvement and paving the way for more naturalistic dialogue systems.
Title: Label Filling via Mixed Supervision for Medical Image Segmentation from Noisy Annotations
Copy Paste: [[2410.16057]] Label Filling via Mixed Supervision for Medical Image Segmentation from Noisy Annotations(https://arxiv.org/abs/2410.16057)
Keywords: segmentation
Abstract: The success of medical image segmentation usually requires a large number of high-quality labels. But since the labeling process is usually affected by the raters' varying skill levels and characteristics, the estimated masks provided by different raters usually suffer from high inter-rater variability. In this paper, we propose a simple yet effective Label Filling framework, termed as LF-Net, predicting the groundtruth segmentation label given only noisy annotations during training. The fundamental idea of label filling is to supervise the segmentation model by a subset of pixels with trustworthy labels, meanwhile filling labels of other pixels by mixed supervision. More concretely, we propose a qualified majority voting strategy, i.e., a threshold voting scheme is designed to model agreement among raters and the majority-voted labels of the selected subset of pixels are regarded as supervision. To fill labels of other pixels, two types of mixed auxiliary supervision are proposed: a soft label learned from intrinsic structures of noisy annotations, and raters' characteristics labels which propagate individual rater's characteristics information. LF-Net has two main advantages. 1) Training with trustworthy pixels incorporates training with confident supervision, guiding the direction of groundtruth label learning. 2) Two types of mixed supervision prevent over-fitting issues when the network is supervised by a subset of pixels, and guarantee high fidelity with the true label. Results on five datasets of diverse imaging modalities show that our LF-Net boosts segmentation accuracy in all datasets compared with state-of-the-art methods, with even a 7% improvement in DSC for MS lesion segmentation.
Title: Surprise! Uniform Information Density Isn't the Whole Story: Predicting Surprisal Contours in Long-form Discourse
Authors: Eleftheria Tsipidi, Franz Nowak, Ryan Cotterell, Ethan Wilcox, Mario Giulianelli, Alex Warstadt
Copy Paste: [[2410.16062]] Surprise! Uniform Information Density Isn't the Whole Story: Predicting Surprisal Contours in Long-form Discourse(https://arxiv.org/abs/2410.16062)
Keywords: large language model
Abstract: The Uniform Information Density (UID) hypothesis posits that speakers tend to distribute information evenly across linguistic units to achieve efficient communication. Of course, information rate in texts and discourses is not perfectly uniform. While these fluctuations can be viewed as theoretically uninteresting noise on top of a uniform target, another explanation is that UID is not the only functional pressure regulating information content in a language. Speakers may also seek to maintain interest, adhere to writing conventions, and build compelling arguments. In this paper, we propose one such functional pressure; namely that speakers modulate information rate based on location within a hierarchically-structured model of discourse. We term this the Structured Context Hypothesis and test it by predicting the surprisal contours of naturally occurring discourses extracted from large language models using predictors derived from discourse structure. We find that hierarchical predictors are significant predictors of a discourse's information contour and that deeply nested hierarchical predictors are more predictive than shallow ones. This work takes an initial step beyond UID to propose testable hypotheses for why the information rate fluctuates in predictable ways
Title: Integrated Image-Text Based on Semi-supervised Learning for Small Sample Instance Segmentation
Copy Paste: [[2410.16063]] Integrated Image-Text Based on Semi-supervised Learning for Small Sample Instance Segmentation(https://arxiv.org/abs/2410.16063)
Keywords: segmentation
Abstract: Small sample instance segmentation is a very challenging task, and many existing methods follow the training strategy of meta-learning which pre-train models on support set and fine-tune on query set. The pre-training phase, which is highly task related, requires a significant amount of additional training time and the selection of datasets with close proximity to ensure effectiveness. The article proposes a novel small sample instance segmentation solution from the perspective of maximizing the utilization of existing information without increasing annotation burden and training costs. The proposed method designs two modules to address the problems encountered in small sample instance segmentation. First, it helps the model fully utilize unlabeled data by learning to generate pseudo labels, increasing the number of available samples. Second, by integrating the features of text and image, more accurate classification results can be obtained. These two modules are suitable for box-free and box-dependent frameworks. In the way, the proposed method not only improves the performance of small sample instance segmentation, but also greatly reduce reliance on pre-training. We have conducted experiments in three datasets from different scenes: on land, underwater and under microscope. As evidenced by our experiments, integrated image-text corrects the confidence of classification, and pseudo labels help the model obtain preciser masks. All the results demonstrate the effectiveness and superiority of our method.
Title: CartesianMoE: Boosting Knowledge Sharing among Experts via Cartesian Product Routing in Mixture-of-Experts
Copy Paste: [[2410.16077]] CartesianMoE: Boosting Knowledge Sharing among Experts via Cartesian Product Routing in Mixture-of-Experts(https://arxiv.org/abs/2410.16077)
Keywords: robust, large language model
Abstract: Large language models (LLM) have been attracting much attention from the community recently, due to their remarkable performance in all kinds of downstream tasks. According to the well-known scaling law, scaling up a dense LLM enhances its capabilities, but also significantly increases the computational complexity. Mixture-of-Experts (MoE) models address that by allowing the model size to grow without substantially raising training or inference costs. Yet MoE models face challenges regarding knowledge sharing among experts, making their performance somehow sensitive to routing accuracy. To tackle that, previous works introduced shared experts and combined their outputs with those of the top $K$ routed experts in an ``addition'' manner. In this paper, inspired by collective matrix factorization to learn shared knowledge among data, we propose CartesianMoE, which implements more effective knowledge sharing among experts in more like a ``multiplication'' manner. Extensive experimental results indicate that CartesianMoE outperforms previous MoE models for building LLMs, in terms of both perplexity and downstream task performance. And we also find that CartesianMoE achieves better expert routing robustness.
Title: Fine-Tuning LLMs for Reliable Medical Question-Answering Services
Copy Paste: [[2410.16088]] Fine-Tuning LLMs for Reliable Medical Question-Answering Services(https://arxiv.org/abs/2410.16088)
Keywords: large language model
Abstract: We present an advanced approach to medical question-answering (QA) services, using fine-tuned Large Language Models (LLMs) to improve the accuracy and reliability of healthcare information. Our study focuses on optimizing models like LLaMA-2 and Mistral, which have shown great promise in delivering precise, reliable medical answers. By leveraging comprehensive datasets, we applied fine-tuning techniques such as rsDoRA+ and ReRAG. rsDoRA+ enhances model performance through a combination of decomposed model weights, varied learning rates for low-rank matrices, and rank stabilization, leading to improved efficiency. ReRAG, which integrates retrieval on demand and question rewriting, further refines the accuracy of the responses. This approach enables healthcare providers to access fast, dependable information, aiding in more efficient decision-making and fostering greater patient trust. Our work highlights the potential of fine-tuned LLMs to significantly improve the quality and accessibility of medical information services, ultimately contributing to better healthcare outcomes for all.
Title: Analysing the Residual Stream of Language Models Under Knowledge Conflicts
Copy Paste: [[2410.16090]] Analysing the Residual Stream of Language Models Under Knowledge Conflicts(https://arxiv.org/abs/2410.16090)
Keywords: large language model
Abstract: Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context. Such conflicts can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. In this work, we investigate whether LLMs can identify knowledge conflicts and whether it is possible to know which source of knowledge the model will rely on by analysing the residual stream of the LLM. Through probing tasks, we find that LLMs can internally register the signal of knowledge conflict in the residual stream, which can be accurately detected by probing the intermediate model activations. This allows us to detect conflicts within the residual stream before generating the answers without modifying the input or model parameters. Moreover, we find that the residual stream shows significantly different patterns when the model relies on contextual knowledge versus parametric knowledge to resolve conflicts. This pattern can be employed to estimate the behaviour of LLMs when conflict happens and prevent unexpected answers before producing the answers. Our analysis offers insights into how LLMs internally manage knowledge conflicts and provides a foundation for developing methods to control the knowledge selection processes.
Title: Defending Against Attack on the Cloned: In-Band Active Man-in-the-Middle Detection for the Signal Protocol
Copy Paste: [[2410.16098]] Defending Against Attack on the Cloned: In-Band Active Man-in-the-Middle Detection for the Signal Protocol(https://arxiv.org/abs/2410.16098)
Keywords: secure, security, attack
Abstract: With Signal's position as one of the most popular secure messaging protocols in use today, the threat of government coercion and mass surveillance, i.e., active Man-in-the-Middle (MitM) attacks, are more relevant than ever. On the other hand, studies [29, 33, 37, 38] have shown that user awareness is very poor when it comes to authenticating keys in instant messaging applications, e.g., comparing key fingerprints out-of-band. The ideal solution to this problem should not require the active participation of the users. Our solution to active MitM attacks builds directly on Signal. We automate the process of key confirmation without relying on the intervention of users, and without using an out-of-band communication channel, at the cost of slightly altered trust assumptions on the server. We consider a powerful active MitM that not only controls the communication channel, but also has (one time) access to all secrets on one of the clients, i.e., can perform a key compromise attack. Our solution utilises the server to keep track of the changes in the clients key fingerprint as ratcheting is performed. Given that the server can keep a message log already, we find that any impact on deniability is minimal in practice. We present our detailed modifications to Signal, and document the new security guarantees while preserving the existing security guarantees of Signal. Our proof-of-concept implementation, which is based on the open-source Signal library used in real-world instant messaging applications, shows that our solution is practical and integrates well with the library. Our experimental results further show that our solution only has a tiny performance overhead when compared to Signal.
Title: Do LLMs write like humans? Variation in grammatical and rhetorical styles
Authors: Alex Reinhart, David West Brown, Ben Markey, Michael Laudenbach, Kachatad Pantusen, Ronald Yurko, Gordon Weinberg
Copy Paste: [[2410.16107]] Do LLMs write like humans? Variation in grammatical and rhetorical styles(https://arxiv.org/abs/2410.16107)
Keywords: large language model
Abstract: Large language models (LLMs) are capable of writing grammatical text that follows instructions, answers questions, and solves problems. As they have advanced, it has become difficult to distinguish their output from human-written text. While past research has found some differences in surface features such as word choice and punctuation, and developed classifiers to detect LLM output, none has studied the rhetorical styles of LLMs. Using several variants of Llama 3 and GPT-4o, we construct two parallel corpora of human- and LLM-written texts from common prompts. Using Douglas Biber's set of lexical, grammatical, and rhetorical features, we identify systematic differences between LLMs and humans and between different LLMs. These differences persist when moving from smaller models to larger ones, and are larger for instruction-tuned models than base models. This demonstrates that despite their advanced abilities, LLMs struggle to match human styles, and hence more advanced linguistic features can detect patterns in their behavior not previously recognized.
Title: Interpreting Microbiome Relative Abundance Data Using Symbolic Regression
Authors: Swagatam Haldar, Christoph Stein-Thoeringer, Vadim Borisov
Copy Paste: [[2410.16109]] Interpreting Microbiome Relative Abundance Data Using Symbolic Regression(https://arxiv.org/abs/2410.16109)
Keywords: interpretability
Abstract: Understanding the complex interactions within the microbiome is crucial for developing effective diagnostic and therapeutic strategies. Traditional machine learning models often lack interpretability, which is essential for clinical and biological insights. This paper explores the application of symbolic regression (SR) to microbiome relative abundance data, with a focus on colorectal cancer (CRC). SR, known for its high interpretability, is compared against traditional machine learning models, e.g., random forest, gradient boosting decision trees. These models are evaluated based on performance metrics such as F1 score and accuracy. We utilize 71 studies encompassing, from various cohorts, over 10,000 samples across 749 species features. Our results indicate that SR not only competes reasonably well in terms of predictive performance, but also excels in model interpretability. SR provides explicit mathematical expressions that offer insights into the biological relationships within the microbiome, a crucial advantage for clinical and biological interpretation. Our experiments also show that SR can help understand complex models like XGBoost via knowledge distillation. To aid in reproducibility and further research, we have made the code openly available at this https URL .
Title: Increasing Interpretability of Neural Networks By Approximating Human Visual Saliency
Authors: Aidan Boyd, Mohamed Trabelsi, Huseyin Uzunalioglu, Dan Kushnir
Copy Paste: [[2410.16115]] Increasing Interpretability of Neural Networks By Approximating Human Visual Saliency(https://arxiv.org/abs/2410.16115)
Keywords: interpretability, explainability
Abstract: Understanding specifically where a model focuses on within an image is critical for human interpretability of the decision-making process. Deep learning-based solutions are prone to learning coincidental correlations in training datasets, causing over-fitting and reducing the explainability. Recent advances have shown that guiding models to human-defined regions of saliency within individual images significantly increases performance and interpretability. Human-guided models also exhibit greater generalization capabilities, as coincidental dataset features are avoided. Results show that models trained with saliency incorporation display an increase in interpretability of up to 30% over models trained without saliency information. The collection of this saliency information, however, can be costly, laborious and in some cases infeasible. To address this limitation, we propose a combination strategy of saliency incorporation and active learning to reduce the human annotation data required by 80% while maintaining the interpretability and performance increase from human saliency. Extensive experimentation outlines the effectiveness of the proposed approach across five public datasets and six active learning criteria.
Title: SeaDAG: Semi-autoregressive Diffusion for Conditional Directed Acyclic Graph Generation
Authors: Xinyi Zhou, Xing Li, Yingzhao Lian, Yiwen Wang, Lei Chen, Mingxuan Yuan, Jianye Hao, Guangyong Chen, Pheng Ann Heng
Abstract: We introduce SeaDAG, a semi-autoregressive diffusion model for conditional generation of Directed Acyclic Graphs (DAGs). Considering their inherent layer-wise structure, we simulate layer-wise autoregressive generation by designing different denoising speed for different layers. Unlike conventional autoregressive generation that lacks a global graph structure view, our method maintains a complete graph structure at each diffusion step, enabling operations such as property control that require the full graph structure. Leveraging this capability, we evaluate the DAG properties during training by employing a graph property decoder. We explicitly train the model to learn graph conditioning with a condition loss, which enhances the diffusion model's capacity to generate graphs that are both realistic and aligned with specified properties. We evaluate our method on two representative conditional DAG generation tasks: (1) circuit generation from truth tables, where precise DAG structures are crucial for realizing circuit functionality, and (2) molecule generation based on quantum properties. Our approach demonstrates promising results, generating high-quality and realistic DAGs that closely align with given conditions.
Title: Extracting Spatiotemporal Data from Gradients with Large Language Models
Copy Paste: [[2410.16121]] Extracting Spatiotemporal Data from Gradients with Large Language Models(https://arxiv.org/abs/2410.16121)
Keywords: security, privacy, protect, defense, attack, federate, large language model
Abstract: Recent works show that sensitive user data can be reconstructed from gradient updates, breaking the key privacy promise of federated learning. While success was demonstrated primarily on image data, these methods do not directly transfer to other domains, such as spatiotemporal data. To understand privacy risks in spatiotemporal federated learning, we first propose Spatiotemporal Gradient Inversion Attack (ST-GIA), a gradient attack algorithm tailored to spatiotemporal data that successfully reconstructs the original location from gradients. Furthermore, the absence of priors in attacks on spatiotemporal data has hindered the accurate reconstruction of real client data. To address this limitation, we propose ST-GIA+, which utilizes an auxiliary language model to guide the search for potential locations, thereby successfully reconstructing the original data from gradients. In addition, we design an adaptive defense strategy to mitigate gradient inversion attacks in spatiotemporal federated learning. By dynamically adjusting the perturbation levels, we can offer tailored protection for varying rounds of training data, thereby achieving a better trade-off between privacy and utility than current state-of-the-art methods. Through intensive experimental analysis on three real-world datasets, we reveal that the proposed defense strategy can well preserve the utility of spatiotemporal federated learning with effective security protection.
Title: MNIST-Nd: a set of naturalistic datasets to benchmark clustering across dimensions
Authors: Polina Turishcheva, Laura Hansel, Martin Ritzert, Marissa A. Weis, Alexander S. Ecker
Copy Paste: [[2410.16124]] MNIST-Nd: a set of naturalistic datasets to benchmark clustering across dimensions(https://arxiv.org/abs/2410.16124)
Keywords: robust
Abstract: Driven by advances in recording technology, large-scale high-dimensional datasets have emerged across many scientific disciplines. Especially in biology, clustering is often used to gain insights into the structure of such datasets, for instance to understand the organization of different cell types. However, clustering is known to scale poorly to high dimensions, even though the exact impact of dimensionality is unclear as current benchmark datasets are mostly two-dimensional. Here we propose MNIST-Nd, a set of synthetic datasets that share a key property of real-world datasets, namely that individual samples are noisy and clusters do not perfectly separate. MNIST-Nd is obtained by training mixture variational autoencoders with 2 to 64 latent dimensions on MNIST, resulting in six datasets with comparable structure but varying dimensionality. It thus offers the chance to disentangle the impact of dimensionality on clustering. Preliminary common clustering algorithm benchmarks on MNIST-Nd suggest that Leiden is the most robust for growing dimensions.
Title: Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs
Authors: Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen
Copy Paste: [[2410.16135]] Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs(https://arxiv.org/abs/2410.16135)
Keywords: transformer, large language model
Abstract: To date, 2:4 sparsity has stood as the only sparse pattern that can be accelerated using sparse tensor cores on GPUs. In practice, 2:4 sparsity often possesses low actual speedups ($\leq 1.3$) and requires fixed sparse ratios, meaning that other ratios, such as 4:8, 8:16, or those exceeding 50% sparsity, do not incur any speedups on GPUs. Recent studies suggest that V:N:M sparsity is promising in addressing these limitations of 2:4 sparsity. However, regarding accuracy, the effects of V:N:M sparsity on broader Transformer models, such as vision Transformers and large language models (LLMs), are largely unexamined. Moreover, Some specific issues related to V:N:M sparsity, such as how to select appropriate V and M values, remain unresolved. In this study, we thoroughly investigate the application of V:N:M sparsity in vision models and LLMs across multiple tasks, from pertaining to downstream tasks. We propose three key approaches to enhance the applicability and accuracy of V:N:M-sparse Transformers, including heuristic V and M selection, V:N:M-specific channel permutation, and three-staged LoRA training techniques. Experimental results show that, with our methods, the DeiT-small achieves lossless accuracy at 64:2:5 sparsity, while the DeiT-base maintains accuracy even at 64:2:8 sparsity. In addition, the fine-tuned LLama2-7B at 64:2:5 sparsity performs comparably or better than training-free 2:4 sparse alternatives on downstream tasks. More importantly, V:N:M-sparse Transformers offer a wider range of speedup-accuracy trade-offs compared to 2:4 sparsity. Overall, our exploration largely facilitates the V:N:M sparsity to act as a truly effective acceleration solution for Transformers in cost-sensitive inference scenarios.
Title: A Psycholinguistic Evaluation of Language Models' Sensitivity to Argument Roles
Authors: Eun-Kyoung Rosa Lee, Sathvik Nair, Naomi Feldman
Copy Paste: [[2410.16139]] A Psycholinguistic Evaluation of Language Models' Sensitivity to Argument Roles(https://arxiv.org/abs/2410.16139)
Keywords: large language model
Abstract: We present a systematic evaluation of large language models' sensitivity to argument roles, i.e., who did what to whom, by replicating psycholinguistic studies on human argument role processing. In three experiments, we find that language models are able to distinguish verbs that appear in plausible and implausible contexts, where plausibility is determined through the relation between the verb and its preceding arguments. However, none of the models capture the same selective patterns that human comprehenders exhibit during real-time verb prediction. This indicates that language models' capacity to detect verb plausibility does not arise from the same mechanism that underlies human real-time sentence processing.
Title: 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs
Copy Paste: [[2410.16144]] 1-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs(https://arxiv.org/abs/2410.16144)
Keywords: large language model
Abstract: Recent advances in 1-bit Large Language Models (LLMs), such as BitNet and BitNet b1.58, present a promising approach to enhancing the efficiency of LLMs in terms of speed and energy consumption. These developments also enable local LLM deployment across a broad range of devices. In this work, we introduce this http URL, a tailored software stack designed to unlock the full potential of 1-bit LLMs. Specifically, we develop a set of kernels to support fast and lossless inference of ternary BitNet b1.58 LLMs on CPUs. Extensive experiments demonstrate that this http URL achieves significant speedups, ranging from 2.37x to 6.17x on x86 CPUs and from 1.37x to 5.07x on ARM CPUs, across various model sizes. The code is available at this https URL.
Title: Modelling Structured Data Learning with Restricted Boltzmann Machines in the Teacher-Student Setting
Authors: Robin Thériault, Francesco Tosello, Daniele Tantari
Copy Paste: [[2410.16150]] Modelling Structured Data Learning with Restricted Boltzmann Machines in the Teacher-Student Setting(https://arxiv.org/abs/2410.16150)
Keywords: generative
Abstract: Restricted Boltzmann machines (RBM) are generative models capable to learn data with a rich underlying structure. We study the teacher-student setting where a student RBM learns structured data generated by a teacher RBM. The amount of structure in the data is controlled by adjusting the number of hidden units of the teacher and the correlations in the rows of the weights, a.k.a. patterns. In the absence of correlations, we validate the conjecture that the performance is independent of the number of teacher patters and hidden units of the student RBMs, and we argue that the teacher-student setting can be used as a toy model for studying the lottery ticket hypothesis. Beyond this regime, we find that the critical amount of data required to learn the teacher patterns decreases with both their number and correlations. In both regimes, we find that, even with an relatively large dataset, it becomes impossible to learn the teacher patterns if the inference temperature used for regularization is kept too low. In our framework, the student can learn teacher patterns one-to-one or many-to-one, generalizing previous findings about the teacher-student setting with two hidden units to any arbitrary finite number of hidden units.
Title: Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models
Authors: Giannis Daras, Weili Nie, Karsten Kreis, Alex Dimakis, Morteza Mardani, Nikola Borislavov Kovachki, Arash Vahdat
Copy Paste: [[2410.16152]] Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models(https://arxiv.org/abs/2410.16152)
Keywords: diffusion
Abstract: Using image models naively for solving inverse video problems often suffers from flickering, texture-sticking, and temporal inconsistency in generated videos. To tackle these problems, in this paper, we view frames as continuous functions in the 2D space, and videos as a sequence of continuous warping transformations between different frames. This perspective allows us to train function space diffusion models only on images and utilize them to solve temporally correlated inverse problems. The function space diffusion models need to be equivariant with respect to the underlying spatial transformations. To ensure temporal consistency, we introduce a simple post-hoc test-time guidance towards (self)-equivariant solutions. Our method allows us to deploy state-of-the-art latent diffusion models such as Stable Diffusion XL to solve video inverse problems. We demonstrate the effectiveness of our method for video inpainting and $8\times$ video super-resolution, outperforming existing techniques based on noise transformations. We provide generated video results: this https URL\this http URL.
Title: Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Authors: Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, Graham Neubig
Copy Paste: [[2410.16153]] Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages(https://arxiv.org/abs/2410.16153)
Keywords: robust, large language model
Abstract: Despite recent advances in multimodal large language models (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world's languages and diverse cultural contexts underrepresented. This paper introduces Pangea, a multilingual multimodal LLM trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. PangeaIns features: 1) high-quality English instructions, 2) carefully machine-translated instructions, and 3) culturally relevant multimodal tasks to ensure cross-cultural coverage. To rigorously assess models' capabilities, we introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. Results show that Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts. Ablation studies further reveal the importance of English data proportions, language popularity, and the number of multimodal training samples on overall performance. We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs, promoting equity and accessibility across a broader linguistic and cultural spectrum.
Title: A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns
Authors: Tianyi Men, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao
Copy Paste: [[2410.16155]] A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns(https://arxiv.org/abs/2410.16155)
Keywords: security, attack, large language model
Abstract: With the development of large language models, they are widely used as agents in various fields. A key component of agents is memory, which stores vital information but is susceptible to jailbreak attacks. Existing research mainly focuses on single-agent attacks and shared memory attacks. However, real-world scenarios often involve independent memory. In this paper, we propose the Troublemaker Makes Chaos in Honest Town (TMCHT) task, a large-scale, multi-agent, multi-topology text-based attack evaluation framework. TMCHT involves one attacker agent attempting to mislead an entire society of agents. We identify two major challenges in multi-agent attacks: (1) Non-complete graph structure, (2) Large-scale systems. We attribute these challenges to a phenomenon we term toxicity disappearing. To address these issues, we propose an Adversarial Replication Contagious Jailbreak (ARCJ) method, which optimizes the retrieval suffix to make poisoned samples more easily retrieved and optimizes the replication suffix to make poisoned samples have contagious ability. We demonstrate the superiority of our approach in TMCHT, with 23.51%, 18.95%, and 52.93% improvements in line topology, star topology, and 100-agent settings. Encourage community attention to the security of multi-agent systems.
Title: Limpeh ga li gong: Challenges in Singlish Annotations
Copy Paste: [[2410.16156]] Limpeh ga li gong: Challenges in Singlish Annotations(https://arxiv.org/abs/2410.16156)
Keywords: transformer
Abstract: Singlish, or Colloquial Singapore English, is a language formed from oral and social communication within multicultural Singapore. In this work, we work on a fundamental Natural Language Processing (NLP) task: Parts-Of-Speech (POS) tagging of Singlish sentences. For our analysis, we build a parallel Singlish dataset containing direct English translations and POS tags, with translation and POS annotation done by native Singlish speakers. Our experiments show that automatic transition- and transformer- based taggers perform with only $\sim 80\%$ accuracy when evaluated against human-annotated POS labels, suggesting that there is indeed room for improvement on computation analysis of the language. We provide an exposition of challenges in Singlish annotation: its inconsistencies in form and semantics, the highly context-dependent particles of the language, its structural unique expressions, and the variation of the language on different mediums. Our task definition, resultant labels and results reflects the challenges in analysing colloquial languages formulated from a variety of dialects, and paves the way for future studies beyond POS tagging.
Title: Metric as Transform: Exploring beyond Affine Transform for Interpretable Neural Network
Copy Paste: [[2410.16159]] Metric as Transform: Exploring beyond Affine Transform for Interpretable Neural Network(https://arxiv.org/abs/2410.16159)
Keywords: interpretability
Abstract: Artificial Neural Networks of varying architectures are generally paired with affine transformation at the core. However, we find dot product neurons with global influence less interpretable as compared to local influence of euclidean distance (as used in Radial Basis Function Network). In this work, we explore the generalization of dot product neurons to $l^p$-norm, metrics, and beyond. We find that metrics as transform performs similarly to affine transform when used in MultiLayer Perceptron or Convolutional Neural Network. Moreover, we explore various properties of Metrics, compare it with Affine, and present multiple cases where metrics seem to provide better interpretability. We develop an interpretable local dictionary based Neural Networks and use it to understand and reject adversarial examples.
Title: DMM: Distributed Matrix Mechanism for Differentially-Private Federated Learning using Packed Secret Sharing
Authors: Alexander Bienstock, Ujjwal Kumar, Antigoni Polychroniadou
Copy Paste: [[2410.16161]] DMM: Distributed Matrix Mechanism for Differentially-Private Federated Learning using Packed Secret Sharing(https://arxiv.org/abs/2410.16161)
Keywords: secure, privacy, federate
Abstract: Federated Learning (FL) has gained lots of traction recently, both in industry and academia. In FL, a machine learning model is trained using data from various end-users arranged in committees across several rounds. Since such data can often be sensitive, a primary challenge in FL is providing privacy while still retaining utility of the model. Differential Privacy (DP) has become the main measure of privacy in the FL setting. DP comes in two flavors: central and local. In the former, a centralized server is trusted to receive the users' raw gradients from a training step, and then perturb their aggregation with some noise before releasing the next version of the model. In the latter (more private) setting, noise is applied on users' local devices, and only the aggregation of users' noisy gradients is revealed even to the server. Great strides have been made in increasing the privacy-utility trade-off in the central DP setting, by utilizing the so-called matrix mechanism. However, progress has been mostly stalled in the local DP setting. In this work, we introduce the distributed matrix mechanism to achieve the best-of-both-worlds; local DP and also better privacy-utility trade-off from the matrix mechanism. We accomplish this by proposing a cryptographic protocol that securely transfers sensitive values across rounds, which makes use of packed secret sharing. This protocol accommodates the dynamic participation of users per training round required by FL, including those that may drop out from the computation. We provide experiments which show that our mechanism indeed significantly improves the privacy-utility trade-off of FL models compared to previous local DP mechanisms, with little added overhead.
Title: Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Authors: Yufei Zhan, Hongyin Zhao, Yousong Zhu, Fan Yang, Ming Tang, Jinqiao Wang
Copy Paste: [[2410.16163]] Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models(https://arxiv.org/abs/2410.16163)
Keywords: large language model
Abstract: Large Multimodal Models (LMMs) have achieved significant breakthroughs in various vision-language and vision-centric tasks based on auto-regressive modeling. However, these models typically focus on either vision-centric tasks, such as visual grounding and region description, or vision-language tasks, like image caption and multi-scenario VQAs. None of the LMMs have yet comprehensively unified both types of tasks within a single model, as seen in Large Language Models in the natural language processing field. Furthermore, even with abundant multi-task instruction-following data, directly stacking these data for universal capabilities extension remains challenging. To address these issues, we introduce a novel multi-dimension curated and consolidated multimodal dataset, named CCMD-8M, which overcomes the data barriers of unifying vision-centric and vision-language tasks through multi-level data curation and multi-task consolidation. More importantly, we present Griffon-G, a general large multimodal model that addresses both vision-centric and vision-language tasks within a single end-to-end paradigm. Griffon-G resolves the training collapse issue encountered during the joint optimization of these tasks, achieving better training efficiency. Evaluations across multimodal benchmarks, general Visual Question Answering (VQA) tasks, scene text-centric VQA tasks, document-related VQA tasks, Referring Expression Comprehension, and object detection demonstrate that Griffon-G surpasses the advanced LMMs and achieves expert-level performance in complicated vision-centric tasks.
Title: From Tokens to Materials: Leveraging Language Models for Scientific Discovery
Copy Paste: [[2410.16165]] From Tokens to Materials: Leveraging Language Models for Scientific Discovery(https://arxiv.org/abs/2410.16165)
Keywords: transformer, generative
Abstract: Exploring the predictive capabilities of language models in material science is an ongoing interest. This study investigates the application of language model embeddings to enhance material property prediction in materials science. By evaluating various contextual embedding methods and pre-trained models, including Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT), we demonstrate that domain-specific models, particularly MatBERT significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties. Our findings reveal that information-dense embeddings from the third layer of MatBERT, combined with a context-averaging approach, offer the most effective method for capturing material-property relationships from the scientific literature. We also identify a crucial "tokenizer effect," highlighting the importance of specialized text processing techniques that preserve complete compound names while maintaining consistent token counts. These insights underscore the value of domain-specific training and tokenization in materials science applications and offer a promising pathway for accelerating the discovery and development of new materials through AI-driven approaches.
Title: Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining
Authors: Han Huang, Yuqi Huo, Zijia Zhao, Haoyu Lu, Shu Wu, Bingning Wang, Qiang Liu, Weipeng Chen, Liang Wang
Abstract: Multimodal large language models (MLLMs) have made significant strides by integrating visual and textual modalities. A critical factor in training MLLMs is the quality of image-text pairs within multimodal pretraining datasets. However, $\textit {de facto}$ filter-based data quality enhancement paradigms often discard a substantial portion of high-quality image data due to inadequate semantic alignment between images and texts, leading to inefficiencies in data utilization and scalability. In this paper, we propose the Adaptive Image-Text Quality Enhancer (AITQE), a model that dynamically assesses and enhances the quality of image-text pairs. AITQE employs a text rewriting mechanism for low-quality pairs and incorporates a negative sample learning strategy to improve evaluative capabilities by integrating deliberately selected low-quality samples during training. Unlike prior approaches that significantly alter text distributions, our method minimally adjusts text to preserve data volume while enhancing quality. Experimental results demonstrate that AITQE surpasses existing methods on various benchmark, effectively leveraging raw data and scaling efficiently with increasing data volumes. We hope our work will inspire future works. The code and model are available at: this https URL.
Title: Exploring Pretraining via Active Forgetting for Improving Cross Lingual Transfer for Decoder Language Models
Copy Paste: [[2410.16168]] Exploring Pretraining via Active Forgetting for Improving Cross Lingual Transfer for Decoder Language Models(https://arxiv.org/abs/2410.16168)
Keywords: large language model
Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities in a multitude of NLP tasks. However, the efficacy of such models to languages other than English is often limited. Prior works have shown that encoder-only models such as BERT or XLM-RoBERTa show impressive cross lingual transfer of their capabilities from English to other languages. In this work, we propose a pretraining strategy that uses active forgetting to achieve similar cross lingual transfer in decoder-only LLMs. We show that LLMs pretrained with active forgetting are highly effective when adapting to new and unseen languages. Through extensive experimentation, we find that LLMs pretrained with active forgetting are able to learn better multilingual representations which translates to better performance in many downstream tasks.
Title: A Framework for Evaluating Predictive Models Using Synthetic Image Covariates and Longitudinal Data
Authors: Simon Deltadahl, Andreu Vall, Vijay Ivaturi, Niklas Korsbo
Copy Paste: [[2410.16177]] A Framework for Evaluating Predictive Models Using Synthetic Image Covariates and Longitudinal Data(https://arxiv.org/abs/2410.16177)
Keywords: privacy, diffusion, generative
Abstract: We present a novel framework for synthesizing patient data with complex covariates (e.g., eye scans) paired with longitudinal observations (e.g., visual acuity over time), addressing privacy concerns in healthcare research. Our approach introduces controlled association in latent spaces generating each data modality, enabling the creation of complex covariate-longitudinal observation pairs. This framework facilitates the development of predictive models and provides openly available benchmarking datasets for healthcare research. We demonstrate our framework using optical coherence tomography (OCT) scans, though it is applicable across domains. Using 109,309 2D OCT scan slices, we trained an image generative model combining a variational autoencoder and a diffusion model. Longitudinal observations were simulated using a nonlinear mixed effect (NLME) model from a low-dimensional space of random effects. We generated 1.1M OCT scan slices paired with five sets of longitudinal observations at controlled association levels (100%, 50%, 10%, 5.26%, and 2% of between-subject variability). To assess the framework, we modeled synthetic longitudinal observations with another NLME model, computed empirical Bayes estimates of random effects, and trained a ResNet to predict these estimates from synthetic OCT scans. We then incorporated ResNet predictions into the NLME model for patient-individualized predictions. Prediction accuracy on withheld data declined as intended with reduced association between images and longitudinal measurements. Notably, in all but the 2% case, we achieved within 50% of the theoretical best possible prediction on withheld data, demonstrating our ability to detect even weak signals. This confirms the effectiveness of our framework in generating synthetic data with controlled levels of association, providing a valuable tool for healthcare research.
Title: MagicPIG: LSH Sampling for Efficient LLM Generation
Authors: Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen
Copy Paste: [[2410.16179]] MagicPIG: LSH Sampling for Efficient LLM Generation(https://arxiv.org/abs/2410.16179)
Keywords: large language model
Abstract: Large language models (LLMs) with long context windows have gained significant attention. However, the KV cache, stored to avoid re-computation, becomes a bottleneck. Various dynamic sparse or TopK-based attention approximation methods have been proposed to leverage the common insight that attention is sparse. In this paper, we first show that TopK attention itself suffers from quality degradation in certain downstream tasks because attention is not always as sparse as expected. Rather than selecting the keys and values with the highest attention scores, sampling with theoretical guarantees can provide a better estimation for attention output. To make the sampling-based approximation practical in LLM generation, we propose MagicPIG, a heterogeneous system based on Locality Sensitive Hashing (LSH). MagicPIG significantly reduces the workload of attention computation while preserving high accuracy for diverse tasks. MagicPIG stores the LSH hash tables and runs the attention computation on the CPU, which allows it to serve longer contexts and larger batch sizes with high approximation accuracy. MagicPIG can improve decoding throughput by $1.9\sim3.9\times$ across various GPU hardware and achieve 110ms decoding latency on a single RTX 4090 for Llama-3.1-8B-Instruct model with a context of 96k tokens. The code is available at \url{this https URL}.
Title: Contamination Report for Multilingual Benchmarks
Copy Paste: [[2410.16186]] Contamination Report for Multilingual Benchmarks(https://arxiv.org/abs/2410.16186)
Keywords: large language model
Abstract: Benchmark contamination refers to the presence of test datasets in Large Language Model (LLM) pre-training or post-training data. Contamination can lead to inflated scores on benchmarks, compromising evaluation results and making it difficult to determine the capabilities of models. In this work, we study the contamination of popular multilingual benchmarks in LLMs that support multiple languages. We use the Black Box test to determine whether $7$ frequently used multilingual benchmarks are contaminated in $7$ popular open and closed LLMs and find that almost all models show signs of being contaminated with almost all the benchmarks we test. Our findings can help the community determine the best set of benchmarks to use for multilingual evaluation.
Title: Training Better Deep Learning Models Using Human Saliency
Authors: Aidan Boyd, Patrick Tinsley, Kevin W. Bowyer, Adam Czajka
Copy Paste: [[2410.16190]] Training Better Deep Learning Models Using Human Saliency(https://arxiv.org/abs/2410.16190)
Keywords: attack, interpretability
Abstract: This work explores how human judgement about salient regions of an image can be introduced into deep convolutional neural network (DCNN) training. Traditionally, training of DCNNs is purely data-driven. This often results in learning features of the data that are only coincidentally correlated with class labels. Human saliency can guide network training using our proposed new component of the loss function that ConveYs Brain Oversight to Raise Generalization (CYBORG) and penalizes the model for using non-salient regions. This mechanism produces DCNNs achieving higher accuracy and generalization compared to using the same training data without human salience. Experimental results demonstrate that CYBORG applies across multiple network architectures and problem domains (detection of synthetic faces, iris presentation attacks and anomalies in chest X-rays), while requiring significantly less data than training without human saliency guidance. Visualizations show that CYBORG-trained models' saliency is more consistent across independent training runs than traditionally-trained models, and also in better agreement with humans. To lower the cost of collecting human annotations, we also explore using deep learning to provide automated annotations. CYBORG training of CNNs addresses important issues such as reducing the appetite for large training sets, increasing interpretability, and reducing fragility by generalizing better to new types of data.
Title: Systematic Review: Text Processing Algorithms in Machine Learning and Deep Learning for Mental Health Detection on Social Media
Copy Paste: [[2410.16204]] Systematic Review: Text Processing Algorithms in Machine Learning and Deep Learning for Mental Health Detection on Social Media(https://arxiv.org/abs/2410.16204)
Keywords: robust
Abstract: The global rise in depression necessitates innovative detection methods for early intervention. Social media provides a unique opportunity to identify depression through user-generated posts. This systematic review evaluates machine learning (ML) models for depression detection on social media, focusing on biases and methodological challenges throughout the ML lifecycle. A search of PubMed, IEEE Xplore, and Google Scholar identified 47 relevant studies published after 2010. The Prediction model Risk Of Bias ASsessment Tool (PROBAST) was utilized to assess methodological quality and risk of bias. Significant biases impacting model reliability and generalizability were found. There is a predominant reliance on Twitter (63.8%) and English-language content (over 90%), with most studies focusing on users from the United States and Europe. Non-probability sampling methods (approximately 80%) limit representativeness. Only 23% of studies explicitly addressed linguistic nuances like negations, crucial for accurate sentiment analysis. Inconsistent hyperparameter tuning was observed, with only 27.7% properly tuning models. About 17% did not adequately partition data into training, validation, and test sets, risking overfitting. While 74.5% used appropriate evaluation metrics for imbalanced data, others relied on accuracy without addressing class imbalance, potentially skewing results. Reporting transparency varied, often lacking critical methodological details. These findings highlight the need to diversify data sources, standardize preprocessing protocols, ensure consistent model development practices, address class imbalance, and enhance reporting transparency. By overcoming these challenges, future research can develop more robust and generalizable ML models for depression detection on social media, contributing to improved mental health outcomes globally.
Title: Pre-training Distillation for Large Language Models: A Design Space Exploration
Authors: Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, Juanzi Li
Copy Paste: [[2410.16215]] Pre-training Distillation for Large Language Models: A Design Space Exploration(https://arxiv.org/abs/2410.16215)
Keywords: large language model
Abstract: Knowledge distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. Previous work applying KD in the field of large language models (LLMs) typically focused on the post-training phase, where the student LLM learns directly from instructions and corresponding responses generated by the teacher model. In this paper, we extend KD to the pre-training phase of LLMs, named pre-training distillation (PD). We first conduct a preliminary experiment using GLM-4-9B as the teacher LLM to distill a 1.9B parameter student LLM, validating the effectiveness of PD. Considering the key impact factors of distillation, we systematically explore the design space of pre-training distillation across four aspects: logits processing, loss selection, scaling law, and offline or online logits. We conduct extensive experiments to explore the design space of pre-training distillation and find better configurations and interesting conclusions, such as larger student LLMs generally benefiting more from pre-training distillation, while a larger teacher LLM does not necessarily guarantee better results. We hope our exploration of the design space will inform future practices in pre-training distillation.
Title: A Realistic Threat Model for Large Language Model Jailbreaks
Authors: Valentyn Boreiko, Alexander Panfilov, Vaclav Voracek, Matthias Hein, Jonas Geiping
Copy Paste: [[2410.16222]] A Realistic Threat Model for Large Language Model Jailbreaks(https://arxiv.org/abs/2410.16222)
Keywords: attack, large language model
Abstract: A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. In their original settings, these methods all largely succeed in coercing the target output, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text, and computational budget, in total FLOPs. For the former, we build an N-gram model on 1T tokens, which, in contrast to model-based perplexity, allows for an LLM-agnostic and inherently interpretable evaluation. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing. After a rigorous comparison, we not only find attack success rates against safety-tuned modern models to be lower than previously presented but also find that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent N-grams, either selecting N-grams absent from real-world text or rare ones, e.g. specific to code datasets.
Title: Building A Coding Assistant via the Retrieval-Augmented Language Model
Copy Paste: [[2410.16229]] Building A Coding Assistant via the Retrieval-Augmented Language Model(https://arxiv.org/abs/2410.16229)
Keywords: large language model
Abstract: Pretrained language models have shown strong effectiveness in code-related tasks, such as code retrieval, code generation, code summarization, and code completion tasks. In this paper, we propose COde assistaNt viA retrieval-augmeNted language model (CONAN), which aims to build a code assistant by mimicking the knowledge-seeking behaviors of humans during coding. Specifically, it consists of a code structure aware retriever (CONAN-R) and a dual-view code representation-based retrieval-augmented generation model (CONAN-G). CONAN-R pretrains CodeT5 using Code-Documentation Alignment and Masked Entity Prediction tasks to make language models code structure-aware and learn effective representations for code snippets and documentation. Then CONAN-G designs a dual-view code representation mechanism for implementing a retrieval-augmented code generation model. CONAN-G regards the code documentation descriptions as prompts, which help language models better understand the code semantics. Our experiments show that CONAN achieves convincing performance on different code generation tasks and significantly outperforms previous retrieval augmented code generation models. Our further analyses show that CONAN learns tailored representations for both code snippets and documentation by aligning code-documentation data pairs and capturing structural semantics by masking and predicting entities in the code data. Additionally, the retrieved code snippets and documentation provide necessary information from both program language and natural language to assist the code generation process. CONAN can also be used as an assistant for Large Language Models (LLMs), providing LLMs with external knowledge in shorter code document lengths to improve their effectiveness on various code tasks. It shows the ability of CONAN to extract necessary information and help filter out the noise from retrieved code documents.
Title: ToW: Thoughts of Words Improve Reasoning in Large Language Models
Authors: Zhikun Xu, Ming Shen, Jacob Dineen, Zhaonan Li, Xiao Ye, Shijie Lu, Aswin RRV, Chitta Baral, Ben Zhou
Copy Paste: [[2410.16235]] ToW: Thoughts of Words Improve Reasoning in Large Language Models(https://arxiv.org/abs/2410.16235)
Keywords: large language model
Abstract: We introduce thoughts of words (ToW), a novel training-time data-augmentation method for next-word prediction. ToW views next-word prediction as a core reasoning task and injects fine-grained thoughts explaining what the next word should be and how it is related to the previous contexts in pre-training texts. Our formulation addresses two fundamental drawbacks of existing next-word prediction learning schemes: they induce factual hallucination and are inefficient for models to learn the implicit reasoning processes in raw texts. While there are many ways to acquire such thoughts of words, we explore the first step of acquiring ToW annotations through distilling from larger models. After continual pre-training with only 70K ToW annotations, we effectively improve models' reasoning performances by 7% to 9% on average and reduce model hallucination by up to 10%. At the same time, ToW is entirely agnostic to tasks and applications, introducing no additional biases on labels or semantics.
Title: LLaVA-KD: A Framework of Distilling Multimodal Large Language Models
Authors: Yuxuan Cai, Jiangning Zhang, Haoyang He, Xinwei He, Ao Tong, Zhenye Gan, Chengjie Wang, Xiang Bai
Copy Paste: [[2410.16236]] LLaVA-KD: A Framework of Distilling Multimodal Large Language Models(https://arxiv.org/abs/2410.16236)
Keywords: large language model
Abstract: The success of Large Language Models (LLM) has led researchers to explore Multimodal Large Language Models (MLLM) for unified visual and linguistic understanding. However, the increasing model size and computational complexity of MLLM limit their use in resource-constrained environments. Small-scale MLLM (s-MLLM) aims to retain the capabilities of the large-scale model (l-MLLM) while reducing computational demands, but resulting in a significant decline in performance. To address the aforementioned issues, we propose a novel LLaVA-KD framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM, and Relation Distillation (RDist) to transfer l-MLLM's ability to model correlations between visual features. Additionally, we propose a three-stage training scheme to fully exploit the potential of s-MLLM: 1) Distilled Pre-Training to align visual-textual representations, 2) Supervised Fine-Tuning to equip the model with multimodal understanding, and 3) Distilled Fine-Tuning to further transfer l-MLLM capabilities. Our approach significantly improves performance without altering the small model's architecture. Extensive experiments and ablation studies validate the effectiveness of each proposed component. Code will be available at this https URL.
Title: Analyzing Context Contributions in LLM-based Machine Translation
Authors: Emmanouil Zaranis, Nuno M. Guerreiro, André F. T. Martins
Copy Paste: [[2410.16246]] Analyzing Context Contributions in LLM-based Machine Translation(https://arxiv.org/abs/2410.16246)
Keywords: large language model
Abstract: Large language models (LLMs) have achieved state-of-the-art performance in machine translation (MT) and demonstrated the ability to leverage in-context learning through few-shot examples. However, the mechanisms by which LLMs use different parts of the input context remain largely unexplored. In this work, we provide a comprehensive analysis of context utilization in MT, studying how LLMs use various context parts, such as few-shot examples and the source text, when generating translations. We highlight several key findings: (1) the source part of few-shot examples appears to contribute more than its corresponding targets, irrespective of translation direction; (2) finetuning LLMs with parallel data alters the contribution patterns of different context parts; and (3) there is a positional bias where earlier few-shot examples have higher contributions to the translated sequence. Finally, we demonstrate that inspecting anomalous context contributions can potentially uncover pathological translations, such as hallucinations. Our findings shed light on the internal workings of LLM-based MT which go beyond those known for standard encoder-decoder MT models.
Title: Can Knowledge Editing Really Correct Hallucinations?
Authors: Baixiang Huang, Canyu Chen, Xiongxiao Xu, Ali Payani, Kai Shu
Copy Paste: [[2410.16251]] Can Knowledge Editing Really Correct Hallucinations?(https://arxiv.org/abs/2410.16251)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) suffer from hallucinations, referring to the non-factual information in generated content, despite their superior capacities across tasks. Meanwhile, knowledge editing has been developed as a new popular paradigm to correct the erroneous factual knowledge encoded in LLMs with the advantage of avoiding retraining from scratch. However, one common issue of existing evaluation datasets for knowledge editing is that they do not ensure LLMs actually generate hallucinated answers to the evaluation questions before editing. When LLMs are evaluated on such datasets after being edited by different techniques, it is hard to directly adopt the performance to assess the effectiveness of different knowledge editing methods in correcting hallucinations. Thus, the fundamental question remains insufficiently validated: Can knowledge editing really correct hallucinations in LLMs? We proposed HalluEditBench to holistically benchmark knowledge editing methods in correcting real-world hallucinations. First, we rigorously construct a massive hallucination dataset with 9 domains, 26 topics and more than 6,000 hallucinations. Then, we assess the performance of knowledge editing methods in a holistic way on five dimensions including Efficacy, Generalization, Portability, Locality, and Robustness. Through HalluEditBench, we have provided new insights into the potentials and limitations of different knowledge editing methods in correcting hallucinations, which could inspire future improvements and facilitate the progress in the field of knowledge editing.
Title: Distribution Learning with Valid Outputs Beyond the Worst-Case
Copy Paste: [[2410.16253]] Distribution Learning with Valid Outputs Beyond the Worst-Case(https://arxiv.org/abs/2410.16253)
Keywords: generative
Abstract: Generative models at times produce "invalid" outputs, such as images with generation artifacts and unnatural sounds. Validity-constrained distribution learning attempts to address this problem by requiring that the learned distribution have a provably small fraction of its mass in invalid parts of space -- something which standard loss minimization does not always ensure. To this end, a learner in this model can guide the learning via "validity queries", which allow it to ascertain the validity of individual examples. Prior work on this problem takes a worst-case stance, showing that proper learning requires an exponential number of validity queries, and demonstrating an improper algorithm which -- while generating guarantees in a wide-range of settings -- makes an atypical polynomial number of validity queries. In this work, we take a first step towards characterizing regimes where guaranteeing validity is easier than in the worst-case. We show that when the data distribution lies in the model class and the log-loss is minimized, the number of samples required to ensure validity has a weak dependence on the validity requirement. Additionally, we show that when the validity region belongs to a VC-class, a limited number of validity queries are often sufficient.
Title: CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
Authors: Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, Kai Chen
Copy Paste: [[2410.16256]] CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution(https://arxiv.org/abs/2410.16256)
Keywords: large language model
Abstract: Efficient and accurate evaluation is crucial for the continuous improvement of large language models (LLMs). Among various assessment methods, subjective evaluation has garnered significant attention due to its superior alignment with real-world usage scenarios and human preferences. However, human-based evaluations are costly and lack reproducibility, making precise automated evaluators (judgers) vital in this process. In this report, we introduce \textbf{CompassJudger-1}, the first open-source \textbf{all-in-one} judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. It is capable of: 1. Performing unitary scoring and two-model comparisons as a reward model; 2. Conducting evaluations according to specified formats; 3. Generating critiques; 4. Executing diverse tasks like a general LLM. To assess the evaluation capabilities of different judge models under a unified setting, we have also established \textbf{JudgerBench}, a new benchmark that encompasses various subjective evaluation tasks and covers a wide range of topics. CompassJudger-1 offers a comprehensive solution for various evaluation tasks while maintaining the flexibility to adapt to diverse requirements. Both CompassJudger and JudgerBench are released and available to the research community athttps://github.com/open-compass/CompassJudger. We believe that by open-sourcing these tools, we can foster collaboration and accelerate progress in LLM evaluation methodologies.
Title: Elucidating the design space of language models for image generation
Authors: Xuantong Liu, Shaozhe Hao, Xianbiao Qi, Tianyang Hu, Jun Wang, Rong Xiao, Yuan Yao
Copy Paste: [[2410.16257]] Elucidating the design space of language models for image generation(https://arxiv.org/abs/2410.16257)
Keywords: large language model
Abstract: The success of autoregressive (AR) language models in text generation has inspired the computer vision community to adopt Large Language Models (LLMs) for image generation. However, considering the essential differences between text and image modalities, the design space of language models for image generation remains underexplored. We observe that image tokens exhibit greater randomness compared to text tokens, which presents challenges when training with token prediction. Nevertheless, AR models demonstrate their potential by effectively learning patterns even from a seemingly suboptimal optimization problem. Our analysis also reveals that while all models successfully grasp the importance of local information in image generation, smaller models struggle to capture the global context. In contrast, larger models showcase improved capabilities in this area, helping to explain the performance gains achieved when scaling up model size. We further elucidate the design space of language models for vision generation, including tokenizer choice, model choice, model scalability, vocabulary design, and sampling strategy through extensive comparative experiments. Our work is the first to analyze the optimization behavior of language models in vision generation, and we believe it can inspire more effective designs when applying LMs to other domains. Finally, our elucidated language model for image generation, termed as ELM, achieves state-of-the-art performance on the ImageNet 256*256 benchmark. The code is available at this https URL.
Title: Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos
Authors: Gengshan Yang, Andrea Bajcsy, Shunsuke Saito, Angjoo Kanazawa
Abstract: We present Agent-to-Sim (ATS), a framework for learning interactive behavior models of 3D agents from casual longitudinal video collections. Different from prior works that rely on marker-based tracking and multiview cameras, ATS learns natural behaviors of animal and human agents non-invasively through video observations recorded over a long time-span (e.g., a month) in a single environment. Modeling 3D behavior of an agent requires persistent 3D tracking (e.g., knowing which point corresponds to which) over a long time period. To obtain such data, we develop a coarse-to-fine registration method that tracks the agent and the camera over time through a canonical 3D space, resulting in a complete and persistent spacetime 4D representation. We then train a generative model of agent behaviors using paired data of perception and motion of an agent queried from the 4D reconstruction. ATS enables real-to-sim transfer from video recordings of an agent to an interactive behavior simulator. We demonstrate results on pets (e.g., cat, dog, bunny) and human given monocular RGBD videos captured by a smartphone.
Title: Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Copy Paste: [[2410.16261]] Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance(https://arxiv.org/abs/2410.16261)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a broad spectrum of domains. However, the large model scale and associated high computational costs pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical images, and remote sensing. We believe that our study can provide valuable insights and resources to advance the development of efficient and effective MLLMs. Code is available at this https URL.
Title: 3DGS-Enhancer: Enhancing Unbounded 3D Gaussian Splatting with View-consistent 2D Diffusion Priors
Copy Paste: [[2410.16266]] 3DGS-Enhancer: Enhancing Unbounded 3D Gaussian Splatting with View-consistent 2D Diffusion Priors(https://arxiv.org/abs/2410.16266)
Keywords: diffusion
Abstract: Novel-view synthesis aims to generate novel views of a scene from multiple input images or videos, and recent advancements like 3D Gaussian splatting (3DGS) have achieved notable success in producing photorealistic renderings with efficient pipelines. However, generating high-quality novel views under challenging settings, such as sparse input views, remains difficult due to insufficient information in under-sampled areas, often resulting in noticeable artifacts. This paper presents 3DGS-Enhancer, a novel pipeline for enhancing the representation quality of 3DGS representations. We leverage 2D video diffusion priors to address the challenging 3D view consistency problem, reformulating it as achieving temporal consistency within a video generation process. 3DGS-Enhancer restores view-consistent latent features of rendered novel views and integrates them with the input views through a spatial-temporal decoder. The enhanced views are then used to fine-tune the initial 3DGS model, significantly improving its rendering performance. Extensive experiments on large-scale datasets of unbounded scenes demonstrate that 3DGS-Enhancer yields superior reconstruction performance and high-fidelity rendering results compared to state-of-the-art methods. The project webpage is this https URL .
Title: SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
Authors: Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, Jiaqi Wang
Copy Paste: [[2410.16268]] SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree(https://arxiv.org/abs/2410.16268)
Keywords: robust, segmentation
Abstract: The Segment Anything Model 2 (SAM 2) has emerged as a powerful foundation model for object segmentation in both images and videos, paving the way for various downstream video applications. The crucial design of SAM 2 for video segmentation is its memory module, which prompts object-aware memories from previous frames for current frame prediction. However, its greedy-selection memory design suffers from the "error accumulation" problem, where an errored or missed mask will cascade and influence the segmentation of the subsequent frames, which limits the performance of SAM 2 toward complex long-term videos. To this end, we introduce SAM2Long, an improved training-free video object segmentation strategy, which considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways in a constrained tree search manner. In practice, we maintain a fixed number of segmentation pathways throughout the video. For each frame, multiple masks are proposed based on the existing pathways, creating various candidate branches. We then select the same fixed number of branches with higher cumulative scores as the new pathways for the next frame. After processing the final frame, the pathway with the highest cumulative score is chosen as the final segmentation result. Benefiting from its heuristic search design, SAM2Long is robust toward occlusions and object reappearances, and can effectively segment and track objects for complex long-term videos. Notably, SAM2Long achieves an average improvement of 3.0 points across all 24 head-to-head comparisons, with gains of up to 5.3 points in J&F on long-term video object segmentation benchmarks such as SA-V and LVOS. The code is released at this https URL.
Title: MvDrag3D: Drag-based Creative 3D Editing via Multi-view Generation-Reconstruction Priors
Copy Paste: [[2410.16272]] MvDrag3D: Drag-based Creative 3D Editing via Multi-view Generation-Reconstruction Priors(https://arxiv.org/abs/2410.16272)
Keywords: diffusion, generative
Abstract: Drag-based editing has become popular in 2D content creation, driven by the capabilities of image generative models. However, extending this technique to 3D remains a challenge. Existing 3D drag-based editing methods, whether employing explicit spatial transformations or relying on implicit latent optimization within limited-capacity 3D generative models, fall short in handling significant topology changes or generating new textures across diverse object categories. To overcome these limitations, we introduce MVDrag3D, a novel framework for more flexible and creative drag-based 3D editing that leverages multi-view generation and reconstruction priors. At the core of our approach is the usage of a multi-view diffusion model as a strong generative prior to perform consistent drag editing over multiple rendered views, which is followed by a reconstruction model that reconstructs 3D Gaussians of the edited object. While the initial 3D Gaussians may suffer from misalignment between different views, we address this via view-specific deformation networks that adjust the position of Gaussians to be well aligned. In addition, we propose a multi-view score function that distills generative priors from multiple views to further enhance the view consistency and visual quality. Extensive experiments demonstrate that MVDrag3D provides a precise, generative, and flexible solution for 3D drag-based editing, supporting more versatile editing effects across various object categories and 3D representations.