Title: Comparison of Epilepsy Induced by Ischemic Hypoxic Brain Injury and Hypoglycemic Brain Injury using Multilevel Fusion of Data Features
Authors: Sameer Kadem, Noor Sami, Ahmed Elaraby, Shahad Alyousif, Mohammed Jalil, M. Altaee, Muntather Almusawi, A. Ghany Ismaeel, Ali Kamil Kareem, Massila Kamalrudin, Adnan Allwi ftaiet
Copy Paste: [[2409.02957]] Comparison of Epilepsy Induced by Ischemic Hypoxic Brain Injury and Hypoglycemic Brain Injury using Multilevel Fusion of Data Features(https://arxiv.org/abs/2409.02957)
Keywords: extraction
Abstract: The study aims to investigate the similarities and differences in the brain damage caused by Hypoxia-Ischemia (HI), Hypoglycemia, and Epilepsy. Hypoglycemia poses a significant challenge in improving glycemic regulation for insulin-treated patients, while HI brain disease in neonates is associated with low oxygen levels. The study examines the possibility of using a combination of medical data and Electroencephalography (EEG) measurements to predict outcomes over a two-year period. The study employs a multilevel fusion of data features to enhance the accuracy of the predictions. Therefore this paper suggests a hybridized classification model for Hypoxia-Ischemia and Hypoglycemia, Epilepsy brain injury (HCM-BI). A Support Vector Machine is applied with clinical details to define the Hypoxia-Ischemia outcomes of each infant. The newborn babies are assessed every two years again to know the neural development results. A selection of four attributes is derived from the Electroencephalography records, and SVM does not get conclusions regarding the classification of diseases. The final feature extraction of the EEG signal is optimized by the Bayesian Neural Network (BNN) to get the clear health condition of Hypoglycemia and Epilepsy patients. Through monitoring and assessing physical effects resulting from Electroencephalography, The Bayesian Neural Network (BNN) is used to extract the test samples with the most log data and to report hypoglycemia and epilepsy Keywords- Hypoxia-Ischemia , Hypoglycemia , Epilepsy , Multilevel Fusion of Data Features , Bayesian Neural Network (BNN) , Support Vector Machine (SVM)
Title: SDOoop: Capturing Periodical Patterns and Out-of-phase Anomalies in Streaming Data Analysis
Authors: Alexander Hartl, Félix Iglesias Vázquez, Tanja Zseby
Copy Paste: [[2409.02973]] SDOoop: Capturing Periodical Patterns and Out-of-phase Anomalies in Streaming Data Analysis(https://arxiv.org/abs/2409.02973)
Keywords: security, interpretability
Abstract: Streaming data analysis is increasingly required in applications, e.g., IoT, cybersecurity, robotics, mechatronics or cyber-physical systems. Despite its relevance, it is still an emerging field with open challenges. SDO is a recent anomaly detection method designed to meet requirements of speed, interpretability and intuitive parameterization. In this work, we present SDOoop, which extends the capabilities of SDO's streaming version to retain temporal information of data structures. SDOoop spots contextual anomalies undetectable by traditional algorithms, while enabling the inspection of data geometries, clusters and temporal patterns. We used SDOoop to model real network communications in critical infrastructures and extract patterns that disclose their dynamics. Moreover, we evaluated SDOoop with data from intrusion detection and natural science domains and obtained performances equivalent or superior to state-of-the-art approaches. Our results show the high potential of new model-based methods to analyze and explain streaming data. Since SDOoop operates with constant per-sample space and time complexity, it is ideal for big data, being able to instantly process large volumes of information. SDOoop conforms to next-generation machine learning, which, in addition to accuracy and speed, is expected to provide highly interpretable and informative models.
Title: Hallucination Detection in LLMs: Fast and Memory-Efficient Finetuned Models
Authors: Gabriel Y. Arteaga, Thomas B. Schön, Nicolas Pielawski
Copy Paste: [[2409.02976]] Hallucination Detection in LLMs: Fast and Memory-Efficient Finetuned Models(https://arxiv.org/abs/2409.02976)
Keywords: large language model
Abstract: Uncertainty estimation is a necessary component when implementing AI in high-risk settings, such as autonomous cars, medicine, or insurances. Large Language Models (LLMs) have seen a surge in popularity in recent years, but they are subject to hallucinations, which may cause serious harm in high-risk settings. Despite their success, LLMs are expensive to train and run: they need a large amount of computations and memory, preventing the use of ensembling methods in practice. In this work, we present a novel method that allows for fast and memory-friendly training of LLM ensembles. We show that the resulting ensembles can detect hallucinations and are a viable approach in practice as only one GPU is needed for training and inference.
Title: Vec2Face: Scaling Face Dataset Generation with Loosely Constrained Vectors
Authors: Haiyu Wu, Jaskirat Singh, Sicong Tian, Liang Zheng, Kevin W. Bowyer
Copy Paste: [[2409.02979]] Vec2Face: Scaling Face Dataset Generation with Loosely Constrained Vectors(https://arxiv.org/abs/2409.02979)
Keywords: robust
Abstract: This paper studies how to synthesize face images of non-existent persons, to create a dataset that allows effective training of face recognition (FR) models. Two important goals are (1) the ability to generate a large number of distinct identities (inter-class separation) with (2) a wide variation in appearance of each identity (intra-class variation). However, existing works 1) are typically limited in how many well-separated identities can be generated and 2) either neglect or use a separate editing model for attribute augmentation. We propose Vec2Face, a holistic model that uses only a sampled vector as input and can flexibly generate and control face images and their attributes. Composed of a feature masked autoencoder and a decoder, Vec2Face is supervised by face image reconstruction and can be conveniently used in inference. Using vectors with low similarity among themselves as inputs, Vec2Face generates well-separated identities. Randomly perturbing an input identity vector within a small range allows Vec2Face to generate faces of the same identity with robust variation in face attributes. It is also possible to generate images with designated attributes by adjusting vector values with a gradient descent method. Vec2Face has efficiently synthesized as many as 300K identities with 15 million total images, whereas 60K is the largest number of identities created in the previous works. FR models trained with the generated HSFace datasets, from 10k to 300k identities, achieve state-of-the-art accuracy, from 92% to 93.52%, on five real-world test sets. For the first time, our model created using a synthetic training set achieves higher accuracy than the model created using a same-scale training set of real face images (on the CALFW test set).
Title: CLUE: Concept-Level Uncertainty Estimation for Large Language Models
Authors: Yu-Hsiang Wang, Andrew Bai, Che-Ping Tsai, Cho-Jui Hsieh
Copy Paste: [[2409.03021]] CLUE: Concept-Level Uncertainty Estimation for Large Language Models(https://arxiv.org/abs/2409.03021)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in various natural language generation (NLG) tasks. Previous studies suggest that LLMs' generation process involves uncertainty. However, existing approaches to uncertainty estimation mainly focus on sequence-level uncertainty, overlooking individual pieces of information within sequences. These methods fall short in separately assessing the uncertainty of each component in a sequence. In response, we propose a novel framework for Concept-Level Uncertainty Estimation (CLUE) for LLMs. We leverage LLMs to convert output sequences into concept-level representations, breaking down sequences into individual concepts and measuring the uncertainty of each concept separately. We conduct experiments to demonstrate that CLUE can provide more interpretable uncertainty estimation results compared with sentence-level uncertainty, and could be a useful tool for various tasks such as hallucination detection and story generation.
Title: NUMOSIM: A Synthetic Mobility Dataset with Anomaly Detection Benchmarks
Copy Paste: [[2409.03024]] NUMOSIM: A Synthetic Mobility Dataset with Anomaly Detection Benchmarks(https://arxiv.org/abs/2409.03024)
Keywords: privacy
Abstract: Collecting real-world mobility data is challenging. It is often fraught with privacy concerns, logistical difficulties, and inherent biases. Moreover, accurately annotating anomalies in large-scale data is nearly impossible, as it demands meticulous effort to distinguish subtle and complex patterns. These challenges significantly impede progress in geospatial anomaly detection research by restricting access to reliable data and complicating the rigorous evaluation, comparison, and benchmarking of methodologies. To address these limitations, we introduce a synthetic mobility dataset, NUMOSIM, that provides a controlled, ethical, and diverse environment for benchmarking anomaly detection techniques. NUMOSIM simulates a wide array of realistic mobility scenarios, encompassing both typical and anomalous behaviours, generated through advanced deep learning models trained on real mobility data. This approach allows NUMOSIM to accurately replicate the complexities of real-world movement patterns while strategically injecting anomalies to challenge and evaluate detection algorithms based on how effectively they capture the interplay between demographic, geospatial, and temporal factors. Our goal is to advance geospatial mobility analysis by offering a realistic benchmark for improving anomaly detection and mobility modeling techniques. To support this, we provide open access to the NUMOSIM dataset, along with comprehensive documentation, evaluation metrics, and benchmark results.
Title: A General Albedo Recovery Approach for Aerial Photogrammetric Images through Inverse Rendering
Copy Paste: [[2409.03032]] A General Albedo Recovery Approach for Aerial Photogrammetric Images through Inverse Rendering(https://arxiv.org/abs/2409.03032)
Keywords: extraction
Abstract: Modeling outdoor scenes for the synthetic 3D environment requires the recovery of reflectance/albedo information from raw images, which is an ill-posed problem due to the complicated unmodeled physics in this process (e.g., indirect lighting, volume scattering, specular reflection). The problem remains unsolved in a practical context. The recovered albedo can facilitate model relighting and shading, which can further enhance the realism of rendered models and the applications of digital twins. Typically, photogrammetric 3D models simply take the source images as texture materials, which inherently embed unwanted lighting artifacts (at the time of capture) into the texture. Therefore, these polluted textures are suboptimal for a synthetic environment to enable realistic rendering. In addition, these embedded environmental lightings further bring challenges to photo-consistencies across different images that cause image-matching uncertainties. This paper presents a general image formation model for albedo recovery from typical aerial photogrammetric images under natural illuminations and derives the inverse model to resolve the albedo information through inverse rendering intrinsic image decomposition. Our approach builds on the fact that both the sun illumination and scene geometry are estimable in aerial photogrammetry, thus they can provide direct inputs for this ill-posed problem. This physics-based approach does not require additional input other than data acquired through the typical drone-based photogrammetric collection and was shown to favorably outperform existing approaches. We also demonstrate that the recovered albedo image can in turn improve typical image processing tasks in photogrammetry such as feature and dense matching, edge, and line extraction.
Title: MDNF: Multi-Diffusion-Nets for Neural Fields on Meshes
Authors: Avigail Cohen Rimon, Tal Shnitzer, Mirela Ben Chen
Copy Paste: [[2409.03034]] MDNF: Multi-Diffusion-Nets for Neural Fields on Meshes(https://arxiv.org/abs/2409.03034)
Keywords: robust, diffusion
Abstract: We propose a novel framework for representing neural fields on triangle meshes that is multi-resolution across both spatial and frequency domains. Inspired by the Neural Fourier Filter Bank (NFFB), our architecture decomposes the spatial and frequency domains by associating finer spatial resolution levels with higher frequency bands, while coarser resolutions are mapped to lower frequencies. To achieve geometry-aware spatial decomposition we leverage multiple DiffusionNet components, each associated with a different spatial resolution level. Subsequently, we apply a Fourier feature mapping to encourage finer resolution levels to be associated with higher frequencies. The final signal is composed in a wavelet-inspired manner using a sine-activated MLP, aggregating higher-frequency signals on top of lower-frequency ones. Our architecture attains high accuracy in learning complex neural fields and is robust to discontinuities, exponential scale variations of the target field, and mesh modification. We demonstrate the effectiveness of our approach through its application to diverse neural fields, such as synthetic RGB functions, UV texture coordinates, and vertex normals, illustrating different challenges. To validate our method, we compare its performance against two alternatives, showcasing the advantages of our multi-resolution architecture.
Title: Can Your Generative Model Detect Out-of-Distribution Covariate Shift?
Authors: Christiaan Viviers, Amaan Valiuddin, Francisco Caetano, Lemar Abdi, Lena Filatova, Peter de With, Fons van der Sommen
Copy Paste: [[2409.03043]] Can Your Generative Model Detect Out-of-Distribution Covariate Shift?(https://arxiv.org/abs/2409.03043)
Keywords: generative
Abstract: Detecting Out-of-Distribution~(OOD) sensory data and covariate distribution shift aims to identify new test examples with different high-level image statistics to the captured, normal and In-Distribution (ID) set. Existing OOD detection literature largely focuses on semantic shift with little-to-no consensus over covariate shift. Generative models capture the ID data in an unsupervised manner, enabling them to effectively identify samples that deviate significantly from this learned distribution, irrespective of the downstream task. In this work, we elucidate the ability of generative models to detect and quantify domain-specific covariate shift through extensive analyses that involves a variety of models. To this end, we conjecture that it is sufficient to detect most occurring sensory faults (anomalies and deviations in global signals statistics) by solely modeling high-frequency signal-dependent and independent details. We propose a novel method, CovariateFlow, for OOD detection, specifically tailored to covariate heteroscedastic high-frequency image-components using conditional Normalizing Flows (cNFs). Our results on CIFAR10 vs. CIFAR10-C and ImageNet200 vs. ImageNet200-C demonstrate the effectiveness of the method by accurately detecting OOD covariate shift. This work contributes to enhancing the fidelity of imaging systems and aiding machine learning models in OOD detection in the presence of covariate shift.
Title: Better Verified Explanations with Applications to Incorrectness and Out-of-Distribution Detection
Authors: Min Wu, Xiaofu Li, Haoze Wu, Clark Barrett
Copy Paste: [[2409.03060]] Better Verified Explanations with Applications to Incorrectness and Out-of-Distribution Detection(https://arxiv.org/abs/2409.03060)
Keywords: explainability
Abstract: Building on VeriX (Verified eXplainability, arXiv:2212.01051), a system for producing optimal verified explanations for machine learning model outputs, we present VeriX+, which significantly improves both the size and the generation time of verified explanations. We introduce a bound propagation-based sensitivity technique to improve the size, and a binary search-based traversal with confidence ranking for improving time -- the two techniques are orthogonal and can be used independently or together. We also show how to adapt the QuickXplain (Junker 2004) algorithm to our setting to provide a trade-off between size and time. Experimental evaluations on standard benchmarks demonstrate significant improvements on both metrics, e.g., a size reduction of 38% on the GTSRB dataset and a time reduction of 90% on MNIST. We also explore applications of our verified explanations and show that explanation size is a useful proxy for both incorrectness detection and out-of-distribution detection.
Title: MobileUNETR: A Lightweight End-To-End Hybrid Vision Transformer For Efficient Medical Image Segmentation
Copy Paste: [[2409.03062]] MobileUNETR: A Lightweight End-To-End Hybrid Vision Transformer For Efficient Medical Image Segmentation(https://arxiv.org/abs/2409.03062)
Keywords: extraction, transformer, segmentation
Abstract: Skin cancer segmentation poses a significant challenge in medical image analysis. Numerous existing solutions, predominantly CNN-based, face issues related to a lack of global contextual understanding. Alternatively, some approaches resort to large-scale Transformer models to bridge the global contextual gaps, but at the expense of model size and computational complexity. Finally many Transformer based approaches rely primarily on CNN based decoders overlooking the benefits of Transformer based decoding models. Recognizing these limitations, we address the need efficient lightweight solutions by introducing MobileUNETR, which aims to overcome the performance constraints associated with both CNNs and Transformers while minimizing model size, presenting a promising stride towards efficient image segmentation. MobileUNETR has 3 main features. 1) MobileUNETR comprises of a lightweight hybrid CNN-Transformer encoder to help balance local and global contextual feature extraction in an efficient manner; 2) A novel hybrid decoder that simultaneously utilizes low-level and global features at different resolutions within the decoding stage for accurate mask generation; 3) surpassing large and complex architectures, MobileUNETR achieves superior performance with 3 million parameters and a computational complexity of 1.3 GFLOP resulting in 10x and 23x reduction in parameters and FLOPS, respectively. Extensive experiments have been conducted to validate the effectiveness of our proposed method on four publicly available skin lesion segmentation datasets, including ISIC 2016, ISIC 2017, ISIC 2018, and PH2 datasets. The code will be publicly available at: this https URL
Title: Backdoor defense, learnability and obfuscation
Authors: Paul Christiano, Jacob Hilton, Victor Lecomte, Mark Xu
Copy Paste: [[2409.03077]] Backdoor defense, learnability and obfuscation(https://arxiv.org/abs/2409.03077)
Keywords: defense, attack
Abstract: We introduce a formal notion of defendability against backdoors using a game between an attacker and a defender. In this game, the attacker modifies a function to behave differently on a particular input known as the "trigger", while behaving the same almost everywhere else. The defender then attempts to detect the trigger at evaluation time. If the defender succeeds with high enough probability, then the function class is said to be defendable. The key constraint on the attacker that makes defense possible is that the attacker's strategy must work for a randomly-chosen trigger. Our definition is simple and does not explicitly mention learning, yet we demonstrate that it is closely connected to learnability. In the computationally unbounded setting, we use a voting algorithm of Hanneke et al. (2022) to show that defendability is essentially determined by the VC dimension of the function class, in much the same way as PAC learnability. In the computationally bounded setting, we use a similar argument to show that efficient PAC learnability implies efficient defendability, but not conversely. On the other hand, we use indistinguishability obfuscation to show that the class of polynomial size circuits is not efficiently defendable. Finally, we present polynomial size decision trees as a natural example for which defense is strictly easier than learning. Thus, we identify efficient defendability as a notable intermediate concept in between efficient learnability and obfuscation.
Title: Leveraging Interpretability in the Transformer to Automate the Proactive Scaling of Cloud Resources
Copy Paste: [[2409.03103]] Leveraging Interpretability in the Transformer to Automate the Proactive Scaling of Cloud Resources(https://arxiv.org/abs/2409.03103)
Keywords: interpretability, transformer
Abstract: Modern web services adopt cloud-native principles to leverage the advantages of microservices. To consistently guarantee high Quality of Service (QoS) according to Service Level Agreements (SLAs), ensure satisfactory user experiences, and minimize operational costs, each microservice must be provisioned with the right amount of resources. However, accurately provisioning microservices with adequate resources is complex and depends on many factors, including workload intensity and the complex interconnections between microservices. To address this challenge, we develop a model that captures the relationship between an end-to-end latency, requests at the front-end level, and resource utilization. We then use the developed model to predict the end-to-end latency. Our solution leverages the Temporal Fusion Transformer (TFT), an attention-based architecture equipped with interpretability features. When the prediction results indicate SLA non-compliance, we use the feature importance provided by the TFT as covariates in Kernel Ridge Regression (KRR), with the response variable being the desired latency, to learn the parameters associated with the feature importance. These learned parameters reflect the adjustments required to the features to ensure SLA compliance. We demonstrate the merit of our approach with a microservice-based application and provide a roadmap to deployment.
Title: Spatial Diffusion for Cell Layout Generation
Copy Paste: [[2409.03106]] Spatial Diffusion for Cell Layout Generation(https://arxiv.org/abs/2409.03106)
Keywords: diffusion, generative
Abstract: Generative models, such as GANs and diffusion models, have been used to augment training sets and boost performances in different tasks. We focus on generative models for cell detection instead, i.e., locating and classifying cells in given pathology images. One important information that has been largely overlooked is the spatial patterns of the cells. In this paper, we propose a spatial-pattern-guided generative model for cell layout generation. Specifically, a novel diffusion model guided by spatial features and generates realistic cell layouts has been proposed. We explore different density models as spatial features for the diffusion model. In downstream tasks, we show that the generated cell layouts can be used to guide the generation of high-quality pathology images. Augmenting with these images can significantly boost the performance of SOTA cell detection methods. The code is available at this https URL.
Title: Probing self-attention in self-supervised speech models for cross-linguistic differences
Copy Paste: [[2409.03115]] Probing self-attention in self-supervised speech models for cross-linguistic differences(https://arxiv.org/abs/2409.03115)
Keywords: transformer
Abstract: Speech models have gained traction thanks to increase in accuracy from novel transformer architectures. While this impressive increase in performance across automatic speech recognition (ASR) benchmarks is noteworthy, there is still much that is unknown about the use of attention mechanisms for speech-related tasks. For example, while it is assumed that these models are learning language-independent (i.e., universal) speech representations, there has not yet been an in-depth exploration of what it would mean for the models to be language-independent. In the current paper, we explore this question within the realm of self-attention mechanisms of one small self-supervised speech transformer model (TERA). We find that even with a small model, the attention heads learned are diverse ranging from almost entirely diagonal to almost entirely global regardless of the training language. We highlight some notable differences in attention patterns between Turkish and English and demonstrate that the models do learn important phonological information during pretraining. We also present a head ablation study which shows that models across languages primarily rely on diagonal heads to classify phonemes.
Title: Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA)
Copy Paste: [[2409.03131]] Well, that escalated quickly: The Single-Turn Crescendo Attack (STCA)(https://arxiv.org/abs/2409.03131)
Keywords: attack, robust, large language model
Abstract: This paper explores a novel approach to adversarial attacks on large language models (LLM): the Single-Turn Crescendo Attack (STCA). The STCA builds upon the multi-turn crescendo attack established by Mark Russinovich, Ahmed Salem, Ronen Eldan. Traditional multi-turn adversarial strategies gradually escalate the context to elicit harmful or controversial responses from LLMs. However, this paper introduces a more efficient method where the escalation is condensed into a single interaction. By carefully crafting the prompt to simulate an extended dialogue, the attack bypasses typical content moderation systems, leading to the generation of responses that would normally be filtered out. I demonstrate this technique through a few case studies. The results highlight vulnerabilities in current LLMs and underscore the need for more robust safeguards. This work contributes to the broader discourse on responsible AI (RAI) safety and adversarial testing, providing insights and practical examples for researchers and developers. This method is unexplored in the literature, making it a novel contribution to the field.
Title: Towards Autonomous Cybersecurity: An Intelligent AutoML Framework for Autonomous Intrusion Detection
Copy Paste: [[2409.03141]] Towards Autonomous Cybersecurity: An Intelligent AutoML Framework for Autonomous Intrusion Detection(https://arxiv.org/abs/2409.03141)
Keywords: security
Abstract: The rapid evolution of mobile networks from 5G to 6G has necessitated the development of autonomous network management systems, such as Zero-Touch Networks (ZTNs). However, the increased complexity and automation of these networks have also escalated cybersecurity risks. Existing Intrusion Detection Systems (IDSs) leveraging traditional Machine Learning (ML) techniques have shown effectiveness in mitigating these risks, but they often require extensive manual effort and expert knowledge. To address these challenges, this paper proposes an Automated Machine Learning (AutoML)-based autonomous IDS framework towards achieving autonomous cybersecurity for next-generation networks. To achieve autonomous intrusion detection, the proposed AutoML framework automates all critical procedures of the data analytics pipeline, including data pre-processing, feature engineering, model selection, hyperparameter tuning, and model ensemble. Specifically, it utilizes a Tabular Variational Auto-Encoder (TVAE) method for automated data balancing, tree-based ML models for automated feature selection and base model learning, Bayesian Optimization (BO) for hyperparameter optimization, and a novel Optimized Confidence-based Stacking Ensemble (OCSE) method for automated model ensemble. The proposed AutoML-based IDS was evaluated on two public benchmark network security datasets, CICIDS2017 and 5G-NIDD, and demonstrated improved performance compared to state-of-the-art cybersecurity methods. This research marks a significant step towards fully autonomous cybersecurity in next-generation networks, potentially revolutionizing network security applications.
Title: Addressing the Gaps in Early Dementia Detection: A Path Towards Enhanced Diagnostic Models through Machine Learning
Copy Paste: [[2409.03147]] Addressing the Gaps in Early Dementia Detection: A Path Towards Enhanced Diagnostic Models through Machine Learning(https://arxiv.org/abs/2409.03147)
Keywords: interpretability, transformer
Abstract: The rapid global aging trend has led to an increase in dementia cases, including Alzheimer's disease, underscoring the urgent need for early and accurate diagnostic methods. Traditional diagnostic techniques, such as cognitive tests, neuroimaging, and biomarker analysis, face significant limitations in sensitivity, accessibility, and cost, particularly in the early stages. This study explores the potential of machine learning (ML) as a transformative approach to enhance early dementia detection by leveraging ML models to analyze and integrate complex multimodal datasets, including cognitive assessments, neuroimaging, and genetic information. A comprehensive review of existing literature was conducted to evaluate various ML models, including supervised learning, deep learning, and advanced techniques such as ensemble learning and transformer models, assessing their accuracy, interpretability, and potential for clinical integration. The findings indicate that while ML models show significant promise in improving diagnostic precision and enabling earlier interventions, challenges remain in their generalizability, interpretability, and ethical deployment. This research concludes by outlining future directions aimed at enhancing the clinical utility of ML models in dementia detection, emphasizing interdisciplinary collaboration and ethically sound frameworks to improve early detection and intervention strategies for Alzheimer's disease and other forms of dementia.
Title: Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models
Authors: Jie Ma, Zhitao Gao, Qi Chai, Wangchun Sun, Pinghui Wang, Hongbin Pei, Jing Tao, Lingyun Song, Jun Liu, Chen Zhang, Lizhen Cui
Copy Paste: [[2409.03155]] Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models(https://arxiv.org/abs/2409.03155)
Keywords: large language model
Abstract: Large Language Models (LLMs) may suffer from hallucinations in real-world applications due to the lack of relevant knowledge. In contrast, knowledge graphs encompass extensive, multi-relational structures that store a vast array of symbolic facts. Consequently, integrating LLMs with knowledge graphs has been extensively explored, with Knowledge Graph Question Answering (KGQA) serving as a critical touchstone for the integration. This task requires LLMs to answer natural language questions by retrieving relevant triples from knowledge graphs. However, existing methods face two significant challenges: \textit{excessively long reasoning paths distracting from the answer generation}, and \textit{false-positive relations hindering the path refinement}. In this paper, we propose an iterative interactive KGQA framework that leverages the interactive learning capabilities of LLMs to perform reasoning and Debating over Graphs (DoG). Specifically, DoG employs a subgraph-focusing mechanism, allowing LLMs to perform answer trying after each reasoning step, thereby mitigating the impact of lengthy reasoning paths. On the other hand, DoG utilizes a multi-role debate team to gradually simplify complex questions, reducing the influence of false-positive relations. This debate mechanism ensures the reliability of the reasoning process. Experimental results on five public datasets demonstrate the effectiveness and superiority of our architecture. Notably, DoG outperforms the state-of-the-art method ToG by 23.7\% and 9.1\% in accuracy on WebQuestions and GrailQA, respectively. Furthermore, the integration experiments with various LLMs on the mentioned datasets highlight the flexibility of DoG. Code is available at \url{this https URL}.
Title: MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models
Copy Paste: [[2409.03161]] MaterialBENCH: Evaluating College-Level Materials Science Problem-Solving Abilities of Large Language Models(https://arxiv.org/abs/2409.03161)
Keywords: large language model
Abstract: A college-level benchmark dataset for large language models (LLMs) in the materials science field, MaterialBENCH, is constructed. This dataset consists of problem-answer pairs, based on university textbooks. There are two types of problems: one is the free-response answer type, and the other is the multiple-choice type. Multiple-choice problems are constructed by adding three incorrect answers as choices to a correct answer, so that LLMs can choose one of the four as a response. Most of the problems for free-response answer and multiple-choice types overlap except for the format of the answers. We also conduct experiments using the MaterialBENCH on LLMs, including ChatGPT-3.5, ChatGPT-4, Bard (at the time of the experiments), and GPT-3.5 and GPT-4 with the OpenAI API. The differences and similarities in the performance of LLMs measured by the MaterialBENCH are analyzed and discussed. Performance differences between the free-response type and multiple-choice type in the same models and the influence of using system massages on multiple-choice problems are also studied. We anticipate that MaterialBENCH will encourage further developments of LLMs in reasoning abilities to solve more complicated problems and eventually contribute to materials research and discovery.
Title: A Scalable Matrix Visualization for Understanding Tree Ensemble Classifiers
Authors: Zhen Li, Weikai Yang, Jun Yuan, Jing Wu, Changjian Chen, Yao Ming, Fan Yang, Hui Zhang, Shixia Liu
Copy Paste: [[2409.03164]] A Scalable Matrix Visualization for Understanding Tree Ensemble Classifiers(https://arxiv.org/abs/2409.03164)
Keywords: interpretability
Abstract: The high performance of tree ensemble classifiers benefits from a large set of rules, which, in turn, makes the models hard to understand. To improve interpretability, existing methods extract a subset of rules for approximation using model reduction techniques. However, by focusing on the reduced rule set, these methods often lose fidelity and ignore anomalous rules that, despite their infrequency, play crucial roles in real-world applications. This paper introduces a scalable visual analysis method to explain tree ensemble classifiers that contain tens of thousands of rules. The key idea is to address the issue of losing fidelity by adaptively organizing the rules as a hierarchy rather than reducing them. To ensure the inclusion of anomalous rules, we develop an anomaly-biased model reduction method to prioritize these rules at each hierarchical level. Synergized with this hierarchical organization of rules, we develop a matrix-based hierarchical visualization to support exploration at different levels of detail. Our quantitative experiments and case studies demonstrate how our method fosters a deeper understanding of both common and anomalous rules, thereby enhancing interpretability without sacrificing comprehensiveness.
Abstract: Neural networks (NN) classification models for Natural Language Processing (NLP) are vulnerable to the Universal Adversarial Triggers (UAT) attack that triggers a model to produce a specific prediction for any input. DARCY borrows the "honeypot" concept to bait multiple trapdoors, effectively detecting the adversarial examples generated by UAT. Unfortunately, we find a new UAT generation method, called IndisUAT, which produces triggers (i.e., tokens) and uses them to craft adversarial examples whose feature distribution is indistinguishable from that of the benign examples in a randomly-chosen category at the detection layer of DARCY. The produced adversarial examples incur the maximal loss of predicting results in the DARCY-protected models. Meanwhile, the produced triggers are effective in black-box models for text generation, text inference, and reading comprehension. Finally, the evaluation results under NN models for NLP tasks indicate that the IndisUAT method can effectively circumvent DARCY and penetrate other defenses. For example, IndisUAT can reduce the true positive rate of DARCY's detection by at least 40.8% and 90.6%, and drop the accuracy by at least 33.3% and 51.6% in the RNN and CNN models, respectively. IndisUAT reduces the accuracy of the BERT's adversarial defense model by at least 34.0%, and makes the GPT-2 language model spew racist outputs even when conditioned on non-racial context.
Title: PEPL: Precision-Enhanced Pseudo-Labeling for Fine-Grained Image Classification in Semi-Supervised Learning
Copy Paste: [[2409.03192]] PEPL: Precision-Enhanced Pseudo-Labeling for Fine-Grained Image Classification in Semi-Supervised Learning(https://arxiv.org/abs/2409.03192)
Keywords: robust
Abstract: Fine-grained image classification has witnessed significant advancements with the advent of deep learning and computer vision technologies. However, the scarcity of detailed annotations remains a major challenge, especially in scenarios where obtaining high-quality labeled data is costly or time-consuming. To address this limitation, we introduce Precision-Enhanced Pseudo-Labeling(PEPL) approach specifically designed for fine-grained image classification within a semi-supervised learning framework. Our method leverages the abundance of unlabeled data by generating high-quality pseudo-labels that are progressively refined through two key phases: initial pseudo-label generation and semantic-mixed pseudo-label generation. These phases utilize Class Activation Maps (CAMs) to accurately estimate the semantic content and generate refined labels that capture the essential details necessary for fine-grained classification. By focusing on semantic-level information, our approach effectively addresses the limitations of standard data augmentation and image-mixing techniques in preserving critical fine-grained features. We achieve state-of-the-art performance on benchmark datasets, demonstrating significant improvements over existing semi-supervised strategies, with notable boosts in accuracy and robustness.Our code has been open sourced at this https URL.
Title: RoomDiffusion: A Specialized Diffusion Model in the Interior Design Industry
Copy Paste: [[2409.03198]] RoomDiffusion: A Specialized Diffusion Model in the Interior Design Industry(https://arxiv.org/abs/2409.03198)
Keywords: diffusion
Abstract: Recent advancements in text-to-image diffusion models have significantly transformed visual content generation, yet their application in specialized fields such as interior design remains underexplored. In this paper, we present RoomDiffusion, a pioneering diffusion model meticulously tailored for the interior design industry. To begin with, we build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. Subsequently, techniques such as multiaspect training, multi-stage fine-tune and model fusion are applied to enhance both the visual appeal and precision of the generated results. Lastly, leveraging the latent consistency Distillation method, we distill and expedite the model for optimal efficiency. Unlike existing models optimized for general scenarios, RoomDiffusion addresses specific challenges in interior design, such as lack of fashion, high furniture duplication rate, and inaccurate style. Through our holistic human evaluation protocol with more than 20 professional human evaluators, RoomDiffusion demonstrates industry-leading performance in terms of aesthetics, accuracy, and efficiency, surpassing all existing open source models such as stable diffusion and SDXL.
Copy Paste: [[2409.03200]] Active Fake: DeepFake Camouflage(https://arxiv.org/abs/2409.03200)
Keywords: security, robust
Abstract: DeepFake technology has gained significant attention due to its ability to manipulate facial attributes with high realism, raising serious societal concerns. Face-Swap DeepFake is the most harmful among these techniques, which fabricates behaviors by swapping original faces with synthesized ones. Existing forensic methods, primarily based on Deep Neural Networks (DNNs), effectively expose these manipulations and have become important authenticity indicators. However, these methods mainly concentrate on capturing the blending inconsistency in DeepFake faces, raising a new security issue, termed Active Fake, emerges when individuals intentionally create blending inconsistency in their authentic videos to evade responsibility. This tactic is called DeepFake Camouflage. To achieve this, we introduce a new framework for creating DeepFake camouflage that generates blending inconsistencies while ensuring imperceptibility, effectiveness, and transferability. This framework, optimized via an adversarial learning strategy, crafts imperceptible yet effective inconsistencies to mislead forensic detectors. Extensive experiments demonstrate the effectiveness and robustness of our method, highlighting the need for further research in active fake detection.
Title: An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification
Copy Paste: [[2409.03203]] An Effective Deployment of Diffusion LM for Data Augmentation in Low-Resource Sentiment Classification(https://arxiv.org/abs/2409.03203)
Keywords: diffusion
Abstract: Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover, textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifications or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose DiffusionCLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens. This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effectiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm the effectiveness of our framework's modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.
Title: Pricing American Options using Machine Learning Algorithms
Copy Paste: [[2409.03204]] Pricing American Options using Machine Learning Algorithms(https://arxiv.org/abs/2409.03204)
Keywords: robust
Abstract: This study investigates the application of machine learning algorithms, particularly in the context of pricing American options using Monte Carlo simulations. Traditional models, such as the Black-Scholes-Merton framework, often fail to adequately address the complexities of American options, which include the ability for early exercise and non-linear payoff structures. By leveraging Monte Carlo methods in conjunction Least Square Method machine learning was used. This research aims to improve the accuracy and efficiency of option pricing. The study evaluates several machine learning models, including neural networks and decision trees, highlighting their potential to outperform traditional approaches. The results from applying machine learning algorithm in LSM indicate that integrating machine learning with Monte Carlo simulations can enhance pricing accuracy and provide more robust predictions, offering significant insights into quantitative finance by merging classical financial theories with modern computational techniques. The dataset was split into features and the target variable representing bid prices, with an 80-20 train-validation split. LSTM and GRU models were constructed using TensorFlow's Keras API, each with four hidden layers of 200 neurons and an output layer for bid price prediction, optimized with the Adam optimizer and MSE loss function. The GRU model outperformed the LSTM model across all evaluated metrics, demonstrating lower mean absolute error, mean squared error, and root mean squared error, along with greater stability and efficiency in training.
Title: TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations
Authors: Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong
Copy Paste: [[2409.03206]] TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations(https://arxiv.org/abs/2409.03206)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have significantly improved performance across various image-language applications. Recently, there has been a growing interest in adapting image pre-trained MLLMs for video-related tasks. However, most efforts concentrate on enhancing the vision encoder and projector components, while the core part, Large Language Models (LLMs), remains comparatively under-explored. In this paper, we propose two strategies to enhance the model's capability in video understanding tasks by improving inter-layer attention computation in LLMs. Specifically, the first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE, which introduces temporal position information to strengthen the MLLM's temporal modeling capabilities while preserving the relative position relationships of both visual and text tokens. The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask, a simple yet effective method that broadens visual token interactions within and across video frames while maintaining the causal inference mechanism. Based on these proposed methods, we adapt LLaVA for video understanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our TC-LLaVA achieves new state-of-the-art performance across various video understanding benchmarks with only supervised fine-tuning (SFT) on video-related datasets.
Title: iSeg: An Iterative Refinement-based Framework for Training-free Segmentation
Authors: Lin Sun, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, Yanwei Pang
Copy Paste: [[2409.03209]] iSeg: An Iterative Refinement-based Framework for Training-free Segmentation(https://arxiv.org/abs/2409.03209)
Keywords: diffusion, segmentation
Abstract: Stable diffusion has demonstrated strong image synthesis ability to given text descriptions, suggesting it to contain strong semantic clue for grouping objects. Inspired by this, researchers have explored employing stable diffusion for trainingfree segmentation. Most existing approaches either simply employ cross-attention map or refine it by self-attention map, to generate segmentation masks. We believe that iterative refinement with self-attention map would lead to better results. However, we mpirically demonstrate that such a refinement is sub-optimal likely due to the self-attention map containing irrelevant global information which hampers accurately refining cross-attention map with multiple iterations. To address this, we propose an iterative refinement framework for training-free segmentation, named iSeg, having an entropy-reduced self-attention module which utilizes a gradient descent scheme to reduce the entropy of self-attention map, thereby suppressing the weak responses corresponding to irrelevant global information. Leveraging the entropy-reduced self-attention module, our iSeg stably improves refined crossattention map with iterative refinement. Further, we design a category-enhanced cross-attention module to generate accurate cross-attention map, providing a better initial input for iterative refinement. Extensive experiments across different datasets and diverse segmentation tasks reveal the merits of proposed contributions, leading to promising performance on diverse segmentation tasks. For unsupervised semantic segmentation on Cityscapes, our iSeg achieves an absolute gain of 3.8% in terms of mIoU compared to the best existing training-free approach in literature. Moreover, our proposed iSeg can support segmentation with different kind of images and interactions.
Title: Bi-capacity Choquet Integral for Sensor Fusion with Label Uncertainty
Copy Paste: [[2409.03212]] Bi-capacity Choquet Integral for Sensor Fusion with Label Uncertainty(https://arxiv.org/abs/2409.03212)
Keywords: robust
Abstract: Sensor fusion combines data from multiple sensor sources to improve reliability, robustness, and accuracy of data interpretation. The Fuzzy Integral (FI), in particular, the Choquet integral (ChI), is often used as a powerful nonlinear aggregator for fusion across multiple sensors. However, existing supervised ChI learning algorithms typically require precise training labels for each input data point, which can be difficult or impossible to obtain. Additionally, prior work on ChI fusion is often based only on the normalized fuzzy measures, which bounds the fuzzy measure values between [0, 1]. This can be limiting in cases where the underlying scales of input data sources are bipolar (i.e., between [-1, 1]). To address these challenges, this paper proposes a novel Choquet integral-based fusion framework, named Bi-MIChI (pronounced "bi-mi-kee"), which uses bi-capacities to represent the interactions between pairs of subsets of the input sensor sources on a bi-polar scale. This allows for extended non-linear interactions between the sensor sources and can lead to interesting fusion results. Bi-MIChI also addresses label uncertainty through Multiple Instance Learning, where training labels are applied to "bags" (sets) of data instead of per-instance. Our proposed Bi-MIChI framework shows effective classification and detection performance on both synthetic and real-world experiments for sensor fusion with label uncertainty. We also provide detailed analyses on the behavior of the fuzzy measures to demonstrate our fusion process.
Title: Optimizing 3D Gaussian Splatting for Sparse Viewpoint Scene Reconstruction
Copy Paste: [[2409.03213]] Optimizing 3D Gaussian Splatting for Sparse Viewpoint Scene Reconstruction(https://arxiv.org/abs/2409.03213)
Keywords: robust, diffusion
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a promising approach for 3D scene representation, offering a reduction in computational overhead compared to Neural Radiance Fields (NeRF). However, 3DGS is susceptible to high-frequency artifacts and demonstrates suboptimal performance under sparse viewpoint conditions, thereby limiting its applicability in robotics and computer vision. To address these limitations, we introduce SVS-GS, a novel framework for Sparse Viewpoint Scene reconstruction that integrates a 3D Gaussian smoothing filter to suppress artifacts. Furthermore, our approach incorporates a Depth Gradient Profile Prior (DGPP) loss with a dynamic depth mask to sharpen edges and 2D diffusion with Score Distillation Sampling (SDS) loss to enhance geometric consistency in novel view synthesis. Experimental evaluations on the MipNeRF-360 and SeaThru-NeRF datasets demonstrate that SVS-GS markedly improves 3D reconstruction from sparse viewpoints, offering a robust and efficient solution for scene understanding in robotics and computer vision applications.
Title: xLAM: A Family of Large Action Models to Empower AI Agent Systems
Authors: Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Awalgaonkar, Rithesh Murthy, Eric Hu, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong
Copy Paste: [[2409.03215]] xLAM: A Family of Large Action Models to Empower AI Agent Systems(https://arxiv.org/abs/2409.03215)
Keywords: large language model
Abstract: Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce and publicly release xLAM, a series of large action models designed for AI agent tasks. The xLAM series includes five models with both dense and mixture-of-expert architectures, ranging from 1B to 8x22B parameters, trained using a scalable, flexible pipeline that unifies, augments, and synthesizes diverse datasets to enhance AI agents' generalizability and performance across varied environments. Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use. By releasing the xLAM series, we aim to advance the performance of open-source LLMs for autonomous AI agents, potentially accelerating progress and democratizing access to high-performance models for agent tasks. Models are available at this https URL
Title: FairQuant: Certifying and Quantifying Fairness of Deep Neural Networks
Authors: Brian Hyeongseok Kim, Jingbo Wang, Chao Wang
Copy Paste: [[2409.03220]] FairQuant: Certifying and Quantifying Fairness of Deep Neural Networks(https://arxiv.org/abs/2409.03220)
Keywords: protect, fair
Abstract: We propose a method for formally certifying and quantifying individual fairness of deep neural networks (DNN). Individual fairness guarantees that any two individuals who are identical except for a legally protected attribute (e.g., gender or race) receive the same treatment. While there are existing techniques that provide such a guarantee, they tend to suffer from lack of scalability or accuracy as the size and input dimension of the DNN increase. Our method overcomes this limitation by applying abstraction to a symbolic interval based analysis of the DNN followed by iterative refinement guided by the fairness property. Furthermore, our method lifts the symbolic interval based analysis from conventional qualitative certification to quantitative certification, by computing the percentage of individuals whose classification outputs are provably fair, instead of merely deciding if the DNN is fair. We have implemented our method and evaluated it on deep neural networks trained on four popular fairness research datasets. The experimental results show that our method is not only more accurate than state-of-the-art techniques but also several orders-of-magnitude faster.
Title: Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion
Authors: Chenguang Zhu, Shan Gao, Huafeng Chen, Guangqian Guo, Chaowei Wang, Yaoxing Wang, Chen Shu Lei, Quanjiang Fan
Copy Paste: [[2409.03223]] Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion(https://arxiv.org/abs/2409.03223)
Keywords: extraction, transformer
Abstract: Multi-modality image fusion aims to integrate the merits of images from different sources and render high-quality fusion images. However, existing feature extraction and fusion methods are either constrained by inherent local reduction bias and static parameters during inference (CNN) or limited by quadratic computational complexity (Transformers), and cannot effectively extract and fuse features. To solve this problem, we propose a dual-branch image fusion network called Tmamba. It consists of linear Transformer and Mamba, which has global modeling capabilities while maintaining linear complexity. Due to the difference between the Transformer and Mamba structures, the features extracted by the two branches carry channel and position information respectively. T-M interaction structure is designed between the two branches, using global learnable parameters and convolutional layers to transfer position and channel information respectively. We further propose cross-modal interaction at the attention level to obtain cross-modal attention. Experiments show that our Tmamba achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion. Code with checkpoints will be available after the peer-review process.
Title: Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration
Abstract: Black-box large language models (LLMs) are increasingly deployed in various environments, making it essential for these models to effectively convey their confidence and uncertainty, especially in high-stakes settings. However, these models often exhibit overconfidence, leading to potential risks and misjudgments. Existing techniques for eliciting and calibrating LLM confidence have primarily focused on general reasoning datasets, yielding only modest improvements. Accurate calibration is crucial for informed decision-making and preventing adverse outcomes but remains challenging due to the complexity and variability of tasks these models perform. In this work, we investigate the miscalibration behavior of black-box LLMs within the healthcare setting. We propose a novel method, \textit{Atypical Presentations Recalibration}, which leverages atypical presentations to adjust the model's confidence estimates. Our approach significantly improves calibration, reducing calibration errors by approximately 60\% on three medical question answering datasets and outperforming existing methods such as vanilla verbalized confidence, CoT verbalized confidence and others. Additionally, we provide an in-depth analysis of the role of atypicality within the recalibration framework.
Title: Labeled-to-Unlabeled Distribution Alignment for Partially-Supervised Multi-Organ Medical Image Segmentation
Copy Paste: [[2409.03228]] Labeled-to-Unlabeled Distribution Alignment for Partially-Supervised Multi-Organ Medical Image Segmentation(https://arxiv.org/abs/2409.03228)
Keywords: segmentation
Abstract: Partially-supervised multi-organ medical image segmentation aims to develop a unified semantic segmentation model by utilizing multiple partially-labeled datasets, with each dataset providing labels for a single class of organs. However, the limited availability of labeled foreground organs and the absence of supervision to distinguish unlabeled foreground organs from the background pose a significant challenge, which leads to a distribution mismatch between labeled and unlabeled pixels. Although existing pseudo-labeling methods can be employed to learn from both labeled and unlabeled pixels, they are prone to performance degradation in this task, as they rely on the assumption that labeled and unlabeled pixels have the same distribution. In this paper, to address the problem of distribution mismatch, we propose a labeled-to-unlabeled distribution alignment (LTUDA) framework that aligns feature distributions and enhances discriminative capability. Specifically, we introduce a cross-set data augmentation strategy, which performs region-level mixing between labeled and unlabeled organs to reduce distribution discrepancy and enrich the training set. Besides, we propose a prototype-based distribution alignment method that implicitly reduces intra-class variation and increases the separation between the unlabeled foreground and background. This can be achieved by encouraging consistency between the outputs of two prototype classifiers and a linear classifier. Extensive experimental results on the AbdomenCT-1K dataset and a union of four benchmark datasets (including LiTS, MSD-Spleen, KiTS, and NIH82) demonstrate that our method outperforms the state-of-the-art partially-supervised methods by a considerable margin, and even surpasses the fully-supervised methods. The source code is publicly available at this https URL.
Title: State-space models are accurate and efficient neural operators for dynamical systems
Authors: Zheyuan Hu, Nazanin Ahmadi Daryakenari, Qianli Shen, Kenji Kawaguchi, George Em Karniadakis
Copy Paste: [[2409.03231]] State-space models are accurate and efficient neural operators for dynamical systems(https://arxiv.org/abs/2409.03231)
Keywords: transformer
Abstract: Physics-informed machine learning (PIML) has emerged as a promising alternative to classical methods for predicting dynamical systems, offering faster and more generalizable solutions. However, existing models, including recurrent neural networks (RNNs), transformers, and neural operators, face challenges such as long-time integration, long-range dependencies, chaotic dynamics, and extrapolation, to name a few. To this end, this paper introduces state-space models implemented in Mamba for accurate and efficient dynamical system operator learning. Mamba addresses the limitations of existing architectures by dynamically capturing long-range dependencies and enhancing computational efficiency through reparameterization techniques. To extensively test Mamba and compare against another 11 baselines, we introduce several strict extrapolation testbeds that go beyond the standard interpolation benchmarks. We demonstrate Mamba's superior performance in both interpolation and challenging extrapolation tasks. Mamba consistently ranks among the top models while maintaining the lowest computational cost and exceptional extrapolation capabilities. Moreover, we demonstrate the good performance of Mamba for a real-world application in quantitative systems pharmacology for assessing the efficacy of drugs in tumor growth under limited data scenarios. Taken together, our findings highlight Mamba's potential as a powerful tool for advancing scientific machine learning in dynamical systems modeling. (The code will be available at this https URL upon acceptance.)
Copy Paste: [[2409.03237]] Robust Q-Learning under Corrupted Rewards(https://arxiv.org/abs/2409.03237)
Keywords: attack, robust
Abstract: Recently, there has been a surge of interest in analyzing the non-asymptotic behavior of model-free reinforcement learning algorithms. However, the performance of such algorithms in non-ideal environments, such as in the presence of corrupted rewards, is poorly understood. Motivated by this gap, we investigate the robustness of the celebrated Q-learning algorithm to a strong-contamination attack model, where an adversary can arbitrarily perturb a small fraction of the observed rewards. We start by proving that such an attack can cause the vanilla Q-learning algorithm to incur arbitrarily large errors. We then develop a novel robust synchronous Q-learning algorithm that uses historical reward data to construct robust empirical Bellman operators at each time step. Finally, we prove a finite-time convergence rate for our algorithm that matches known state-of-the-art bounds (in the absence of attacks) up to a small inevitable $O(\varepsilon)$ error term that scales with the adversarial corruption fraction $\varepsilon$. Notably, our results continue to hold even when the true reward distributions have infinite support, provided they admit bounded second moments.
Title: UAV (Unmanned Aerial Vehicles): Diverse Applications of UAV Datasets in Segmentation, Classification, Detection, and Tracking
Copy Paste: [[2409.03245]] UAV (Unmanned Aerial Vehicles): Diverse Applications of UAV Datasets in Segmentation, Classification, Detection, and Tracking(https://arxiv.org/abs/2409.03245)
Keywords: segmentation
Abstract: Unmanned Aerial Vehicles (UAVs), have greatly revolutionized the process of gathering and analyzing data in diverse research domains, providing unmatched adaptability and effectiveness. This paper presents a thorough examination of Unmanned Aerial Vehicle (UAV) datasets, emphasizing their wide range of applications and progress. UAV datasets consist of various types of data, such as satellite imagery, images captured by drones, and videos. These datasets can be categorized as either unimodal or multimodal, offering a wide range of detailed and comprehensive information. These datasets play a crucial role in disaster damage assessment, aerial surveillance, object recognition, and tracking. They facilitate the development of sophisticated models for tasks like semantic segmentation, pose estimation, vehicle re-identification, and gesture recognition. By leveraging UAV datasets, researchers can significantly enhance the capabilities of computer vision models, thereby advancing technology and improving our understanding of complex, dynamic environments from an aerial perspective. This review aims to encapsulate the multifaceted utility of UAV datasets, emphasizing their pivotal role in driving innovation and practical applications in multiple domains.
Title: Multiple weather images restoration using the task transformer and adaptive mixup strategy
Authors: Yang Wen, Anyu Lai, Bo Qian, Hao Wang, Wuzhen Shi, Wenming Cao
Copy Paste: [[2409.03249]] Multiple weather images restoration using the task transformer and adaptive mixup strategy(https://arxiv.org/abs/2409.03249)
Keywords: transformer
Abstract: The current state-of-the-art in severe weather removal predominantly focuses on single-task applications, such as rain removal, haze removal, and snow removal. However, real-world weather conditions often consist of a mixture of several weather types, and the degree of weather mixing in autonomous driving scenarios remains unknown. In the presence of complex and diverse weather conditions, a single weather removal model often encounters challenges in producing clear images from severe weather images. Therefore, there is a need for the development of multi-task severe weather removal models that can effectively handle mixed weather conditions and improve image quality in autonomous driving scenarios. In this paper, we introduce a novel multi-task severe weather removal model that can effectively handle complex weather conditions in an adaptive manner. Our model incorporates a weather task sequence generator, enabling the self-attention mechanism to selectively focus on features specific to different weather types. To tackle the challenge of repairing large areas of weather degradation, we introduce Fast Fourier Convolution (FFC) to increase the receptive field. Additionally, we propose an adaptive upsampling technique that effectively processes both the weather task information and underlying image features by selectively retaining relevant information. Our proposed model has achieved state-of-the-art performance on the publicly available dataset.
Title: Gr-IoU: Ground-Intersection over Union for Robust Multi-Object Tracking with 3D Geometric Constraints
Copy Paste: [[2409.03252]] Gr-IoU: Ground-Intersection over Union for Robust Multi-Object Tracking with 3D Geometric Constraints(https://arxiv.org/abs/2409.03252)
Keywords: robust
Abstract: We propose a Ground IoU (Gr-IoU) to address the data association problem in multi-object tracking. When tracking objects detected by a camera, it often occurs that the same object is assigned different IDs in consecutive frames, especially when objects are close to each other or overlapping. To address this issue, we introduce Gr-IoU, which takes into account the 3D structure of the scene. Gr-IoU transforms traditional bounding boxes from the image space to the ground plane using the vanishing point geometry. The IoU calculated with these transformed bounding boxes is more sensitive to the front-to-back relationships of objects, thereby improving data association accuracy and reducing ID switches. We evaluated our Gr-IoU method on the MOT17 and MOT20 datasets, which contain diverse tracking scenarios including crowded scenes and sequences with frequent occlusions. Experimental results demonstrated that Gr-IoU outperforms conventional real-time methods without appearance features.
Title: Granular-ball Representation Learning for Deep CNN on Learning with Label Noise
Authors: Dawei Dai, Hao Zhu, Shuyin Xia, Guoyin Wang
Copy Paste: [[2409.03254]] Granular-ball Representation Learning for Deep CNN on Learning with Label Noise(https://arxiv.org/abs/2409.03254)
Keywords: robust
Abstract: In actual scenarios, whether manually or automatically annotated, label noise is inevitably generated in the training data, which can affect the effectiveness of deep CNN models. The popular solutions require data cleaning or designing additional optimizations to punish the data with mislabeled data, thereby enhancing the robustness of models. However, these methods come at the cost of weakening or even losing some data during the training process. As we know, content is the inherent attribute of an image that does not change with changes in annotations. In this study, we propose a general granular-ball computing (GBC) module that can be embedded into a CNN model, where the classifier finally predicts the label of granular-ball ($gb$) samples instead of each individual samples. Specifically, considering the classification task: (1) in forward process, we split the input samples as $gb$ samples at feature-level, each of which can correspond to multiple samples with varying numbers and share one single label; (2) during the backpropagation process, we modify the gradient allocation strategy of the GBC module to enable it to propagate normally; and (3) we develop an experience replay policy to ensure the stability of the training process. Experiments demonstrate that the proposed method can improve the robustness of CNN models with no additional data or optimization.
Title: Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard
Copy Paste: [[2409.03257]] Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard(https://arxiv.org/abs/2409.03257)
Keywords: large language model
Abstract: This paper conducts a longitudinal study over eleven months to address the limitations of prior research on the Open Ko-LLM Leaderboard, which have relied on empirical studies with restricted observation periods of only five months. By extending the analysis duration, we aim to provide a more comprehensive understanding of the progression in developing Korean large language models (LLMs). Our study is guided by three primary research questions: (1) What are the specific challenges in improving LLM performance across diverse tasks on the Open Ko-LLM Leaderboard over time? (2) How does model size impact task performance correlations across various benchmarks? (3) How have the patterns in leaderboard rankings shifted over time on the Open Ko-LLM Leaderboard?. By analyzing 1,769 models over this period, our research offers a comprehensive examination of the ongoing advancements in LLMs and the evolving nature of evaluation frameworks.
Title: GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding
Authors: Yukun Cao, Shuo Han, Zengyi Gao, Zezhong Ding, Xike Xie, S. Kevin Zhou
Copy Paste: [[2409.03258]] GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding(https://arxiv.org/abs/2409.03258)
Keywords: large language model
Abstract: Although Large Language Models (LLMs) have demonstrated potential in processing graphs, they struggle with comprehending graphical structure information through prompts of graph description sequences, especially as the graph size increases. We attribute this challenge to the uneven memory performance of LLMs across different positions in graph description sequences, known as ''positional biases''. To address this, we propose GraphInsight, a novel framework aimed at improving LLMs' comprehension of both macro- and micro-level graphical information. GraphInsight is grounded in two key strategies: 1) placing critical graphical information in positions where LLMs exhibit stronger memory performance, and 2) investigating a lightweight external knowledge base for regions with weaker memory performance, inspired by retrieval-augmented generation (RAG). Moreover, GraphInsight explores integrating these two strategies into LLM agent processes for composite graph tasks that require multi-step reasoning. Extensive empirical studies on benchmarks with a wide range of evaluation tasks show that GraphInsight significantly outperforms all other graph description methods (e.g., prompting techniques and reordering strategies) in understanding graph structures of varying sizes.
Title: SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model
Authors: Weipeng Tan, Chuming Lin, Chengming Xu, Xiaozhong Ji, Junwei Zhu, Chengjie Wang, Yanwei Fu
Abstract: Talking Head Generation (THG), typically driven by audio, is an important and challenging task with broad application prospects in various fields such as digital humans, film production, and virtual reality. While diffusion model-based THG methods present high quality and stable content generation, they often overlook the intrinsic style which encompasses personalized features such as speaking habits and facial expressions of a video. As consequence, the generated video content lacks diversity and vividness, thus being limited in real life scenarios. To address these issues, we propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG. Specifically, we first introduce the novel probabilistic style prior learning to model the intrinsic style as a Gaussian distribution using facial expressions and audio embedding. The distribution is learned through the 'bespoked' contrastive objective, effectively capturing the dynamic style information in each video. Then we finetune a pretrained Stable Diffusion (SD) model to inject the learned intrinsic style as a controlling signal via cross attention. Experiments show that our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.
Title: OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving
Copy Paste: [[2409.03272]] OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving(https://arxiv.org/abs/2409.03272)
Keywords: generative, large language model
Abstract: The rise of multi-modal large language models(MLLMs) has spurred their applications in autonomous driving. Recent MLLM-based methods perform action by learning a direct mapping from perception to action, neglecting the dynamics of the world and the relations between action and world dynamics. In contrast, human beings possess world model that enables them to simulate the future states based on 3D internal visual representation and plan actions accordingly. To this end, we propose OccLLaMA, an occupancy-language-action generative world model, which uses semantic occupancy as a general visual representation and unifies vision-language-action(VLA) modalities through an autoregressive model. Specifically, we introduce a novel VQVAE-like scene tokenizer to efficiently discretize and reconstruct semantic occupancy scenes, considering its sparsity and classes imbalance. Then, we build a unified multi-modal vocabulary for vision, language and action. Furthermore, we enhance LLM, specifically LLaMA, to perform the next token/scene prediction on the unified vocabulary to complete multiple tasks in autonomous driving. Extensive experiments demonstrate that OccLLaMA achieves competitive performance across multiple tasks, including 4D occupancy forecasting, motion planning, and visual question answering, showcasing its potential as a foundation model in autonomous driving.
Title: Recent Advances in Attack and Defense Approaches of Large Language Models
Copy Paste: [[2409.03274]] Recent Advances in Attack and Defense Approaches of Large Language Models(https://arxiv.org/abs/2409.03274)
Keywords: security, defense, attack, robust, fair, large language model
Abstract: Large Language Models (LLMs) have revolutionized artificial intelligence and machine learning through their advanced text processing and generating capabilities. However, their widespread deployment has raised significant safety and reliability concerns. Established vulnerabilities in deep neural networks, coupled with emerging threat models, may compromise security evaluations and create a false sense of security. Given the extensive research in the field of LLM security, we believe that summarizing the current state of affairs will help the research community better understand the present landscape and inform future developments. This paper reviews current research on LLM vulnerabilities and threats, and evaluates the effectiveness of contemporary defense mechanisms. We analyze recent studies on attack vectors and model weaknesses, providing insights into attack mechanisms and the evolving threat landscape. We also examine current defense strategies, highlighting their strengths and limitations. By contrasting advancements in attack and defense methodologies, we identify research gaps and propose future directions to enhance LLM security. Our goal is to advance the understanding of LLM safety challenges and guide the development of more robust security measures.
Title: Interpretable mixture of experts for time series prediction under recurrent and non-recurrent conditions
Copy Paste: [[2409.03282]] Interpretable mixture of experts for time series prediction under recurrent and non-recurrent conditions(https://arxiv.org/abs/2409.03282)
Keywords: transformer
Abstract: Non-recurrent conditions caused by incidents are different from recurrent conditions that follow periodic patterns. Existing traffic speed prediction studies are incident-agnostic and use one single model to learn all possible patterns from these drastically diverse conditions. This study proposes a novel Mixture of Experts (MoE) model to improve traffic speed prediction under two separate conditions, recurrent and non-recurrent (i.e., with and without incidents). The MoE leverages separate recurrent and non-recurrent expert models (Temporal Fusion Transformers) to capture the distinct patterns of each traffic condition. Additionally, we propose a training pipeline for non-recurrent models to remedy the limited data issues. To train our model, multi-source datasets, including traffic speed, incident reports, and weather data, are integrated and processed to be informative features. Evaluations on a real road network demonstrate that the MoE achieves lower errors compared to other benchmark algorithms. The model predictions are interpreted in terms of temporal dependencies and variable importance in each condition separately to shed light on the differences between recurrent and non-recurrent conditions.
Title: LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts
Authors: Henrique Da Silva Gameiro, Andrei Kucharavy, Ljiljana Dolamic
Copy Paste: [[2409.03291]] LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts(https://arxiv.org/abs/2409.03291)
Keywords: attack, large language model
Abstract: With the emergence of widely available powerful LLMs, disinformation generated by large Language Models (LLMs) has become a major concern. Historically, LLM detectors have been touted as a solution, but their effectiveness in the real world is still to be proven. In this paper, we focus on an important setting in information operations -- short news-like posts generated by moderately sophisticated attackers. We demonstrate that existing LLM detectors, whether zero-shot or purpose-trained, are not ready for real-world use in that setting. All tested zero-shot detectors perform inconsistently with prior benchmarks and are highly vulnerable to sampling temperature increase, a trivial attack absent from recent benchmarks. A purpose-trained detector generalizing across LLMs and unseen attacks can be developed, but it fails to generalize to new human-written texts. We argue that the former indicates domain-specific benchmarking is needed, while the latter suggests a trade-off between the adversarial evasion resilience and overfitting to the reference human text, with both needing evaluation in benchmarks and currently absent. We believe this suggests a re-consideration of current LLM detector benchmarking approaches and provides a dynamically extensible benchmark to allow it (this https URL).
Title: N-gram Prediction and Word Difference Representations for Language Modeling
Copy Paste: [[2409.03295]] N-gram Prediction and Word Difference Representations for Language Modeling(https://arxiv.org/abs/2409.03295)
Keywords: large language model
Abstract: Causal language modeling (CLM) serves as the foundational framework underpinning remarkable successes of recent large language models (LLMs). Despite its success, the training approach for next word prediction poses a potential risk of causing the model to overly focus on local dependencies within a sentence. While prior studies have been introduced to predict future N words simultaneously, they were primarily applied to tasks such as masked language modeling (MLM) and neural machine translation (NMT). In this study, we introduce a simple N-gram prediction framework for the CLM task. Moreover, we introduce word difference representation (WDR) as a surrogate and contextualized target representation during model training on the basis of N-gram prediction framework. To further enhance the quality of next word prediction, we propose an ensemble method that incorporates the future N words' prediction results. Empirical evaluations across multiple benchmark datasets encompassing CLM and NMT tasks demonstrate the significant advantages of our proposed methods over the conventional CLM.
Title: On the construction of ultra-light MDS matrices
Copy Paste: [[2409.03298]] On the construction of ultra-light MDS matrices(https://arxiv.org/abs/2409.03298)
Keywords: security, robust
Abstract: In recent years, the Substitution-Permutation Network has emerged as a crucial structure for constructing symmetric key ciphers. Composed primarily of linear matrices and nonlinear S-boxes, it offers a robust foundation for cryptographic security. Among the various metrics used to assess the cryptographic properties of linear matrices, the branch number stands out as a particularly important index. Matrices with an optimal branch number are referred to as MDS matrices and are highly prized in the field of cryptography. In this paper we delve into the construction of lightweight MDS matrices. We commence implementation trees of MDS matrices, which is a vital tool for understanding and manipulating their implementations, and then present an algorithm that efficiently enumerates all the lightest MDS matrices based on the word representation. As results, we obtain a series of ultra-lightweight $4\times 4$ MDS matrices, remarkably, 4-bit input MDS matrices with 35 XOR operations and 8-bit input ones with 67 XOR operations . These matrices represent the most comprehensive lightweight MDS matrices available to date. Furthermore, we craft some involution $4\times 4$ MDS matrices with a mere 68 XOR this http URL our best knowledge, they are the best up to date. In the realm of higher-order MDS matrices, we have successfully constructed $5\times 5$ and $6\times 6$ matrices with 114 and 148 XOR gates respectively. These findings outperform the current state-of-the-art.
Title: Improving Robustness to Multiple Spurious Correlations by Multi-Objective Optimization
Copy Paste: [[2409.03303]] Improving Robustness to Multiple Spurious Correlations by Multi-Objective Optimization(https://arxiv.org/abs/2409.03303)
Keywords: robust
Abstract: We study the problem of training an unbiased and accurate model given a dataset with multiple biases. This problem is challenging since the multiple biases cause multiple undesirable shortcuts during training, and even worse, mitigating one may exacerbate the other. We propose a novel training method to tackle this challenge. Our method first groups training data so that different groups induce different shortcuts, and then optimizes a linear combination of group-wise losses while adjusting their weights dynamically to alleviate conflicts between the groups in performance; this approach, rooted in the multi-objective optimization theory, encourages to achieve the minimax Pareto solution. We also present a new benchmark with multiple biases, dubbed MultiCelebA, for evaluating debiased training methods under realistic and challenging scenarios. Our method achieved the best on three datasets with multiple biases, and also showed superior performance on conventional single-bias datasets.
Title: Enhancing User-Centric Privacy Protection: An Interactive Framework through Diffusion Models and Machine Unlearning
Copy Paste: [[2409.03326]] Enhancing User-Centric Privacy Protection: An Interactive Framework through Diffusion Models and Machine Unlearning(https://arxiv.org/abs/2409.03326)
Keywords: privacy, protect, diffusion, generative
Abstract: In the realm of multimedia data analysis, the extensive use of image datasets has escalated concerns over privacy protection within such data. Current research predominantly focuses on privacy protection either in data sharing or upon the release of trained machine learning models. Our study pioneers a comprehensive privacy protection framework that safeguards image data privacy concurrently during data sharing and model publication. We propose an interactive image privacy protection framework that utilizes generative machine learning models to modify image information at the attribute level and employs machine unlearning algorithms for the privacy preservation of model parameters. This user-interactive framework allows for adjustments in privacy protection intensity based on user feedback on generated images, striking a balance between maximal privacy safeguarding and maintaining model performance. Within this framework, we instantiate two modules: a differential privacy diffusion model for protecting attribute information in images and a feature unlearning algorithm for efficient updates of the trained model on the revised image dataset. Our approach demonstrated superiority over existing methods on facial datasets across various attribute classifications.
Title: Rethinking Improved Privacy-Utility Trade-off with Pre-existing Knowledge for DP Training
Authors: Yu Zheng, Wenchao Zhang, Yonggang Zhang, Wei Song, Kai Zhou, Bo Han
Copy Paste: [[2409.03344]] Rethinking Improved Privacy-Utility Trade-off with Pre-existing Knowledge for DP Training(https://arxiv.org/abs/2409.03344)
Keywords: privacy, protect
Abstract: Differential privacy (DP) provides a provable framework for protecting individuals by customizing a random mechanism over a privacy-sensitive dataset. Deep learning models have demonstrated privacy risks in model exposure as an established learning model unintentionally records membership-level privacy leakage. Differentially private stochastic gradient descent (DP- SGD) has been proposed to safeguard training individuals by adding random Gaussian noise to gradient updates in the backpropagation. Researchers identify that DP-SGD typically causes utility loss since the injected homogeneous noise alters the gradient updates calculated at each iteration. Namely, all elements in the gradient are contaminated regardless of their importance in updating model parameters. In this work, we argue that the utility loss mainly results from the homogeneity of injected noise. Consequently, we propose a generic differential privacy framework with heterogeneous noise (DP-Hero) by defining a heterogeneous random mechanism to abstract its property. The insight of DP-Hero is to leverage the knowledge encoded in the previously trained model to guide the subsequent allocation of noise heterogeneity, thereby leveraging the statistical perturbation and achieving enhanced utility. Atop DP-Hero, we instantiate a heterogeneous version of DP-SGD, where the noise injected into gradients is heterogeneous and guided by prior-established model parameters. We conduct comprehensive experiments to verify and explain the effectiveness of the proposed DP-Hero, showing improved training accuracy compared with state-of-the-art works. Broadly, we shed light on improving the privacy-utility space by learning the noise guidance from the pre-existing leaked knowledge encoded in the previously trained model, showing a different perspective of understanding the utility-improved DP training.
Title: Sketch: A Toolkit for Streamlining LLM Operations
Copy Paste: [[2409.03346]] Sketch: A Toolkit for Streamlining LLM Operations(https://arxiv.org/abs/2409.03346)
Keywords: generative, large language model
Abstract: Large language models (LLMs) represented by GPT family have achieved remarkable success. The characteristics of LLMs lie in their ability to accommodate a wide range of tasks through a generative approach. However, the flexibility of their output format poses challenges in controlling and harnessing the model's outputs, thereby constraining the application of LLMs in various domains. In this work, we present Sketch, an innovative toolkit designed to streamline LLM operations across diverse fields. Sketch comprises the following components: (1) a suite of task description schemas and prompt templates encompassing various NLP tasks; (2) a user-friendly, interactive process for building structured output LLM services tailored to various NLP tasks; (3) an open-source dataset for output format control, along with tools for dataset construction; and (4) an open-source model based on LLaMA3-8B-Instruct that adeptly comprehends and adheres to output formatting instructions. We anticipate this initiative to bring considerable convenience to LLM users, achieving the goal of ''plug-and-play'' for various applications. The components of Sketch will be progressively open-sourced at this https URL.
Title: MouseSIS: A Frames-and-Events Dataset for Space-Time Instance Segmentation of Mice
Authors: Friedhelm Hamann, Hanxiong Li, Paul Mieske, Lars Lewejohann, Guillermo Gallego
Copy Paste: [[2409.03358]] MouseSIS: A Frames-and-Events Dataset for Space-Time Instance Segmentation of Mice(https://arxiv.org/abs/2409.03358)
Keywords: robust, segmentation
Abstract: Enabled by large annotated datasets, tracking and segmentation of objects in videos has made remarkable progress in recent years. Despite these advancements, algorithms still struggle under degraded conditions and during fast movements. Event cameras are novel sensors with high temporal resolution and high dynamic range that offer promising advantages to address these challenges. However, annotated data for developing learning-based mask-level tracking algorithms with events is not available. To this end, we introduce: ($i$) a new task termed \emph{space-time instance segmentation}, similar to video instance segmentation, whose goal is to segment instances throughout the entire duration of the sensor input (here, the input are quasi-continuous events and optionally aligned frames); and ($ii$) \emph{\dname}, a dataset for the new task, containing aligned grayscale frames and events. It includes annotated ground-truth labels (pixel-level instance segmentation masks) of a group of up to seven freely moving and interacting mice. We also provide two reference methods, which show that leveraging event data can consistently improve tracking performance, especially when used in combination with conventional cameras. The results highlight the potential of event-aided tracking in difficult scenarios. We hope our dataset opens the field of event-based video instance segmentation and enables the development of robust tracking algorithms for challenging conditions.\url{this https URL}
Title: Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding
Copy Paste: [[2409.03363]] Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding(https://arxiv.org/abs/2409.03363)
Keywords: security, privacy, robust, membership infer, large language model
Abstract: The training data in large language models is key to their success, but it also presents privacy and security risks, as it may contain sensitive information. Detecting pre-training data is crucial for mitigating these concerns. Existing methods typically analyze target text in isolation or solely with non-member contexts, overlooking potential insights from simultaneously considering both member and non-member contexts. While previous work suggested that member contexts provide little information due to the minor distributional shift they induce, our analysis reveals that these subtle shifts can be effectively leveraged when contrasted with non-member contexts. In this paper, we propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts through contrastive decoding, amplifying subtle differences to enhance membership inference. Extensive empirical evaluations demonstrate that Con-ReCall achieves state-of-the-art performance on the WikiMIA benchmark and is robust against various text manipulation techniques.
Title: Leveraging Large Language Models through Natural Language Processing to provide interpretable Machine Learning predictions of mental deterioration in real time
Authors: Francisco de Arriba-Pérez, Silvia García-Méndez
Copy Paste: [[2409.03375]] Leveraging Large Language Models through Natural Language Processing to provide interpretable Machine Learning predictions of mental deterioration in real time(https://arxiv.org/abs/2409.03375)
Keywords: extraction, explainability, large language model
Abstract: Based on official estimates, 50 million people worldwide are affected by dementia, and this number increases by 10 million new patients every year. Without a cure, clinical prognostication and early intervention represent the most effective ways to delay its progression. To this end, Artificial Intelligence and computational linguistics can be exploited for natural language analysis, personalized assessment, monitoring, and treatment. However, traditional approaches need more semantic knowledge management and explicability capabilities. Moreover, using Large Language Models (LLMs) for cognitive decline diagnosis is still scarce, even though these models represent the most advanced way for clinical-patient communication using intelligent systems. Consequently, we leverage an LLM using the latest Natural Language Processing (NLP) techniques in a chatbot solution to provide interpretable Machine Learning prediction of cognitive decline in real-time. Linguistic-conceptual features are exploited for appropriate natural language analysis. Through explainability, we aim to fight potential biases of the models and improve their potential to help clinical workers in their diagnosis decisions. More in detail, the proposed pipeline is composed of (i) data extraction employing NLP-based prompt engineering; (ii) stream-based data processing including feature engineering, analysis, and selection; (iii) real-time classification; and (iv) the explainability dashboard to provide visual and natural language descriptions of the prediction outcome. Classification results exceed 80 % in all evaluation metrics, with a recall value for the mental deterioration class about 85 %. To sum up, we contribute with an affordable, flexible, non-invasive, personalized diagnostic system to this work.
Title: CogniDual Framework: Self-Training Large Language Models within a Dual-System Theoretical Framework for Improving Cognitive Tasks
Authors: Yongxin Deng (1), Xihe Qiu (1), Xiaoyu Tan (2), Chao Qu (2), Jing Pan (3), Yuan Cheng (3), Yinghui Xu (4), Wei Chu (2) ((1) School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China, (2) INF Technology (Shanghai) Co., Ltd., Shanghai, China, (3) School of Art, Design and Architecture, Monash University, Melbourne, Australia, (4) Artificial Intelligence Innovation and Incubation Institute, Fudan University, Shanghai, China)
Copy Paste: [[2409.03381]] CogniDual Framework: Self-Training Large Language Models within a Dual-System Theoretical Framework for Improving Cognitive Tasks(https://arxiv.org/abs/2409.03381)
Keywords: large language model
Abstract: Cognitive psychology investigates perception, attention, memory, language, problem-solving, decision-making, and reasoning. Kahneman's dual-system theory elucidates the human decision-making process, distinguishing between the rapid, intuitive System 1 and the deliberative, rational System 2. Recent advancements have positioned large language Models (LLMs) as formidable tools nearing human-level proficiency in various cognitive tasks. Nonetheless, the presence of a dual-system framework analogous to human cognition in LLMs remains unexplored. This study introduces the \textbf{CogniDual Framework for LLMs} (CFLLMs), designed to assess whether LLMs can, through self-training, evolve from deliberate deduction to intuitive responses, thereby emulating the human process of acquiring and mastering new information. Our findings reveal the cognitive mechanisms behind LLMs' response generation, enhancing our understanding of their capabilities in cognitive psychology. Practically, self-trained models can provide faster responses to certain queries, reducing computational demands during inference.
Title: Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression
Authors: Jingcheng Ke, Dele Wang, Jun-Cheng Chen, I-Hong Jhuo, Chia-Wen Lin, Yen-Yu Lin
Copy Paste: [[2409.03385]] Make Graph-based Referring Expression Comprehension Great Again through Expression-guided Dynamic Gating and Regression(https://arxiv.org/abs/2409.03385)
Keywords: transformer
Abstract: One common belief is that with complex models and pre-training on large-scale datasets, transformer-based methods for referring expression comprehension (REC) perform much better than existing graph-based methods. We observe that since most graph-based methods adopt an off-the-shelf detector to locate candidate objects (i.e., regions detected by the object detector), they face two challenges that result in subpar performance: (1) the presence of significant noise caused by numerous irrelevant objects during reasoning, and (2) inaccurate localization outcomes attributed to the provided detector. To address these issues, we introduce a plug-and-adapt module guided by sub-expressions, called dynamic gate constraint (DGC), which can adaptively disable irrelevant proposals and their connections in graphs during reasoning. We further introduce an expression-guided regression strategy (EGR) to refine location prediction. Extensive experimental results on the RefCOCO, RefCOCO+, RefCOCOg, Flickr30K, RefClef, and Ref-reasoning datasets demonstrate the effectiveness of the DGC module and the EGR strategy in consistently boosting the performances of various graph-based REC methods. Without any pretaining, the proposed graph-based method achieves better performance than the state-of-the-art (SOTA) transformer-based methods.
Title: KAN See In the Dark
Authors: Aoxiang Ning, Minglong Xue, Jinhong He, Chengyun Song
Copy Paste: [[2409.03404]] KAN See In the Dark(https://arxiv.org/abs/2409.03404)
Keywords: interpretability, diffusion
Abstract: Existing low-light image enhancement methods are difficult to fit the complex nonlinear relationship between normal and low-light images due to uneven illumination and noise effects. The recently proposed Kolmogorov-Arnold networks (KANs) feature spline-based convolutional layers and learnable activation functions, which can effectively capture nonlinear dependencies. In this paper, we design a KAN-Block based on KANs and innovatively apply it to low-light image enhancement. This method effectively alleviates the limitations of current methods constrained by linear network structures and lack of interpretability, further demonstrating the potential of KANs in low-level vision tasks. Given the poor perception of current low-light image enhancement methods and the stochastic nature of the inverse diffusion process, we further introduce frequency-domain perception for visually oriented enhancement. Extensive experiments demonstrate the competitive performance of our method on benchmark datasets. The code will be available at: this https URL}{this https URL.
Title: TG-LMM: Enhancing Medical Image Segmentation Accuracy through Text-Guided Large Multi-Modal Model
Authors: Yihao Zhao, Enhao Zhong, Cuiyun Yuan, Yang Li, Man Zhao, Chunxia Li, Jun Hu, Chenbin Liu
Copy Paste: [[2409.03412]] TG-LMM: Enhancing Medical Image Segmentation Accuracy through Text-Guided Large Multi-Modal Model(https://arxiv.org/abs/2409.03412)
Keywords: segmentation
Abstract: We propose TG-LMM (Text-Guided Large Multi-Modal Model), a novel approach that leverages textual descriptions of organs to enhance segmentation accuracy in medical images. Existing medical image segmentation methods face several challenges: current medical automatic segmentation models do not effectively utilize prior knowledge, such as descriptions of organ locations; previous text-visual models focus on identifying the target rather than improving the segmentation accuracy; prior models attempt to use prior knowledge to enhance accuracy but do not incorporate pre-trained models. To address these issues, TG-LMM integrates prior knowledge, specifically expert descriptions of the spatial locations of organs, into the segmentation process. Our model utilizes pre-trained image and text encoders to reduce the number of training parameters and accelerate the training process. Additionally, we designed a comprehensive image-text information fusion structure to ensure thorough integration of the two modalities of data. We evaluated TG-LMM on three authoritative medical image datasets, encompassing the segmentation of various parts of the human body. Our method demonstrated superior performance compared to existing approaches, such as MedSAM, SAM and nnUnet.
Title: mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
Authors: Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou
Abstract: Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at this https URL.
Title: Weight Conditioning for Smooth Optimization of Neural Networks
Authors: Hemanth Saratchandran, Thomas X. Wang, Simon Lucey
Copy Paste: [[2409.03424]] Weight Conditioning for Smooth Optimization of Neural Networks(https://arxiv.org/abs/2409.03424)
Keywords: transformer
Abstract: In this article, we introduce a novel normalization technique for neural network weight matrices, which we term weight conditioning. This approach aims to narrow the gap between the smallest and largest singular values of the weight matrices, resulting in better-conditioned matrices. The inspiration for this technique partially derives from numerical linear algebra, where well-conditioned matrices are known to facilitate stronger convergence results for iterative solvers. We provide a theoretical foundation demonstrating that our normalization technique smoothens the loss landscape, thereby enhancing convergence of stochastic gradient descent algorithms. Empirically, we validate our normalization across various neural network architectures, including Convolutional Neural Networks (CNNs), Vision Transformers (ViT), Neural Radiance Fields (NeRF), and 3D shape modeling. Our findings indicate that our normalization method is not only competitive but also outperforms existing weight normalization techniques from the literature.
Title: UV-Mamba: A DCN-Enhanced State Space Model for Urban Village Boundary Identification in High-Resolution Remote Sensing Images
Authors: Lulin Li, Ben Chen, Xuechao Zou, Junliang Xing, Pin Tao
Copy Paste: [[2409.03431]] UV-Mamba: A DCN-Enhanced State Space Model for Urban Village Boundary Identification in High-Resolution Remote Sensing Images(https://arxiv.org/abs/2409.03431)
Keywords: extraction
Abstract: Owing to the diverse geographical environments, intricate landscapes, and high-density settlements, the automatic identification of urban village boundaries using remote sensing images is a highly challenging task. This paper proposes a novel and efficient neural network model called UV-Mamba for accurate boundary detection in high-resolution remote sensing images. UV-Mamba mitigates the memory loss problem in long sequence modeling, which arises in state space model (SSM) with increasing image size, by incorporating deformable convolutions (DCN). Its architecture utilizes an encoder-decoder framework, includes an encoder with four deformable state space augmentation (DSSA) blocks for efficient multi-level semantic extraction and a decoder to integrate the extracted semantic information. We conducted experiments on the Beijing and Xi'an datasets, and the results show that UV-Mamba achieves state-of-the-art performance. Specifically, our model achieves 73.3% and 78.1% IoU on the Beijing and Xi'an datasets, respectively, representing improvements of 1.2% and 3.4% IoU over the previous best model, while also being 6x faster in inference speed and 40x smaller in parameter count. Source code and pre-trained models are available in the supplementary material.
Title: A Key-Driven Framework for Identity-Preserving Face Anonymization
Copy Paste: [[2409.03434]] A Key-Driven Framework for Identity-Preserving Face Anonymization(https://arxiv.org/abs/2409.03434)
Keywords: privacy, protect
Abstract: Virtual faces are crucial content in the metaverse. Recently, attempts have been made to generate virtual faces for privacy protection. Nevertheless, these virtual faces either permanently remove the identifiable information or map the original identity into a virtual one, which loses the original identity forever. In this study, we first attempt to address the conflict between privacy and identifiability in virtual faces, where a key-driven face anonymization and authentication recognition (KFAAR) framework is proposed. Concretely, the KFAAR framework consists of a head posture-preserving virtual face generation (HPVFG) module and a key-controllable virtual face authentication (KVFA) module. The HPVFG module uses a user key to project the latent vector of the original face into a virtual one. Then it maps the virtual vectors to obtain an extended encoding, based on which the virtual face is generated. By simultaneously adding a head posture and facial expression correction module, the virtual face has the same head posture and facial expression as the original face. During the authentication, we propose a KVFA module to directly recognize the virtual faces using the correct user key, which can obtain the original identity without exposing the original face image. We also propose a multi-task learning objective to train HPVFG and KVFA. Extensive experiments demonstrate the advantages of the proposed HPVFG and KVFA modules, which effectively achieve both facial anonymity and identifiability.
Title: Shuffle Vision Transformer: Lightweight, Fast and Efficient Recognition of Driver Facial Expression
Authors: Ibtissam Saadi, Douglas W. Cunningham, Taleb-ahmed Abdelmalik, Abdenour Hadid, Yassin El Hillali
Copy Paste: [[2409.03438]] Shuffle Vision Transformer: Lightweight, Fast and Efficient Recognition of Driver Facial Expression(https://arxiv.org/abs/2409.03438)
Keywords: transformer
Abstract: Existing methods for driver facial expression recognition (DFER) are often computationally intensive, rendering them unsuitable for real-time applications. In this work, we introduce a novel transfer learning-based dual architecture, named ShuffViT-DFER, which elegantly combines computational efficiency and accuracy. This is achieved by harnessing the strengths of two lightweight and efficient models using convolutional neural network (CNN) and vision transformers (ViT). We efficiently fuse the extracted features to enhance the performance of the model in accurately recognizing the facial expressions of the driver. Our experimental results on two benchmarking and public datasets, KMU-FED and KDEF, highlight the validity of our proposed method for real-time application with superior performance when compared to state-of-the-art methods.
Title: Rx Strategist: Prescription Verification using LLM Agents System
Authors: Phuc Phan Van, Dat Nguyen Minh, An Dinh Ngoc, Huy Phan Thanh
Abstract: To protect patient safety, modern pharmaceutical complexity demands strict prescription verification. We offer a new approach - Rx Strategist - that makes use of knowledge graphs and different search strategies to enhance the power of Large Language Models (LLMs) inside an agentic framework. This multifaceted technique allows for a multi-stage LLM pipeline and reliable information retrieval from a custom-built active ingredient database. Different facets of prescription verification, such as indication, dose, and possible drug interactions, are covered in each stage of the pipeline. We alleviate the drawbacks of monolithic LLM techniques by spreading reasoning over these stages, improving correctness and reliability while reducing memory demands. Our findings demonstrate that Rx Strategist surpasses many current LLMs, achieving performance comparable to that of a highly experienced clinical pharmacist. In the complicated world of modern medications, this combination of LLMs with organized knowledge and sophisticated search methods presents a viable avenue for reducing prescription errors and enhancing patient outcomes.
Title: Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities
Copy Paste: [[2409.03444]] Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities(https://arxiv.org/abs/2409.03444)
Keywords: large language model
Abstract: The advancement of Large Language Models (LLMs) for domain applications in fields such as materials science and engineering depends on the development of fine-tuning strategies that adapt models for specialized, technical capabilities. In this work, we explore the effects of Continued Pretraining (CPT), Supervised Fine-Tuning (SFT), and various preference-based optimization approaches, including Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO), on fine-tuned LLM performance. Our analysis shows how these strategies influence model outcomes and reveals that the merging of multiple fine-tuned models can lead to the emergence of capabilities that surpass the individual contributions of the parent models. We find that model merging leads to new functionalities that neither parent model could achieve alone, leading to improved performance in domain-specific assessments. Experiments with different model architectures are presented, including Llama 3.1 8B and Mistral 7B models, where similar behaviors are observed. Exploring whether the results hold also for much smaller models, we use a tiny LLM with 1.7 billion parameters and show that very small LLMs do not necessarily feature emergent capabilities under model merging, suggesting that model scaling may be a key component. In open-ended yet consistent chat conversations between a human and AI models, our assessment reveals detailed insights into how different model variants perform and show that the smallest model achieves a high intelligence score across key criteria including reasoning depth, creativity, clarity, and quantitative precision. Other experiments include the development of image generation prompts based on disparate biological material design concepts, to create new microstructures, architectural concepts, and urban design based on biological materials-inspired construction principles.
Title: Automatic occlusion removal from 3D maps for maritime situational awareness
Authors: Felix Sattler, Borja Carrillo Perez, Maurice Stephan, Sarah Barnes
Copy Paste: [[2409.03451]] Automatic occlusion removal from 3D maps for maritime situational awareness(https://arxiv.org/abs/2409.03451)
Keywords: robust, generative, segmentation
Abstract: We introduce a novel method for updating 3D geospatial models, specifically targeting occlusion removal in large-scale maritime environments. Traditional 3D reconstruction techniques often face problems with dynamic objects, like cars or vessels, that obscure the true environment, leading to inaccurate models or requiring extensive manual editing. Our approach leverages deep learning techniques, including instance segmentation and generative inpainting, to directly modify both the texture and geometry of 3D meshes without the need for costly reprocessing. By selectively targeting occluding objects and preserving static elements, the method enhances both geometric and visual accuracy. This approach not only preserves structural and textural details of map data but also maintains compatibility with current geospatial standards, ensuring robust performance across diverse datasets. The results demonstrate significant improvements in 3D model fidelity, making this method highly applicable for maritime situational awareness and the dynamic display of auxiliary information.
Title: How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes
Authors: Inacio Vieira, Will Allred, Seamus Lankford, Sheila Castilho Monteiro De Sousa, Andy Way
Copy Paste: [[2409.03454]] How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes(https://arxiv.org/abs/2409.03454)
Keywords: large language model
Abstract: Decoder-only LLMs have shown impressive performance in MT due to their ability to learn from extensive datasets and generate high-quality translations. However, LLMs often struggle with the nuances and style required for organisation-specific translation. In this study, we explore the effectiveness of fine-tuning Large Language Models (LLMs), particularly Llama 3 8B Instruct, leveraging translation memories (TMs), as a valuable resource to enhance accuracy and efficiency. We investigate the impact of fine-tuning the Llama 3 model using TMs from a specific organisation in the software sector. Our experiments cover five translation directions across languages of varying resource levels (English to Brazilian Portuguese, Czech, German, Finnish, and Korean). We analyse diverse sizes of training datasets (1k to 207k segments) to evaluate their influence on translation quality. We fine-tune separate models for each training set and evaluate their performance based on automatic metrics, BLEU, chrF++, TER, and COMET. Our findings reveal improvement in translation performance with larger datasets across all metrics. On average, BLEU and COMET scores increase by 13 and 25 points, respectively, on the largest training set against the baseline model. Notably, there is a performance deterioration in comparison with the baseline model when fine-tuning on only 1k and 2k examples; however, we observe a substantial improvement as the training dataset size increases. The study highlights the potential of integrating TMs with LLMs to create bespoke translation models tailored to the specific needs of businesses, thus enhancing translation quality and reducing turn-around times. This approach offers a valuable insight for organisations seeking to leverage TMs and LLMs for optimal translation outcomes, especially in narrower domains.
Title: Data-free Distillation with Degradation-prompt Diffusion for Multi-weather Image Restoration
Copy Paste: [[2409.03455]] Data-free Distillation with Degradation-prompt Diffusion for Multi-weather Image Restoration(https://arxiv.org/abs/2409.03455)
Keywords: diffusion, data-free
Abstract: Multi-weather image restoration has witnessed incredible progress, while the increasing model capacity and expensive data acquisition impair its applications in memory-limited devices. Data-free distillation provides an alternative for allowing to learn a lightweight student model from a pre-trained teacher model without relying on the original training data. The existing data-free learning methods mainly optimize the models with the pseudo data generated by GANs or the real data collected from the Internet. However, they inevitably suffer from the problems of unstable training or domain shifts with the original data. In this paper, we propose a novel Data-free Distillation with Degradation-prompt Diffusion framework for multi-weather Image Restoration (D4IR). It replaces GANs with pre-trained diffusion models to avoid model collapse and incorporates a degradation-aware prompt adapter to facilitate content-driven conditional diffusion for generating domain-related images. Specifically, a contrast-based degradation prompt adapter is firstly designed to capture degradation-aware prompts from web-collected degraded images. Then, the collected unpaired clean images are perturbed to latent features of stable diffusion, and conditioned with the degradation-aware prompts to synthesize new domain-related degraded images for knowledge distillation. Experiments illustrate that our proposal achieves comparable performance to the model distilled with original training data, and is even superior to other mainstream unsupervised methods.
Title: LM-Gaussian: Boost Sparse-view 3D Gaussian Splatting with Large Model Priors
Copy Paste: [[2409.03456]] LM-Gaussian: Boost Sparse-view 3D Gaussian Splatting with Large Model Priors(https://arxiv.org/abs/2409.03456)
Keywords: robust, diffusion
Abstract: We aim to address sparse-view reconstruction of a 3D scene by leveraging priors from large-scale vision models. While recent advancements such as 3D Gaussian Splatting (3DGS) have demonstrated remarkable successes in 3D reconstruction, these methods typically necessitate hundreds of input images that densely capture the underlying scene, making them time-consuming and impractical for real-world applications. However, sparse-view reconstruction is inherently ill-posed and under-constrained, often resulting in inferior and incomplete outcomes. This is due to issues such as failed initialization, overfitting on input images, and a lack of details. To mitigate these challenges, we introduce LM-Gaussian, a method capable of generating high-quality reconstructions from a limited number of images. Specifically, we propose a robust initialization module that leverages stereo priors to aid in the recovery of camera poses and the reliable point clouds. Additionally, a diffusion-based refinement is iteratively applied to incorporate image diffusion priors into the Gaussian optimization process to preserve intricate scene details. Finally, we utilize video diffusion priors to further enhance the rendered images for realistic visual effects. Overall, our approach significantly reduces the data acquisition requirements compared to previous 3DGS methods. We validate the effectiveness of our framework through experiments on various public datasets, demonstrating its potential for high-quality 360-degree scene reconstruction. Visual results are on our website.
Title: Non-Uniform Illumination Attack for Fooling Convolutional Neural Networks
Abstract: Convolutional Neural Networks (CNNs) have made remarkable strides; however, they remain susceptible to vulnerabilities, particularly in the face of minor image perturbations that humans can easily recognize. This weakness, often termed as 'attacks', underscores the limited robustness of CNNs and the need for research into fortifying their resistance against such manipulations. This study introduces a novel Non-Uniform Illumination (NUI) attack technique, where images are subtly altered using varying NUI masks. Extensive experiments are conducted on widely-accepted datasets including CIFAR10, TinyImageNet, and CalTech256, focusing on image classification with 12 different NUI attack models. The resilience of VGG, ResNet, MobilenetV3-small and InceptionV3 models against NUI attacks are evaluated. Our results show a substantial decline in the CNN models' classification accuracy when subjected to NUI attacks, indicating their vulnerability under non-uniform illumination. To mitigate this, a defense strategy is proposed, including NUI-attacked images, generated through the new NUI transformation, into the training set. The results demonstrate a significant enhancement in CNN model performance when confronted with perturbed images affected by NUI attacks. This strategy seeks to bolster CNN models' resilience against NUI attacks.
Title: LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones
Authors: Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni
Abstract: Research in efficient vision backbones is evolving into models that are a mixture of convolutions and transformer blocks. A smart combination of both, architecture-wise and component-wise is mandatory to excel in the speedaccuracy trade-off. Most publications focus on maximizing accuracy and utilize MACs (multiply accumulate operations) as an efficiency metric. The latter however often do not measure accurately how fast a model actually is due to factors like memory access cost and degree of parallelism. We analyzed common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency, as the combination of the latter two is a better representation of the efficiency of models in real applications. We applied the conclusions taken from that analysis to create a recipe for increasing hardware-efficiency in macro design. Additionally we introduce a simple slimmed-down version of MultiHead Self-Attention, that aligns with our analysis. We combine both macro and micro design to create a new family of hardware-efficient backbone networks called LowFormer. LowFormer achieves a remarkable speedup in terms of throughput and latency, while achieving similar or better accuracy than current state-of-the-art efficient backbones. In order to prove the generalizability of our hardware-efficient design, we evaluate our method on GPU, mobile GPU and ARM CPU. We further show that the downstream tasks object detection and semantic segmentation profit from our hardware-efficient architecture. Code and models are available at this https URL altair199797/LowFormer.
Title: Characterizing Massive Activations of Attention Mechanism in Graph Neural Networks
Authors: Lorenzo Bini, Marco Sorbi, Stephane Marchand-Maillet
Copy Paste: [[2409.03463]] Characterizing Massive Activations of Attention Mechanism in Graph Neural Networks(https://arxiv.org/abs/2409.03463)
Keywords: robust, transformer
Abstract: Graph Neural Networks (GNNs) have become increasingly popular for effectively modeling data with graph structures. Recently, attention mechanisms have been integrated into GNNs to improve their ability to capture complex patterns. This paper presents the first comprehensive study revealing a critical, unexplored consequence of this integration: the emergence of Massive Activations (MAs) within attention layers. We introduce a novel method for detecting and analyzing MAs, focusing on edge features in different graph transformer architectures. Our study assesses various GNN models using benchmark datasets, including ZINC, TOX21, and PROTEINS. Key contributions include (1) establishing the direct link between attention mechanisms and MAs generation in GNNs, (2) developing a robust definition and detection method for MAs based on activation ratio distributions, (3) introducing the Explicit Bias Term (EBT) as a potential countermeasure and exploring it as an adversarial framework to assess models robustness based on the presence or absence of MAs. Our findings highlight the prevalence and impact of attention-induced MAs across different architectures, such as GraphTransformer, GraphiT, and SAN. The study reveals the complex interplay between attention mechanisms, model architecture, dataset characteristics, and MAs emergence, providing crucial insights for developing more robust and reliable graph models.
Title: Tyche: Collateral-Free Coalition-Resistant Multiparty Lotteries with Arbitrary Payouts
Abstract: We propose Tyche, a family of protocols for performing practically (as well as asymptotically) efficient multiparty lotteries, resistant against aborts and majority coalitions. Our protocols are based on a commit-and-reveal approach, requiring only a collision-resistant hash function. All our protocols use a blockchain as a public bulletin board and for buy-in collection and payout settlement. Importantly though, they do not rely on it or any other third party for providing randomness. Also, participants are not required to post any collateral beyond their buy-in. Any honest participant can eventually settle the lottery, and dishonest behavior never reduces the winning probability of any honest participant. Further, we adapt all three protocols into anonymous lotteries, where (under certain conditions) the winner is unlinkable to any particular participant. We show that our protocols are secure, fair, and some preserve the participants' privacy. Finally, we evaluate the performance of our protocols, particularly in terms of transaction fees, by implementing them on the Sui blockchain. There we see that per user transaction fees are reasonably low and our protocols could potentially support millions of participants.
Title: Improving Uncertainty-Error Correspondence in Deep Bayesian Medical Image Segmentation
Authors: Prerak Mody, Nicolas F. Chaves-de-Plaza, Chinmay Rao, Eleftheria Astrenidou, Mischa de Ridder, Nienke Hoekstra, Klaus Hildebrandt, Marius Staring
Copy Paste: [[2409.03470]] Improving Uncertainty-Error Correspondence in Deep Bayesian Medical Image Segmentation(https://arxiv.org/abs/2409.03470)
Keywords: segmentation
Abstract: Increased usage of automated tools like deep learning in medical image segmentation has alleviated the bottleneck of manual contouring. This has shifted manual labour to quality assessment (QA) of automated contours which involves detecting errors and correcting them. A potential solution to semi-automated QA is to use deep Bayesian uncertainty to recommend potentially erroneous regions, thus reducing time spent on error detection. Previous work has investigated the correspondence between uncertainty and error, however, no work has been done on improving the "utility" of Bayesian uncertainty maps such that it is only present in inaccurate regions and not in the accurate ones. Our work trains the FlipOut model with the Accuracy-vs-Uncertainty (AvU) loss which promotes uncertainty to be present only in inaccurate regions. We apply this method on datasets of two radiotherapy body sites, c.f. head-and-neck CT and prostate MR scans. Uncertainty heatmaps (i.e. predictive entropy) are evaluated against voxel inaccuracies using Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves. Numerical results show that when compared to the Bayesian baseline the proposed method successfully suppresses uncertainty for accurate voxels, with similar presence of uncertainty for inaccurate voxels. Code to reproduce experiments is available at this https URL
Title: ScreenMark: Watermarking Arbitrary Visual Content on Screen
Copy Paste: [[2409.03487]] ScreenMark: Watermarking Arbitrary Visual Content on Screen(https://arxiv.org/abs/2409.03487)
Keywords: protect, robust, watermark, diffusion
Abstract: Digital watermarking has demonstrated its effectiveness in protecting multimedia content. However, existing watermarking are predominantly tailored for specific media types, rendering them less effective for the protection of content displayed on computer screens, which is often multimodal and dynamic. Visual Screen Content (VSC), is particularly susceptible to theft and leakage via screenshots, a vulnerability that current watermarking methods fail to adequately this http URL tackle these challenges, we propose ScreenMark, a robust and practical watermarking method designed specifically for arbitrary VSC protection. ScreenMark utilizes a three-stage progressive watermarking framework. Initially, inspired by diffusion principles, we initialize the mutual transformation between regular watermark information and irregular watermark patterns. Subsequently, these patterns are integrated with screen content using a pre-multiplication alpha blending technique, supported by a pre-trained screen decoder for accurate watermark retrieval. The progressively complex distorter enhances the robustness of the watermark in real-world screenshot scenarios. Finally, the model undergoes fine-tuning guided by a joint-level distorter to ensure optimal this http URL validate the effectiveness of ScreenMark, we compiled a dataset comprising 100,000 screenshots from various devices and resolutions. Extensive experiments across different datasets confirm the method's superior robustness, imperceptibility, and practical applicability.
Title: Towards Data-Centric Face Anti-Spoofing: Improving Cross-domain Generalization via Physics-based Data Synthesis
Authors: Rizhao Cai, Cecelia Soh, Zitong Yu, Haoliang Li, Wenhan Yang, Alex Kot
Copy Paste: [[2409.03501]] Towards Data-Centric Face Anti-Spoofing: Improving Cross-domain Generalization via Physics-based Data Synthesis(https://arxiv.org/abs/2409.03501)
Keywords: attack, transformer
Abstract: Face Anti-Spoofing (FAS) research is challenged by the cross-domain problem, where there is a domain gap between the training and testing data. While recent FAS works are mainly model-centric, focusing on developing domain generalization algorithms for improving cross-domain performance, data-centric research for face anti-spoofing, improving generalization from data quality and quantity, is largely ignored. Therefore, our work starts with data-centric FAS by conducting a comprehensive investigation from the data perspective for improving cross-domain generalization of FAS models. More specifically, at first, based on physical procedures of capturing and recapturing, we propose task-specific FAS data augmentation (FAS-Aug), which increases data diversity by synthesizing data of artifacts, such as printing noise, color distortion, moiré pattern, \textit{etc}. Our experiments show that using our FAS augmentation can surpass traditional image augmentation in training FAS models to achieve better cross-domain performance. Nevertheless, we observe that models may rely on the augmented artifacts, which are not environment-invariant, and using FAS-Aug may have a negative effect. As such, we propose Spoofing Attack Risk Equalization (SARE) to prevent models from relying on certain types of artifacts and improve the generalization performance. Last but not least, our proposed FAS-Aug and SARE with recent Vision Transformer backbones can achieve state-of-the-art performance on the FAS cross-domain generalization protocols. The implementation is available at this https URL.
Title: Blended Latent Diffusion under Attention Control for Real-World Video Editing
Copy Paste: [[2409.03514]] Blended Latent Diffusion under Attention Control for Real-World Video Editing(https://arxiv.org/abs/2409.03514)
Keywords: diffusion
Abstract: Due to lack of fully publicly available text-to-video models, current video editing methods tend to build on pre-trained text-to-image generation models, however, they still face grand challenges in dealing with the local editing of video with temporal information. First, although existing methods attempt to focus on local area editing by a pre-defined mask, the preservation of the outside-area background is non-ideal due to the spatially entire generation of each frame. In addition, specially providing a mask by user is an additional costly undertaking, so an autonomous masking strategy integrated into the editing process is desirable. Last but not least, image-level pretrained model hasn't learned temporal information across frames of a video which is vital for expressing the motion and dynamics. In this paper, we propose to adapt a image-level blended latent diffusion model to perform local video editing tasks. Specifically, we leverage DDIM inversion to acquire the latents as background latents instead of the randomly noised ones to better preserve the background information of the input video. We further introduce an autonomous mask manufacture mechanism derived from cross-attention maps in diffusion steps. Finally, we enhance the temporal consistency across video frames by transforming the self-attention blocks of U-Net into temporal-spatial blocks. Through extensive experiments, our proposed approach demonstrates effectiveness in different real-world video editing tasks.
Title: LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution
Abstract: Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have demonstrated impressive performance. However, they suffer from significant complexity, resulting in high inference times and memory usage. Additionally, ViT models using Window Self-Attention (WSA) face challenges in processing regions outside their windows. To address these issues, we propose the Low-to-high Multi-Level Transformer (LMLT), which employs attention with varying feature sizes for each head. LMLT divides image features along the channel dimension, gradually reduces spatial size for lower heads, and applies self-attention to each head. This approach effectively captures both local and global information. By integrating the results from lower heads into higher heads, LMLT overcomes the window boundary issues in self-attention. Extensive experiments show that our model significantly reduces inference time and GPU memory usage while maintaining or even surpassing the performance of state-of-the-art ViT-based Image Super-Resolution methods. Our codes are availiable at this https URL.
Title: FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation
Authors: Xi Chen, Haosen Yang, Sheng Jin, Xiatian Zhu, Hongxun Yao
Copy Paste: [[2409.03525]] FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation(https://arxiv.org/abs/2409.03525)
Keywords: transformer, segmentation
Abstract: Open-vocabulary segmentation poses significant challenges, as it requires segmenting and recognizing objects across an open set of categories in unconstrained environments. Building on the success of powerful vision-language (ViL) foundation models, such as CLIP, recent efforts sought to harness their zero-short capabilities to recognize unseen categories. Despite notable performance improvements, these models still encounter the critical issue of generating precise mask proposals for unseen categories and scenarios, resulting in inferior segmentation performance eventually. To address this challenge, we introduce a novel approach, FrozenSeg, designed to integrate spatial knowledge from a localization foundation model (e.g., SAM) and semantic knowledge extracted from a ViL model (e.g., CLIP), in a synergistic framework. Taking the ViL model's visual encoder as the feature backbone, we inject the space-aware feature into the learnable queries and CLIP features within the transformer decoder. In addition, we devise a mask proposal ensemble strategy for further improving the recall rate and mask quality. To fully exploit pre-trained knowledge while minimizing training overhead, we freeze both foundation models, focusing optimization efforts solely on a lightweight transformer decoder for mask proposal generation-the performance bottleneck. Extensive experiments demonstrate that FrozenSeg advances state-of-the-art results across various segmentation benchmarks, trained exclusively on COCO panoptic data, and tested in a zero-shot manner. Code is available at this https URL.
Title: Use of triplet loss for facial restoration in low-resolution images
Copy Paste: [[2409.03530]] Use of triplet loss for facial restoration in low-resolution images(https://arxiv.org/abs/2409.03530)
Keywords: biometric
Abstract: In recent years, facial recognition (FR) models have become the most widely used biometric tool, achieving impressive results on numerous datasets. However, inherent hardware challenges or shooting distances often result in low-resolution images, which significantly impact the performance of FR models. To address this issue, several solutions have been proposed, including super-resolution (SR) models that generate highly realistic faces. Despite these efforts, significant improvements in FR algorithms have not been achieved. We propose a novel SR model FTLGAN, which focuses on generating high-resolution images that preserve individual identities rather than merely improving image quality, thereby maximizing the performance of FR models. The results are compelling, demonstrating a mean value of d' 21% above the best current state-of-the-art models, specifically having a value of d' = 1.099 and AUC = 0.78 for 14x14 pixels, d' = 2.112 and AUC = 0.92 for 28x28 pixels, and d' = 3.049 and AUC = 0.98 for 56x56 pixels. The contributions of this study are significant in several key areas. Firstly, a notable improvement in facial recognition performance has been achieved in low-resolution images, specifically at resolutions of 14x14, 28x28, and 56x56 pixels. Secondly, the enhancements demonstrated by FTLGAN show a consistent response across all resolutions, delivering outstanding performance uniformly, unlike other comparative models. Thirdly, an innovative approach has been implemented using triplet loss logic, enabling the training of the super-resolution model solely with real images, contrasting with current models, and expanding potential real-world applications. Lastly, this study introduces a novel model that specifically addresses the challenge of improving classification performance in facial recognition systems by integrating facial recognition quality as a loss during model training.
Title: Risk-based Calibration for Probabilistic Classifiers
Authors: Aritz Pérez, Carlos Echegoyen, Guzmán Santafé
Copy Paste: [[2409.03542]] Risk-based Calibration for Probabilistic Classifiers(https://arxiv.org/abs/2409.03542)
Keywords: generative
Abstract: We introduce a general iterative procedure called risk-based calibration (RC) designed to minimize the empirical risk under the 0-1 loss (empirical error) for probabilistic classifiers. These classifiers are based on modeling probability distributions, including those constructed from the joint distribution (generative) and those based on the class conditional distribution (conditional). RC can be particularized to any probabilistic classifier provided a specific learning algorithm that computes the classifier's parameters in closed form using data statistics. RC reinforces the statistics aligned with the true class while penalizing those associated with other classes, guided by the 0-1 loss. The proposed method has been empirically tested on 30 datasets using naïve Bayes, quadratic discriminant analysis, and logistic regression classifiers. RC improves the empirical error of the original closed-form learning algorithms and, more notably, consistently outperforms the gradient descent approach with the three classifiers.
Title: Prediction Accuracy & Reliability: Classification and Object Localization under Distribution Shift
Copy Paste: [[2409.03543]] Prediction Accuracy & Reliability: Classification and Object Localization under Distribution Shift(https://arxiv.org/abs/2409.03543)
Keywords: robust
Abstract: Natural distribution shift causes a deterioration in the perception performance of convolutional neural networks (CNNs). This comprehensive analysis for real-world traffic data addresses: 1) investigating the effect of natural distribution shift and weather augmentations on both detection quality and confidence estimation, 2) evaluating model performance for both classification and object localization, and 3) benchmarking two common uncertainty quantification methods - Ensembles and different variants of Monte-Carlo (MC) Dropout - under natural and close-to-natural distribution shift. For this purpose, a novel dataset has been curated from publicly available autonomous driving datasets. The in-distribution (ID) data is based on cutouts of a single object, for which both class and bounding box annotations are available. The six distribution-shift datasets cover adverse weather scenarios, simulated rain and fog, corner cases, and out-of-distribution data. A granular analysis of CNNs under distribution shift allows to quantize the impact of different types of shifts on both, task performance and confidence estimation: ConvNeXt-Tiny is more robust than EfficientNet-B0; heavy rain degrades classification stronger than localization, contrary to heavy fog; integrating MC-Dropout into selected layers only has the potential to enhance task performance and confidence estimation, whereby the identification of these layers depends on the type of distribution shift and the considered task.
Title: CTMBIDS: Convolutional Tsetlin Machine Based Intrusion Detection System for DDoS attacks in an SDN environment
Authors: Rasoul Jafari Gohari, Laya Aliahmadipour, Marjan Kuchaki Rafsanjani
Copy Paste: [[2409.03544]] CTMBIDS: Convolutional Tsetlin Machine Based Intrusion Detection System for DDoS attacks in an SDN environment(https://arxiv.org/abs/2409.03544)
Keywords: security, attack, interpretability
Abstract: Software Defined Networks (SDN) face many security challenges today. A great deal of research has been done within the field of Intrusion Detection Systems (IDS) in these networks. Yet, numerous approaches still rely on deep learning algorithms. These algorithms suffer from complexity in implementation, high processing power and high memory consumption. In addition to security issues, firstly, the number of datasets that are based on SDN protocols are very small. Secondly, the ones that are available encompass numerous attacks in the network and do not focus on a single attack. For this reason, to introduce an SDN-based IDS with a focus on Distributed Denial of Service (DDoS) attacks, it is necessary to generate a DDoS-oriented dataset whose features can train a high-quality IDS. In this work, in order to address two important challenges in SDNs, initially, we generate three DDoS attack datasets based on three common and different network topologies. In the second step, using the Convolutional Tsetlin Machine (CTM), we introduce a lightweight IDS for DDoS attack dubbed CTMBIDS. The lightweight nature of the CTMBIDS stems from its low memory consumption and also its interpretability compared to the existing complex deep learning models. The low usage of system resources for the CTMBIDS makes it an ideal choice for an optimal software that consumes the SDN controllers least amount of memory. Also, in order to ascertain the quality of the generated datasets, we compare the CTMBIDS empirical results with the DDoS attacks of the KDDCup99 benchmark dataset as well. Since the main focus of this work is on a lightweight IDS, the results show the CTMBIDS performs much more efficiently than deep learning based approaches. Furthermore, the results also show in most datasets, the proposed method has relatively equal or better accuracy and also consumes much less memory than the existing methods.
Title: DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture
Authors: Qianlong Xiang, Miao Zhang, Yuzhang Shang, Jianlong Wu, Yan Yan, Liqiang Nie
Copy Paste: [[2409.03550]] DKDM: Data-Free Knowledge Distillation for Diffusion Models with Any Architecture(https://arxiv.org/abs/2409.03550)
Keywords: diffusion, data-free, generative
Abstract: Diffusion models (DMs) have demonstrated exceptional generative capabilities across various areas, while they are hindered by slow inference speeds and high computational demands during deployment. The most common way to accelerate DMs involves reducing the number of denoising steps during generation, achieved through faster sampling solvers or knowledge distillation (KD). In contrast to prior approaches, we propose a novel method that transfers the capability of large pretrained DMs to faster architectures. Specifically, we employ KD in a distinct manner to compress DMs by distilling their generative ability into more rapid variants. Furthermore, considering that the source data is either unaccessible or too enormous to store for current generative models, we introduce a new paradigm for their distillation without source data, termed Data-Free Knowledge Distillation for Diffusion Models (DKDM). Generally, our established DKDM framework comprises two main components: 1) a DKDM objective that uses synthetic denoising data produced by pretrained DMs to optimize faster DMs without source data, and 2) a dynamic iterative distillation method that flexibly organizes the synthesis of denoising data, preventing it from slowing down the optimization process as the generation is slow. To our knowledge, this is the first attempt at using KD to distill DMs into any architecture in a data-free manner. Importantly, our DKDM is orthogonal to most existing acceleration methods, such as denoising step reduction, quantization and pruning. Experiments show that our DKDM is capable of deriving 2x faster DMs with performance remaining on par with the baseline. Notably, our DKDM enables pretrained DMs to function as "datasets" for training new DMs.
Title: Organized Grouped Discrete Representation for Object-Centric Learning
Authors: Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen
Copy Paste: [[2409.03553]] Organized Grouped Discrete Representation for Object-Centric Learning(https://arxiv.org/abs/2409.03553)
Keywords: diffusion, transformer, segmentation
Abstract: Object-Centric Learning (OCL) represents dense image or video pixels as sparse object features. Representative methods utilize discrete representation composed of Variational Autoencoder (VAE) template features to suppress pixel-level information redundancy and guide object-level feature aggregation. The most recent advancement, Grouped Discrete Representation (GDR), further decomposes these template features into attributes. However, its naive channel grouping as decomposition may erroneously group channels belonging to different attributes together and discretize them as sub-optimal template attributes, which losses information and harms expressivity. We propose Organized GDR (OGDR) to organize channels belonging to the same attributes together for correct decomposition from features into attributes. In unsupervised segmentation experiments, OGDR is fully superior to GDR in augmentating classical transformer-based OCL methods; it even improves state-of-the-art diffusion-based ones. Codebook PCA and representation similarity analyses show that compared with GDR, our OGDR eliminates redundancy and preserves information better for guiding object representation learning. The source code is available in the supplementary material.
Title: Enabling Practical and Privacy-Preserving Image Processing
Abstract: Fully Homomorphic Encryption (FHE) enables computations on encrypted data, preserving confidentiality without the need for decryption. However, FHE is often hindered by significant performance overhead, particularly for high-precision and complex data like images. Due to serious efficiency issues, traditional FHE methods often encrypt images by monolithic data blocks (such as pixel rows), instead of pixels. However, this strategy compromises the advantages of homomorphic operations and disables pixel-level image processing. In this study, we address these challenges by proposing and implementing a pixel-level homomorphic encryption approach, iCHEETAH, based on the CKKS scheme. To enhance computational efficiency, we introduce three novel caching mechanisms to pre-encrypt radix values or frequently occurring pixel values, substantially reducing redundant encryption operations. Extensive experiments demonstrate that our approach achieves up to a 19-fold improvement in encryption speed compared to the original CKKS, while maintaining high image quality. Additionally, real-world image applications such as mean filtering, brightness enhancement, image matching and watermarking are tested based on FHE, showcasing up to a 91.53% speed improvement. We also proved that our method is IND-CPA (Indistinguishability under Chosen Plaintext Attack) secure, providing strong encryption security. These results underscore the practicality and efficiency of iCHEETAH, marking a significant advancement in privacy-preserving image processing at scale.
Title: Costs Estimation in Unit Commitment Problems using Simulation-Based Inference
Authors: Matthias Pirlet, Adrien Bolland, Gilles Louppe, Damien Ernst
Copy Paste: [[2409.03588]] Costs Estimation in Unit Commitment Problems using Simulation-Based Inference(https://arxiv.org/abs/2409.03588)
Keywords: robust
Abstract: The Unit Commitment (UC) problem is a key optimization task in power systems to forecast the generation schedules of power units over a finite time period by minimizing costs while meeting demand and technical constraints. However, many parameters required by the UC problem are unknown, such as the costs. In this work, we estimate these unknown costs using simulation-based inference on an illustrative UC problem, which provides an approximated posterior distribution of the parameters given observed generation schedules and demands. Our results highlight that the learned posterior distribution effectively captures the underlying distribution of the data, providing a range of possible values for the unknown parameters given a past observation. This posterior allows for the estimation of past costs using observed past generation schedules, enabling operators to better forecast future costs and make more robust generation scheduling forecasts. We present avenues for future research to address overconfidence in posterior estimation, enhance the scalability of the methodology and apply it to more complex UC problems modeling the network constraints and renewable energy sources.
Title: A practical approach to evaluating the adversarial distance for machine learning classifiers
Authors: Georg Siedel, Ekagra Gupta, Andrey Morozov
Copy Paste: [[2409.03598]] A practical approach to evaluating the adversarial distance for machine learning classifiers(https://arxiv.org/abs/2409.03598)
Keywords: protect, attack, robust
Abstract: Robustness is critical for machine learning (ML) classifiers to ensure consistent performance in real-world applications where models may encounter corrupted or adversarial inputs. In particular, assessing the robustness of classifiers to adversarial inputs is essential to protect systems from vulnerabilities and thus ensure safety in use. However, methods to accurately compute adversarial robustness have been challenging for complex ML models and high-dimensional data. Furthermore, evaluations typically measure adversarial accuracy on specific attack budgets, limiting the informative value of the resulting metrics. This paper investigates the estimation of the more informative adversarial distance using iterative adversarial attacks and a certification approach. Combined, the methods provide a comprehensive evaluation of adversarial robustness by computing estimates for the upper and lower bounds of the adversarial distance. We present visualisations and ablation studies that provide insights into how this evaluation method should be applied and parameterised. We find that our adversarial attack approach is effective compared to related implementations, while the certification method falls short of expectations. The approach in this paper should encourage a more informative way of evaluating the adversarial robustness of ML classifiers.
Title: TCDiff: Triple Condition Diffusion Model with 3D Constraints for Stylizing Synthetic Faces
Authors: Bernardo Biesseck, Pedro Vidal, Luiz Coelho, Roger Granada, David Menotti|
Copy Paste: [[2409.03600]] TCDiff: Triple Condition Diffusion Model with 3D Constraints for Stylizing Synthetic Faces(https://arxiv.org/abs/2409.03600)
Keywords: privacy, robust, diffusion
Abstract: A robust face recognition model must be trained using datasets that include a large number of subjects and numerous samples per subject under varying conditions (such as pose, expression, age, noise, and occlusion). Due to ethical and privacy concerns, large-scale real face datasets have been discontinued, such as MS1MV3, and synthetic face generators have been proposed, utilizing GANs and Diffusion Models, such as SYNFace, SFace, DigiFace-1M, IDiff-Face, DCFace, and GANDiffFace, aiming to supply this demand. Some of these methods can produce high-fidelity realistic faces, but with low intra-class variance, while others generate high-variance faces with low identity consistency. In this paper, we propose a Triple Condition Diffusion Model (TCDiff) to improve face style transfer from real to synthetic faces through 2D and 3D facial constraints, enhancing face identity consistency while keeping the necessary high intra-class variance. Face recognition experiments using 1k, 2k, and 5k classes of our new dataset for training outperform state-of-the-art synthetic datasets in real face benchmarks such as LFW, CFP-FP, AgeDB, and BUPT. Our source code is available at: this https URL.
Title: SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing
Copy Paste: [[2409.03605]] SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing(https://arxiv.org/abs/2409.03605)
Keywords: generative, segmentation
Abstract: Audio-driven talking face generation aims to synthesize video with lip movements synchronized to input audio. However, current generative techniques face challenges in preserving intricate regional textures (skin, teeth). To address the aforementioned challenges, we propose a novel framework called SegTalker to decouple lip movements and image textures by introducing segmentation as intermediate representation. Specifically, given the mask of image employed by a parsing network, we first leverage the speech to drive the mask and generate talking segmentation. Then we disentangle semantic regions of image into style codes using a mask-guided encoder. Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame. In this way, most of textures are fully preserved. Moreover, our approach can inherently achieve background separation and facilitate mask-guided facial local editing. In particular, by editing the mask and swapping the region textures from a given reference image (e.g. hair, lip, eyebrows), our approach enables facial editing seamlessly when generating talking face video. Experiments demonstrate that our proposed approach can effectively preserve texture details and generate temporally consistent video while remaining competitive in lip synchronization. Quantitative and qualitative results on the HDTF and MEAD datasets illustrate the superior performance of our method over existing methods.
Title: VFLGAN-TS: Vertical Federated Learning-based Generative Adversarial Networks for Publication of Vertically Partitioned Time-Series Data
Copy Paste: [[2409.03612]] VFLGAN-TS: Vertical Federated Learning-based Generative Adversarial Networks for Publication of Vertically Partitioned Time-Series Data(https://arxiv.org/abs/2409.03612)
Keywords: privacy, protect, federate, generative
Abstract: In the current artificial intelligence (AI) era, the scale and quality of the dataset play a crucial role in training a high-quality AI model. However, often original data cannot be shared due to privacy concerns and regulations. A potential solution is to release a synthetic dataset with a similar distribution to the private dataset. Nevertheless, in some scenarios, the attributes required to train an AI model are distributed among different parties, and the parties cannot share the local data for synthetic data construction due to privacy regulations. In PETS 2024, we recently introduced the first Vertical Federated Learning-based Generative Adversarial Network (VFLGAN) for publishing vertically partitioned static data. However, VFLGAN cannot effectively handle time-series data, presenting both temporal and attribute dimensions. In this article, we proposed VFLGAN-TS, which combines the ideas of attribute discriminator and vertical federated learning to generate synthetic time-series data in the vertically partitioned scenario. The performance of VFLGAN-TS is close to that of its counterpart, which is trained in a centralized manner and represents the upper limit for VFLGAN-TS. To further protect privacy, we apply a Gaussian mechanism to make VFLGAN-TS satisfy an $(\epsilon,\delta)$-differential privacy. Besides, we develop an enhanced privacy auditing scheme to evaluate the potential privacy breach through the framework of VFLGAN-TS and synthetic datasets.
Title: Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers
Copy Paste: [[2409.03621]] Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers(https://arxiv.org/abs/2409.03621)
Keywords: transformer
Abstract: In decoder-based LLMs, the representation of a given layer serves two purposes: as input to the next layer during the computation of the current token; and as input to the attention mechanism of future tokens. In this work, we show that the importance of the latter role might be overestimated. To show that, we start by manipulating the representations of previous tokens; e.g. by replacing the hidden states at some layer k with random vectors. Our experimenting with four LLMs and four tasks show that this operation often leads to small to negligible drop in performance. Importantly, this happens if the manipulation occurs in the top part of the model-k is in the final 30-50% of the layers. In contrast, doing the same manipulation in earlier layers might lead to chance level performance. We continue by switching the hidden state of certain tokens with hidden states of other tokens from another prompt; e.g., replacing the word "Italy" with "France" in "What is the capital of Italy?". We find that when applying this switch in the top 1/3 of the model, the model ignores it (answering "Rome"). However if we apply it before, the model conforms to the switch ("Paris"). Our results hint at a two stage process in transformer-based LLMs: the first part gathers input from previous tokens, while the second mainly processes that information internally.
Title: On the Compliance of Self-Sovereign Identity with GDPR Principles: A Critical Review
Copy Paste: [[2409.03624]] On the Compliance of Self-Sovereign Identity with GDPR Principles: A Critical Review(https://arxiv.org/abs/2409.03624)
Keywords: privacy, protect, federate
Abstract: Identity Management Systems (IdMs) have complemented how users are identified, authenticated, and authorised on e-services. Among the methods used for this purpose are traditional IdMs (isolated, centralised and federated) that mostly rely on identity providers (IdPs) to broker trust between a user and service-providers (SPs). An IdP also identifies and authenticates a user on-behalf of the SP, who then determines the authorisation of the user. In these processes, both SP and IdP collect, process or store private users' data, which can be prone to breach. One approach to address the data breach is to relieve the IdP, and return control and storage of personal data to the owner. Self-sovereign identity (SSI) was introduced as an IdM model to reduce the possibility of data breaches by offering control of personal data to the owner. SSI is a decentralised IdM, where the data owner has sovereign control of personal data stored in their digital wallet. Since SSI is an emerging technology, its components and methods require careful evaluation. This paper provides an evolution to IdMs and reviews the state-of-the-art SSI frameworks. We explored articles in the literature that reviewed blockchain solutions for General Data Protection Regulation (GDPR). We systematically searched recent SSI and blockchain proposals, evaluated the compliance of the retrieved documents with the GDPR privacy principles, and discussed their potentials, constraints, and limitations. This work identifies potential research gaps and opportunities.
Title: Beyond Model Interpretability: Socio-Structural Explanations in Machine Learning
Copy Paste: [[2409.03632]] Beyond Model Interpretability: Socio-Structural Explanations in Machine Learning(https://arxiv.org/abs/2409.03632)
Keywords: interpretability
Abstract: What is it to interpret the outputs of an opaque machine learning model. One approach is to develop interpretable machine learning techniques. These techniques aim to show how machine learning models function by providing either model centric local or global explanations, which can be based on mechanistic interpretations revealing the inner working mechanisms of models or nonmechanistic approximations showing input feature output data relationships. In this paper, we draw on social philosophy to argue that interpreting machine learning outputs in certain normatively salient domains could require appealing to a third type of explanation that we call sociostructural explanation. The relevance of this explanation type is motivated by the fact that machine learning models are not isolated entities but are embedded within and shaped by social structures. Sociostructural explanations aim to illustrate how social structures contribute to and partially explain the outputs of machine learning models. We demonstrate the importance of sociostructural explanations by examining a racially biased healthcare allocation algorithm. Our proposal highlights the need for transparency beyond model interpretability, understanding the outputs of machine learning systems could require a broader analysis that extends beyond the understanding of the machine learning model itself.
Title: CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation
Authors: Bin Wang, Fan Wu, Linke Ouyang, Zhuangcheng Gu, Rui Zhang, Renqiu Xia, Bo Zhang, Conghui He
Copy Paste: [[2409.03643]] CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation(https://arxiv.org/abs/2409.03643)
Keywords: extraction, fair
Abstract: Formula recognition presents significant challenges due to the complicated structure and varied notation of mathematical expressions. Despite continuous advancements in formula recognition models, the evaluation metrics employed by these models, such as BLEU and Edit Distance, still exhibit notable limitations. They overlook the fact that the same formula has diverse representations and is highly sensitive to the distribution of training data, thereby causing the unfairness in formula recognition evaluation. To this end, we propose a Character Detection Matching (CDM) metric, ensuring the evaluation objectivity by designing a image-level rather than LaTex-level metric score. Specifically, CDM renders both the model-predicted LaTeX and the ground-truth LaTeX formulas into image-formatted formulas, then employs visual feature extraction and localization techniques for precise character-level matching, incorporating spatial position information. Such a spatially-aware and character-matching method offers a more accurate and equitable evaluation compared with previous BLEU and Edit Distance metrics that rely solely on text-based character matching. Experimentally, we evaluated various formula recognition models using CDM, BLEU, and ExpRate metrics. Their results demonstrate that the CDM aligns more closely with human evaluation standards and provides a fairer comparison across different models by eliminating discrepancies caused by diverse formula representations.
Title: RealisHuman: A Two-Stage Approach for Refining Malformed Human Parts in Generated Images
Authors: Benzhi Wang, Jingkai Zhou, Jingqi Bai, Yang Yang, Weihua Chen, Fan Wang, Zhen Lei
Copy Paste: [[2409.03644]] RealisHuman: A Two-Stage Approach for Refining Malformed Human Parts in Generated Images(https://arxiv.org/abs/2409.03644)
Keywords: diffusion, generative
Abstract: In recent years, diffusion models have revolutionized visual generation, outperforming traditional frameworks like Generative Adversarial Networks (GANs). However, generating images of humans with realistic semantic parts, such as hands and faces, remains a significant challenge due to their intricate structural complexity. To address this issue, we propose a novel post-processing solution named RealisHuman. The RealisHuman framework operates in two stages. First, it generates realistic human parts, such as hands or faces, using the original malformed parts as references, ensuring consistent details with the original image. Second, it seamlessly integrates the rectified human parts back into their corresponding positions by repainting the surrounding areas to ensure smooth and realistic blending. The RealisHuman framework significantly enhances the realism of human generation, as demonstrated by notable improvements in both qualitative and quantitative metrics. Code is available at this https URL.
Title: Limited but consistent gains in adversarial robustness by co-training object recognition models with human EEG
Authors: Manshan Guo, Bhavin Choksi, Sari Sadiya, Alessandro T. Gifford, Martina G. Vilas, Radoslaw M. Cichy, Gemma Roig
Copy Paste: [[2409.03646]] Limited but consistent gains in adversarial robustness by co-training object recognition models with human EEG(https://arxiv.org/abs/2409.03646)
Keywords: attack, robust
Abstract: In contrast to human vision, artificial neural networks (ANNs) remain relatively susceptible to adversarial attacks. To address this vulnerability, efforts have been made to transfer inductive bias from human brains to ANNs, often by training the ANN representations to match their biological counterparts. Previous works relied on brain data acquired in rodents or primates using invasive techniques, from specific regions of the brain, under non-natural conditions (anesthetized animals), and with stimulus datasets lacking diversity and naturalness. In this work, we explored whether aligning model representations to human EEG responses to a rich set of real-world images increases robustness to ANNs. Specifically, we trained ResNet50-backbone models on a dual task of classification and EEG prediction; and evaluated their EEG prediction accuracy and robustness to adversarial attacks. We observed significant correlation between the networks' EEG prediction accuracy, often highest around 100 ms post stimulus onset, and their gains in adversarial robustness. Although effect size was limited, effects were consistent across different random initializations and robust for architectural variants. We further teased apart the data from individual EEG channels and observed strongest contribution from electrodes in the parieto-occipital regions. The demonstrated utility of human EEG for such tasks opens up avenues for future efforts that scale to larger datasets under diverse stimuli conditions with the promise of stronger effects.
Title: Unsupervised Anomaly Detection and Localization with Generative Adversarial Networks
Copy Paste: [[2409.03657]] Unsupervised Anomaly Detection and Localization with Generative Adversarial Networks(https://arxiv.org/abs/2409.03657)
Keywords: generative
Abstract: We propose a novel unsupervised anomaly detection approach using generative adversarial networks and SOP-derived spectrograms. Demonstrating remarkable efficacy, our method achieves over 97% accuracy on SOP datasets from both submarine and terrestrial fiber links, all achieved without the need for labelled data.
Title: LLM-based multi-agent poetry generation in non-cooperative environments
Copy Paste: [[2409.03659]] LLM-based multi-agent poetry generation in non-cooperative environments(https://arxiv.org/abs/2409.03659)
Keywords: large language model
Abstract: Despite substantial progress of large language models (LLMs) for automatic poetry generation, the generated poetry lacks diversity while the training process differs greatly from human learning. Under the rationale that the learning process of the poetry generation systems should be more human-like and their output more diverse and novel, we introduce a framework based on social learning where we emphasize non-cooperative interactions besides cooperative interactions to encourage diversity. Our experiments are the first attempt at LLM-based multi-agent systems in non-cooperative environments for poetry generation employing both TRAINING-BASED agents (GPT-2) and PROMPTING-BASED agents (GPT-3 and GPT-4). Our evaluation based on 96k generated poems shows that our framework benefits the poetry generation process for TRAINING-BASED agents resulting in 1) a 3.0-3.7 percentage point (pp) increase in diversity and a 5.6-11.3 pp increase in novelty according to distinct and novel n-grams. The generated poetry from TRAINING-BASED agents also exhibits group divergence in terms of lexicons, styles and semantics. PROMPTING-BASED agents in our framework also benefit from non-cooperative environments and a more diverse ensemble of models with non-homogeneous agents has the potential to further enhance diversity, with an increase of 7.0-17.5 pp according to our experiments. However, PROMPTING-BASED agents show a decrease in lexical diversity over time and do not exhibit the group-based divergence intended in the social network. Our paper argues for a paradigm shift in creative tasks such as automatic poetry generation to include social learning processes (via LLM-based agent modeling) similar to human interaction.
Title: The representation landscape of few-shot learning and fine-tuning in large language models
Authors: Diego Doimo, Alessandro Serra, Alessio Ansuini, Alberto Cazzaniga
Copy Paste: [[2409.03662]] The representation landscape of few-shot learning and fine-tuning in large language models(https://arxiv.org/abs/2409.03662)
Keywords: large language model
Abstract: In-context learning (ICL) and supervised fine-tuning (SFT) are two common strategies for improving the performance of modern large language models (LLMs) on specific tasks. Despite their different natures, these strategies often lead to comparable performance gains. However, little is known about whether they induce similar representations inside LLMs. We approach this problem by analyzing the probability landscape of their hidden representations in the two cases. More specifically, we compare how LLMs solve the same question-answering task, finding that ICL and SFT create very different internal structures, in both cases undergoing a sharp transition in the middle of the network. In the first half of the network, ICL shapes interpretable representations hierarchically organized according to their semantic content. In contrast, the probability landscape obtained with SFT is fuzzier and semantically mixed. In the second half of the model, the fine-tuned representations develop probability modes that better encode the identity of answers, while the landscape of ICL representations is characterized by less defined peaks. Our approach reveals the diverse computational strategies developed inside LLMs to solve the same task across different conditions, allowing us to make a step towards designing optimal methods to extract information from language models.
Title: A Fused Large Language Model for Predicting Startup Success
Authors: Abdurahman Maarouf, Stefan Feuerriegel, Nicolas Pröllochs
Copy Paste: [[2409.03668]] A Fused Large Language Model for Predicting Startup Success(https://arxiv.org/abs/2409.03668)
Keywords: large language model
Abstract: Investors are continuously seeking profitable investment opportunities in startups and, hence, for effective decision-making, need to predict a startup's probability of success. Nowadays, investors can use not only various fundamental information about a startup (e.g., the age of the startup, the number of founders, and the business sector) but also textual description of a startup's innovation and business model, which is widely available through online venture capital (VC) platforms such as Crunchbase. To support the decision-making of investors, we develop a machine learning approach with the aim of locating successful startups on VC platforms. Specifically, we develop, train, and evaluate a tailored, fused large language model to predict startup success. Thereby, we assess to what extent self-descriptions on VC platforms are predictive of startup success. Using 20,172 online profiles from Crunchbase, we find that our fused large language model can predict startup success, with textual self-descriptions being responsible for a significant part of the predictive power. Our work provides a decision support tool for investors to find profitable investment opportunities.
Title: Wind turbine condition monitoring based on intra- and inter-farm federated learning
Authors: Albin Grataloup, Stefan Jonas, Angela Meyer
Copy Paste: [[2409.03672]] Wind turbine condition monitoring based on intra- and inter-farm federated learning(https://arxiv.org/abs/2409.03672)
Keywords: privacy, federate
Abstract: As wind energy adoption is growing, ensuring the efficient operation and maintenance of wind turbines becomes essential for maximizing energy production and minimizing costs and downtime. Many AI applications in wind energy, such as in condition monitoring and power forecasting, may benefit from using operational data not only from individual wind turbines but from multiple turbines and multiple wind farms. Collaborative distributed AI which preserves data privacy holds a strong potential for these applications. Federated learning has emerged as a privacy-preserving distributed machine learning approach in this context. We explore federated learning in wind turbine condition monitoring, specifically for fault detection using normal behaviour models. We investigate various federated learning strategies, including collaboration across different wind farms and turbine models, as well as collaboration restricted to the same wind farm and turbine model. Our case study results indicate that federated learning across multiple wind turbines consistently outperforms models trained on a single turbine, especially when training data is scarce. Moreover, the amount of historical data necessary to train an effective model can be significantly reduced by employing a collaborative federated learning strategy. Finally, our findings show that extending the collaboration to multiple wind farms may result in inferior performance compared to restricting learning within a farm, specifically when faced with statistical heterogeneity and imbalanced datasets.
Title: A Different Level Text Protection Mechanism With Differential Privacy
Copy Paste: [[2409.03707]] A Different Level Text Protection Mechanism With Differential Privacy(https://arxiv.org/abs/2409.03707)
Keywords: privacy, protect
Abstract: The article introduces a method for extracting words of different degrees of importance based on the BERT pre-training model and proves the effectiveness of this method. The article also discusses the impact of maintaining the same perturbation results for words of different importance on the overall text utility. This method can be applied to long text protection.
Title: RAG based Question-Answering for Contextual Response Prediction System
Copy Paste: [[2409.03708]] RAG based Question-Answering for Contextual Response Prediction System(https://arxiv.org/abs/2409.03708)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown versatility in various Natural Language Processing (NLP) tasks, including their potential as effective question-answering systems. However, to provide precise and relevant information in response to specific customer queries in industry settings, LLMs require access to a comprehensive knowledge base to avoid hallucinations. Retrieval Augmented Generation (RAG) emerges as a promising technique to address this challenge. Yet, developing an accurate question-answering framework for real-world applications using RAG entails several challenges: 1) data availability issues, 2) evaluating the quality of generated content, and 3) the costly nature of human evaluation. In this paper, we introduce an end-to-end framework that employs LLMs with RAG capabilities for industry use cases. Given a customer query, the proposed system retrieves relevant knowledge documents and leverages them, along with previous chat history, to generate response suggestions for customer service agents in the contact centers of a major retail company. Through comprehensive automated and human evaluations, we show that this solution outperforms the current BERT-based algorithms in accuracy and relevance. Our findings suggest that RAG-based LLMs can be an excellent support to human customer service representatives by lightening their workload.
Title: Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation
Authors: Slava Elizarov, Ciara Rowles, Simon Donné
Copy Paste: [[2409.03718]] Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation(https://arxiv.org/abs/2409.03718)
Keywords: diffusion
Abstract: Generating high-quality 3D objects from textual descriptions remains a challenging problem due to computational cost, the scarcity of 3D data, and complex 3D representations. We introduce Geometry Image Diffusion (GIMDiffusion), a novel Text-to-3D model that utilizes geometry images to efficiently represent 3D shapes using 2D images, thereby avoiding the need for complex 3D-aware architectures. By integrating a Collaborative Control mechanism, we exploit the rich 2D priors of existing Text-to-Image models such as Stable Diffusion. This enables strong generalization even with limited 3D training data (allowing us to use only high-quality training data) as well as retaining compatibility with guidance techniques such as IPAdapter. In short, GIMDiffusion enables the generation of 3D assets at speeds comparable to current Text-to-Image models. The generated objects consist of semantically meaningful, separate parts and include internal structures, enhancing both usability and versatility.
Title: Confidential Computing Transparency
Authors: Ceren Kocaoğullar, Tina Marjanov, Ivan Petrov, Ben Laurie, Al Cutter, Christoph Kern, Alice Hutchings, Alastair R. Beresford
Abstract: Confidential Computing is a security paradigm designed to protect data in-use by leveraging hardware-based Trusted Execution Environments (TEEs). While TEEs offer significant security benefits, the need for user trust remains a challenge, as attestation alone cannot guarantee the absence of vulnerabilities or backdoors. To address this, we propose a Confidential Computing Transparency framework with progressive levels of transparency. This framework goes beyond current measures like open-source code and audits by incorporating accountability for reviewers and robust technical safeguards, creating a comprehensive trust chain. Our tiered approach provides a practical pathway to achieving transparency in complex, real-world systems. Through a user study with 400 participants, we demonstrate that higher levels of transparency are associated with increased user comfort, particularly for sensitive data types.
Title: Planning In Natural Language Improves LLM Search For Code Generation
Copy Paste: [[2409.03733]] Planning In Natural Language Improves LLM Search For Code Generation(https://arxiv.org/abs/2409.03733)
Keywords: large language model
Abstract: While scaling training compute has led to remarkable improvements in large language models (LLMs), scaling inference compute has not yet yielded analogous gains. We hypothesize that a core missing component is a lack of diverse LLM outputs, leading to inefficient search due to models repeatedly sampling highly similar, yet incorrect generations. We empirically demonstrate that this lack of diversity can be mitigated by searching over candidate plans for solving a problem in natural language. Based on this insight, we propose PLANSEARCH, a novel search algorithm which shows strong results across HumanEval+, MBPP+, and LiveCodeBench (a contamination-free benchmark for competitive coding). PLANSEARCH generates a diverse set of observations about the problem and then uses these observations to construct plans for solving the problem. By searching over plans in natural language rather than directly over code solutions, PLANSEARCH explores a significantly more diverse range of potential solutions compared to baseline search methods. Using PLANSEARCH on top of Claude 3.5 Sonnet achieves a state-of-the-art pass@200 of 77.0% on LiveCodeBench, outperforming both the best score achieved without search (pass@1 = 41.4%) and using standard repeated sampling (pass@200 = 60.6%). Finally, we show that, across all models, search algorithms, and benchmarks analyzed, we can accurately predict performance gains due to search as a direct function of the diversity over generated ideas.
Title: Safety vs. Performance: How Multi-Objective Learning Reduces Barriers to Market Entry
Authors: Meena Jagadeesan, Michael I. Jordan, Jacob Steinhardt
Copy Paste: [[2409.03734]] Safety vs. Performance: How Multi-Objective Learning Reduces Barriers to Market Entry(https://arxiv.org/abs/2409.03734)
Keywords: large language model
Abstract: Emerging marketplaces for large language models and other large-scale machine learning (ML) models appear to exhibit market concentration, which has raised concerns about whether there are insurmountable barriers to entry in such markets. In this work, we study this issue from both an economic and an algorithmic point of view, focusing on a phenomenon that reduces barriers to entry. Specifically, an incumbent company risks reputational damage unless its model is sufficiently aligned with safety objectives, whereas a new company can more easily avoid reputational damage. To study this issue formally, we define a multi-objective high-dimensional regression framework that captures reputational damage, and we characterize the number of data points that a new company needs to enter the market. Our results demonstrate how multi-objective considerations can fundamentally reduce barriers to entry -- the required number of data points can be significantly smaller than the incumbent company's dataset size. En route to proving these results, we develop scaling laws for high-dimensional linear regression in multi-objective environments, showing that the scaling rate becomes slower when the dataset size is large, which could be of independent interest.
Title: LLM-CI: Assessing Contextual Integrity Norms in Language Models
Authors: Yan Shvartzshnaider, Vasisht Duddu, John Lacalamita
Copy Paste: [[2409.03735]] LLM-CI: Assessing Contextual Integrity Norms in Language Models(https://arxiv.org/abs/2409.03735)
Keywords: privacy, large language model
Abstract: Large language models (LLMs), while memorizing parts of their training data scraped from the Internet, may also inadvertently encode societal preferences and norms. As these models are integrated into sociotechnical systems, it is crucial that the norms they encode align with societal expectations. These norms could vary across models, hyperparameters, optimization techniques, and datasets. This is especially challenging due to prompt sensitivity$-$small variations in prompts yield different responses, rendering existing assessment methodologies unreliable. There is a need for a comprehensive framework covering various models, optimization, and datasets, along with a reliable methodology to assess encoded norms. We present LLM-CI, the first open-sourced framework to assess privacy norms encoded in LLMs. LLM-CI uses a Contextual Integrity-based factorial vignette methodology to assess the encoded norms across different contexts and LLMs. We propose the multi-prompt assessment methodology to address prompt sensitivity by assessing the norms from only the prompts that yield consistent responses across multiple variants. Using LLM-CI and our proposed methodology, we comprehensively evaluate LLMs using IoT and COPPA vignettes datasets from prior work, examining the impact of model properties (e.g., hyperparameters, capacity) and optimization strategies (e.g., alignment, quantization).
Title: Understanding Data Importance in Machine Learning Attacks: Does Valuable Data Pose Greater Harm?
Copy Paste: [[2409.03741]] Understanding Data Importance in Machine Learning Attacks: Does Valuable Data Pose Greater Harm?(https://arxiv.org/abs/2409.03741)
Abstract: Machine learning has revolutionized numerous domains, playing a crucial role in driving advancements and enabling data-centric processes. The significance of data in training models and shaping their performance cannot be overstated. Recent research has highlighted the heterogeneous impact of individual data samples, particularly the presence of valuable data that significantly contributes to the utility and effectiveness of machine learning models. However, a critical question remains unanswered: are these valuable data samples more vulnerable to machine learning attacks? In this work, we investigate the relationship between data importance and machine learning attacks by analyzing five distinct attack types. Our findings reveal notable insights. For example, we observe that high importance data samples exhibit increased vulnerability in certain attacks, such as membership inference and model stealing. By analyzing the linkage between membership inference vulnerability and data importance, we demonstrate that sample characteristics can be integrated into membership metrics by introducing sample-specific criteria, therefore enhancing the membership inference performance. These findings emphasize the urgent need for innovative defense mechanisms that strike a balance between maximizing utility and safeguarding valuable data against potential exploitation.
Title: Libra: Architectural Support For Principled, Secure And Efficient Balanced Execution On High-End Processors (Extended Version)
Authors: Hans Winderix, Marton Bognar, Lesly-Ann Daniel, Frank Piessens
Copy Paste: [[2409.03743]] Libra: Architectural Support For Principled, Secure And Efficient Balanced Execution On High-End Processors (Extended Version)(https://arxiv.org/abs/2409.03743)
Keywords: secure, attack
Abstract: Control-flow leakage (CFL) attacks enable an attacker to expose control-flow decisions of a victim program via side-channel observations. Linearization (i.e., elimination) of secret-dependent control flow is the main countermeasure against these attacks, yet it comes at a non-negligible cost. Conversely, balancing secret-dependent branches often incurs a smaller overhead, but is notoriously insecure on high-end processors. Hence, linearization has been widely believed to be the only effective countermeasure against CFL attacks. In this paper, we challenge this belief and investigate an unexplored alternative: how to securely balance secret-dependent branches on higher-end processors? We propose Libra, a generic and principled hardware-software codesign to efficiently address CFL on high-end processors. We perform a systematic classification of hardware primitives leaking control flow from the literature, and provide guidelines to handle them with our design. Importantly, Libra enables secure control-flow balancing without the need to disable performance-critical hardware such as the instruction cache and the prefetcher. We formalize the semantics of Libra and propose a code transformation algorithm for securing programs, which we prove correct and secure. Finally, we implement and evaluate Libra on an out-of-order RISC-V processor, showing performance overhead on par with insecure balanced code, and outperforming state-of-the-art linearized code by 19.3%.
Title: ArtiFade: Learning to Generate High-quality Subject from Blemished Images
Authors: Shuya Yang, Shaozhe Hao, Yukang Cao, Kwan-Yee K. Wong
Copy Paste: [[2409.03745]] ArtiFade: Learning to Generate High-quality Subject from Blemished Images(https://arxiv.org/abs/2409.03745)
Keywords: diffusion, generative
Abstract: Subject-driven text-to-image generation has witnessed remarkable advancements in its ability to learn and capture characteristics of a subject using only a limited number of images. However, existing methods commonly rely on high-quality images for training and may struggle to generate reasonable images when the input images are blemished by artifacts. This is primarily attributed to the inadequate capability of current techniques in distinguishing subject-related features from disruptive artifacts. In this paper, we introduce ArtiFade to tackle this issue and successfully generate high-quality artifact-free images from blemished datasets. Specifically, ArtiFade exploits fine-tuning of a pre-trained text-to-image model, aiming to remove artifacts. The elimination of artifacts is achieved by utilizing a specialized dataset that encompasses both unblemished images and their corresponding blemished counterparts during fine-tuning. ArtiFade also ensures the preservation of the original generative capabilities inherent within the diffusion model, thereby enhancing the overall performance of subject-driven methods in generating high-quality and artifact-free images. We further devise evaluation benchmarks tailored for this task. Through extensive qualitative and quantitative experiments, we demonstrate the generalizability of ArtiFade in effective artifact removal under both in-distribution and out-of-distribution scenarios.
Title: Attention Heads of Large Language Models: A Survey
Authors: Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao Song, Bo Tang, Feiyu Xiong, Zhiyu Li
Copy Paste: [[2409.03752]] Attention Heads of Large Language Models: A Survey(https://arxiv.org/abs/2409.03752)
Keywords: interpretability, large language model
Abstract: Since the advent of ChatGPT, Large Language Models (LLMs) have excelled in various tasks but remain largely as black-box systems. Consequently, their development relies heavily on data-driven approaches, limiting performance enhancement through changes in internal architecture and reasoning pathways. As a result, many researchers have begun exploring the potential internal mechanisms of LLMs, aiming to identify the essence of their reasoning bottlenecks, with most studies focusing on attention heads. Our survey aims to shed light on the internal reasoning processes of LLMs by concentrating on the interpretability and underlying mechanisms of attention heads. We first distill the human thought process into a four-stage framework: Knowledge Recalling, In-Context Identification, Latent Reasoning, and Expression Preparation. Using this framework, we systematically review existing research to identify and categorize the functions of specific attention heads. Furthermore, we summarize the experimental methodologies used to discover these special heads, dividing them into two categories: Modeling-Free methods and Modeling-Required methods. Also, we outline relevant evaluation methods and benchmarks. Finally, we discuss the limitations of current research and propose several potential future directions. Our reference list is open-sourced at \url{this https URL}.
Title: Foundation Model or Finetune? Evaluation of few-shot semantic segmentation for river pollution
Copy Paste: [[2409.03754]] Foundation Model or Finetune? Evaluation of few-shot semantic segmentation for river pollution(https://arxiv.org/abs/2409.03754)
Keywords: segmentation
Abstract: Foundation models (FMs) are a popular topic of research in AI. Their ability to generalize to new tasks and datasets without retraining or needing an abundance of data makes them an appealing candidate for applications on specialist datasets. In this work, we compare the performance of FMs to finetuned pre-trained supervised models in the task of semantic segmentation on an entirely new dataset. We see that finetuned models consistently outperform the FMs tested, even in cases were data is scarce. We release the code and dataset for this work on GitHub.
Title: DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation
Authors: Wenliang Zhao, Haolin Wang, Jie Zhou, Jiwen Lu
Abstract: Diffusion probabilistic models (DPMs) have shown remarkable performance in visual synthesis but are computationally expensive due to the need for multiple evaluations during the sampling. Recent predictor-corrector diffusion samplers have significantly reduced the required number of function evaluations (NFE), but inherently suffer from a misalignment issue caused by the extra corrector step, especially with a large classifier-free guidance scale (CFG). In this paper, we introduce a new fast DPM sampler called DC-Solver, which leverages dynamic compensation (DC) to mitigate the misalignment of the predictor-corrector samplers. The dynamic compensation is controlled by compensation ratios that are adaptive to the sampling steps and can be optimized on only 10 datapoints by pushing the sampling trajectory toward a ground truth trajectory. We further propose a cascade polynomial regression (CPR) which can instantly predict the compensation ratios on unseen sampling configurations. Additionally, we find that the proposed dynamic compensation can also serve as a plug-and-play module to boost the performance of predictor-only samplers. Extensive experiments on both unconditional sampling and conditional sampling demonstrate that our DC-Solver can consistently improve the sampling quality over previous methods on different DPMs with a wide range of resolutions up to 1024$\times$1024. Notably, we achieve 10.38 FID (NFE=5) on unconditional FFHQ and 0.394 MSE (NFE=5, CFG=7.5) on Stable-Diffusion-2.1. Code is available at this https URL
Title: Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Copy Paste: [[2409.03757]] Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding(https://arxiv.org/abs/2409.03757)
Keywords: diffusion, segmentation
Abstract: Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks.