No Arabic abstract
From the identification of a drawback in the Isolation Forest (IF) algorithm that limits its use in the scope of anomaly detection, we propose two extensions that allow to firstly overcome the previously mention limitation and secondly to provide it with some supervised learning capability. The resulting Hybrid Isolation Forest (HIF) that we propose is first evaluated on a synthetic dataset to analyze the effect of the new meta-parameters that are introduced and verify that the addressed limitation of the IF algorithm is effectively overcame. We hen compare the two algorithms on the ISCX benchmark dataset, in the context of a network intrusion detection application. Our experiments show that HIF outperforms IF, but also challenges the 1-class and 2-classes SVM baselines with computational efficiency.
Anomaly Detection is an unsupervised learning task aimed at detecting anomalous behaviours with respect to historical data. In particular, multivariate Anomaly Detection has an important role in many applications thanks to the capability of summarizing the status of a complex system or observed phenomenon with a single indicator (typically called `Anomaly Score) and thanks to the unsupervised nature of the task that does not require human tagging. The Isolation Forest is one of the most commonly adopted algorithms in the field of Anomaly Detection, due to its proven effectiveness and low computational complexity. A major problem affecting Isolation Forest is represented by the lack of interpretability, an effect of the inherent randomness governing the splits performed by the Isolation Trees, the building blocks of the Isolation Forest. In this paper we propose effective, yet computationally inexpensive, methods to define feature importance scores at both global and local level for the Isolation Forest. Moreover, we define a procedure to perform unsupervised feature selection for Anomaly Detection problems based on our interpretability method; such procedure also serves the purpose of tackling the challenging task of feature importance evaluation in unsupervised anomaly detection. We assess the performance on several synthetic and real-world datasets, including comparisons against state-of-the-art interpretability techniques, and make the code publicly available to enhance reproducibility and foster research in the field.
This paper proposes an intrusion detection and prediction system based on uncertain and imprecise inference networks and its implementation. Giving a historic of sessions, it is about proposing a method of supervised learning doubled of a classifier permitting to extract the necessary knowledge in order to identify the presence or not of an intrusion in a session and in the positive case to recognize its type and to predict the possible intrusions that will follow it. The proposed system takes into account the uncertainty and imprecision that can affect the statistical data of the historic. The systematic utilization of an unique probability distribution to represent this type of knowledge supposes a too rich subjective information and risk to be in part arbitrary. One of the first objectives of this work was therefore to permit the consistency between the manner of which we represent information and information which we really dispose.
Huge datasets in cyber security, such as network traffic logs, can be analyzed using machine learning and data mining methods. However, the amount of collected data is increasing, which makes analysis more difficult. Many machine learning methods have not been designed for big datasets, and consequently are slow and difficult to understand. We address the issue of efficient network traffic classification by creating an intrusion detection framework that applies dimensionality reduction and conjunctive rule extraction. The system can perform unsupervised anomaly detection and use this information to create conjunctive rules that classify huge amounts of traffic in real time. We test the implemented system with the widely used KDD Cup 99 dataset and real-world network logs to confirm that the performance is satisfactory. This system is transparent and does not work like a black box, making it intuitive for domain experts, such as network administrators.
Ubiquitous anomalies endanger the security of our system constantly. They may bring irreversible damages to the system and cause leakage of privacy. Thus, it is of vital importance to promptly detect these anomalies. Traditional supervised methods such as Decision Trees and Support Vector Machine (SVM) are used to classify normality and abnormality. However, in some case the abnormal status are largely rarer than normal status, which leads to decision bias of these methods. Generative adversarial network (GAN) has been proposed to handle the case. With its strong generative ability, it only needs to learn the distribution of normal status, and identify the abnormal status through the gap between it and the learned distribution. Nevertheless, existing GAN-based models are not suitable to process data with discrete values, leading to immense degradation of detection performance. To cope with the discrete features, in this paper, we propose an efficient GAN-based model with specifically-designed loss function. Experiment results show that our model outperforms state-of-the-art models on discrete dataset and remarkably reduce the overhead.
Internet of Things (IoT) devices are becoming increasingly popular and are influencing many application domains such as healthcare and transportation. These devices are used for real-world applications such as sensor monitoring, real-time control. In this work, we look at differentially private (DP) neural network (NN) based network intrusion detection systems (NIDS) to detect intrusion attacks on networks of such IoT devices. Existing NN training solutions in this domain either ignore privacy considerations or assume that the privacy requirements are homogeneous across all users. We show that the performance of existing differentially private stochastic methods degrade for clients with non-identical data distributions when clients privacy requirements are heterogeneous. We define a cohort-based $(epsilon,delta)$-DP framework that models the more practical setting of IoT device cohorts with non-identical clients and heterogeneous privacy requirements. We propose two novel continual-learning based DP training methods that are designed to improve model performance in the aforementioned setting. To the best of our knowledge, ours is the first system that employs a continual learning-based approach to handle heterogeneity in client privacy requirements. We evaluate our approach on real datasets and show that our techniques outperform the baselines. We also show that our methods are robust to hyperparameter changes. Lastly, we show that one of our proposed methods can easily adapt to post-hoc relaxations of client privacy requirements.