Optimal Feature Selection for Network Intrusion Detection: a Data Mining Approach

No Thumbnail Available

Date

2011-06

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The traditional approach in securing computer systems against cyber threats is designing mechanisms such as firewalls, authentication tools, and virtual private networks that create a protective shield almost always with vulnerabilities. This has created Intrusion Detection Systems (IDS) to be developed that complement traditional approaches. However, with the advancement of computer technology, the behavior of intrusions has become complex that makes the work of network security experts hard to analyze and detect intrusions. In order to address these challenges, using data mining techniques have become a possible solution. However, the performance data mining algorithms are affected when no optimized features provided. This is because, complex relationships can be seen as well between the features and intrusion classes contributing to high computational costs in processing tasks, subsequently leads to delays in identifying intrusions. Feature selection is thus important to be conducted in detecting intrusions by allowing the data mining system to focus on what is really important. Researches on data mining have focused on the induction of models with low expected error by totally ignoring the cost that could be incurred during misclassification and feature selection in skewed data distribution between classes. In reality, for many problem domains, the requirement is not merely to predict the most probable class label, since different types of errors carry different costs. For example the cost of allowing unauthorized access can be much greater than that of wrongly denying access to authorized individuals. Similarly the cost of not selecting features that contain unauthorized profiles is much more than probing profiles. Implementing cost sensitive classifiers that involve cost by modifying (direct) and without modifying (indirect) algorithm during model building and feature selection are a rising research interest to handle this problem and attempts have been made. However, little attention has been given to evaluate the performance of direct and indirect cost sensitive classifiers using cost sensitive feature selection approach. In this research, we proposed filter approach to select important features; namely, IGR and CFS to ii illustrate the significance of feature selection in classifying the NSL-KDD intrusion detection dataset. The central idea is the minority class feature sets, those which have low values, can be ranked at the top by gaining high information gain value and correlation percentages, at the same time those score low, ranked at the bottom in WEKA tool assuming some of the features can be redundant or contribute little to the detection process. The selected features are experimented repeatedly where features added into the final selected feature set as far as no decrease in performance and then models are constructed on the two algorithms; namely, CS-CM4 (direct) and C4.5 (indirect) using TANAGRA tool. Experiments show that CFS and IGR select below half of the total (41) features with equaled or better performance in most cases. Comparatively, the approach fits more for indirect cost sensitive C4.5 than direct cost sensitive CS-CM4. Generally, the study indicated that CS-CM4 and C4.5 algorithms by far achieved better on the proposed approach with fewer features that require less storage and time to identify new attacks as well as better performance in terms of detection rate, overall classification accuracy, average misclassification cost and false positive rate. Keywords: Cost sensitive feature selection, Cost insensitive feature selection, IGR, CFS.

Description

Keywords

Cost sensitive feature selection,, Cost insensitive feature selection,, IGR, CFS

Citation