Optimal Feature Selection for Network Intrusion Detection: a Data Mining Approach
No Thumbnail Available
Date
2011-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The traditional approach in securing computer systems against cyber threats is designing
mechanisms such as firewalls, authentication tools, and virtual private networks that
create a protective shield almost always with vulnerabilities. This has created Intrusion
Detection Systems (IDS) to be developed that complement traditional approaches.
However, with the advancement of computer technology, the behavior of intrusions has
become complex that makes the work of network security experts hard to analyze and
detect intrusions. In order to address these challenges, using data mining techniques have
become a possible solution. However, the performance data mining algorithms are
affected when no optimized features provided. This is because, complex relationships can
be seen as well between the features and intrusion classes contributing to high
computational costs in processing tasks, subsequently leads to delays in identifying
intrusions. Feature selection is thus important to be conducted in detecting intrusions by
allowing the data mining system to focus on what is really important.
Researches on data mining have focused on the induction of models with low expected
error by totally ignoring the cost that could be incurred during misclassification and
feature selection in skewed data distribution between classes. In reality, for many
problem domains, the requirement is not merely to predict the most probable class label,
since different types of errors carry different costs. For example the cost of allowing
unauthorized access can be much greater than that of wrongly denying access to
authorized individuals. Similarly the cost of not selecting features that contain
unauthorized profiles is much more than probing profiles. Implementing cost sensitive
classifiers that involve cost by modifying (direct) and without modifying (indirect)
algorithm during model building and feature selection are a rising research interest to
handle this problem and attempts have been made.
However, little attention has been given to evaluate the performance of direct and indirect
cost sensitive classifiers using cost sensitive feature selection approach. In this research,
we proposed filter approach to select important features; namely, IGR and CFS to
ii
illustrate the significance of feature selection in classifying the NSL-KDD intrusion
detection dataset. The central idea is the minority class feature sets, those which have low
values, can be ranked at the top by gaining high information gain value and correlation
percentages, at the same time those score low, ranked at the bottom in WEKA tool
assuming some of the features can be redundant or contribute little to the detection
process.
The selected features are experimented repeatedly where features added into the final
selected feature set as far as no decrease in performance and then models are constructed
on the two algorithms; namely, CS-CM4 (direct) and C4.5 (indirect) using TANAGRA
tool. Experiments show that CFS and IGR select below half of the total (41) features with
equaled or better performance in most cases. Comparatively, the approach fits more for
indirect cost sensitive C4.5 than direct cost sensitive CS-CM4. Generally, the study
indicated that CS-CM4 and C4.5 algorithms by far achieved better on the proposed
approach with fewer features that require less storage and time to identify new attacks as
well as better performance in terms of detection rate, overall classification accuracy,
average misclassification cost and false positive rate.
Keywords: Cost sensitive feature selection, Cost insensitive feature selection, IGR, CFS.
Description
Keywords
Cost sensitive feature selection,, Cost insensitive feature selection,, IGR, CFS