Flow-Based E-mail Spam Detection

No Thumbnail Available

Date

2011-11

Authors

Hailu, Zelalem

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

The volume of unsol icited commercial e-mai ls, also known as spam, is in such a rapid increase that almost over 90% of all e-mail messages are spam. We are in a state where an average of200 bill ion e-mail spamsare sent eachday. This problem is exacerbated by the fact that many of these spams contain some sort of malicious code for attack. In addition to wasting of users' time and attack threats, the huge amount of spam also consumes bandwidth and storage spaces illegally. There have been efforts over the years to combat spam messages. The most popular ones arc based on e-mail content analysis and IP address reputation. Techniques based on e-mail content analysis arc fall ing behind because of spammers' ability to trick such filters using legitimate e-mail-like words in their contents. The introduction of image and PDF spams is also another headache for content based filters. Fi lters based on IP add ress reputat ion are also not coping well with the spammers because of the dynamic nature of II) addresses and the difficulty of hunting down malicious addresses before significant damages are donc. Our approach is to filter out spam messages before they are delivered to the user's inbox based on packet flow characteristi cs. This is a complimentary approach that can be used with other techniques to reduce the number of spam messages reaching users' inbox. Our approach is based on over 55,000 packet flow records. We have identified nine features that best different iate spam from legitimate e-mail. Based on these attributes and a classification model with an accuracy of 99.5% and a fal se-positive of 2.6%, we have developed a ranking algorithm that scores a given flow into one of five categories. Based on these scores, a given packet flow will be accepted, rejected or will be passed for further examination by other tech niques. In addition to giving the advantage of not rel ying on e-mail content or IP address to filter spam, our method also avoids the wastage of resources like bandwidth and storage space by spam messages.

Description

Keywords

Network flow, E-mail spam, Feature selection, Classification, Ranking algorithm

Citation

Collections