Flow-Based E-mail Spam Detection
No Thumbnail Available
Date
2011-11
Authors
Hailu, Zelalem
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
The volume of unsol icited commercial e-mai ls, also known as spam, is in such a rapid
increase that almost over 90% of all e-mail messages are spam. We are in a state where
an average of200 bill ion e-mail spamsare sent eachday. This problem is exacerbated by
the fact that many of these spams contain some sort of malicious code for attack. In
addition to wasting of users' time and attack threats, the huge amount of spam also
consumes bandwidth and storage spaces illegally. There have been efforts over the years
to combat spam messages. The most popular ones arc based on e-mail content analysis
and IP address reputation. Techniques based on e-mail content analysis arc fall ing behind
because of spammers' ability to trick such filters using legitimate e-mail-like words in
their contents. The introduction of image and PDF spams is also another headache for
content based filters. Fi lters based on IP add ress reputat ion are also not coping well with
the spammers because of the dynamic nature of II) addresses and the difficulty of hunting
down malicious addresses before significant damages are donc. Our approach is to filter
out spam messages before they are delivered to the user's inbox based on packet flow
characteristi cs. This is a complimentary approach that can be used with other techniques
to reduce the number of spam messages reaching users' inbox. Our approach is based on
over 55,000 packet flow records. We have identified nine features that best different iate
spam from legitimate e-mail. Based on these attributes and a classification model with an
accuracy of 99.5% and a fal se-positive of 2.6%, we have developed a ranking algorithm
that scores a given flow into one of five categories. Based on these scores, a given packet
flow will be accepted, rejected or will be passed for further examination by other
tech niques. In addition to giving the advantage of not rel ying on e-mail content or IP
address to filter spam, our method also avoids the wastage of resources like bandwidth
and storage space by spam messages.
Description
Keywords
Network flow, E-mail spam, Feature selection, Classification, Ranking algorithm