Comparative Study of Machine Learning Algorithms for Smishing SMS Detection Model from CDR Dataset

Yalemzewd Negash (PhD)Samson Akele2023-12-052023-12-052023-05http://etd.aau.edu.et/handle/123456789/246Phishing is becoming a significant threat to online security, and it spreads through a variety of channels like email and SMS or even a phone calls to gather crucial profile data about the victims. Although numerous anti-phishing measures have been created to halt the spread of phishing, it remains an unresolved issue. Smishing is a phishing attack that uses a mobile device's Short Messaging Service (SMS) to obtain the victim’s credentials. Employing an automated detection system will help improve identification and stop it before affecting targeted companies and third parties to alleviate this critical problem. A Smishing SMS detection-based CDR data framework is important to early monitoring experts and service providers in screening this kind of phishing attack, provides more accuracy, automates detection time, and keeps safe individuals. Many mobile phone users have been victimized yearly due to mistakenly interpreting the lures. Developing Accurate Smishing detecting system is helpful for organizations and related third parties who are highly affected due to smishing. This paper compares machine learning algorithms for the smishing SMS detection model. In this thesis, six supervised machine learning algorithm classifiers K-nearest Neighbor (KNN), Support vector machine (SMV), decision tree (DT), Naive Bayer (NB), Random Forest (RF), and logistic regression (LR) are compared for the performance of detecting Smishing SMS which is more recommended by scholars and the result obtained prove that these algorithms are much efficient in detecting Smishing problem. 10-fold cross-validation based on correlation algorithms is used for classification and implementation. The research collected Call Detail record CDRs data, and 33 distinct features were extracted initially, relevant features were selected, and eliminated unnecessary and irrelevant information, and different preprocessing methods, such as feature selection, and shaping the data were performed For the purpose of conducting this study. As a result, the RF algorithm with options for Cross Validation (CV), which scored 90.1% accuracy, is determined to be the best classifier algorithm, the two algorithms come next with the best result, KNN and DT, which scored 89.6% and, 88.8%, respectively, Using cross-validation, the SVM algorithm performs inaccurately and exceeds the desired detection delay by more than an hour during training Time. This outcome is the result of the RF algorithm's superior capacity to accurately handle vast amounts of data, form decision trees at random, and prevent overfitting by employing random subsets of characteristics to create smaller trees.en-USComparative Study of Machine Learning Algorithms for Smishing SMS Detection Model from CDR DatasetThesis