Content Based Multi-Class Amharic Short Message Service Classification using Machine Learning

No Thumbnail Available

Date

2023-08-31

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Short message services are one of the most common communication methods. With the increased usage of SMS, a variety of Amharic-content spam and smishing messages have also increased. Spam SMS messages are unwanted messages received on a mobile phone. Examples of spam messages are advertisements, promotions, and information from organizations, whereas smishing messages critically harm users and service providers. Free fees, rewards, fake lottery tickets, and malicious links are among the types of smishing messages. Both types of SMS cause poor customer experiences and reduced revenue for operators. Therefore, many studies have been conducted to classify short message service using a foreign language SMS dataset to keep and win customers and avoid revenue loss. However, the features they used are not relevant for Amharic SMS classification due to the diversity of SMS characteristics. This paper studies a model that classifies Amharic SMS using a machine learning technique. The model classifies Amharic SMS texts into three classes that are ham, spam, and smishing. The model was trained on 1844 labeled messages. The features have been prepared as follows: Two relevant features have been selected from English spam detection approaches; three new features have been created; and 162 keywords have been extracted from the dataset and vectorized using TF-IDF. Then the Random Forest classifier has been trained using the prepared features, using 10-fold cross-validation. Finally, the prepared features with RF outperformed the existing approaches that have been done for foreign languages by 6% and achieved a 0.99 F1-score to classify Amharic SMS.

Description

Keywords

Citation