Author Identification of Amharic Online Text Using Stylometry and N-Gram Features and Different Classification Techniques

No Thumbnail Available

Date

2021-06-25

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Users in cyberspace generated a vast amount of text data by hiding their identity. Those anonymous online text writers are distributing misinformation throughout the world. In Ethiopia also, the number of anonymous writers who are hiding their identity increases from time to time. Such writers use different languages and different social media accounts. Amharic is one of more than 80 Ethiopian languages in which misinformation are spread by anonymous online writers. Author identification is a scientific method of identifying the author of anonymous texts by recognizing and extracting features of the author's writing style. To our knowledge, there is no authors identification model or published work to identify anonymous writers for Amharic so as to take the necessary measures. This thesis, therefore, aims at exploring the development of model for identifying Amharic text authors using stylometry, n-gram or both features and three classification algorithms: support vector machine, Naive Bayesian and Neural Network multilayer perceptron. In addition, the research investigates the effects of number of articles per author and number of authors on the performance of the author identification model. To achieve the aim of the study, experimental research methodology was followed. The necessary data (Amharic online texts) to train the model is collected and pre-processed, features are extracted and selected. The effects of increasing the number of authors and number of articles per authors are investigated in two experiments. The discrimination capability of the features and models was then tested using an anonymous Amharic online text from a suspected list. From the first experiment, the number of authors is inversely proportional with accuracy, precision, recall and f1-scores. On the other hand, these performance metrics increase as the number of articles per author increases, as the results of the two experiments show. The research findings indicate that merged features are better than the individual features for almost all models. NN-MLP-logistics has 90.47% accuracy and 90% model performance score for merged features and 27 authors. SVM Linear has 97.52% accuracy and 98% model performance score for merged features and 100 articles per author.Based on the results of the study we conclude that the Neural Network models are preferred to other classification models for small number of online text per authors to authorship identification and also the results are stable and show the best identification capability throughout number of suspects. We have conducted the experiments with limited number of authors; we recommend that further study can be conducted for more number of Amharic online text authors.

Description

Keywords

Author Identification, Amharic Online Text Using Stylometry, N-Gram Features, Different Classification Techniques

Citation