Automatic Classification of Afaan Oromo News Text: The Case of Radio Fana

Ejigu, Dejene (PhD)Diriba, Abera2018-11-232023-11-292018-11-232023-11-292009-03http://etd.aau.edu.et/handle/123456789/14463The vast growth of information and communication technology resulted in a huge volume of information very large bulk of which is stored as unstructured text. The presence of so much text in electronic form is a challenge to natural language processing. As the volume of electronic information increases, there is growing interest in developing tools to help people better find, filter, and manage these resources. Arguably, the only way for humans to cope with the information explosion is to exploit computational techniques that can sift through huge bodies of text. Currently news agencies in Ethiopia in which large amount of news from all the available sources are processed every day is implementing a manual classification system to categorize news items in their daily activities despite the fact, they are using computerized system to store and edit news items. Radio Fana is the one among these agencies. The objective of this research is to develop and adopt processing tools for Afaan Oromo text classification and investigate the application of machine learning techniques for automatic classification of Afaan Oromo news items. The data source for this research is the Afaan Oromo news items obtained from Radio Fana Share Company. In this research, tools for pre-processing Afaan Oromo news items such as tokenization, removal of extraneous characters, removal of stop-words and removal of affixes from the words are prepared to facilitate the experimentation process for the automatic classifiers. Among the automatic classifiers which are applicable on high dimensional data, four of them; Sequential Minimal Optimization (SMO) algorithm from Support Vector Machines, NaiveBayesMultiNominal (NBM) from Bayesian Classifiers, J48 algorithm from the Decision trees and K-Nearest Neighbor (KNN) from the Lazy Learners have been experimented on the final data. The data, the pre-processed Afaan Oromo news items, is organized in to categories of four classes, seven classes and all (eleven) classes for the experimentation purpose and the experimentation uses 10-fold stratified cross validation for training and test data. For the SMO and NBM classifiers, which have best accuracy over the others, the detailed accuracy by class together with the confusion matrix of the experimentation is shown, whereas for J48 and KNN classifiers the average accuracy on each category is presented in this thesis. The result of the experimentation is encouraging, the best result (accuracy) from both the SMO and BayesMultiNominal classifiers, 95.82% and 96.58% respectively, is obtained when the number of instance documents is approximately equal in the classes, and it was for the four categories of news items. The lower accuracy seen is for J48 on category of 7 classes, 79.69% and on category of 11 classes, 82.05%. SMO tends to have better accuracy over the other classifiers for the Afaan Oromo news items classification. In all the classifiers, unevenly distributions of instances of documents in classes tend to decrease the accuracy of the classifiers when taken together, i.e, experimentation on all of the eleven categories taken together; while an increase in number of instances in a given class tends to increase the accuracy for the class. Accordingly, from the result of this research, it was observed that Machine Learning approach can be applied to Afaan Oromo news items classification task, nevertheless, additional works are recommended in order to come up with best result. Key Words: Natural language processing, machine learning, text classification, document indexing, Classifier algorithms, Afaan Oromo newsenNatural language processingmachine learningtext classificationdocument indexingClassifier algorithmsAfaan Oromo newsAutomatic Classification of Afaan Oromo News Text: The Case of Radio FanaThesis