Phrase Based Amharic News Text Classification

Abebaw Zeleke

Phrase Based Amharic News Text Classification

Files

Zeleke Abebaw.pdf (789.06 KB)

Date

2010-06

Authors

Abebaw Zeleke

Publisher

Addis Ababa University

Abstract

The recent growth of Information and Communication Technologies (ICT) infrastructure in Ethiopia is resulting in an exponential increase of digital information in local languages including Amharic. Huge and increasing volumes of data are available in Amharic, which is observed on the growing online newspapers, websites, and digital storages of Ethiopian News Agency (ENA).Thus, to tackle the agency’s news text management problems, a number of researches have conducted on automatic processing of Amharic news texts using bag-ofwords feature representation approach. However, using words as features could result in losing the intended meaning when the concept is created from two or more sequential words. Thus, in order to maintain this concept, a phrase based approach has been proposed and implemented in this research. Preprocessing, feature representation, and testing were the major steps for the accomplishment of this study. Preprocessing the data (character normalization, stop word removal and stemming) is worked out before the datasets are fed into the classifier. In feature representations, two forms of phrase structures (bigrams and trigrams) have been developed and tested. After features have been represented by these phrase structures and their weights are identified using TFIDF schemes, phrase matrix have been generated and saved as CSV file format. The CSV files have been imported to the LibSVM classifier using the GUI of WEKA application package. Finally, the testing was performed for both bigram and trigram phrase structures at four, eight and twelve news category levels. From this research, using bigram phrase structures, the best accuracy (95.3%) has been obtained at four categories, followed by (81.3%) for eight categories and the least accuracy (72.01%) has been obtained at twelve categories. On the other hand, for trigram phrase structure, the best accuracy was obtained at four categories (72.9 %), followed by 69.7% for eight categories, and the least accuracy has been obtained at twelve categories that accounts to 56.4%. From these results, it can be observed that bigram phrase structures have better performance result (72.01%) than trigram phrase structures (56.4%)for all twelve news categories. Keywords: Text categorization/classification, Machine Learning, Support Vector Machines, Phrase Based Feature Representations

Keywords

Text categorization/classification, Machine Learning, Support Vector Machines, Phrase Based Feature Representations

URI

http://etd.aau.edu.et/handle/12345678/14857

Collections

Information Sciences

Full item page

Phrase Based Amharic News Text Classification

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections