School of Information Science
Permanent URI for this college
Browse
Browsing School of Information Science by Author "Abebaw Zeleke"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
Item Phrase Based Amharic News Text Classification(Addis Ababa University, 2010-06) Abebaw Zeleke; M.Wondwosson (PhD)The recent growth of Information and Communication Technologies (ICT) infrastructure in Ethiopia is resulting in an exponential increase of digital information in local languages including Amharic. Huge and increasing volumes of data are available in Amharic, which is observed on the growing online newspapers, websites, and digital storages of Ethiopian News Agency (ENA).Thus, to tackle the agency’s news text management problems, a number of researches have conducted on automatic processing of Amharic news texts using bag-ofwords feature representation approach. However, using words as features could result in losing the intended meaning when the concept is created from two or more sequential words. Thus, in order to maintain this concept, a phrase based approach has been proposed and implemented in this research. Preprocessing, feature representation, and testing were the major steps for the accomplishment of this study. Preprocessing the data (character normalization, stop word removal and stemming) is worked out before the datasets are fed into the classifier. In feature representations, two forms of phrase structures (bigrams and trigrams) have been developed and tested. After features have been represented by these phrase structures and their weights are identified using TFIDF schemes, phrase matrix have been generated and saved as CSV file format. The CSV files have been imported to the LibSVM classifier using the GUI of WEKA application package. Finally, the testing was performed for both bigram and trigram phrase structures at four, eight and twelve news category levels. From this research, using bigram phrase structures, the best accuracy (95.3%) has been obtained at four categories, followed by (81.3%) for eight categories and the least accuracy (72.01%) has been obtained at twelve categories. On the other hand, for trigram phrase structure, the best accuracy was obtained at four categories (72.9 %), followed by 69.7% for eight categories, and the least accuracy has been obtained at twelve categories that accounts to 56.4%. From these results, it can be observed that bigram phrase structures have better performance result (72.01%) than trigram phrase structures (56.4%)for all twelve news categories. Keywords: Text categorization/classification, Machine Learning, Support Vector Machines, Phrase Based Feature Representations