Phrase Based Amharic News Text Classification

Abebaw Zeleke

Phrase Based Amharic News Text Classification

dc.contributor.advisor	M.Wondwosson (PhD)
dc.contributor.author	Abebaw Zeleke
dc.date.accessioned	2018-12-05T07:39:19Z
dc.date.accessioned	2023-11-18T12:44:16Z
dc.date.available	2018-12-05T07:39:19Z
dc.date.available	2023-11-18T12:44:16Z
dc.date.issued	2010-06
dc.description.abstract	The recent growth of Information and Communication Technologies (ICT) infrastructure in Ethiopia is resulting in an exponential increase of digital information in local languages including Amharic. Huge and increasing volumes of data are available in Amharic, which is observed on the growing online newspapers, websites, and digital storages of Ethiopian News Agency (ENA).Thus, to tackle the agency’s news text management problems, a number of researches have conducted on automatic processing of Amharic news texts using bag-ofwords feature representation approach. However, using words as features could result in losing the intended meaning when the concept is created from two or more sequential words. Thus, in order to maintain this concept, a phrase based approach has been proposed and implemented in this research. Preprocessing, feature representation, and testing were the major steps for the accomplishment of this study. Preprocessing the data (character normalization, stop word removal and stemming) is worked out before the datasets are fed into the classifier. In feature representations, two forms of phrase structures (bigrams and trigrams) have been developed and tested. After features have been represented by these phrase structures and their weights are identified using TFIDF schemes, phrase matrix have been generated and saved as CSV file format. The CSV files have been imported to the LibSVM classifier using the GUI of WEKA application package. Finally, the testing was performed for both bigram and trigram phrase structures at four, eight and twelve news category levels. From this research, using bigram phrase structures, the best accuracy (95.3%) has been obtained at four categories, followed by (81.3%) for eight categories and the least accuracy (72.01%) has been obtained at twelve categories. On the other hand, for trigram phrase structure, the best accuracy was obtained at four categories (72.9 %), followed by 69.7% for eight categories, and the least accuracy has been obtained at twelve categories that accounts to 56.4%. From these results, it can be observed that bigram phrase structures have better performance result (72.01%) than trigram phrase structures (56.4%)for all twelve news categories. Keywords: Text categorization/classification, Machine Learning, Support Vector Machines, Phrase Based Feature Representations	en_US
dc.identifier.uri	http://etd.aau.edu.et/handle/12345678/14857
dc.language.iso	en	en_US
dc.publisher	Addis Ababa University	en_US
dc.subject	Text categorization/classification	en_US
dc.subject	Machine Learning	en_US
dc.subject	Support Vector Machines	en_US
dc.subject	Phrase Based Feature Representations	en_US
dc.title	Phrase Based Amharic News Text Classification	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Zeleke Abebaw.pdf
Size:: 789.06 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Plain Text
Description:

Download

Collections

Information Sciences