Phrase Based Amharic News Text Classification

dc.contributor.advisorM.Wondwosson (PhD)
dc.contributor.authorAbebaw Zeleke
dc.date.accessioned2018-12-05T07:39:19Z
dc.date.accessioned2023-11-18T12:44:16Z
dc.date.available2018-12-05T07:39:19Z
dc.date.available2023-11-18T12:44:16Z
dc.date.issued2010-06
dc.description.abstractThe recent growth of Information and Communication Technologies (ICT) infrastructure in Ethiopia is resulting in an exponential increase of digital information in local languages including Amharic. Huge and increasing volumes of data are available in Amharic, which is observed on the growing online newspapers, websites, and digital storages of Ethiopian News Agency (ENA).Thus, to tackle the agency’s news text management problems, a number of researches have conducted on automatic processing of Amharic news texts using bag-ofwords feature representation approach. However, using words as features could result in losing the intended meaning when the concept is created from two or more sequential words. Thus, in order to maintain this concept, a phrase based approach has been proposed and implemented in this research. Preprocessing, feature representation, and testing were the major steps for the accomplishment of this study. Preprocessing the data (character normalization, stop word removal and stemming) is worked out before the datasets are fed into the classifier. In feature representations, two forms of phrase structures (bigrams and trigrams) have been developed and tested. After features have been represented by these phrase structures and their weights are identified using TFIDF schemes, phrase matrix have been generated and saved as CSV file format. The CSV files have been imported to the LibSVM classifier using the GUI of WEKA application package. Finally, the testing was performed for both bigram and trigram phrase structures at four, eight and twelve news category levels. From this research, using bigram phrase structures, the best accuracy (95.3%) has been obtained at four categories, followed by (81.3%) for eight categories and the least accuracy (72.01%) has been obtained at twelve categories. On the other hand, for trigram phrase structure, the best accuracy was obtained at four categories (72.9 %), followed by 69.7% for eight categories, and the least accuracy has been obtained at twelve categories that accounts to 56.4%. From these results, it can be observed that bigram phrase structures have better performance result (72.01%) than trigram phrase structures (56.4%)for all twelve news categories. Keywords: Text categorization/classification, Machine Learning, Support Vector Machines, Phrase Based Feature Representationsen_US
dc.identifier.urihttp://etd.aau.edu.et/handle/12345678/14857
dc.language.isoenen_US
dc.publisherAddis Ababa Universityen_US
dc.subjectText categorization/classificationen_US
dc.subjectMachine Learningen_US
dc.subjectSupport Vector Machinesen_US
dc.subjectPhrase Based Feature Representationsen_US
dc.titlePhrase Based Amharic News Text Classificationen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Zeleke Abebaw.pdf
Size:
789.06 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: