Phrase Based Amharic News Text Classification
No Thumbnail Available
Date
2010-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
The recent growth of Information and Communication Technologies (ICT) infrastructure in
Ethiopia is resulting in an exponential increase of digital information in local languages
including Amharic. Huge and increasing volumes of data are available in Amharic, which is
observed on the growing online newspapers, websites, and digital storages of Ethiopian News
Agency (ENA).Thus, to tackle the agency’s news text management problems, a number of
researches have conducted on automatic processing of Amharic news texts using bag-ofwords
feature representation approach.
However, using words as features could result in losing the intended meaning when the
concept is created from two or more sequential words. Thus, in order to maintain this
concept, a phrase based approach has been proposed and implemented in this research.
Preprocessing, feature representation, and testing were the major steps for the
accomplishment of this study. Preprocessing the data (character normalization, stop word
removal and stemming) is worked out before the datasets are fed into the classifier. In feature
representations, two forms of phrase structures (bigrams and trigrams) have been developed
and tested. After features have been represented by these phrase structures and their weights
are identified using TFIDF schemes, phrase matrix have been generated and saved as CSV
file format. The CSV files have been imported to the LibSVM classifier using the GUI of
WEKA application package. Finally, the testing was performed for both bigram and trigram
phrase structures at four, eight and twelve news category levels. From this research, using
bigram phrase structures, the best accuracy (95.3%) has been obtained at four categories,
followed by (81.3%) for eight categories and the least accuracy (72.01%) has been obtained
at twelve categories. On the other hand, for trigram phrase structure, the best accuracy was
obtained at four categories (72.9 %), followed by 69.7% for eight categories, and the least
accuracy has been obtained at twelve categories that accounts to 56.4%. From these results, it
can be observed that bigram phrase structures have better performance result (72.01%) than
trigram phrase structures (56.4%)for all twelve news categories.
Keywords: Text categorization/classification, Machine Learning, Support Vector Machines,
Phrase Based Feature Representations
Description
Keywords
Text categorization/classification, Machine Learning, Support Vector Machines, Phrase Based Feature Representations