Automatic Categorization of Amharic News Text: a Machine Learning Approach
No Thumbnail Available
Date
2003-07
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Currently newspaper companies and news agencies in Ethiopia are implementing a manual
categorization system to categorize Amharic news articles in their day-to-day activities (although
they are using computer system to store and dispatch information).
The objective of this research was to investigate the application of machine learning techniques
to automatic categorization of Amharic news items. 11, 024 news articles were used to do this
research. To come up with good results text preparation and preprocessing was done. Stop-word
and words that occur in 3 or less documents were removed from the collection. Thirty-three
percent of the data was used for testing purposes. Machine learning techniques, Naïve Bayes and
k Nearest Neigbor classifiers, were used to categorize the Amharic news items.
The result of this research indicated that such classifiers are applicable to automatically classify
Amharic news items. However, the classifiers work well when the categories contain almost
evenly distributed news items. The best result obtained by the naïve Bayes and kNN classifiers is
on three categories data (95.80% vs. 89.61%) and the least performance is shown on the 16
categories (78.48% vs. 64.50%) respectively. The 16 categories contain unevenly distributed data
than the three categories and it is learnt that unevenly distributed numbers of documents over the
categories decreases the performance of both classifiers; K nearest Neighbor dramatically
decreases than naïve Bayes. This research indicated that Naïve Bayes is more applicable to
automatic categorization of Amharic news items.
The result of this research is promising. Nevertheless, additional works are recommended in
order to come up with good result.
Keywords: Text categorization, machine Learning, naïve Bayes, K Nearest Neigbor
Description
Keywords
Text categorization,, machine Learning,, naïve Bayes,, K Nearest Neigbor