Automatic Classification of Afaan Oromo News Text: the Case of Radio Fana
No Thumbnail Available
Date
2009-03
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
The vast growth of information and communication technology resulted in a huge
volume of information very large bulk of which is stored as unstructured text. The
presence of so much text in electronic form is a challenge to natural language
processing As the volume of electronic information increases, there is growing interest
in developing tools to help people better find , filter, and manage these resources.
Arguably, the only way for humans to cope with the information explosion is to exploit
computational techniques that can sift through huge bodies of text.
Currently news agencies in Ethiopia in which large amount of news from all the available
sources are processed every day is implementing a manual classification system to
categorize news items in their daily activities despite the fact, they are using
computerized system to store and edit news items. Radio Fana is the one among these
agencies.
The objective of this research is to develop and adopt processing tools for Afaan Oromo
text classification and investigate the application of machine learning techniques for
automatic classification of Afaan Oromo news items.
The data source for this research is the Afaan Oromo news items obtained from Radio
Fauna Share Company.
In this research , tools for pre-processing Afaan Oromo news items such as tokenization,
removal of extraneous characters, removal of stop-words and removal of affixes from
the words are prepared to facilitate the experimentation process for the automatic
classifiers.
Among the automatic classifiers which are applicable on high dimensional data, four of
them; Sequential Minimal Optimization (SMO) algorithm from Support Vector Machines,
Naïve Bayes Multi Nominal (NBM) from Bayesian Classifiers, J48 algorithm from the
Decision trees and K-Nearest Neighbor (KNN) from the Lazy Learners have been
experimented on the final data. The data, the pre-processed Afaan Oromo news items,
is organized in to categories of four classes, seven classes and all (eleven) classes for
the experimentation purpose and the experimentation uses 10-fold stratified cross
validation for training and test data.
Description
Keywords
Afaan Oromo News Text