Automatic Classification of News Amharic Items: the Case of Ethiopian News Agency
No Thumbnail Available
Date
2001-07
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
To organize its news stock efficiently and to facilitate the storage and retrieval of news
items, Ethiopian News Agency (ENA) use a classification scheme developed in-house.
With its large volume of news items produced each year, ENA is facing problems in
classifying news items timely. This research has come up with Amharic News
Classifier (ANC) that has the capability of classifying Amharic news items into the
predefined classes automatically based on their content.
The development of automatic document classification system passes through di fferent
steps and there are different methods that can be used at each step. This research used
stati stical techniques of automatic class ification in all the steps. The steps in automatic
class ification include document analys is, generation of document and class vectors
based on document and class representatives, and matching document and class
vectors to determine the class where a document belongs.
The process of document analysis reqUIres some preprocessmg activities such as
stemming and stopword removal, which are language dependent. In this research, the
key terms are stemmed using a simple depluralization and suffix and prefix removal
program developed for this purpose. A database of stop word li st, which contains
most frequently occurring Amharic words, was also developed. In addition, problems
related to Amhatic language script were considered during text processing.
To identify document representatives, tfX idf weighting technique is used. Class
vectors, also called centroid vectors, are generated by computing the average value of
document vectors. After identifying class representatives from the learning data set,
cosine function is used as a matching technique to automatically classify the test data
set that had no relation with the construction of the class vectors.
The overall result of this research has showed that statistical techniques can be used to
analyze Amharic news items and classify them automatically into predefined classes.
After training the classifier, 273 out of 321 news items were correctly classified by the
system. The result is very promising, however, additional works are recommended in
order to implement the system.
Description
Keywords
Information Science