Automatic Classification of News Amharic Items: the Case of Ethiopian News Agency

No Thumbnail Available

Date

2001-07

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

To organize its news stock efficiently and to facilitate the storage and retrieval of news items, Ethiopian News Agency (ENA) use a classification scheme developed in-house. With its large volume of news items produced each year, ENA is facing problems in classifying news items timely. This research has come up with Amharic News Classifier (ANC) that has the capability of classifying Amharic news items into the predefined classes automatically based on their content. The development of automatic document classification system passes through di fferent steps and there are different methods that can be used at each step. This research used stati stical techniques of automatic class ification in all the steps. The steps in automatic class ification include document analys is, generation of document and class vectors based on document and class representatives, and matching document and class vectors to determine the class where a document belongs. The process of document analysis reqUIres some preprocessmg activities such as stemming and stopword removal, which are language dependent. In this research, the key terms are stemmed using a simple depluralization and suffix and prefix removal program developed for this purpose. A database of stop word li st, which contains most frequently occurring Amharic words, was also developed. In addition, problems related to Amhatic language script were considered during text processing. To identify document representatives, tfX idf weighting technique is used. Class vectors, also called centroid vectors, are generated by computing the average value of document vectors. After identifying class representatives from the learning data set, cosine function is used as a matching technique to automatically classify the test data set that had no relation with the construction of the class vectors. The overall result of this research has showed that statistical techniques can be used to analyze Amharic news items and classify them automatically into predefined classes. After training the classifier, 273 out of 321 news items were correctly classified by the system. The result is very promising, however, additional works are recommended in order to implement the system.

Description

Keywords

Information Science

Citation