Automatic Amharic Text News Classification: A Neural Networks Approach

No Thumbnail Available

Date

2009-09

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Text classification is one of the methods used to organize massively available textual information in a meaningful context to maximize utilization of information. Automatic text classification is the preferred method for accomplishing classification in large volumes of information. Research works on automatic classification is flourishing in the context of other languages; whereas, research on automatic Amharic text classification is in its infancy stage and very few attempts have been made till now. This study puts forward its own contribution for automatic Amharic text classification. Before the classifier is constructed, preprocessing has been done on the data to make it ready for the learning algorithm including changing various Amharic characters with the same sound to one common form; stemming word variants; and removing stop words, punctuation marks and numbers. And Document Frequency (DF) threshold is applied to select features of news items. Two weighting schemes, Term Frequency (TF) and Term Frequency by Inverse Document Frequency (TF*IDF), are used so as to weight the features in news documents to construct news by features matrix, which is fed to the learning algorithm. This study considers one of the neural networks learning methods called Learning Vector Quantization (LVQ), to see its suitability for automatic Amharic text news classification. In the course of this study, it is found that TF weighting scheme outperforms TF*IDF weighting scheme by 3.54% on average. Using the TF weight method, 94.81%, 61.61% and 70.08% accuracies are obtained at three, six and nine categories experiments respectively with an average of 75.5% accuracy. For similar experiments, the application of TF*IDF weight method resulted in 69.63%, 78.22% and 68.03% accuracies with an average of 71.96% accuracy. Previous research works on Amharic text classification show that, accuracy decreases consistently with the increase in categories. The result of this study shows that accuracy does not depend on the number of news items and categories considered; rather, representing each category with enough number of subclasses determines accuracy. Therefore, further works focusing on finding the optimum number of subclasses is the major direction of research with regard to Amharic text news classification using LVQ.

Description

Keywords

information in a meaningful, context to maximize utilization of information

Citation