Automatic Amharic Text News Classification: A Neural Networks Approach

No Thumbnail Available

Date

2009-10

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Text classification is one of the methods used to organize massively ail able textual information in a meaningful context to maximize utilization of information. Automatic text class fiction is the preferred method for accomplishing Classify at ion in large volumes of in formation. Research works on automatic classification is flourishing in the context of other languages; whereas, research on automatic Amharic text classy fiction is in its in fancy stage and very few attempts have been made till now. This study puts forward its own contribution for automatic Amharic text class fiction. Before the classifier is constructed, preprocessing has been done on the data to make it ready for the learning algorithm including changing various Amharic characters with the same sound to one common form; stemming word variants; and removing stop words, punctuation marks and numbers. And Document Frequency (OF) threshold is applied to select features of news items . Two weighting schemes, Term Frequency (TF) and Term Frequency by In verse Document Frequency (TF* IOF), are used so as to weight the features in news documents to construct news by features matrix, which is fed to the learning algorithm. This study considers one of the neural networks learning methods called Learning Vector Quantization (LVQ), to see its suitability for automatic Amharic text news classification. In the course of this study, it is found that TF weighting scheme outperforms TF* IDF weighting scheme by 3.54% on average. Using the TF weight method, 94.81 %, 61.61 % and 70.08% accuracies are obtained at three, six and nine cat ego rise pediments respectively with an average of 75.5% accuracy. For similar experiments, the application of TF*IOF weight method resulted in 69.63%, 78.22% and 68.03% ac curacies with an average of 71.96% accuracy. Previous research works on Amharic text c classification show that, accuracy decreases consistently with the increase in categories. The result of this study shows that accuracy does not depend on the number of news items and categories considered; rather, representing each category with enough number of subclasses determines accuracy. Therefore, further works focusing on finding the optimum number of subclasses is the major direction of research with regard to Amharic text news classification using LVQ.

Description

Keywords

Information Science

Citation