Automatic Amharic Text News Classification: A Neural Networks Approach
No Thumbnail Available
Date
2009-10
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Text classification is one of the methods used to organize massively ail able textual information
in a meaningful context to maximize utilization of information. Automatic text class fiction is
the preferred method for accomplishing Classify at ion in large volumes of in formation. Research
works on automatic classification is flourishing in the context of other languages; whereas,
research on automatic Amharic text classy fiction is in its in fancy stage and very few attempts
have been made till now. This study puts forward its own contribution for automatic Amharic
text class fiction.
Before the classifier is constructed, preprocessing has been done on the data to make it ready for
the learning algorithm including changing various Amharic characters with the same sound to
one common form; stemming word variants; and removing stop words, punctuation marks and
numbers. And Document Frequency (OF) threshold is applied to select features of news items .
Two weighting schemes, Term Frequency (TF) and Term Frequency by In verse Document
Frequency (TF* IOF), are used so as to weight the features in news documents to construct news
by features matrix, which is fed to the learning algorithm. This study considers one of the neural
networks learning methods called Learning Vector Quantization (LVQ), to see its suitability for
automatic Amharic text news classification. In the course of this study, it is found that TF
weighting scheme outperforms TF* IDF weighting scheme by 3.54% on average. Using the TF
weight method, 94.81 %, 61.61 % and 70.08% accuracies are obtained at three, six and nine
cat ego rise pediments respectively with an average of 75.5% accuracy. For similar experiments,
the application of TF*IOF weight method resulted in 69.63%, 78.22% and 68.03% ac curacies
with an average of 71.96% accuracy.
Previous research works on Amharic text c classification show that, accuracy decreases
consistently with the increase in categories. The result of this study shows that accuracy does not
depend on the number of news items and categories considered; rather, representing each
category with enough number of subclasses determines accuracy. Therefore, further works
focusing on finding the optimum number of subclasses is the major direction of research with
regard to Amharic text news classification using LVQ.
Description
Keywords
Information Science