A Two step Approach for Tigrigna Text Categorization
No Thumbnail Available
Date
2011-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Tigrigna language is a Semitic language spoken by the Tigray people in Northern
Ethiopia and Eritrea which has more than six million speakers worldwide. There are large
collections of Tigrigna document available in web, in addition to hard copy document in
library, and documentation centers. Even though the amount of the document increase,
there are challenging tasks to identify the relevant documents related to a specific topic.
So, a text categorization mechanism is required for finding, filtering and managing the
rapid growth of online information.
Several researches have been done on text categorization, especially news text
classification with the help of different machine learning approaches; and good results
were found. However, with the growth of text corpus the text classification using a
predefined category is an extremely costly and time-consuming activity. The need for
classifiers that can learn from unlabeled data is required. Hence, this study attempts to
design a two step Tigrigna text categorization system. First, clustering is used to find
natural grouping of the unlabeled Tigrigna text documents. Here, repeated bisection and
direct k-means clustering algorithms are used to obtain documents of natural group of the
Tigrigna data set. The repeated bisection clustering algorithm outperforms the direct kmeans
clustering algorithms. So the repeated bisection clustering algorithm results are
selected for classification task.
For the classification task decision tree and support vector machine techniques are used
in the present study. The SMO support vector machine classifier performs better than J48
decision tree classifier. SMO registers 82.4% correct classification. However, there are
challenges in designing a Tigrigna text categorization system; worth to mention are the
mismatch encountered between clustering and classification algorithms, and the Tigrigna
language ambiguity which demands further research to apply ontology-based hierarchical
text categorization.
Description
Keywords
Information Retrieval