A Two step Approach for Tigrigna Text Categorization

Assefa, Gebrehiwot

A Two step Approach for Tigrigna Text Categorization

dc.contributor.advisor	Meshesha, Million(PhD)
dc.contributor.author	Assefa, Gebrehiwot
dc.date.accessioned	2018-11-16T08:34:00Z
dc.date.accessioned	2023-11-18T12:44:01Z
dc.date.available	2018-11-16T08:34:00Z
dc.date.available	2023-11-18T12:44:01Z
dc.date.issued	2011-06
dc.description.abstract	Tigrigna language is a Semitic language spoken by the Tigray people in Northern Ethiopia and Eritrea which has more than six million speakers worldwide. There are large collections of Tigrigna document available in web, in addition to hard copy document in library, and documentation centers. Even though the amount of the document increase, there are challenging tasks to identify the relevant documents related to a specific topic. So, a text categorization mechanism is required for finding, filtering and managing the rapid growth of online information. Several researches have been done on text categorization, especially news text classification with the help of different machine learning approaches; and good results were found. However, with the growth of text corpus the text classification using a predefined category is an extremely costly and time-consuming activity. The need for classifiers that can learn from unlabeled data is required. Hence, this study attempts to design a two step Tigrigna text categorization system. First, clustering is used to find natural grouping of the unlabeled Tigrigna text documents. Here, repeated bisection and direct k-means clustering algorithms are used to obtain documents of natural group of the Tigrigna data set. The repeated bisection clustering algorithm outperforms the direct kmeans clustering algorithms. So the repeated bisection clustering algorithm results are selected for classification task. For the classification task decision tree and support vector machine techniques are used in the present study. The SMO support vector machine classifier performs better than J48 decision tree classifier. SMO registers 82.4% correct classification. However, there are challenges in designing a Tigrigna text categorization system; worth to mention are the mismatch encountered between clustering and classification algorithms, and the Tigrigna language ambiguity which demands further research to apply ontology-based hierarchical text categorization.	en_US
dc.identifier.uri	http://etd.aau.edu.et/handle/12345678/14345
dc.language.iso	en	en_US
dc.publisher	Addis Ababa University	en_US
dc.subject	Information Retrieval	en_US
dc.title	A Two step Approach for Tigrigna Text Categorization	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Gebrehiwot Assefa.pdf
Size:: 933.24 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Plain Text
Description:

Download

Collections

Information Sciences