AAU-ETD :: Browsing by Author "Bekele, Rediat"

Browsing by Author "Bekele, Rediat"

Now showing 1 - 1 of 1

A Comparative Study of Automatic Language Identification of Ethio-Semitic Languages
(Addis Ababa University, 2018-06-06) Bekele, Rediat; Mulugeta, Wondwossen (PhD)
The dominant languages under the family of Ethio-Semitic languages are Amharic, Geez, Guragigna and Tigrigna. From the findings of the language identification studies on European languages, there is a conclusion that most classifiers performance reached the accuracy of 100%. Local and global studied confirmed that Naïve Bayes Classifier (NBC) classifier does not reached the accuracy level of 100% in language identification especially on shorter test strings. Comparative Language Identification studies in European languages shows that Cumulative Frequency Addition (CFA) performs close to 100% accuracies better than the NBC classifier. The purpose of our study is to assess the performance of CFA as compared to NBC on Ethio-Semitic languages, to validate the research findings of CFA and NBC classifiers, and recommend the classifier, language model, evaluation context and the optimal values of N that performs better in language identification. In this research we have employed and experimental study to measure the performance CFA and NBC classifiers. We have developed a training and test corpus from online bibles written in Amharic, Geez, Guragigna and Tigrigna to generate 5 different character based n-gram language models. We have measured the classifiers performance using under two different evaluation contexts using 10-fold cross validation. F-score is used as an optimal measure of performance for comparing classifiers performances. The classifiers commonly exhibited higher performance when the length of the test phrase grows from a single word to 2, 3 and beyond to reach an F-score measure beyond 99%. Both classifiers performed similarly under each context corresponding to the language models and n-grams tested. The language model, fixed length character n-grams with location features, exhibited highest performance in F-score for both classifiers under each evaluation contexts on test strings as short as one word length. N=5 on Fixed length character n-grams with location features language model is the optimal value of N whereas N=2 is the optimal value for the remaining language models on both CFA and NBC classifiers and evaluation contexts. Based on our findings CFA is a classifier that performs better as compared to NBC as it is founded in sound theoretical assumptions and its performance in language identification.

Browsing by Author "Bekele, Rediat"

Results Per Page

Sort Options