A Comparative Study of Automatic Language Identification of Ethio-Semitic Languages
dc.contributor.advisor | Mulugeta, Wondwossen (PhD) | |
dc.contributor.author | Bekele, Rediat | |
dc.date.accessioned | 2019-07-29T14:07:26Z | |
dc.date.accessioned | 2023-11-18T12:44:25Z | |
dc.date.available | 2019-07-29T14:07:26Z | |
dc.date.available | 2023-11-18T12:44:25Z | |
dc.date.issued | 2018-06-06 | |
dc.description.abstract | The dominant languages under the family of Ethio-Semitic languages are Amharic, Geez, Guragigna and Tigrigna. From the findings of the language identification studies on European languages, there is a conclusion that most classifiers performance reached the accuracy of 100%. Local and global studied confirmed that Naïve Bayes Classifier (NBC) classifier does not reached the accuracy level of 100% in language identification especially on shorter test strings. Comparative Language Identification studies in European languages shows that Cumulative Frequency Addition (CFA) performs close to 100% accuracies better than the NBC classifier. The purpose of our study is to assess the performance of CFA as compared to NBC on Ethio-Semitic languages, to validate the research findings of CFA and NBC classifiers, and recommend the classifier, language model, evaluation context and the optimal values of N that performs better in language identification. In this research we have employed and experimental study to measure the performance CFA and NBC classifiers. We have developed a training and test corpus from online bibles written in Amharic, Geez, Guragigna and Tigrigna to generate 5 different character based n-gram language models. We have measured the classifiers performance using under two different evaluation contexts using 10-fold cross validation. F-score is used as an optimal measure of performance for comparing classifiers performances. The classifiers commonly exhibited higher performance when the length of the test phrase grows from a single word to 2, 3 and beyond to reach an F-score measure beyond 99%. Both classifiers performed similarly under each context corresponding to the language models and n-grams tested. The language model, fixed length character n-grams with location features, exhibited highest performance in F-score for both classifiers under each evaluation contexts on test strings as short as one word length. N=5 on Fixed length character n-grams with location features language model is the optimal value of N whereas N=2 is the optimal value for the remaining language models on both CFA and NBC classifiers and evaluation contexts. Based on our findings CFA is a classifier that performs better as compared to NBC as it is founded in sound theoretical assumptions and its performance in language identification. | en_US |
dc.identifier.uri | http://etd.aau.edu.et/handle/12345678/18691 | |
dc.language.iso | en | en_US |
dc.publisher | Addis Ababa University | en_US |
dc.subject | Language Identification | en_US |
dc.subject | Naïve Bayes | en_US |
dc.subject | N-Gram | en_US |
dc.subject | Cumulative Frequency Addition | en_US |
dc.subject | Fixed Length Character N-Grams | en_US |
dc.subject | Infiniti N-Grams | en_US |
dc.subject | N-Gram From Text String | en_US |
dc.title | A Comparative Study of Automatic Language Identification of Ethio-Semitic Languages | en_US |
dc.type | Thesis | en_US |