A Comparative Study of Automatic Language Identification of Ethio-Semitic Languages

dc.contributor.advisorMulugeta, Wondwossen (PhD)
dc.contributor.authorBekele, Rediat
dc.date.accessioned2019-07-29T14:07:26Z
dc.date.accessioned2023-11-18T12:44:25Z
dc.date.available2019-07-29T14:07:26Z
dc.date.available2023-11-18T12:44:25Z
dc.date.issued2018-06-06
dc.description.abstractThe dominant languages under the family of Ethio-Semitic languages are Amharic, Geez, Guragigna and Tigrigna. From the findings of the language identification studies on European languages, there is a conclusion that most classifiers performance reached the accuracy of 100%. Local and global studied confirmed that Naïve Bayes Classifier (NBC) classifier does not reached the accuracy level of 100% in language identification especially on shorter test strings. Comparative Language Identification studies in European languages shows that Cumulative Frequency Addition (CFA) performs close to 100% accuracies better than the NBC classifier. The purpose of our study is to assess the performance of CFA as compared to NBC on Ethio-Semitic languages, to validate the research findings of CFA and NBC classifiers, and recommend the classifier, language model, evaluation context and the optimal values of N that performs better in language identification. In this research we have employed and experimental study to measure the performance CFA and NBC classifiers. We have developed a training and test corpus from online bibles written in Amharic, Geez, Guragigna and Tigrigna to generate 5 different character based n-gram language models. We have measured the classifiers performance using under two different evaluation contexts using 10-fold cross validation. F-score is used as an optimal measure of performance for comparing classifiers performances. The classifiers commonly exhibited higher performance when the length of the test phrase grows from a single word to 2, 3 and beyond to reach an F-score measure beyond 99%. Both classifiers performed similarly under each context corresponding to the language models and n-grams tested. The language model, fixed length character n-grams with location features, exhibited highest performance in F-score for both classifiers under each evaluation contexts on test strings as short as one word length. N=5 on Fixed length character n-grams with location features language model is the optimal value of N whereas N=2 is the optimal value for the remaining language models on both CFA and NBC classifiers and evaluation contexts. Based on our findings CFA is a classifier that performs better as compared to NBC as it is founded in sound theoretical assumptions and its performance in language identification.en_US
dc.identifier.urihttp://etd.aau.edu.et/handle/12345678/18691
dc.language.isoenen_US
dc.publisherAddis Ababa Universityen_US
dc.subjectLanguage Identificationen_US
dc.subjectNaïve Bayesen_US
dc.subjectN-Gramen_US
dc.subjectCumulative Frequency Additionen_US
dc.subjectFixed Length Character N-Gramsen_US
dc.subjectInfiniti N-Gramsen_US
dc.subjectN-Gram From Text Stringen_US
dc.titleA Comparative Study of Automatic Language Identification of Ethio-Semitic Languagesen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Rediat Bekele 2018.pdf
Size:
2.55 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: