A Comparative Study of Automatic Language Identification of Ethio-Semitic Languages

Bekele, Rediat

A Comparative Study of Automatic Language Identification of Ethio-Semitic Languages

dc.contributor.advisor	Mulugeta, Wondwossen (PhD)
dc.contributor.author	Bekele, Rediat
dc.date.accessioned	2019-07-29T14:07:26Z
dc.date.accessioned	2023-11-18T12:44:25Z
dc.date.available	2019-07-29T14:07:26Z
dc.date.available	2023-11-18T12:44:25Z
dc.date.issued	2018-06-06
dc.description.abstract	The dominant languages under the family of Ethio-Semitic languages are Amharic, Geez, Guragigna and Tigrigna. From the findings of the language identification studies on European languages, there is a conclusion that most classifiers performance reached the accuracy of 100%. Local and global studied confirmed that Naïve Bayes Classifier (NBC) classifier does not reached the accuracy level of 100% in language identification especially on shorter test strings. Comparative Language Identification studies in European languages shows that Cumulative Frequency Addition (CFA) performs close to 100% accuracies better than the NBC classifier. The purpose of our study is to assess the performance of CFA as compared to NBC on Ethio-Semitic languages, to validate the research findings of CFA and NBC classifiers, and recommend the classifier, language model, evaluation context and the optimal values of N that performs better in language identification. In this research we have employed and experimental study to measure the performance CFA and NBC classifiers. We have developed a training and test corpus from online bibles written in Amharic, Geez, Guragigna and Tigrigna to generate 5 different character based n-gram language models. We have measured the classifiers performance using under two different evaluation contexts using 10-fold cross validation. F-score is used as an optimal measure of performance for comparing classifiers performances. The classifiers commonly exhibited higher performance when the length of the test phrase grows from a single word to 2, 3 and beyond to reach an F-score measure beyond 99%. Both classifiers performed similarly under each context corresponding to the language models and n-grams tested. The language model, fixed length character n-grams with location features, exhibited highest performance in F-score for both classifiers under each evaluation contexts on test strings as short as one word length. N=5 on Fixed length character n-grams with location features language model is the optimal value of N whereas N=2 is the optimal value for the remaining language models on both CFA and NBC classifiers and evaluation contexts. Based on our findings CFA is a classifier that performs better as compared to NBC as it is founded in sound theoretical assumptions and its performance in language identification.	en_US
dc.identifier.uri	http://etd.aau.edu.et/handle/12345678/18691
dc.language.iso	en	en_US
dc.publisher	Addis Ababa University	en_US
dc.subject	Language Identification	en_US
dc.subject	Naïve Bayes	en_US
dc.subject	N-Gram	en_US
dc.subject	Cumulative Frequency Addition	en_US
dc.subject	Fixed Length Character N-Grams	en_US
dc.subject	Infiniti N-Grams	en_US
dc.subject	N-Gram From Text String	en_US
dc.title	A Comparative Study of Automatic Language Identification of Ethio-Semitic Languages	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Rediat Bekele 2018.pdf
Size:: 2.55 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Plain Text
Description:

Download

Collections

Information Sciences