Application of Part-of-Speech Tagged Corpus to improve the Performance of Word Sense Disambiguation: the case of Amharic

No Thumbnail Available

Date

2015-06-05

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Natural language inherently involves polysemy, words which can be interpreted in multiple ways depending on the context in which they occur. Even if, Human brain is capable of identifying sense of a polysemous word spontaneously from a given context; ambiguity in natural language is a hindrance for users to utilize information technology to the fullest. Hence, it is of paramount importance to handle it computationally. Word Sense Disambiguation, one of the open research area in NLP, is a task focused on figuring out the intended meaning of a polysemous word in context. Thus, this study has focused on investigation of the application of POS tagged corpus on the performance improvement of WSD. During the study, a corpus based approach was used involving supervised, unsupervised and semi-supervised machine learning paradigms. Five ambiguous Amharic words: bela, tenesa, derese, ale and eTena with about 1031 sentences involving two senses of each ambiguous word were used after adding POS tag to each word involved in the text corpus. Besides, two unsupervised algorithms (EM and Simple K-means) and five classification algorithms (AdaboostM1, Bagging, ADtree, SMO and Naïve Bayes) were used. Among the three machine learning paradigms, semi-supervised has achieved a score of 92.66% using ADtree, 92.33% using AdaboostM1, 89.92% using SMO, 80.98% using Bagging and 60.62% using Naïve Bayes algorithm. In addition, one seed word has been found to result better accuracy for WSD research using the above mentioned algorithms. The optimal average window size of 6-6 has been considered enough while POS tag information is involved for WSD study in Amharic. So, for WSD study in Amharic using semi-supervised machine learning paradigm; inclusion of POS tag information to each word in the corpus has been found to yield better performance improvement of 4.2% using ADtree, 8.4% using AdaboostM1, 1.1% using Bagging, 2.5% using SMO and 12.6% using Naïve Bayes algorithm than the performance score of the baseline. Lastly, the researcher recommends further researches to be conducted for other ambiguous words and using different approaches to better address a problem of WSD.

Description

Keywords

Ambiguity, Machine Learning Paradigms, NLP, POS Tagged Corpus, Polysemy, WSD

Citation