Application of Part-of-Speech Tagged Corpus to improve the Performance of Word Sense Disambiguation: the case of Amharic
No Thumbnail Available
Date
2015-06-05
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Natural language inherently involves polysemy, words which can be interpreted in multiple
ways depending on the context in which they occur. Even if, Human brain is capable of
identifying sense of a polysemous word spontaneously from a given context; ambiguity in
natural language is a hindrance for users to utilize information technology to the fullest. Hence,
it is of paramount importance to handle it computationally. Word Sense Disambiguation, one of
the open research area in NLP, is a task focused on figuring out the intended meaning of a
polysemous word in context. Thus, this study has focused on investigation of the application of
POS tagged corpus on the performance improvement of WSD.
During the study, a corpus based approach was used involving supervised, unsupervised and
semi-supervised machine learning paradigms. Five ambiguous Amharic words: bela, tenesa,
derese, ale and eTena with about 1031 sentences involving two senses of each ambiguous word
were used after adding POS tag to each word involved in the text corpus. Besides, two
unsupervised algorithms (EM and Simple K-means) and five classification algorithms
(AdaboostM1, Bagging, ADtree, SMO and Naïve Bayes) were used. Among the three machine
learning paradigms, semi-supervised has achieved a score of 92.66% using ADtree, 92.33%
using AdaboostM1, 89.92% using SMO, 80.98% using Bagging and 60.62% using Naïve Bayes
algorithm. In addition, one seed word has been found to result better accuracy for WSD
research using the above mentioned algorithms. The optimal average window size of 6-6 has
been considered enough while POS tag information is involved for WSD study in Amharic.
So, for WSD study in Amharic using semi-supervised machine learning paradigm; inclusion of
POS tag information to each word in the corpus has been found to yield better performance
improvement of 4.2% using ADtree, 8.4% using AdaboostM1, 1.1% using Bagging, 2.5% using
SMO and 12.6% using Naïve Bayes algorithm than the performance score of the baseline. Lastly,
the researcher recommends further researches to be conducted for other ambiguous words
and using different approaches to better address a problem of WSD.
Description
Keywords
Ambiguity, Machine Learning Paradigms, NLP, POS Tagged Corpus, Polysemy, WSD