Word Sense Disambiguation for Tigrigna Language Using Semi-Supervised Machine Learning Approach

No Thumbnail Available

Date

2018-06-02

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Developing different natural language applications can be used to make the communication between humans and computers easy. Most natural languages have words having different meanings depending on different contexts. Those words make the communication between computers and humans difficult because computers cannot differentiate (identify) proper meanings of ambiguos words. Therefore, Word Sense Disambiguation (WSD) enables computers identify the proper meaning of the ambiguous words depending on the surrounding contexts. In this study, we tried to build WSD prototype model for Tigrigna language because WSD is an intermediate task for other NLP tasks like Machine Translation, Information Retrieval systems, Information Extraction, Speech Processing, etc. WSD can be developed by using corpus-based, knowledge based, and hybrid approaches. From those approaches, we used a corpus-based approach to build WSD prototype model. Corpus-based approach is supported by machine learning methods. Corpus-based approaches can be classified as supervised, semi-supervised, and unsupervised machine learning methods. Since semi-supervised machine learning method narrows the weakness of both supervised and unsupervised machine learning methods by exploiting many unlabeled datasets with small labeled datasets, we worked on semi-supervised machine learning method to build WSD prototype model of Tigrigna Language. We conducted our experiments on five ambiguous words of Tigrigna by collecting a total of 1250 sentences of the language. Those five ambiguous words of the language are: - kefele(ከፈለ), OareQe (ዓረቐ), seOare (ሰዓረ), Halefe (ሓለፈ), and medeb (መደብ). The first three words have two senses of each, and the fourth and fifth words have four and three senses respectively. We applied four clustering algorithms (EM, Simple K-Means, FarthestFirst, and HierarchicalClusterer) and five classification algorithms (ADTree, AdaBoostM1, Bagging, SMO, and Naïve Bayes) for clustering and classification purposes of the sentences into their senses respectively. Since those algorithms are available in WEKA 3.8, we used this tool in our study. We compared the three machine learning methods; and found out that semi-supervised machine learning achived the best performance. We achieved an average performance of 93.6636%, 91.9224%, 85.35%, 79.1917%, and 70.8968% using ADTree, SMO, AdaBoostM1, Bagging, and Naïve Bayes algorithms respectively. Window size of 1-1 became the optimal window size to identify the meaning of the selected ambiguous words of Tigrigna language using all of the selected classification algorithms.

Description

Keywords

Word Sense Disambiguation, Tigrigna Language Using Semi-Supervised, Machine Learning Approach

Citation