Amharic Wordnet Construction Using Word Embedding
No Thumbnail Available
Date
2020-05-29
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
A big amount of data is produced on the web and this data is available in the online data portals. By any means, people always need to access, analyze, and organize those data easily. To access and analyze those data effectively there must be an automated system that can understand human language as it is spoken. This is possible by using natural language processing applications. However, most of the natural language applications such as sentiment analysis, information retrieval, question answering, word sense disambiguation, etc. use WordNet as a resource. Some natural language applications like information retrieval can be done using electronic thesaurus and dictionary, but the coverage of such resources is limited. WordNet solves such a problem and it is used as a resource for many other natural language processing applications. A WordNet resource can be constructed using manual, semi-automated, and fully automated methods from the text data. However, while the manual method is time consuming and semi-automated methods are not effective methods since the resource includes different relations in addition with a large dataset. So, using these methods is tiresome and time-consuming. Semi-automated and automated methods can be effective for languages which have sufficient resources like thesaurus, bilingual dictionary, monolingual text corpus, effective machine translator, etc. So, automatically constructing a WordNet resource from unlabeled text data is the best way for languages like Amharic which have limited resource.
In this study, we propose Automatic Amharic WordNet construction using word embedding. The proposed model includes different tasks. The first task is text pre-processing which consists of commonly used text pre-processing tasks in many natural language processing applications. We perform text pre-processing in Amharic text document and train the document using a word embedding gensim library (word2vec) in order to generate word embedding model. The embedding result provides a contextually similar word for every words in the training set. Most of contextual similar words belong to a relation r. The trained word vector model captures different patterns. After training the data we take the trained model as input and discover different patterns that used to extract WordNet relations like: hypernym/hyponym, synonym, and antonym. Conceptual synonym of a word is extracted based on cosine similarity. We use an additional distance supervision method for near-synonym (like meaning exist in dictionary) relation extraction. So, for this method we perform feature extraction task based on given sample seed words (synonym pairs). In the other hand, we extracted hypernym/hyponym relation from the trained model by taking the advantage of mutual information concept and measuring similarity (based on cosine distance). Whereas antonym relation of words are extracted from the trained word2vec model based on the concept of word analogy.
The common evaluation metrics such as recall and precision were used to measure our proposed model performance. Amharic WordNet prototype is developed and used to tests the system performance of using the collected Amharic text document. Finally, this study shows a result of 78.3% recall and a precision of 53.9. We also evaluate using Spearman’s correlation, and achieve +0.79 correlation coefficient.
Description
Keywords
Amharic Wordnet, Word Embedding, Distance Supervision, Natural Language Processing, Word Analogy, Hypernym, Hyponym, Synonym, Antonym, Word2vec, Gensim