Amharic Wordnet Construction Using Word Embedding

dc.contributor.advisorAssabie, Yaregal (PhD)
dc.contributor.authorGetaneh, Mulat
dc.date.accessioned2020-08-10T09:04:40Z
dc.date.accessioned2023-11-29T04:06:16Z
dc.date.available2020-08-10T09:04:40Z
dc.date.available2023-11-29T04:06:16Z
dc.date.issued2020-05-29
dc.description.abstractA big amount of data is produced on the web and this data is available in the online data portals. By any means, people always need to access, analyze, and organize those data easily. To access and analyze those data effectively there must be an automated system that can understand human language as it is spoken. This is possible by using natural language processing applications. However, most of the natural language applications such as sentiment analysis, information retrieval, question answering, word sense disambiguation, etc. use WordNet as a resource. Some natural language applications like information retrieval can be done using electronic thesaurus and dictionary, but the coverage of such resources is limited. WordNet solves such a problem and it is used as a resource for many other natural language processing applications. A WordNet resource can be constructed using manual, semi-automated, and fully automated methods from the text data. However, while the manual method is time consuming and semi-automated methods are not effective methods since the resource includes different relations in addition with a large dataset. So, using these methods is tiresome and time-consuming. Semi-automated and automated methods can be effective for languages which have sufficient resources like thesaurus, bilingual dictionary, monolingual text corpus, effective machine translator, etc. So, automatically constructing a WordNet resource from unlabeled text data is the best way for languages like Amharic which have limited resource. In this study, we propose Automatic Amharic WordNet construction using word embedding. The proposed model includes different tasks. The first task is text pre-processing which consists of commonly used text pre-processing tasks in many natural language processing applications. We perform text pre-processing in Amharic text document and train the document using a word embedding gensim library (word2vec) in order to generate word embedding model. The embedding result provides a contextually similar word for every words in the training set. Most of contextual similar words belong to a relation r. The trained word vector model captures different patterns. After training the data we take the trained model as input and discover different patterns that used to extract WordNet relations like: hypernym/hyponym, synonym, and antonym. Conceptual synonym of a word is extracted based on cosine similarity. We use an additional distance supervision method for near-synonym (like meaning exist in dictionary) relation extraction. So, for this method we perform feature extraction task based on given sample seed words (synonym pairs). In the other hand, we extracted hypernym/hyponym relation from the trained model by taking the advantage of mutual information concept and measuring similarity (based on cosine distance). Whereas antonym relation of words are extracted from the trained word2vec model based on the concept of word analogy. The common evaluation metrics such as recall and precision were used to measure our proposed model performance. Amharic WordNet prototype is developed and used to tests the system performance of using the collected Amharic text document. Finally, this study shows a result of 78.3% recall and a precision of 53.9. We also evaluate using Spearman’s correlation, and achieve +0.79 correlation coefficient.en_US
dc.identifier.urihttp://etd.aau.edu.et/handle/123456789/22036
dc.language.isoenen_US
dc.publisherAddis Ababa Universityen_US
dc.subjectAmharic Wordneten_US
dc.subjectWord Embeddingen_US
dc.subjectDistance Supervisionen_US
dc.subjectNatural Language Processingen_US
dc.subjectWord Analogyen_US
dc.subjectHypernymen_US
dc.subjectHyponymen_US
dc.subjectSynonymen_US
dc.subjectAntonymen_US
dc.subjectWord2vecen_US
dc.subjectGensimen_US
dc.titleAmharic Wordnet Construction Using Word Embeddingen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Mulat Getaneh 2020.pdf
Size:
2.03 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: