Amharic Wordnet Construction Using Word Embedding

Getaneh, Mulat

Amharic Wordnet Construction Using Word Embedding

dc.contributor.advisor	Assabie, Yaregal (PhD)
dc.contributor.author	Getaneh, Mulat
dc.date.accessioned	2020-08-10T09:04:40Z
dc.date.accessioned	2023-11-29T04:06:16Z
dc.date.available	2020-08-10T09:04:40Z
dc.date.available	2023-11-29T04:06:16Z
dc.date.issued	2020-05-29
dc.description.abstract	A big amount of data is produced on the web and this data is available in the online data portals. By any means, people always need to access, analyze, and organize those data easily. To access and analyze those data effectively there must be an automated system that can understand human language as it is spoken. This is possible by using natural language processing applications. However, most of the natural language applications such as sentiment analysis, information retrieval, question answering, word sense disambiguation, etc. use WordNet as a resource. Some natural language applications like information retrieval can be done using electronic thesaurus and dictionary, but the coverage of such resources is limited. WordNet solves such a problem and it is used as a resource for many other natural language processing applications. A WordNet resource can be constructed using manual, semi-automated, and fully automated methods from the text data. However, while the manual method is time consuming and semi-automated methods are not effective methods since the resource includes different relations in addition with a large dataset. So, using these methods is tiresome and time-consuming. Semi-automated and automated methods can be effective for languages which have sufficient resources like thesaurus, bilingual dictionary, monolingual text corpus, effective machine translator, etc. So, automatically constructing a WordNet resource from unlabeled text data is the best way for languages like Amharic which have limited resource. In this study, we propose Automatic Amharic WordNet construction using word embedding. The proposed model includes different tasks. The first task is text pre-processing which consists of commonly used text pre-processing tasks in many natural language processing applications. We perform text pre-processing in Amharic text document and train the document using a word embedding gensim library (word2vec) in order to generate word embedding model. The embedding result provides a contextually similar word for every words in the training set. Most of contextual similar words belong to a relation r. The trained word vector model captures different patterns. After training the data we take the trained model as input and discover different patterns that used to extract WordNet relations like: hypernym/hyponym, synonym, and antonym. Conceptual synonym of a word is extracted based on cosine similarity. We use an additional distance supervision method for near-synonym (like meaning exist in dictionary) relation extraction. So, for this method we perform feature extraction task based on given sample seed words (synonym pairs). In the other hand, we extracted hypernym/hyponym relation from the trained model by taking the advantage of mutual information concept and measuring similarity (based on cosine distance). Whereas antonym relation of words are extracted from the trained word2vec model based on the concept of word analogy. The common evaluation metrics such as recall and precision were used to measure our proposed model performance. Amharic WordNet prototype is developed and used to tests the system performance of using the collected Amharic text document. Finally, this study shows a result of 78.3% recall and a precision of 53.9. We also evaluate using Spearman’s correlation, and achieve +0.79 correlation coefficient.	en_US
dc.identifier.uri	http://etd.aau.edu.et/handle/123456789/22036
dc.language.iso	en	en_US
dc.publisher	Addis Ababa University	en_US
dc.subject	Amharic Wordnet	en_US
dc.subject	Word Embedding	en_US
dc.subject	Distance Supervision	en_US
dc.subject	Natural Language Processing	en_US
dc.subject	Word Analogy	en_US
dc.subject	Hypernym	en_US
dc.subject	Hyponym	en_US
dc.subject	Synonym	en_US
dc.subject	Antonym	en_US
dc.subject	Word2vec	en_US
dc.subject	Gensim	en_US
dc.title	Amharic Wordnet Construction Using Word Embedding	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Mulat Getaneh 2020.pdf
Size:: 2.03 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Plain Text
Description:

Download

Collections

Environmental Science