Afaan Oromo Wordnet Construction Using Sense Embedding
No Thumbnail Available
Date
2021-10-01
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
One of the primary goals of the field of natural language processing is to create very high-quality WordNet which can be used in many domains. The main area which WordNet methods typically fall short is in handling polysemy representation. A word is polysemous when it has multiple meanings (e.g., the word bank when used in a financial context versus an ecological context). Current WordNet methods fail to handle this at all when using word embedding models for automatic WordNet construction that train just one embedding for all meanings of a word. Words have different meanings (i.e., senses) depending on the context. Disambiguating the correct sense is important and a challenging task for natural language processing. Contextualized models represent the meanings of words in context. This enables them to capture some of the vast array of linguistic phenomena that occur above the word level.
In this study, we propose automatic Afaan Oromo WordNet construction using sense embedding. The proposed model includes different tasks. We perform text pre-processing in Afaan Oromo text document and train the document using a sense embedding spacy library (sense2vec) and Facebook fastText library to generate sense embedding model. The embedding result provides a contextually similar word for every word in the training set. The trained sense vector model captures different patterns. After training the data we take the trained model as input and discover different patterns that used to extract WordNet relations.
We use POS tagged Afaan Oromo corpus to model WordNet. The resulting WordNet using fastText and sense2vec showed that words that are similar or analogous to each other happen together or closer in space. Related Afaan Oromo words were found closer to each other in the vector space. Morphological relatedness took the highest stake. The sense embedding has also learned the vector representation, “moti (king) - dhira(man) + dubara(woman)” resulting in a vector closer to the word “gifti(queen)”. Out-of-vocabulary words were also entertained. We got Spearman's correlation score of Rs=0.74 for each relation type, multi class text classification on the model attained 92.6% F1-score; result being fluctuated based on parameters.
Description
Keywords
Afaan Oromo Wordnet, Word Sense Induction, Polysemy, Word Analogy, Hypernym, Hyponym, Synonym, Antonym, Sense Embeddings, Word Embedding, Fasttext