Afaan Oromo Named Entity Recognition Using Neural Word Embeddings

dc.contributor.advisorAssabie, Yaregal (PhD)
dc.contributor.authorKasu, Mekonini
dc.date.accessioned2021-08-05T12:36:41Z
dc.date.accessioned2023-11-29T04:06:31Z
dc.date.available2021-08-05T12:36:41Z
dc.date.available2023-11-29T04:06:31Z
dc.date.issued2020-10-26
dc.description.abstractNamed Entity Recognition (NER) is one of the canonical examples of sequence tagging that assigns a named entity label to each of a sequence of words. This task is important for a wide range of downstream applications in natural languages processing. Two attempts have been conducted for Afaan Oromo NER that automatically identifies and classifies the proper names in text into predefined semantic types like a person, location, and organizations and miscellaneous. However, their work heavily relied on hand design feature. We proposed a deep neural network architecture for Afaan Oromo Named Entity Recognition, based on context encoder and decoder models using Bi-directional Long Short Term Memory and Conditional Random Fields respectively. In the proposed approach, initially, we generated neural word embeddings automatically using skip-gram with negative subsampling from an unsupervised corpus size of 50,284KB. The generated word embeddings represent words in semantic vectors which are further used as an input feature for encoder and decoder model. Likewise, character level representation is generated automatically using BiLSTM from the supervised corpus size of 768KB. Because of the use of character level representation, the proposed model is robust for the out-of-vocabulary words. In this study, we manually prepared annotated dataset size of 768KB for Afaan Oromo Named Entity Recognition. We split this dataset into 80% for training, 5% for testing and 15% for validation. We prepared totally 12,963 named entities from these 10,370.4 %, 648.15% and 1,944.45% are used for training, validation and test set respectively. Experimental results show that the combination of BiLSTM-CRF algorithms with pre-trained word embedding and character level representation and regularization techniques (dropout) perform better as compared to the other models such as Bi-LSTM, BiLSTM-CRF with only character level representation or word embeddings. Using Bi-LSTM-CRF model with pre-trained word embeddings and character level representation significantly improved Afaan Oromo Named Entity Recognition with an average of 93.26 % F-Score and 98.87 % accuracy.en_US
dc.identifier.urihttp://etd.aau.edu.et/handle/123456789/27612
dc.language.isoenen_US
dc.publisherAddis Ababa Universityen_US
dc.subjectAfaan Oromo NERen_US
dc.subjectContext Encoder and Tag Decoderen_US
dc.subjectDistributed Representationen_US
dc.subjectDeep Neural Networksen_US
dc.titleAfaan Oromo Named Entity Recognition Using Neural Word Embeddingsen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Mekonini Kasu 2020.pdf
Size:
1.52 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: