Afaan Oromo Named Entity Recognition Using Neural Word Embeddings

No Thumbnail Available

Date

10/26/2020

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Named Entity Recognition (NER) is one of the canonical examples of sequence tagging that assigns a named entity label to each of a sequence of words. This task is important for a wide range of downstream applications in natural languages processing. Two attempts have been conducted for Afaan Oromo NER that automatically identifies and classifies the proper names in text into predefined semantic types like a person, location, and organizations and miscellaneous. However, their work heavily relied on hand design feature. We proposed a deep neural network architecture for Afaan Oromo Named Entity Recognition, based on context encoder and decoder models using Bi-directional Long Short Term Memory and Conditional Random Fields respectively. In the proposed approach, initially, we generated neural word embeddings automatically using skip-gram with negative subsampling from an unsupervised corpus size of 50,284KB. The generated word embeddings represent words in semantic vectors which are further used as an input feature for encoder and decoder model. Likewise, character level representation is generated automatically using BiLSTM from the supervised corpus size of 768KB. Because of the use of character level representation, the proposed model is robust for the out-of-vocabulary words. In this study, we manually prepared annotated dataset size of 768KB for Afaan Oromo Named Entity Recognition. We split this dataset into 80% for training, 5% for testing and 15% for validation. We prepared totally 12,963 named entities from these 10,370.4 %, 648.15% and 1,944.45% are used for training, validation and test set respectively. Experimental results show that the combination of BiLSTM-CRF algorithms with pre-trained word embedding and character level representation and regularization techniques (dropout) perform better as compared to the other models such as Bi-LSTM, BiLSTM-CRF with only character level representation or word embeddings. Using Bi-LSTM-CRF model with pre-trained word embeddings and character level representation significantly improved Afaan Oromo Named Entity Recognition with an average of 93.26 % F-Score and 98.87 % accuracy.

Description

Keywords

Afaan Oromo NER, Context Encoder and Tag Decoder, Distributed Representation, Deep Neural Networks

Citation

Collections