Named Entity Recognition for Afan Oromo

No Thumbnail Available

Date

2010-10

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

This thesis describes the development of Named Entity Recognition (NER) system for Afan Oromo language. NER is an information extraction task aimed at identifying and classifying words of a sentence, a paragraph or a document into predefined categories of NEs. A lot of researches in NER task have been conducted for European and Asian languages, while this work is the first of its kind for Afan Oromo, a language that has the largest native speakers in Ethiopia. A new Afan Oromo NER solution is proposed based on a hybrid approach which contains machine learning and rule based components. Afan Oromo NE corpus of size more than 23,000 words have been developed based on CoNLL‟s 2002, BIO tagging scheme. Four NE categories have been identified and used in the study: person, location, organization and miscellaneous. The miscellaneous category includes date/time, monetary value and percentage. Some of the components in the system include NE Chunker, Feature Extractor and Model Builder. The NE Chunker chunks a sequence of tokens belonging to the same NE category. The feature extractor extracts features from the training data. Position, word-shape, POS, normalization, prefix and suffix of a token were used as features. The model builder estimates the model‟s parameters. Stochastic gradient descent has been used to estimate the model‟s parameter. We have also developed Afan Oromo POST model that can generate POS feature. Evaluation results from our system were promising. We obtained an average performance of Recall 77.41%, Precision 75.80% and F1-measure 76.60% in two major experimentation scenarios: increasing the size of the training data and examining the influence of features. The result from our experiment shows that features play a vital role than increasing the training data size. Examining the influence of features justified that we used the best combination of feature for the development of the system. In general, the algorithms and techniques used in this study obtained good performance when compared to the other resource-rich languages like English. Keywords: Named Entity Recognition, Named Entities, Conditional Random Fields, Afan Oromo.

Description

Keywords

Named Entity Recognition; Named Entities; Conditional Random Fields; Afan Oromo.

Citation