Development of Part of Speech Tagger for Ge’ez Language
No Thumbnail Available
Date
2017-10-05
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Part of Speech tagging is the process of assigning part of speech or other lexical class markers to each word in a sentence or literature. Most other tasks and applications heavily depend on it. Much of the research in natural language processing has been dedicated to resource rich languages like English, French and other major European and Asian languages. Among the languages for which POS tagger is developed are Tigrigna, Amharic, Kafi-Noonoo, Arabic, Afaan-Oromo, etc. The objective of this research work is to develop POS tagger for Geez using hybrid approach that combines Trigrams 'n' Tags tagger, human written rule, regular expression and unknown word guessing.
Among those diverse statistical taggers, we adopt TnT tagger to the hybrid tagger. Because it enables to the tagger to perform morphological analyzer and maintains several internal frequency distribution and conditional frequency distribution instances based on the training data. Even though TnT is preferred tagger among those statistical taggers for Ge’ez language, still it has shortcoming. TnT does not deal with prefix pattern of unknown words. Regular expression can solve slightly the drawback of TnT tagger. However, the combination of TnT and Regular expression tagger is not still sufficient to get acceptable accuracy, because, Ge’ez language is morphologically complex language and follow free grammar which can follow subject-object-verb, object-subject -verb or subject-verb-object order without change the meaning of the sentence. Consequently, human written rules and unknown word guessing are combined to the hybrid tagger. The hybrid tagger performs better than the individual component of the taggers taken alone.
There was no readymade standard corpus for Ge’ez language. As a result, 26 broad tag sets were identified and 15,154 words from around 1,305 sentences collected from one genre i.e., holy Bible. Then, those words ware manually tagged by Ge’ez language professionals for training and testing purpose. Different experiments are conducted for the three types of taggers namely the TnT tagger, TnT with Regex tagger and Hybrid tagger. We obtained 77.87%, 82.23% and 94.32% performances for TnT tagger, TnT with Regex tagger and Hybrid taggers respectively. As a result, it is possible to conclude that the hybrid tagger performs better than the TnT tagger and TnT with Regex tagger used individually.
Description
Keywords
Ge’ez, POS Tagger For Ge’ez, NLP, Tnt, Hybrid POS Tagger