Part-of-Speech Tagging for Afaan Oromo Language
No Thumbnail Available
Date
2009-01
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Most natural language processing systems use part-of-speech (POS) tagger as a separate
module in their architecture. Specially, it is very significant for developing parser, machine
translator, speech recognizer and search engines. Tagging is a process of labeling part-of speech
tags to words of a text such that contextual information can be obtained from word
labels. The main aim of this study is to develop part-of-speech tagger for Afaan Oromo language.
After reviewing literature on Afaan Oromo grammars and identifying tag set and word
categories, the study adopted Hidden Markov Model (HMM) approach and has implemented
uni gram and bi gram models of vertebra algorithm. Uni gram model is used to understand word
ambiguity in the language, while bi gram model is used to undertake contextual analysis of
words. For training and testing purpose 159 sentences (with a total of 162 1 words) that are manually
annotated sample corpus are used. The corpus is collected from different public Afaan Oromo
newspapers and bulletins to make the sample corpus balanced. A database of lexical
probabilities (LexProb) and transitional probabilities (Trans Prob) are developed from thi s
annotated corpus. These two probabilities are from which the tagger learn and tag sequence of
words in a sentence The performance of the prototype, Afaan Oromo tagger is tested using ten fold cross
validation mechanism. The result shows that in both uni gram and bi gram models 87.58% and
91.97% accuracy is obtained, respectively. Based on experimental analysis, concluding
remarks and recommendations are forwarded.
Keywords: Natural Language processing, parts of speech tagging, Hidden Markov Model, N - Gram.
Description
Keywords
Natural Language processing, parts of speech tagging, Hidden Markov Model, N -Gram