Automatic Part-Of-Speech Tagger For Tigrigna Language Using Hybrid Approach
No Thumbnail Available
Date
2016-10-01
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Tagging is a process of associating word class categories markers for corpora contents as additional
information. Tagging can be used as pre-processing step for other high level language
technology applications, such as to develop stemming algorithm, to prepare annotated corpora,
etc. The process of tagging is a challenging task with Tigrigna because of the nature and morphological
complexity of the language, resources scarcity and compiling Tigrigna texts.
The study uses a corpus containing 3100 sentences, 10000 distinct words and 56,151 total tokens
and they are balanced corpus (not a domain specific corpus). A total of 22 Morpho-Syntactic
course-grained tag-sets were adapted to prepare the annotated corpus using semi-supervised approach.
Because the corpus is normalized, processed and annotated corpora it can be used for
other language processing tasks.
The entire work describes an experimental study for improving Tigrigna tagger performance by
combining outputs of two sequence taggers. Rule based, averaged perceptron taggers, and hybrid
of the two taggers are investigated. The hybrid tagger was constructed from the sequence of the
two taggers as averaged perceptron tagger followed by rule based tagger. The models are trained
in 75% of the corpus and tested on the remaining 25% for their robustness and effectiveness. For
each model several different experiments have been conducted.
Experimental result shows that reasonable tagger is achieved with modified rule based tagger
along to three combined initial state annotator. In this study state-of-the-art tagging accuracy for
morphological rich languages particularly Tigrigna with Averaged perceptron tagger is achieved.
The Rule based tagger has found 94.8%, while Averaged perceptron tagger achieved 95.5%.
Thus, averaged perceptron tagger and rule based tagger achieved comparable performance; however,
the hybrid tagger improves the accuracy to 96.3%. The hybrid tagger works as a sequence
of averaged perceptron followed with rule based tagger as error detection and correction sequence.
In between the trained averaged perceptron and rule based tagger there is output analyzer
with a threshold value as output validation and decision maker.
Therefore, the hybrid approach based rule based and averaged perceptron tagger creates a reasonable
PoS tagger for Tigrigna.
1
Description
Keywords
Tigrigna Language Using Hybrid Approach