Part of Speech Tagger for Tigrigna Language

No Thumbnail Available

Date

2010-11

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Due to many sophisticated and advanced technologies like the Internet, the world has become a single village. It is possible to get a vast amount of digitized information that are generated, propagated, exchanged, stored and accessed through the internet and other media like mobile network each day across the world. The accumulation of digital data is making information acquisition increasingly difficult, with natural language becoming critically an obstacle. The step towards tackling this obstacle is Natural Language Processing. And part of speech tagging is one and preliminary among the many steps that are used for information acquisition and other advanced NLP applications. It is a technique of labeling each word in a text/sentence with its corresponding part of speech category that best suits the definition of the word as well as its context in the particular position of the sentence in which it is used. As far as the researcher‟s knowledge is concerned, there is no part of speech tagger developed for Tigrigna language though there are many part of speech taggers developed using different approaches for many languages such as English, Arabic, Amharic, etc. Therefore, this work proposes a hybrid approach, HMM tagger combined with rule based tagger, for Tigrigna part of speech tagger. Tigrigna literatures on grammar and morphology are reviewed to understand nature of the language and also to identify possible tagsets. As a result, 36 broad tagsets were identified and 26,000 words from around 1000 sentences containing 8000 distinct words were tagged for training and testing purpose. Since there is no readymade standard corpus, the manual tagging process to prepare corpus for this work was challenging and hence, it is recommended that a standard corpus is prepared. Raw Tigrigna text is first tagged by the HMM tagger; afterwards the rule based tagger is used as a corrector of the HMM tagger. Viterbi algorithm and Brill Transformation-based Error driven learning are adapted for the HMM and Rule based taggers respectively. Different experiments are conducted for HMM based, rule based and hybrid based approach taking 25% of the whole data for testing. The HMM and rule based approach shows an accuracy of 89.13% and 91.8% respectively whereas, the hybrid model improve the accuracy to 95.88%. Hence, it is found that that the hybrid of the two taggers outperforms the individual taggers. Keywords: Tigrigna, POS tagger, NLP, Brill Tagger, Hidden Markov Model, Hybrid POS tagger

Description

Keywords

Tigrigna, POS Tagger, NLP, Brill Tagger, Hidden Markov Model, Hybrid POS Tagger

Citation

Collections