Development of Stemming Algorithm for Tigrigna Text

No Thumbnail Available

Date

2011-06

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

This paper presents the development of a rule-based stemming algorithm for Tigrigna. The algorithm is simple yet highly effective; it is based on a set of steps composed by a collection of rules. Each rule specifies the affixes to be removed; the minimum length allowed for the stem and a list of exceptions rules. In Tigrigna language there are many exceptions for making any stemming rule. The researcher has considered these exceptions in designing the stemmer. The deep study of the Tigrigna grammar as well as the analysis of the inflectional and derivational types of affixes of the language was necessary for this kind of thesis work. The stemmer was designed by new word classification according to their affixes. The stemming is performed using a rule-based algorithm that removes affixes. Research done for Tigrigna language and Tigrigna stemmer was taken in to consideration. It was necessary to conduct the research as the past research of Tigrigna language stemming is limited. By Analyzing the Tigrigna grammatical rules, the researcher decided to follow inflectional and derivational affix removal and designed a new rule-set for the Tigrigna stemmer. The goal of the research was to develop and document a new rule-based stemmer for the Tigrigna language. The Tigrigna stemmer was developed in Python programming language. The researcher tried to follow a simple structure in the algorithm, creating x small rule-sets for similar affixes, which are working as Rule-sets on the input words. The stemmer was evaluated using error counting method. The system was tested and evaluated based on the counting of actual understemming and overstemming errors using a total of 5437 word variants derived from two data sets. Results show that the stemmer has 85.8 % accuracy for the first dataset and 86.3% accuracy for the second dataset and average accuracy of 86.1%. The proposed method generates some errors. The average error rate is about 13.9%.These errors were analyzed and classified into two different categories (overstemming and understemming). Most of the errors occurred due to overstemming of words.

Description

Keywords

Stemming Algorithem

Citation