Development of Stemming Algorithm for Tigrigna Text
No Thumbnail Available
Date
2011-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
This paper presents the development of a rule-based stemming algorithm for Tigrigna.
The algorithm is simple yet highly effective; it is based on a set of steps composed by a
collection of rules. Each rule specifies the affixes to be removed; the minimum length
allowed for the stem and a list of exceptions rules. In Tigrigna language there are many
exceptions for making any stemming rule. The researcher has considered these
exceptions in designing the stemmer.
The deep study of the Tigrigna grammar as well as the analysis of the inflectional and
derivational types of affixes of the language was necessary for this kind of thesis work.
The stemmer was designed by new word classification according to their affixes. The
stemming is performed using a rule-based algorithm that removes affixes.
Research done for Tigrigna language and Tigrigna stemmer was taken in to
consideration. It was necessary to conduct the research as the past research of Tigrigna
language stemming is limited. By Analyzing the Tigrigna grammatical rules, the
researcher decided to follow inflectional and derivational affix removal and designed a
new rule-set for the Tigrigna stemmer.
The goal of the research was to develop and document a new rule-based stemmer for the
Tigrigna language. The Tigrigna stemmer was developed in Python programming
language. The researcher tried to follow a simple structure in the algorithm, creating
x
small rule-sets for similar affixes, which are working as Rule-sets on the input words.
The stemmer was evaluated using error counting method. The system was tested and
evaluated based on the counting of actual understemming and overstemming errors using
a total of 5437 word variants derived from two data sets. Results show that the stemmer
has 85.8 % accuracy for the first dataset and 86.3% accuracy for the second dataset and
average accuracy of 86.1%. The proposed method generates some errors. The average
error rate is about 13.9%.These errors were analyzed and classified into two different
categories (overstemming and understemming). Most of the errors occurred due to
overstemming of words.
Description
Keywords
Stemming Algorithem