Development of Stemming Algorithm for Wolaytta Text
No Thumbnail Available
Date
2003-07
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
This study describes the design of a stemming algorithm for Wolaytta language. To give a solid
background for the thesis, literature on conflation in general and stemming algorithms in
particular were reviewed. Since it is the nature and characteristics of suffixation that guide the
development of steamer, the Wolaytta language morphology was studied and described in order
to model the language and develop an automatic procedure for conflation. The inflectional and
derivational morphologies of the language are discussed. It is indicated that suffixation is the
main word formation process in Wordplay language. It is also attempted to show that the language
is morphological complex and uses extensive concatenation of suffixes
The result of the study is a prototype context sensitive iterative stemmer for Wolaytta language.
Error counting technique was employed to evaluate the performance of this stemmer. The
stemmer was trained on 3537 words (80% of the sample text) and the improved version reveals
an accuracy of 90.6% on the training set. The number of over stemmed and understeml11ed words
on the training set were 8.6% (304 words) and 0.8% (28 words) respectively. When the stemmer
rW1S on the unseen sample of 884 words (20% of the sample text), it performed with an accuracy
of 86.9%. The percentage of endorser recorded as under stunned and over stemmed on this unseen
(test set) were 9% and 4.1 %, respectively. Moreover, a dictionary reduction of 38 .92% was
attained on the test set. The major sources of errors are also reported with possible
recommendations to further improve the performance of the stemmer and also for further
research.
Description
Keywords
Information Science