Designinig a Stemming Algorithm for Silt’e Language

No Thumbnail Available

Date

2012-06

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Variant word forms that are likely to be encountered in indexing and retrieval are one of the causes of the problems that are involved in the use of freetext retrieval system. The variant word structure s used in indexing and searching are to be expected in determining the relevance of a document to a user query that specifies just a single form. Shrinking the variant words in to one form advances the performance of IR system and this can be achieved by conflation techniques, which is usually stemming that is established in this work. Stemmers are used in information retrieval to reduce as many related words and word forms as possible to a standard form, which can then be used in the retrieval process. This research explores the possibility of developing a stemmer to conflate variant words of Silt’e language. Silt’e belongs to the Semitic language group. These languages have a common grammatical system based on a root-pattern structure. Consonants bear the basic meaning while vowels form different patterns. Stems are built from consonantal roots before other word forms are built. Silt’e uses affixation and reduplication to derive different word forms from stems. Common affixations are prefix, suffix, and infix. Silt’e uses extensive concatenation of affixes and can result in relatively long words, which often contain an amount of semantic information equivalent to a whole English phrase, clause or sentence. As a result of this complex morphological structure, a single Silt’e word can have very large variants. To design the stemmer, a sample text was collected from different sources and research paper that explains the morphology of Silt’e language also used and affixes and stopwords collected from this research paper and the sample text document to develope the stemmer. The stemmer, developed in this study is iterative and uses context sensitive and recoding rules that remove prefix, suffix and reduplication of letters (type 1 and type 2). In this experiment the stripping procedure were applied in order: prefix, suffix and finally letter reduplication. The stemmer was tested on a sample data of 1486 words, which were selected randomly from the sample texts. The result of the experiment shows that, the stemmer performs at accuracy of 85.71%, and brings a dictionary reduction of 34.99% for stem words. Lastly conclution and the possible recommendation for future work were reported.

Description

Keywords

Algorithem

Citation