A Stemming Algorithm Development for Tigrigna Language Text Documents

No Thumbnail Available

Date

2001-06

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Variant word forms that are likely to be encountered in indexing and retrieval are one of the causes of the problems that are involved in the use of free-text retrieval system. The variant word forms used in indexing and searching are likely to be of comparable importance in determining the relevance of a document to a user query that specifies just a single form. Reducing the variant words into one from improves performance of IR system and this can be achieved by a conflation technique, which is usually stemming that is established in this work. Steamers are used in information retrieval to reduce as many related words and word forms as possible to a common forms, which can then be used in the retrieval process. This research explores the possibility of developing a steamer to conflate variant words of Tigrigna language for use in IR of the languageTigrigna belongs to the Semitic language group. These languages have a common grammatical system based on a root-pattern structure. Consonants bear the basic meanings while vowels forms different patterns. Stems are built from consonant al roots before other words are built from stems. Tigrigna uses affixation to derive different word forms from stems. Common affixations are prefix, suffix, prefix-suffix pair and reduplication. Tigrigna uses extensive concatenation of affixes and can result in relatively long words, which often contain an amount of semantic information equivalent to a whole English phrase, clause or sentence. Due to this complex morphological structure, a single Tigrigna word can have thousand variants. To design the stemmer, a sample text was collected from three different sources. The experiment in word-distribution on the sample data shows that words exist in their variants across the text and singleton words constitute large percentage of the text. This resulted in low word-ratio and deviation from Zipfs law. A stemmer is developed which is iterative and uses context-sensitive rules that removes prefix, suffilx, prefix-suffIx pair and reduplication . of single and double letters. A semi automated procedure was used to compile stop words and affIxes. The stemmer was tested on sample data of 1568 words, which were selected randomly from the sample texts. In this experiment the stripping procedures were applied in the order of prefix-suffIx, double letter reduplication, prefix, suffIx and single letter reduplication. The result of the experiment shows that, the steamer performs at accuracy of 84% and brings a dictionary reduction of 32.40% and 54.6% for stem and root respectively.

Description

Keywords

Information Science

Citation