A Stemming Algorithm Development for Tigrigna Language Text Documents
No Thumbnail Available
Date
2001-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Variant word forms that are likely to be encountered in indexing and retrieval are one of the
causes of the problems that are involved in the use of free-text retrieval system. The variant
word forms used in indexing and searching are likely to be of comparable importance in
determining the relevance of a document to a user query that specifies just a single form.
Reducing the variant words into one from improves performance of IR system and this can be
achieved by a conflation technique, which is usually stemming that is established in this work.
Steamers are used in information retrieval to reduce as many related words and word forms
as possible to a common forms, which can then be used in the retrieval process.
This research explores the possibility of developing a steamer to conflate variant words of
Tigrigna language for use in IR of the languageTigrigna belongs to the Semitic language
group. These languages have a common grammatical system based on a root-pattern structure.
Consonants bear the basic meanings while vowels forms different patterns. Stems are built
from consonant al roots before other words are built from stems. Tigrigna uses affixation to
derive different word forms from stems. Common affixations are prefix, suffix, prefix-suffix
pair and reduplication. Tigrigna uses extensive concatenation of affixes and can result in
relatively long words, which often contain an amount of semantic information equivalent to a
whole English phrase, clause or sentence. Due to this complex morphological structure, a
single Tigrigna word can have thousand variants.
To design the stemmer, a sample text was collected from three different sources. The
experiment in word-distribution on the sample data shows that words exist in their variants
across the text and singleton words constitute large percentage of the text. This resulted in low
word-ratio and deviation from Zipfs law.
A stemmer is developed which is iterative and uses context-sensitive rules that removes
prefix, suffilx, prefix-suffIx pair and reduplication . of single and double letters. A semi automated
procedure was used to compile stop words and affIxes. The stemmer was tested on
sample data of 1568 words, which were selected randomly from the sample texts. In this
experiment the stripping procedures were applied in the order of prefix-suffIx, double letter
reduplication, prefix, suffIx and single letter reduplication. The result of the experiment shows
that, the steamer performs at accuracy of 84% and brings a dictionary reduction of 32.40%
and 54.6% for stem and root respectively.
Description
Keywords
Information Science