N•Gram•Based Automatic Indexing for Amharic Text
No Thumbnail Available
Date
2002-07
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
This research explored the applicability of the n-gram method for indexing text written in the
Amharic language. 100 documents (Amharic news articles written in the Visual Ge'ez font
obtained from Walta Information Center) and 24 queries (collected from people who
frequent ly read newspapers) were selected and used for the test. The values of n used were
n=2 (bi-grams) and n=3 (tri-grams). For comparison purposes, unstemmed words were also
used as index terms. The Vector Space Model (VSM) was used for document representation and retrieval. Thus,
the individual words, bi-grams and tri -grams were identified for the collection. These unique
tel111S were then weighted using the TFIIDF weighting technique used in the VSM. The term
vectors were generated from these calculated weights for each type of term, i.e. unstemmed
word, bi-gram, and tri-gram. The query terms (words, bi-grams, and tri-grams) were also
identified and weighted. A different weighting fOl111Ula was used for the query terms. The
vectors of terms were then formed.In order to retrieve relevant documents, similarity calculations were performed between each
document-query vector pair. The ranked results from this calculation were then used to
calculate precision and recall measures that are used in the VSM to test or compare retrieval
effectiveness. The relevance information th at was used to detel111ine recall and precision was
stored in a tabl e. Recall and precision values for the queries for each type of index (word, bigram,
and tri-gram) were calculated and compared.The results showed that although word indexes are better in overall indexing performance, bigrams
and tri-grams also have va lues for indexing comparable to words.
Description
Keywords
Information Science