Addis Ababa University Libraries Electronic Thesis and Dissertations: AAU-ETD! >
Faculty of Informatics >
Thesis - Information Science >
Please use this identifier to cite or link to this item:
|Title: ||N-gram-Based Automatic Indexing for Amharic Text|
|Authors: ||BETHLEHEM, MENGISTU|
|Advisors: ||Dr. Abebe G/Tsadik|
Ato Werkshet Lamenew
Wzt. Saba Amsalu
|Copyright: ||2002 |
|Date Added: ||14-May-2008 |
|Publisher: ||Addis Ababa University|
|Abstract: ||This research explored the applicability of the n-gram method for indexing text written in
the Amharic language. 100 documents (Amharic news articles written in the Visual
Ge’ez font obtained from Walta Information Center) and 24 queries (collected from
people who frequently read newspapers) were selected and used for the test. The values
of n used were n=2 (bi-grams) and n=3 (tri-grams). For comparison purposes, unstemmed
words were also used as index terms.
The Vector Space Model (VSM) was used for document representation and retrieval.
Thus, the individual words, bi-grams and tri-grams were identified for the collection.
These unique terms were then weighted using the TF/IDF weighting technique used in
the VSM. The term vectors were generated from these calculated weights for each type of
term, i.e. unstemmed word, bi-gram, and tri-gram. The query terms (words, bi-grams,
and tri-grams) were also identified and weighted. A different weighting formula was used
for the query terms. The vectors of terms were then formed.
In order to retrieve relevant documents, similarity calculations were performed between
each document-query vector pair. The ranked results from this calculation were then used
to calculate precision and recall measures that are used in the VSM to test or compare
retrieval effectiveness. The relevance information that was used to determine recall and
precision was stored in a table. Recall and precision values for the queries for each type
of index (word, bi-gram, and tri-gram) were calculated and compared.
The results showed that although word indexes are better in overall indexing
performance, bi-grams and tri-grams also have values for indexing comparable to words.|
|Description: ||A thesis submitted to the school of Graduate Studies of Addis Ababa University in
Partial fulfillment for the Degree of Master of Science in Information Science|
|Appears in:||Thesis - Information Science|
Items in the AAUL Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.