N-Gram-Based Automatic Indexing for Amharic text

Mengistum Bethelhem

N-Gram-Based Automatic Indexing for Amharic text

dc.contributor.advisor	G/Tsadik Abebe (PhD)
dc.contributor.advisor	Lamenew Workshet (PhD)
dc.contributor.advisor	Amsalu Saba
dc.contributor.author	Mengistum Bethelhem
dc.date.accessioned	2018-11-15T08:17:04Z
dc.date.accessioned	2023-11-18T12:43:58Z
dc.date.available	2018-11-15T08:17:04Z
dc.date.available	2023-11-18T12:43:58Z
dc.date.issued	2002-06
dc.description.abstract	This research explored the applicability of the n-gram method for indexing text written in the Amharic language. 100 documents (Amharic news articles written in the Visual Ge’ez font obtained from Walta Information Center) and 24 queries (collected from people who frequently read newspapers) were selected and used for the test. The values of n used were n=2 (bi-grams) and n=3 (tri-grams). For comparison purposes, unstemmed words were also used as index terms. The Vector Space Model (VSM) was used for document representation and retrieval. Thus, the individual words, bi-grams and tri-grams were identified for the collection. These unique terms were then weighted using the TF/IDF weighting technique used in the VSM. The term vectors were generated from these calculated weights for each type of term, i.e. unstemmed word, bi-gram, and tri-gram. The query terms (words, bi-grams, and tri-grams) were also identified and weighted. A different weighting formula was used for the query terms. The vectors of terms were then formed. In order to retrieve relevant documents, similarity calculations were performed between each document-query vector pair. The ranked results from this calculation were then used to calculate precision and recall measures that are used in the VSM to test or compare retrieval effectiveness. The relevance information that was used to determine recall and precision was stored in a table. Recall and precision values for the queries for each type of index (word, bi-gram, and tri-gram) were calculated and compared. The results showed that although word indexes are better in overall indexing performance, bi-grams and tri-grams also have values for indexing comparable to words.	en_US
dc.identifier.uri	http://etd.aau.edu.et/handle/12345678/14238
dc.language.iso	en	en_US
dc.publisher	Addis Ababa University	en_US
dc.subject	Automatic Indexing	en_US
dc.title	N-Gram-Based Automatic Indexing for Amharic text	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Bethelhem Mengistu.pdf
Size:: 632.76 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Plain Text
Description:

Download

Collections

Information Sciences