The Application of Websom for Amharic Text Retrieval
No Thumbnail Available
Date
2003-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
This research explored the applicability of WEBSOM (Web Based Self Organizing map) for retrieving texts
written in Amharic language. The method applies a neural network's self organizing algorithm for
generating the map display. The map display detects complex relationships among given documents, and
reveals the relationships based on the arrangements of terms abstracted from the documents.
To conduct the experiment, 330 Amharic news articles of three classes were collected from the Ethiopian
News Agency. 248 of the news articles were taken as a training set and the remaining as a test set. For the
purpose of document representation, the Vector Space Model was used. Non-content bearing terms were
removed from the lists of terms identified from the headline and slug parts of the news articles and
suffix/prefix-stripping technique was applied on the remaining list. After changing terms having different
writing forms in to one common form, terms with a total frequency of above 70 and below 3 were discarded
from the list. Then, a matrix both for the training and test set were constructed on the remaining 142 terms.
A normalized weight was assigned to each term in a given news article based on TF-IDF (Term Frequency-
Inverse Document Frequency) weighting technique and the vector matrix were prepared in appropriate
format for the tool to be used.
Using Nenet (Neural Network Tool), the SOM map was trained with the 248 articles in the training set and
tested with three test sets selected from the three classes of news articles. From the distribution of these
articles on the map, it was observed that the map placed similar articles near to each other. The results
obtained from the three tests made, indicated that the clustering capability of the SOM for Amharic
documents is promising.
x
Lastly, a map was constructed for the entire (330) news articles and an HTML based prototype browsing
interface map was developed and labled with descriptive terms that convey properties of the area. A link
was also made with the actual database through the Active Server Pages created so that users can browse
on the map for relevant articles.
Description
Keywords
Retrieval