The Application of Websom for Amharic Text Retrieval

No Thumbnail Available

Date

2003-06

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

This research explored the applicability of WEBSOM (Web Based Self Organizing map) for retrieving texts written in Amharic language. The method applies a neural network's self organizing algorithm for generating the map display. The map display detects complex relationships among given documents, and reveals the relationships based on the arrangements of terms abstracted from the documents. To conduct the experiment, 330 Amharic news articles of three classes were collected from the Ethiopian News Agency. 248 of the news articles were taken as a training set and the remaining as a test set. For the purpose of document representation, the Vector Space Model was used. Non-content bearing terms were removed from the lists of terms identified from the headline and slug parts of the news articles and suffix/prefix-stripping technique was applied on the remaining list. After changing terms having different writing forms in to one common form, terms with a total frequency of above 70 and below 3 were discarded from the list. Then, a matrix both for the training and test set were constructed on the remaining 142 terms. A normalized weight was assigned to each term in a given news article based on TF-IDF (Term Frequency- Inverse Document Frequency) weighting technique and the vector matrix were prepared in appropriate format for the tool to be used. Using Nenet (Neural Network Tool), the SOM map was trained with the 248 articles in the training set and tested with three test sets selected from the three classes of news articles. From the distribution of these articles on the map, it was observed that the map placed similar articles near to each other. The results obtained from the three tests made, indicated that the clustering capability of the SOM for Amharic documents is promising. x Lastly, a map was constructed for the entire (330) news articles and an HTML based prototype browsing interface map was developed and labled with descriptive terms that convey properties of the area. A link was also made with the actual database through the Active Server Pages created so that users can browse on the map for relevant articles.

Description

Keywords

Retrieval

Citation