Automatic Thesaurus Construction From Wolaytta Text

Addis Ababa University


Thesaurus is a set of terms for documents classification during indexing and query expansion during the process of searching with the aim of enhancing retrieval effectiveness. The major problem associated with information retrieval system: in one hand, users are required to explicitly describe their information need to the system, on the other the system itself often retrieve irrelevant documents due to vocabulary mismatch between query term and index term. As information retrieval system compares query term and index term at a lexical level, the mismatch is so pronounced to affect the retrieval performance. Therefore thesaurus a means to the problem by providing precise and controlled vocabulary of terms for indexing and searching there by resolve vocabulary mismatch. Wolaytta is an official language of literacy in Ethiopia. Since the introduction of the Latin script in the writing system in 1993, the language has evolved significantly from mere verbal communication to means of instruction then to source of information. To use the language as source of information, the retrieval system should be designed with enhanced capability in resolving what so ever mismatches that arise between query term and index term. This research thesis develops an automatic association thesaurus from Wolaytta text for possible inception of enhanced retrieval system or to provide a frame work for the development of crosslanguage retrieval system. The developed system is based on term-to-term co-occurrence based automatically constructed association thesaurus from document corpora. In order to obtain a reasonably good performance the system incorporated manual approaches regarding stop words and suffix list compilation processes and achieved a better result in generating related concepts.



