Linguistically Motivated Amharic IR (LM-IR)

No Thumbnail Available



Journal Title

Journal ISSN

Volume Title


Addis Ababa University


Information Retrieval (IR) is the very essential tool in every society for knowledge acquiring. The challenge of designing effective IR on Amharic is related to linguistic characterstics that are specific for the language. Detail studies on the Amharic language indicate two core features. These features make difficult to apply IR models that are effective on English. The first is syllabic nature of the writing system the other is morphological nature of word formation. These characterstics cause too many morph variation and linguistic ambiguity. That is why applying already existing IR models cause document silence and noise during. Adopted models of statistical preprocessing fail to give enough attention for the core characteristics of the language, in this research an attempt is made to develop a new Linguistic Analyzer (LA) for word preprocessor using morph syntactic analysis (MSA) to resolve challenges related with linguistic ambiguity and linguistic variation. Morph variation has been a major challenge of Amharic IR system by causing document silence during retrieval. This problem has been resolved in this research by introducing incremental index file structure. Incremental indexing has a capability of storing linguistic inflections that are related with gender, number, tense, and other form. This indexing structure helps to keep precession while increasing the recall values of retrieval system. A preprocessor LA is build using 74,000 words found in Amharic bible. After performing preprocessing on 5000 words using the newly designed LA, output found with better performance of 82%. On the same test the statistical preprocessor with stemming can deliver only a maximum of 30%. The LM-IR, that is built on top of LA have incremental indexing file structure that is capable of delivering average F-measure of 83%. It was possible to maintain recall of 88% while the precession is not below 76% The comparison of LA and statistical word preprocessor shows a significant difference on effectiveness therefore LA approach benefits Amharic IR design. In addition the incremental indexing structure protect the semantic lose on index words that used to happen statistical index structures. Incremental indexing structure helps to increase recall and precision at the same time. This research also shows the possibility of designing Amharic IR using linguistic technique. Therefor further research especially on searching part of linguistic approach of Amharic IR would yield even better result.



Information Science