Linguistically Motivated Amharic IR (LM-IR)
No Thumbnail Available
Date
2013-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Information Retrieval (IR) is the very essential tool in every society for
knowledge acquiring. The challenge of designing effective IR on Amharic is
related to linguistic characterstics that are specific for the language.
Detail studies on the Amharic language indicate two core features. These
features make difficult to apply IR models that are effective on English. The
first is syllabic nature of the writing system the other is morphological nature
of word formation. These characterstics cause too many morph variation and
linguistic ambiguity. That is why applying already existing IR models cause
document silence and noise during.
Adopted models of statistical preprocessing fail to give enough attention for
the core characteristics of the language, in this research an attempt is made to
develop a new Linguistic Analyzer (LA) for word preprocessor using morph
syntactic analysis (MSA) to resolve challenges related with linguistic ambiguity
and linguistic variation.
Morph variation has been a major challenge of Amharic IR system by causing
document silence during retrieval. This problem has been resolved in this
research by introducing incremental index file structure. Incremental indexing
has a capability of storing linguistic inflections that are related with gender,
number, tense, and other form. This indexing structure helps to keep
precession while increasing the recall values of retrieval system.
A preprocessor LA is build using 74,000 words found in Amharic bible. After
performing preprocessing on 5000 words using the newly designed LA, output
found with better performance of 82%. On the same test the statistical
preprocessor with stemming can deliver only a maximum of 30%. The LM-IR,
that is built on top of LA have incremental indexing file structure that is
capable of delivering average F-measure of 83%. It was possible to maintain
recall of 88% while the precession is not below 76%
The comparison of LA and statistical word preprocessor shows a significant
difference on effectiveness therefore LA approach benefits Amharic IR design.
In addition the incremental indexing structure protect the semantic lose on
index words that used to happen statistical index structures. Incremental
indexing structure helps to increase recall and precision at the same time. This
research also shows the possibility of designing Amharic IR using linguistic
technique. Therefor further research especially on searching part of linguistic
approach of Amharic IR would yield even better result.
Description
Keywords
Information Science