Automatic Morphological Analyzer for Amharic an Experiment Employing Unsupervised Learning and Autosegmental Analysis Approaches

No Thumbnail Available



Journal Title

Journal ISSN

Volume Title


Addis Ababa University


Automatic understanding of natural languages requires a set of language processing tools. A morphological analyzer, which parses words into their morphemic components, is one of these tools. This thesis reports an attempt intended to develop such a tool for Amharic. Word formation in Amharic involves three levels of morphological operations – stem formation, affixation and cliticization. Since affixation and cliticization are similar with those in Indio-European languages, a language independent system tested in these languages is used. The system, called Linguistica2001, creates morphological dictionary (called signature) by extracting prefixes, stems and suffixes from a given corpus. The system uses the modified version of Harris’s Algorithm of Successor Frequency to detect plausible word break points. Additional heuristics are used to improve the word breaks produced. Minimum Description Length (MDL) test serves as a benchmark to accept a signature as part of the morphology of a given language. For the stem internal operations, another approach based on the principle of autosegmental Phonology is used. This principle represents phonemic features of a word in different tiers and uses association lines to maintain their relationships. This approach is used to design algorithms and data structures required for extraction and representation of stem components. A prototype system, called Amharic Stems Morphological Analyzer (ASMA), is developed to test the algorithms. Though the two systems are tested separately, ASMA is designed to work in an integrated manner by accepting as its input stems identified by Linguistica2001. The experiment is conducted using corpuses prepared in this study. The experimental result obtained is encouraging. Linguistica2001 parses successfully 87% of words of the test data (433 of 500 words). This result corresponds to a precision of 95% and a recall of 90%. The second system analyses 241 (or 94%) of the255 sample stems correctly.



Natural language processing