Automatic Morphological Analyzer for Amharic an Experiment Employing Unsupervised Learning and Autosegmental Analysis Approaches
No Thumbnail Available
Date
2002-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Automatic understanding of natural languages requires a set of language processing tools. A
morphological analyzer, which parses words into their morphemic components, is one of these tools.
This thesis reports an attempt intended to develop such a tool for Amharic.
Word formation in Amharic involves three levels of morphological operations – stem formation,
affixation and cliticization. Since affixation and cliticization are similar with those in Indio-European
languages, a language independent system tested in these languages is used. The system, called
Linguistica2001, creates morphological dictionary (called signature) by extracting prefixes, stems and
suffixes from a given corpus. The system uses the modified version of Harris’s Algorithm of Successor
Frequency to detect plausible word break points. Additional heuristics are used to improve the word
breaks produced. Minimum Description Length (MDL) test serves as a benchmark to accept a
signature as part of the morphology of a given language.
For the stem internal operations, another approach based on the principle of autosegmental
Phonology is used. This principle represents phonemic features of a word in different tiers and uses
association lines to maintain their relationships. This approach is used to design algorithms and data
structures required for extraction and representation of stem components. A prototype system, called
Amharic Stems Morphological Analyzer (ASMA), is developed to test the algorithms. Though the two
systems are tested separately, ASMA is designed to work in an integrated manner by accepting as its
input stems identified by Linguistica2001.
The experiment is conducted using corpuses prepared in this study. The experimental result obtained
is encouraging. Linguistica2001 parses successfully 87% of words of the test data (433 of 500 words).
This result corresponds to a precision of 95% and a recall of 90%. The second system analyses 241 (or
94%) of the255 sample stems correctly.
Description
Keywords
Natural language processing