|Title:||Machine Learning of Complex Morphology: The Case of Amharic Verbs|
|???metadata.dc.contributor.*???:||Prof. Baye Yimam|
Prof. Michael Gasser
|Keywords:||Machine Learning;Inductive Logic Programming;Amharic Morphology;Computational Morphology;Rule Induction;Incremental Learning;Genetic Algorithm|
|Publisher:||Addis Ababa University|
|Abstract:||This research work presents the application of various supervised machine learning approaches to learning Amharic verb morphology. Amharic, an under-resourced African language, has very complex inflectional and derivational verb morphology, with a number of non-concatenative prefix and suffix morphemes. The complexity of Amharic verbs emanates from its prefix and suffix structure as well as the templatic stem internal structure. The language is also complex as there are boundary level as well as stem internal orthographic alternations. Grammatical features in Amharic verbs can be represented by numerous prefix and suffix morphemes as well as the template of the stem and the vowel sequence found within the stem. This complex structure was found to be difficult to capture with a single machine learning approach leading to the possibility of applying various learning methods to different components of the task of morphological processing. We have mainly used Inductive Logic Programming (ILP), implemented in CLOG, to learn morphological processing rules from examples. CLOG learns rules as a first order predicate decision list. The first phase, Stage I, is developed to segment affixes from the stem and capture the template structure and grammatical features of the verbs. This stage also learns alternation rules occurring at word boundaries and within the stem. In the second phase, Stage II, we have used incremental learning methods to perform progressive affix segmentation into valid morphemes by referring to the knowledge of previously segmented morphemes. Finally, in Stage III, we have applied genetic algorithms, an evolutionary learning method, to identify morpheme slot classes and perform the classification of morphemes into their respective slots from examples. The training data used to learn the morphological rules are manually prepared and transliterated. After training the system with the example set, a total of 108 rules for stem-affix extraction, and 18 rules for stem internal alternation have been learned. The learning process also captured a number of template and grammatical feature patterns found in the example. The rules which are learned from the examples are represented in human understandable format and were found to be very simple to modify. We have collected 1,784 new distinct inflected Amharic verbs to test the performance of the learned rules and the system was able to correctly analyze 1,552 verbs with 87% accuracy rate. In Stage II, using incremental learning, the system was trained on 140 examples containing words, the stem and codified morphological features, and the system was able to learn and segment 6 prefix and 25 suffix morphemes from the extracted affixes. The segmentation achieved a precision score of 0.94 and a recall value of 0.97 when comparing the system output with the analysis made by a linguist. The morpheme slot model, which is generated by the genetic algorithm program, was able to create the slots and put the morphemes in their appropriate slot with 90.2% accuracy level.|
|Appears in Collections:||Thesis - Information Science|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.