Morpheme-Based Bi-Directional Ge’ez -Amharic Machine Translation
No Thumbnail Available
Date
2018-10-04
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
This study aims to explore the effect of morpheme level translation unit for bi-directional Ge’ez-Amharic machine translation. Using word as a translation unit is a problem in statistical machine translation while conducting translation between two morphologically rich languages such as Ge’ez and Amharic. At word level, data scarcity and unavailability of well prepared corpus is a challenge for under resourced language. And, at word level, it is difficult to manage many forms of a single word, not specific and lacks consistency. At morpheme level sub parts of words are specific, easy to manage specific parts and has consistency our many words of the same class.
To conduct the experiment, parallel corpus was collected from online sources. Such Online sources include Old Testament of Holy bible and anaphora (or Kidase). The corpus include manually prepared bitext from Wedase Maryam, Anketse Berhane, yewedesewa melahekete, Kidan and Liton. To make the corpus suitable for the system, different preprocessing tasks such as tokenization, cleaning and normalization have been done. The data set contains a total of 13,833 simple and complex sentences, out of which 90% and 10% are used for training and testing, respectively. To build a language model for both languages we used 12, 450 parallel sentences. For both statistical and rule-based approachs we used Mosses for translation process, MGIZA++ for alignment of word and morpheme, morfessor and rules were used for morphological segmentation and IRSTLM for language modeling. After preparing and designing the prototype and the corpus, different experiments were conducted.
Experimental results showed a better performance of 15.14% and 16.15% BLEU scores using morpheme-based from Geez to Amharic and from Amharic to Geez translation, respectively. As compared to word level translation there is on the average 6.77% and 7.73% improvement from Geez-Amharic and Amharic-Ge’ez respectively. This result further shows that morpheme-level translation performs better than word-level translation. As a result, using morpheme as a translation unit we conducted further experiment using unsupervised and rule-based morpheme segmentation approaches. Accordingly, the performance of rule-based morphological segmentation is better than unsupervised with an average BLEU score of 0.6% and 1.27% for Ge’ez to Amharic and Amharic to Ge’ez respectively.
Alignments of Amharic and Ge’ez text have shown correspondence, such as one-one, one-to-many, many-one and many-many alignment. In this study, many-to-many alignment is the major challenge. So further research is needed to handle many-to-many, word order and morphology of the two languages.
Description
Keywords
SMT, Morpheme Level Alignment, Morfessor, Amharic, Geez