English -Tigrigna Factored Statistical Machine Translation

No Thumbnail Available

Date

2014-06-08

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

In this paper, English to Tigrigna translation was conducted using Statistical machine translation approach. A total of 17,649 sentence pairs were used as a bilingual corpus to develop, train and test the translation system. Experiment was conducted using MOSES employing three types of corpus namely baseline, Segmented and finally factored corpus that integrates linguistic knowledge at word level. Some preliminary preprocessing task were performed namely sentence level segmentation and tokenization. These preprocessing tasks were done using a program codes written with python. In addition to that a lot of manual cleaning tasks were done when the preprocessing task required the researcher's judgment. After preprocessing, morphological segmentation, stemming and POS tagging were performed to prepare the factored corpora. The performance of the system was then tested using the BLEU metric. The result revealed that segmentation has contributed for the overall performance of the segmented system that has shown better performance compared to the baseline phrase-based system. When compared with the same segmented reference, the BLEU score for the segmented system is 22.65% which is a 1.61% increase from the baseline system that has a BLEU score 21.04. The factored corpus has shown a decrease of 6.15% from the segmented and 4.53% from the baseline system. The researcher believes that, the low performance of the factored system is accounted to the POS tags attached since the tagger was trained using a small manually tagged corpus prepared by the researcher.

Description

Keywords

Machine Translation ;English to Tigrigna translation

Citation