A Bidirectional Tigrigna – English Statistical Machine Translation

No Thumbnail Available

Date

2017-06-04

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Machine Translation (MT) is one task in Natural Language Processing (NLP), where automatic systems are used to translate text from one (source) language to another (target) while preserving the meaning of source language. Since there is a need for translation of documents between Tigrigna and English languages, there needs to be a mechanism to do so. Hence, this study explored the possibility of developing Tigrigna – English statistical machine translation and improving the translation quality by applying linguistic information. In this work, experimental quantitative research method is used. In order to achieve the objective of this research work, a corpora are collected from different domain and classified into five sets of corpora, and prepared in a format suitable for use in the development process. In order to realize the goal, three sets of experiments are conducted: baseline (phrase based machine translation system), morph-based (based on morphemes obtained using unsupervised method) and post processed segmented systems (based on morphemes obtained by post-processing the output of the unsupervised segmenter). We work on MOSES which is a free statistical machine translation framework, which allows automatically training translation model using parallel corpus. Since the system is bidirectional, four language models are developed; one for English and the other three are for Tigrigna language includes for baseline, morph-based and the other for the post processed experiment. Translation models which assigns a probability that a given source language text generates a target language text are built and a decoder which searches for the shortest path is used. BLUE score is used to evaluate the performance of each set of experiment. Accordingly, the result obtained from the post processed experiment using corpus II has outperformed the other, and the result obtained has a BLUE score of 53.35 % for Tigrigna – English and 22.46 % for English – Tigrigna translations. This research focuses on segmenting prepositions and conjunctions because of data scarcity . Therefore future research should focus to further improve the BLUE score by applying semi supervised segmentation to include the remaining linguistic information.

Description

Keywords

Machine translation, Statistical Machine Translation, Segmentation, TigrignaEnglishBidirectional Tigrigna – English Machine Translation

Citation