Incorporating Linguistic features in bi-directional Amharic - English Statistical Machine Translation

No Thumbnail Available

Date

2019-02

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

In this study, linguistic features of a word have been incorporated to improve bidirectional Amharic - English SMT accuracy using factored phrase-based translation models. With this approach, we have used not only the most common linguistic feature such as POS tags and lemma of a word but also morphemes of a token employing a factored mode1 to improve the quality of translations. Experiments were carried out on a most recent data set, a corpus prepared at Addis Ababa University, under a thematic research project of collecting parallel Corpora for bi- lingual English. Ethiopian pairs available for the research community. We have used a segment of the corpus Amharic -English pairs, a religious bible text of 30,646 sentence pairs. Results show that using morpheme segment's on the Amharic side in combination with lemma and POS tag of a word improves the BLEU scores of the translation significantly. The best results we have using these factored phrase-based l110del obtained with the same data used r r an ordinary base line system, increasing the BLEU score from 9.52 to 15.84 (a from English to Amharic translation. We also improve the accuracy up to a 25. 48 BLEU score from 23. 12 (or Amharic to English translation. To summarize, this study introduces an approach of applying morpheme segments on the Amharic data and lemmatization on the English data to build an English to Amharic statistical machine translation system. The results of this study arc compared to the benchmark systems which were built with the same data sets and arc found t0 be significantly higher than those.

Description

Keywords

Incorporating Linguistic features in bi-directional Amharic - English Statistical Machine Translation

Citation