Compression of Amharic Text Using Prediction by Partial Match (Ppm) Context-Modeling Algorithm

No Thumbnail Available

Date

2019-05

Authors

Yalemsew, Abate

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

A recent study on entropy estimation of Amharic language showed that its 16-bit representation in Universal Transformation Format (UTF-8) very high as compared to the entropy of the language. The study showed a minimum of 1.074 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙 and a maximum of 7.981 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙 can be sufficient for transmission of text sources written in Amharic through telecom networks. In digital communication, the source encoding operation produces a compressed representation of an information source for efficient utilization of communication resources like bandwidth and energy. Practical source encoding approaches in text compression use Statistical Language Models (SLMs) based on Markov process to model redundancies exhibited in a language. The Prediction by Partial Match (PPM) context-modeling algorithm is capable of high compression rates and is well suited for multiple alphabet sources like textual data. PPM adaptively combines different order Markov models to capture dependencies between successive symbols in a text. In this thesis, the PPM algorithm is used to show the advantages gained by context-modeling techniques in Amharic text source encoding and demonstrate how close practical compression gets to estimated entropy of Amharic language. Two Versions of the PPM algorithm; namely PPMC and PPMD were used to model and encode eight source files written in Amharic. It is shown that the optimum order for efficient encoding is order-3 and it is possible to achieve an average of 84.2% reduction in file size. Using both algorithms, an average compression rate of 3.3 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙 is attainable for source encoding and storage applications. Modeling Amharic text sources using context models in general and PPM in particular can help to maximize efficiency in communication networks by reducing the average number of bits required for coding text sources

Description

Keywords

Amharic, Entropy, Source Encoding, Context, Modeling, Coding, PPM

Citation