Compression of Amharic Text Using Prediction by Partial Match (Ppm) Context-Modeling Algorithm
No Thumbnail Available
Date
2019-05
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
A recent study on entropy estimation of Amharic language showed that its 16-bit
representation in Universal Transformation Format (UTF-8) very high as compared to
the entropy of the language. The study showed a minimum of 1.074 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙 and
a maximum of 7.981 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙 can be sufficient for transmission of text sources
written in Amharic through telecom networks. In digital communication, the source
encoding operation produces a compressed representation of an information source for
efficient utilization of communication resources like bandwidth and energy. Practical
source encoding approaches in text compression use Statistical Language Models
(SLMs) based on Markov process to model redundancies exhibited in a language.
The Prediction by Partial Match (PPM) context-modeling algorithm is capable of high
compression rates and is well suited for multiple alphabet sources like textual data.
PPM adaptively combines different order Markov models to capture dependencies
between successive symbols in a text. In this thesis, the PPM algorithm is used to show
the advantages gained by context-modeling techniques in Amharic text source
encoding and demonstrate how close practical compression gets to estimated entropy
of Amharic language.
Two Versions of the PPM algorithm; namely PPMC and PPMD were used to model
and encode eight source files written in Amharic. It is shown that the optimum order
for efficient encoding is order-3 and it is possible to achieve an average of
84.2% reduction in file size. Using both algorithms, an average compression rate of
3.3 𝑏𝑖𝑡𝑠/𝑠𝑦𝑚𝑏𝑜𝑙 is attainable for source encoding and storage applications. Modeling
Amharic text sources using context models in general and PPM in particular can help
to maximize efficiency in communication networks by reducing the average number
of bits required for coding text sources
Description
Keywords
Amharic, Entropy, Source Encoding, Context, Modeling, Coding, PPM