Entropy Estimation and Entropy Based Encoding of Written Afaan Oromo for its Efficient Digital Transmission and Storage
No Thumbnail Available
Date
2021-02
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
According to Ethiopian population census, Oromo Language is estimated to be spoken
by 36.4% of the local population. Furthermore, in addition to the local population
the language is spoken outside of Ethiopia, for instance in small portion of Kenya.
Thus, taking this into account the language is estimated to be spoken by around
fty million people. In addition to the spoken form, a considerable portion of the
language's speaker are capable of understanding its written form known as Qubee.
The introduction of Qubee, in the mid-nineties has opened doors for its utilization
in modern day communication systems. Leaving this argument aside, in the eyes of
information theory and communication channels both symbol utilization schemes are
found to be ine cient. This is because, Latin or Amharic symbols are represented by
ASCII8 and UTF 16 xed length encoding mechanisms poorly model written natural
language.
With the expected increasing demand of the language in telecom services in mind, in
this thesis we mainly aim at estimating the Oromo Language Language's entropy. The
estimation will set the optimum number of bits per symbol needed to e ciently trans-
mit written Oromo Language in communication systems. To achieve our objective, we
have modeled the sources, i.e., written Oromo Language, as Nth order Markovian chain
random process. Based on the modeling scheme we have studied the distribution of
symbols in ten literature written in Oromo Language. The study reveals the Language
can be transmitted using 4.31 bits/symbol when modeled as rst order Markovian
Chain source. Whereas, the zero crossing entropy of the source was estimated to be in
average at N=19.5; which gave an entropy estimation of 0.85 bits/symbol with a re-
dundancy of 89.36%. Additionally, we have conducted two entropy-based compression
algorithms, namely, Hu man and Arithmetic coding, to test the validity of our estima-
tion. The Hu man algorithm was able to compress our sample corpora in average from
42:17% �� 64:88% for N = 1 �� 5. These compression results con rm the results of our
Nth order estimation of the Language's entropy by approaching their theoretical limits.
Description
Keywords
Entropy, Encoding, Oromo Language, Language