Character Recognition of Bilingual Amharic-Latin Printed Documents

No Thumbnail Available



Journal Title

Journal ISSN

Volume Title


Addis Ababa University


Optical character recognition (OCR), is system that automatically converts captured images of handwritten, typewritten or printed text documents into machine encoded text. In Ethiopia more than 80 language are spoken and those languages use either Amharic scripts or adopted Latin scripts. In such environment, in order to reach a larger cross section of people, it is necessary that a document should be composed of text contents in different languages written in Amharic and/or Latin characters. To prepare dataset, several documents were collected from different sources for both script types. Character images were collected for 231 Amharic characters and 52 characters for English (merged capital and small letters). Totally for 257-character classes, 49,087-character image are prepared to train and test the system. Randomly selected 80% of dataset were used to train the system where as remaining 20% for purpose of testing the accuracy. Data acquisition, image binarization, noise removal, skew correction, character segmentation, feature extraction and character classification are steps in developing character recognition system. A number of algorithms were implemented to develop the proposed OCR system. In this research work, it was discussed the process of developing an OCR for bilingual Amharic and Latin script using Convolutional Neural Network (CNN) which is feature extraction and character classification model. From the experiment 99.20% of classification accuracy was obtained when the number of neurons is 256 and with adaptive learning rate. In character segmentation stage, average of 98.85% accuracy was achieved for clear sample document and 95.86% for unclear sample documents. Therefore, overall recognition accuracy become 98.06 % and 95.09 % respectively.



Bilingual OCR, CNN, Neural Network, Convolutional Neural Network, Ethiopic OCR