Recognition of Real-Life Amharic Document Images
No Thumbnail Available
Date
2014-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Considerable information is found carved in hard copy documents. Those documents cont in
valuable information which needs to be accessible, easily reachable and searchable by end
users. Optical Character Recognition (OCR) systems in this aspect play an important role in
liberating this information by converting the text on paper in to an electronic form so that
content indexing and searching and accessibilities to the resources will be easy.
The development of such systems for Amharic scripts was not a recent research focus.
However, OCR for Amharic scripts is still an area that requires the contribution of many
research works for recognizing different document images with higher accuracy rates. In this
study, an attempt has been made in exploring the various recognition techniques with the aim
of enhancing the performance of Amharic OCR system, towards recognizing real-life
documents.
This study applied the basic OCR per-processing methods like noise removal and image
thresholding algorithms in documents that are taken from real-life. Two noise filtering
(Median and Wiener) and two thresholding algorithms (Otsu and Sauvola) are tested in this
regard. From the experiment. it was found Wiener coupled Sauvola found to perform best.
And for segmenting out lines, words and characters from document images, a modified
project ion profile method is used. The method employed is able to adjust automatically the
threshold values for word and character segmentation. Using this method 98.79%, 95.67%
and 95.6.1% of the lines, words and characters are correctly segmented in the test set
respectively. Also, underline detection and removal and size normalization of characters is
performed. For identifying the unique discriminating features of Amharic characters. a
modified zoning technique is employed.
Training and testing is performed using linear multi-class SVM. For the purpose of training,
the complete set of Amharic alphabets are used which are prepared in h...-O commonly used
fonts (i.e. Nyala and Visual Geez Unicode). The test result shows that the combined features
of the two font's results a better performance rate than the uni-font built models. Applying
this model registered an average recognition rate of 98.94% and 88.38% in training and test
Sets respectively. This is a promising result towards developing an applicable Amharic OCR
system. But. since Amharic script contains highly similar characters and real-life documents
are full of noise, there is a need to explore advanced segmentation algorithms along with
shape and noise invariant feature extract ion techniques