Recognition of Real-Life Amharic Document Images

No Thumbnail Available

Date

2014-06

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Considerable information is found carved in hard copy documents. Those documents cont in valuable information which needs to be accessible, easily reachable and searchable by end users. Optical Character Recognition (OCR) systems in this aspect play an important role in liberating this information by converting the text on paper in to an electronic form so that content indexing and searching and accessibilities to the resources will be easy. The development of such systems for Amharic scripts was not a recent research focus. However, OCR for Amharic scripts is still an area that requires the contribution of many research works for recognizing different document images with higher accuracy rates. In this study, an attempt has been made in exploring the various recognition techniques with the aim of enhancing the performance of Amharic OCR system, towards recognizing real-life documents. This study applied the basic OCR per-processing methods like noise removal and image thresholding algorithms in documents that are taken from real-life. Two noise filtering (Median and Wiener) and two thresholding algorithms (Otsu and Sauvola) are tested in this regard. From the experiment. it was found Wiener coupled Sauvola found to perform best. And for segmenting out lines, words and characters from document images, a modified project ion profile method is used. The method employed is able to adjust automatically the threshold values for word and character segmentation. Using this method 98.79%, 95.67% and 95.6.1% of the lines, words and characters are correctly segmented in the test set respectively. Also, underline detection and removal and size normalization of characters is performed. For identifying the unique discriminating features of Amharic characters. a modified zoning technique is employed. Training and testing is performed using linear multi-class SVM. For the purpose of training, the complete set of Amharic alphabets are used which are prepared in h...-O commonly used fonts (i.e. Nyala and Visual Geez Unicode). The test result shows that the combined features of the two font's results a better performance rate than the uni-font built models. Applying this model registered an average recognition rate of 98.94% and 88.38% in training and test Sets respectively. This is a promising result towards developing an applicable Amharic OCR system. But. since Amharic script contains highly similar characters and real-life documents are full of noise, there is a need to explore advanced segmentation algorithms along with shape and noise invariant feature extract ion techniques

Description

Keywords

Citation