Optical Character Recognition of Amharic Text: An Integrated Approach
No Thumbnail Available
Date
2002-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Optical Character Recognition (OCR) is an area of research and development where a
system is made to recognize document images. Cultural considerations and enormous flood of
documents motivated the development of OCR across the world. Unlike other scripts, OCR
development for Amharic characters has been started recently at SISA. Some developments
have been made in recognizing specific font styles, font sizes, and font types. But, as the font
style, size or type changes the recognition accuracy falls down
The purpose of this study is, therefore, to explore the possibilities of developing a versatile
OCR system that is independent of sizes of Amharic characters. To this end, different
preprocessing techniques and pattern recognition techniques have been reviewed. Since the
segmentation algorithm that was used by previous studies in the area works well, it is
incorporated in this study with some modifications. Template matching, statistical,
syntactic/structural, and neural network approaches are found to be the most commonly used
pattern recognition techniques and the pros and cons of each technique is reviewed. To take
their advantage, a hybrid system of syntactic/structural and neural network approaches is
implemented.
Syntactic/structural approach enables the developed OCR system to extract primitive
structures of characters and generate a unique pattern for each character to be used by the
neural network. The neural network enables the developed OCR system to classify/recognize
the patterns generated and it can also predict for new cases. The network takes the output of
the syntactic/structural approach as an input. With this procedure, the neural network is
trained with VG2000 Agazian font of sizes J 0 and J 2. The performance of the developed
system is tested with documents written using VG2000 Agazian font of sizes 8, 12, and 14. The
results showed that, with minor differences, the developed OCR system classifies/recognizes
the test cases of different font sizes with more or less the same level of accuracy. Based on the
results, further research areas are a/so recommended.
Description
Keywords
Information Science