Segmentation of Real Life Amharic Documents for Improving Recognition
No Thumbnail Available
Date
2015-06-05
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
A huge amount of paper based documents with valuable information is available in churches,
libraries, caves, governmental and private institutions in a printed, typewritten and handwritten
format. To enable those documents accessible and searchable, Optical Character Recognition
(OCR) systems play a vital role by converting them into their digital format. Some researchers
attempted to develop Amharic OCR systems. However, OCR systems are not yet applicable for
real life document images that contain column blocks, graphics, tables, lines, logos and other
shapes. Moreover, the effectiveness of the system is highly dependent on the text segmentation
output. This study attempts to explore an effective page and text segmentation method to
improve the applicability and performance of Amharic OCR for real life documents.
Accordingly, a skew correction and page segmentation algorithms based on Hough Transform,
Morphological Dilation, and Connected Component (CC) Analysis are tested, and 90.47%,
92.31%, 96.67% and 71.43% accuracy is obtained for detecting tables, graphics, column blocks
and titles individually. Three noise filtering and two binarization techniques are tested and
wiener coupled sauvola found to perform best. Text segmentation methods based on projection
profile, morphological dilation and CC Analysis are experimented on four noise levels (i.e. low,
medium, high and very-high) documents. Projection profile coupled vertical dilation performs
best by scoring 100% accuracy to segment text lines in low and medium noise levels. An image
smoothing based method is proposed and 99.18% accuracy is registered to extract lines from inkbleeded
documents. Vertical projection profile method is applied to extract words and 99.23%,
96.26%, 87.12% and 54.80% accuracy is registered for each noise levels respectively.
A new method based on CC Analysis is introduced to segment overlapping characters, and
besides to detect and split connected characters. An accuracy of 87.61% and 82.29% is obtained
for low and medium noise levels and 50.64% for high and very high noise levels. By integrating
it with the Amharic OCR system, recognition accuracy rate of 79.13% and 59.07% are registered
for the proposed and vertical projection profile method respectively, which is a promising result.
However, since the developed character segmentation technique fails to segment characters with
discontinuity, and detects long characters as connected character for real life documents, there is
a need to explore noise tolerant segmentation methods.
Description
Keywords
Amharic Documents for Improving Recognition