Segmentation of Real Life Amharic Documents for Improving Recognition

No Thumbnail Available

Date

2015-06-05

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

A huge amount of paper based documents with valuable information is available in churches, libraries, caves, governmental and private institutions in a printed, typewritten and handwritten format. To enable those documents accessible and searchable, Optical Character Recognition (OCR) systems play a vital role by converting them into their digital format. Some researchers attempted to develop Amharic OCR systems. However, OCR systems are not yet applicable for real life document images that contain column blocks, graphics, tables, lines, logos and other shapes. Moreover, the effectiveness of the system is highly dependent on the text segmentation output. This study attempts to explore an effective page and text segmentation method to improve the applicability and performance of Amharic OCR for real life documents. Accordingly, a skew correction and page segmentation algorithms based on Hough Transform, Morphological Dilation, and Connected Component (CC) Analysis are tested, and 90.47%, 92.31%, 96.67% and 71.43% accuracy is obtained for detecting tables, graphics, column blocks and titles individually. Three noise filtering and two binarization techniques are tested and wiener coupled sauvola found to perform best. Text segmentation methods based on projection profile, morphological dilation and CC Analysis are experimented on four noise levels (i.e. low, medium, high and very-high) documents. Projection profile coupled vertical dilation performs best by scoring 100% accuracy to segment text lines in low and medium noise levels. An image smoothing based method is proposed and 99.18% accuracy is registered to extract lines from inkbleeded documents. Vertical projection profile method is applied to extract words and 99.23%, 96.26%, 87.12% and 54.80% accuracy is registered for each noise levels respectively. A new method based on CC Analysis is introduced to segment overlapping characters, and besides to detect and split connected characters. An accuracy of 87.61% and 82.29% is obtained for low and medium noise levels and 50.64% for high and very high noise levels. By integrating it with the Amharic OCR system, recognition accuracy rate of 79.13% and 59.07% are registered for the proposed and vertical projection profile method respectively, which is a promising result. However, since the developed character segmentation technique fails to segment characters with discontinuity, and detects long characters as connected character for real life documents, there is a need to explore noise tolerant segmentation methods.

Description

Keywords

Amharic Documents for Improving Recognition

Citation