Page Segmentation in Amharic Document Image Collections

No Thumbnail Available

Date

2013-06

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

The advancement and accessibility of digital computers and the introduction of the Internet and World Wide Web resulted in massive information explosion all over the world. Large amount of handwritten, typewritten and printed documents contain numerous information and knowledge of different areas. To make the information and knowledge embedded in these documents accessible to the public, it is desirable to digitize, organize and develop retrieval systems for such kind of documents. In response to this need, researchers are moving towards recognition-free approach since optical character recognition OCR engines have various limitations. Researches have been conducted to develop Amharic document image retrieval (DIR) system without explicit recognition that retrieve information from document images relying on image features only. However, effectiveness of the system is highly affected by segmentation errors at word-level. Moreover, the system does not work on real-life document images in which images, graphics, logos, tables, etc. are embedded. This study attempts to integrate effective page segmentation technique that can work on documents which contain images, graphics, tables, etc. and improve word level segmentation. Accordingly, page segmentation algorithms namely: Hough transforms, Connected Components (CC), Horizontal Run Length Smoothing (HRLS), Dilation and Watershed are tested. The performance evaluation showed that the integration of CC and Dilation is the best combination. Average Match Score of 0.865 in different level noisy document images, 0.93 in typewritten documents, 0.97 in documents containing pictures, 0.97 in documents containing tables and 0.45 in handwritten documents (‗kum tshihuf‘) is scored. On the average, an increase of 2.34% F-Measure is scored in different level noisy document images. Degraded features of old documents, slimness of typewritten characters and font size variation had a great impact on the performance of the system which needs further attention by future researches.

Description

Keywords

Amharic Document Image Collections

Citation