Page Column and Paragraph Layouts Segmentation and Reconstruction for Recognizing Real Life Documents
No Thumbnail Available
Date
2016-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Nowadays a huge amount of handwritten, typewritten and printed documents contain valuable
information and knowledge that still recorded, stored, and distributed in paper format. To make
the information and knowledge embedded in these documents accessible and easily reachable, it
is required to digitize and organize them. In the course of digitization, Optical Character
Recognition (OCR) plays a vital role, since it simplifies the process of converting scanned
images of text into editable digital documents, while preserving both the content and the format
of documents. Different researchers explore various issues on the course of developing Amharic
OCR. Most of previously conducted researches focus on character (text) recognition of the
script. However, Real-life document images usually contain not only characters (text) but also
some associated non text elements (graphics, column, paragraph etc.). Consequently, detecting
and reconstructing non-text elements of a document image during the digitization process are
important for the purpose of reusing documents.
This study applies dilation, connected component (CC) analysis, CC width, height and area
analysis and a novel modified whitespace analysis page segmentation algorithm to separate
graphics from text; to detect column and paragraph block and also to collect information of those
layouts with the aim of reconstructing the original document image column and paragraph
layouts. Based on the stored layout information, the proposed system maintains a column block
80% and paragraph block 72.22%. The performance of column and paragraph layouts
reconstruction heavily depends on page segmentation stage. It reconstructs column and
paragraph layouts with the efficiency of 100 % for correctly segmented column and paragraph
blocks.
Maintaining original document image layout in character recognition is important to produce
well-structured recognized text. However, the developed column and paragraph layouts
segmentation and reconstruction techniques fails to reconstruct column blocks based on the
width size of the original document image, and to segment paragraph blocks when every lines in
the paragraph have equal end points. Thus, there is a need to explore on adaptive page
segmentation techniques, and on preservation of width variant column blocks.
Description
Keywords
Recognizing Real Life Documents