Page Column and Paragraph Layouts Segmentation and Reconstruction for Recognizing Real Life Documents

No Thumbnail Available



Journal Title

Journal ISSN

Volume Title


Addis Ababa University


Nowadays a huge amount of handwritten, typewritten and printed documents contain valuable information and knowledge that still recorded, stored, and distributed in paper format. To make the information and knowledge embedded in these documents accessible and easily reachable, it is required to digitize and organize them. In the course of digitization, Optical Character Recognition (OCR) plays a vital role, since it simplifies the process of converting scanned images of text into editable digital documents, while preserving both the content and the format of documents. Different researchers explore various issues on the course of developing Amharic OCR. Most of previously conducted researches focus on character (text) recognition of the script. However, Real-life document images usually contain not only characters (text) but also some associated non text elements (graphics, column, paragraph etc.). Consequently, detecting and reconstructing non-text elements of a document image during the digitization process are important for the purpose of reusing documents. This study applies dilation, connected component (CC) analysis, CC width, height and area analysis and a novel modified whitespace analysis page segmentation algorithm to separate graphics from text; to detect column and paragraph block and also to collect information of those layouts with the aim of reconstructing the original document image column and paragraph layouts. Based on the stored layout information, the proposed system maintains a column block 80% and paragraph block 72.22%. The performance of column and paragraph layouts reconstruction heavily depends on page segmentation stage. It reconstructs column and paragraph layouts with the efficiency of 100 % for correctly segmented column and paragraph blocks. Maintaining original document image layout in character recognition is important to produce well-structured recognized text. However, the developed column and paragraph layouts segmentation and reconstruction techniques fails to reconstruct column blocks based on the width size of the original document image, and to segment paragraph blocks when every lines in the paragraph have equal end points. Thus, there is a need to explore on adaptive page segmentation techniques, and on preservation of width variant column blocks.



Recognizing Real Life Documents