AAU Institutional Repository

Page Column and Paragraph Layouts Segmentation and Reconstruction for Recognizing Real Life Documents

Show simple item record

dc.contributor.advisor Meshesha (Ph.D), Million
dc.contributor.author Andualem, Ayedagne
dc.date.accessioned 2018-11-08T13:17:51Z
dc.date.available 2018-11-08T13:17:51Z
dc.date.issued 2016-06
dc.identifier.uri http://etd.aau.edu.et/handle/123456789/13997
dc.description.abstract Nowadays a huge amount of handwritten, typewritten and printed documents contain valuable information and knowledge that still recorded, stored, and distributed in paper format. To make the information and knowledge embedded in these documents accessible and easily reachable, it is required to digitize and organize them. In the course of digitization, Optical Character Recognition (OCR) plays a vital role, since it simplifies the process of converting scanned images of text into editable digital documents, while preserving both the content and the format of documents. Different researchers explore various issues on the course of developing Amharic OCR. Most of previously conducted researches focus on character (text) recognition of the script. However, Real-life document images usually contain not only characters (text) but also some associated non text elements (graphics, column, paragraph etc.). Consequently, detecting and reconstructing non-text elements of a document image during the digitization process are important for the purpose of reusing documents. This study applies dilation, connected component (CC) analysis, CC width, height and area analysis and a novel modified whitespace analysis page segmentation algorithm to separate graphics from text; to detect column and paragraph block and also to collect information of those layouts with the aim of reconstructing the original document image column and paragraph layouts. Based on the stored layout information, the proposed system maintains a column block 80% and paragraph block 72.22%. The performance of column and paragraph layouts reconstruction heavily depends on page segmentation stage. It reconstructs column and paragraph layouts with the efficiency of 100 % for correctly segmented column and paragraph blocks. Maintaining original document image layout in character recognition is important to produce well-structured recognized text. However, the developed column and paragraph layouts segmentation and reconstruction techniques fails to reconstruct column blocks based on the width size of the original document image, and to segment paragraph blocks when every lines in the paragraph have equal end points. Thus, there is a need to explore on adaptive page segmentation techniques, and on preservation of width variant column blocks. en_US
dc.publisher Addis Ababa University en_US
dc.subject Recognizing Real Life Documents en_US
dc.title Page Column and Paragraph Layouts Segmentation and Reconstruction for Recognizing Real Life Documents en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search AAU-ETD


Browse

My Account