Page Segmentation in Amharic Document Image Collections

dc.contributor.advisorMeshesha, Million (PhD)
dc.contributor.authorAssefa, Gedion
dc.date.accessioned2018-11-28T07:23:12Z
dc.date.accessioned2023-11-29T04:56:54Z
dc.date.available2018-11-28T07:23:12Z
dc.date.available2023-11-29T04:56:54Z
dc.date.issued2013-06
dc.description.abstractThe advancement and accessibility of digital computers and the introduction of the Internet and World Wide Web resulted in massive information explosion all over the world. Large amount of handwritten, typewritten and printed documents contain numerous information and knowledge of different areas. To make the information and knowledge embedded in these documents accessible to the public, it is desirable to digitize, organize and develop retrieval systems for such kind of documents. In response to this need, researchers are moving towards recognition-free approach since optical character recognition OCR engines have various limitations. Researches have been conducted to develop Amharic document image retrieval (DIR) system without explicit recognition that retrieve information from document images relying on image features only. However, effectiveness of the system is highly affected by segmentation errors at word-level. Moreover, the system does not work on real-life document images in which images, graphics, logos, tables, etc. are embedded. This study attempts to integrate effective page segmentation technique that can work on documents which contain images, graphics, tables, etc. and improve word level segmentation. Accordingly, page segmentation algorithms namely: Hough transforms, Connected Components (CC), Horizontal Run Length Smoothing (HRLS), Dilation and Watershed are tested. The performance evaluation showed that the integration of CC and Dilation is the best combination. Average Match Score of 0.865 in different level noisy document images, 0.93 in typewritten documents, 0.97 in documents containing pictures, 0.97 in documents containing tables and 0.45 in handwritten documents (‗kum tshihuf‘) is scored. On the average, an increase of 2.34% F-Measure is scored in different level noisy document images. Degraded features of old documents, slimness of typewritten characters and font size variation had a great impact on the performance of the system which needs further attention by future researches.en_US
dc.identifier.urihttp://etd.aau.edu.et/handle/123456789/14589
dc.language.isoenen_US
dc.publisherAddis Ababa Universityen_US
dc.subjectAmharic Document Image Collectionsen_US
dc.titlePage Segmentation in Amharic Document Image Collectionsen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Gedion Assefa.pdf
Size:
2.79 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: