Addis Ababa University Libraries Electronic Thesis and Dissertations: AAU-ETD! >
School of Information Science and Computer Science >
Thesis - Information Science >

Please use this identifier to cite or link to this item: http://hdl.handle.net/123456789/4134

Title: Amharic-English Script Identification in Real-Life Document Images
Advisors: Dereje Teferi (Ph.D)
Keywords: Information science
Copyright: Jun-2012
Date Added: 29-Nov-2012
Publisher: AAU
Abstract: Computer technology enabled humans to process, store, retrieve and disseminate information with much flexibility and ease. As a result of this, vast amount of information is being digitized. Currently, digital libraries are digitizing printed documents in order to offer more people access to larger document collections, and at far greater speed, than physical libraries can. This in turn created the need for effective document image processing systems which resulted number of studies on Optical Character recognition (OCR) and Document Image Retrieval (DIR) systems. Nowadays, the emergence of English as the universal language has resulted in multi-script documents in many nations using their own scripts. This situation posed a serious challenge for the traditional document image processing systems which are capable of processing only documents prepared in a single script. To address this issue number of researches has been conducted on script identification and various techniques have been reported. Ethiopia has also the same situation where many historical, legal, news papers and business documents are prepared using two scripts (English and Amharic). Even though many studies have been conducted on document image processing systems for Amharic, only one research is conducted on script identification for Amharic-English documents. This research is pioneer on the subject and proposed feature extraction techniques for Amharic-English script identification. The present research is a continuation of the previous work aiming in improving the performance of the previously proposed system in Real-Life document images. Real-Life document images have wide facet of challenges. The two main challenges in Real-Life document images are printing variation (font type, size, etc) and noise. To this end, in the present research four noise removal techniques and 11 features extraction techniques are investigated. The experimentation conducted on clean and Real-Life documents showed that the DBF (adaptive noise removal technique) are effective in suppressing noise while keeping the features intact. In addition to this, the combination of features (extracted at word level) selected following the forward sequential feature selection method showed to be effective in terms of less sensitivity to noise, font type and word length variation. More importantly, the experimentation is conducted without performing any normalization of variations (size, space, etc) that are common in Real-Life documents and promising results are registered. In addition to this, important recommendations are forwarded that needs further investigation
Description: A Thesis submitted to the School of Graduate Studies of Addis Ababa University in partial fulfillment of the requirements for the Degree of Masters of Science in Information Science
URI: http://hdl.handle.net/123456789/4134
Appears in:Thesis - Information Science

Files in This Item:

File Description SizeFormat
final_revised6.pdf3.06 MBAdobe PDFView/Open

Items in the AAUL Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.


  Last updated: May 2010. Copyright © Addis Ababa University Libraries - Feedback