Addis Ababa University Libraries Electronic Thesis and Dissertations: AAU-ETD! >
School of Information Science and Computer Science >
Thesis - Information Science >
Please use this identifier to cite or link to this item:
|Title: ||Amharic-English Script Identification in Real-Life Document Images|
|Authors: ||ABEBAYEHU, SAMUEL|
|Advisors: ||Dereje Teferi (Ph.D)|
|Keywords: ||Information science|
|Copyright: ||Jun-2012 |
|Date Added: ||29-Nov-2012 |
|Abstract: ||Computer technology enabled humans to process, store, retrieve and disseminate information with much flexibility and ease. As a result of this, vast amount of information is being digitized. Currently, digital libraries are digitizing printed documents in order to offer more people access to larger document collections, and at far greater speed, than physical libraries can. This in turn created the need for effective document image processing systems which resulted number of studies on Optical Character recognition (OCR) and Document Image Retrieval (DIR) systems. Nowadays, the emergence of English as the universal language has resulted in multi-script documents in many nations using their own scripts. This situation posed a serious challenge for the traditional document image processing systems which are capable of processing only documents prepared in a single script. To address this issue number of researches has been conducted on script identification and various techniques have been reported.
Ethiopia has also the same situation where many historical, legal, news papers and business documents are prepared using two scripts (English and Amharic). Even though many studies have been conducted on document image processing systems for Amharic, only one research is conducted on script identification for Amharic-English documents. This research is pioneer on the subject and proposed feature extraction techniques for Amharic-English script identification. The present research is a continuation of the previous work aiming in improving the performance of the previously proposed system in Real-Life document images.
Real-Life document images have wide facet of challenges. The two main challenges in Real-Life document images are printing variation (font type, size, etc) and noise. To this end, in the present research four noise removal techniques and 11 features extraction techniques are investigated. The experimentation conducted on clean and Real-Life documents showed that the DBF (adaptive noise removal technique) are effective in suppressing noise while keeping the features intact. In addition to this, the combination of features (extracted at word level) selected following the forward sequential feature selection method showed to be effective in terms of less sensitivity to noise, font type and word length variation. More importantly, the experimentation is conducted without performing any normalization of variations (size, space, etc) that are common in Real-Life documents and promising results are registered. In addition to this, important recommendations are forwarded that needs further investigation|
|Description: ||A Thesis submitted to the School of Graduate Studies of Addis Ababa University in partial fulfillment of the requirements for the Degree of Masters of Science in Information Science|
|Appears in:||Thesis - Information Science|
Items in the AAUL Digital Library are protected by copyright, with all rights reserved, unless otherwise indicated.