Amharic Document Image Retrieval Using Lingustic Features

Yeshambel, Tilahun

Amharic Document Image Retrieval Using Lingustic Features

dc.contributor.advisor	Assabie, Yaregal(PhD)
dc.contributor.author	Yeshambel, Tilahun
dc.date.accessioned	2018-06-26T05:50:24Z
dc.date.accessioned	2023-11-04T12:22:25Z
dc.date.available	2018-06-26T05:50:24Z
dc.date.available	2023-11-04T12:22:25Z
dc.date.issued	10/21/2011
dc.description.abstract	The advent of modern computers play important roles in processing and managing electronic information that are found in the form of texts, images, audios and videos, etc. With the rapid development of computer technology, digital documents have become popular options for storage, accessing and transmission. With the need of current fast evolving digital libraries, an increasing amount of historical documents, newspaper, books, etc. are being digitized into an electronic format for easy archival and dissemination purposes. Optical Character Recognition (OCR) and Document Image Retrieval (DIR), as part of information retrieval paradigm, are the two means of accessing document images that received attention among the IR community. Amharic is the official language of Ethiopia since 19th century and as a result so many religious and government documents are written in Amharic. Huge collections of Amharic machine printed documents are found in almost every institution of the country. It is observed that accessing those documents has become more and more difficult. To address this problem, very few number of research works have been attempted recently by using OCR and DIR methods. The aim of this research is to develop a system model that enables users to find relevant Amharic document images from a corpus of digitized documents in an easy, accurate, fast and efficient manner. So this work presents the architecture of Amharic DIR which allows users to search scanned Amharic documents without the need of OCR. The proposed model is designed after making detailed analysis of the specific nature of Amharic language. Amharic belongs to the Semitic languages and is morphologically rich language. Surface words formation involves prefixation, suffixation, infixation, circumfixation and reduplication. In this work a model for searching Amharic document images is proposed and word image features are systematically extracted for automatically indexing, retrieving and ranking of document images stored in a database. A new approach that applies one of the NLP tools which is Amharic word generator is incorporated in the proposed system model. By providing a given Amharic root word to this Amharic specific surface word synthesizer, a number of possible surface words are produced. Then, the descriptions of these surface word images are used for indexing and searching purposes. On the other hand the system passes through various phases such as noise removal, binirization, text line and word boundary identification, word segmentation and resizing to normalize different font types, sizes and styles, feature extraction and finally matching query word image against document word images. The proposed method was tested on different real world Amharic documents from different sources like magazines, textbooks and newspapers with various font styles, types and sizes. Precision-recall measures of evaluation had been conducted for sample queries on sample document images and promising results have been achieved.	en_US
dc.identifier.uri	http://etd.aau.edu.et/handle/123456789/3391
dc.language.iso	en	en_US
dc.publisher	Addis Ababa University	en_US
dc.subject	Using	en_US
dc.subject	Lingustic Features	en_US
dc.title	Amharic Document Image Retrieval Using Lingustic Features	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Tilahun Yeshambel.pdf
Size:: 2.02 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Plain Text
Description:

Download

Collections

Computer Science