Amharic Document Image Retrieval Using Lingustic Features
No Thumbnail Available
Date
10/21/2011
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
The advent of modern computers play important roles in processing and managing electronic
information that are found in the form of texts, images, audios and videos, etc. With the rapid
development of computer technology, digital documents have become popular options for
storage, accessing and transmission. With the need of current fast evolving digital libraries, an
increasing amount of historical documents, newspaper, books, etc. are being digitized into an
electronic format for easy archival and dissemination purposes. Optical Character Recognition
(OCR) and Document Image Retrieval (DIR), as part of information retrieval paradigm, are the
two means of accessing document images that received attention among the IR community.
Amharic is the official language of Ethiopia since 19th century and as a result so many religious
and government documents are written in Amharic. Huge collections of Amharic machine
printed documents are found in almost every institution of the country. It is observed that
accessing those documents has become more and more difficult. To address this problem, very
few number of research works have been attempted recently by using OCR and DIR methods.
The aim of this research is to develop a system model that enables users to find relevant Amharic
document images from a corpus of digitized documents in an easy, accurate, fast and efficient
manner. So this work presents the architecture of Amharic DIR which allows users to search
scanned Amharic documents without the need of OCR. The proposed model is designed after
making detailed analysis of the specific nature of Amharic language. Amharic belongs to the
Semitic languages and is morphologically rich language. Surface words formation involves
prefixation, suffixation, infixation, circumfixation and reduplication.
In this work a model for searching Amharic document images is proposed and word image
features are systematically extracted for automatically indexing, retrieving and ranking of
document images stored in a database. A new approach that applies one of the NLP tools which
is Amharic word generator is incorporated in the proposed system model. By providing a given
Amharic root word to this Amharic specific surface word synthesizer, a number of possible
surface words are produced. Then, the descriptions of these surface word images are used for
indexing and searching purposes. On the other hand the system passes through various phases
such as noise removal, binirization, text line and word boundary identification, word
segmentation and resizing to normalize different font types, sizes and styles, feature extraction
and finally matching query word image against document word images. The proposed method
was tested on different real world Amharic documents from different sources like magazines,
textbooks and newspapers with various font styles, types and sizes. Precision-recall measures of
evaluation had been conducted for sample queries on sample document images and promising
results have been achieved.
Description
Keywords
Using, Lingustic Features