Feature Extraction and Matching in Amharic Document Image Collections
No Thumbnail Available
Date
2011-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
The ubiquity of digital computers and the boom of the Internet and World Wide Web resulted
in massive information explosion over the entire world. Different types of information are
uploaded in the Internet such as text documents, document images and other multimedia files.
Document images facilitate office automation by preserving scanned documents in a
document image database. However, information retrieving from document image database
becomes a difficult task for organizations due to lack of efficient retrieval schemes. To
overcome this challenge, recognition based and recognition free retrieval approaches are
attempted by researchers. Recognition based retrieval first applies optical character
recognition (OCR) to convert document images into text and then performs text retrieval
using search engines. On the other hand, recognition free approach attempts to search and
retrieve directly from document images relying on image features.
Due to the limitation of OCR systems, recognition based retrieval is not effective. Hence,
attempts are made by different researchers to develop a document image retrieval system
without explicit recognition. On top of this, attempts are made to develop effective Amharic
document image retrieval system. As a continuation, the current study is initiated to explore
and design feature extraction and matching schemes that are insensitive to word variants,
difference in font types, sizes and styles and degradation.
In doing so, eight feature extraction methods and four matching techniques are tested. Of the
four matching schemes dynamic time warping is insensitive to font types, sizes and styles
difference. The eight feature extraction techniques are tested for performance, and then each
feature is combined systematically following best stepwise feature selection method. The
result shows that combined features score better performance than individuals. Using the best
performer matching algorithm stemming is performed in image domain to handle word
variants. Accordingly, promising experimental results are registered for word variants. The
explored matching, feature extraction and stemming techniques are integrated with the
previous Amharic document image retrieval system and tested on noisy document images. As
the experimentation, the performance of the current system outperforms the previous
attempts. Besides, relevant conclusions are drawn and some valid recommendations are
forwarded to future investigation.
Description
Keywords
Information Retrieval