Retrieval from Real-life Amharic Document Images

dc.contributor.advisorGirma, Melaku
dc.contributor.advisorMeshesha, Million
dc.contributor.authorAsnake, Biniam
dc.date.accessioned2018-11-15T08:34:24Z
dc.date.accessioned2023-11-18T12:43:58Z
dc.date.available2018-11-15T08:34:24Z
dc.date.available2023-11-18T12:43:58Z
dc.date.issued2012-06
dc.description.abstractBulk of real life documents contain vital information and knowledge about history, culture, economy, politics, religion and science that are available in written form in Ethiopic script. This knowledge ought to be shared and the advancement of technology and research in Information Retrieval (IR), Artificial Intelligence (AI) and related fields bring the need to digitize documents and make it available for public use. The two major approaches of retrieving information from document images are recognition-based (optical character recognition /OCR/) and recognition-free (document image retrieval without explicit recognition /DIR/). The first approach is a long term process, error-prone and registers minimized performance for noisy documents, where as document image retrieval without explicit recognition is the preferred one. A few researches have been conducted to develop a recognition-free document image retrieval system that extracts information from document images relying on image features only. These systems are highly affected by noise in real life documents which results from paper aging, folding, scanning and printing errors. In this study, an attempt is made to integrate effective noise reduction and thresholding techniques to enhance the effectiveness of the system in searching within real-life document images. This study also improves the online searching process of the system by accepting multiple query terms then retrieving documents in recall-oriented manner and achieve 77.33% F-measure. A combination of three noise reduction techniques: median, adaptive median and wiener filters, and three thresholding techniques: Otsu’s, Niblack’s and Sauvola’s techniques are experimented in printed real-life documents plagued by low, medium, high and very high noise. Performance analysis shows that the best performing combination of denoising and thresholding techniques are wiener filtering and Otsu thresholding. Finally, the performance of the system is evaluated before and after the integration of the selected preprocessing techniques in which an average overall performance of 82.37% F-measure is registered in documents having low, medium, high and very high levels of noise. The major challenge is segmentation error where the current system either considers multiple separate words as one because of noise or a single word as multiple words when the noise is removed and the space between characters of a single word is large enough to be a word (segmentation threshold value) by the segmentation algorithm.en_US
dc.identifier.urihttp://etd.aau.edu.et/handle/12345678/14240
dc.language.isoenen_US
dc.publisherAddis Ababa Universityen_US
dc.subjectRetrievalen_US
dc.titleRetrieval from Real-life Amharic Document Imagesen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Biniam Asnake.pdf
Size:
6.89 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: