Browsing by Author "Meshesha, Million"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
Item An Automatic Sentence Parser for Oromo Language using Supervised Learning Technique(Addis Ababa University, 2002-06) Megersa, Diriba; Getachew, Mesfin; Meshesha, Million; Engdashet, Haile EyesusThe goal of Information Retrieval has been to reduce human language complexities and as a result serve users in the most efficient way. The decisive tool in achieving such end is the Natural language Processing (NLP). NLP has many components in serving such purpose. Parsing is one of such components in NLP in improving precision and recall which is the goal of Information Retrieval Systems. Moreover, parsing is also used in the effort towards machine translation which is one of the heart of Natural Language Processing. Today, different kinds of parsers have been developed for languages, which have relatively wider use nationally and/or internationally since the 1960s. Unfortunately Oromo has not captured the advantage of such system being the working language of the State Government of Oromiya, and one of the major languages in Ethiopia and Africa (Abebe 2002) for there are no systems (parsers of any sort) that parse written texts in this language. This study is, therefore, an attempt to develop a simple automatic sentence parser for Oromo language. In the study, the chart algorithm was used with some modification. A module for morphological analyzer, which splits words into root form and their corresponding morpheme, was also developed in order to facilitate the preparation of texts in a file to be parsed with appropriate lexical categories. In addition, the unsupervised learning algorithm was designed to guide the parser in predicting unknown and ambiguous words in a sentence. Grammar rules, lexicon, morphological rules and contextual information were also designed on the basis of the review made on the linguistic properties of Oromo grammatical categories. This system, in fact, is the first in its kind for this language. The study adopts an intelligent (Rule-Based+ learning module) approach to develop a prototype, which is a simple Oromo parser for the language. The thesis, in short, describes processes of automated sentence parsing of Free Texts. That is, it is aimed at developing a prototype and conducting an experiment with it. The result obtained (95% on the training test and 88.5% on the test set) using the small manually parsed sentences encourage further research to be launched, especially with the aim of developing a full-fledged Oromo sentence parser.Item A Generalized Approach to Optical Character Recognition (OCR) Amharic Texts(Addis Ababa University, 2000-05) Meshesha, Million; Biru, Tesfaye (PhD)These days research in Optical Character Recognition is popular for its application potential in banks, post offices, insurance, and other governmental and non-governmental organizations. Other application areas include library automation and natural language processing. As Amharic is the working language of Ethiopia and used as a means of communication by most governmental and non-governmental organizations, there is a huge collection of document and processing that could benefit from OCR system. To this end, since recent times, research in the area of Amharic OCR system has been undertaken at SISA. The present research is a continuation with the aim of improving the performance of the system under investigation at SISA in recognizing characters written in different font types. To this end, feature-based approach was considered after thoroughly studying features of Amharic characters. Algorithms for thinning and feature extractions were reviewed from literature. An attempt was made to implement some of these algorithms so as to see their performance on Amharic text printed in different typeface s. Previous algorithms implemented for segmentation (stage-by-stage segmentation) and feature extraction/detection (tree-based topological features extraction teclmique) are incorporated with some modification to complete the Amharic OCR. The system is then tested on sample Amharic documents of actual cases (written in Agafari, Washra and Visual Geez) and test results obtained for each of the case is repOt1ed. Recommendations are also drawn to highlight areas of further research so as to improve the current work and incorporate other features to Amharic OCR system.Item Optical Character Recognition of Amharic Text: an Integrated Approach(Addis Ababa University, 2002-06) Assabie, Yaregal; Teferi, Dereje(PhD); Meshesha, MillionItem Retrieval from Real-life Amharic Document Images(Addis Ababa University, 2012-06) Asnake, Biniam; Girma, Melaku; Meshesha, MillionBulk of real life documents contain vital information and knowledge about history, culture, economy, politics, religion and science that are available in written form in Ethiopic script. This knowledge ought to be shared and the advancement of technology and research in Information Retrieval (IR), Artificial Intelligence (AI) and related fields bring the need to digitize documents and make it available for public use. The two major approaches of retrieving information from document images are recognition-based (optical character recognition /OCR/) and recognition-free (document image retrieval without explicit recognition /DIR/). The first approach is a long term process, error-prone and registers minimized performance for noisy documents, where as document image retrieval without explicit recognition is the preferred one. A few researches have been conducted to develop a recognition-free document image retrieval system that extracts information from document images relying on image features only. These systems are highly affected by noise in real life documents which results from paper aging, folding, scanning and printing errors. In this study, an attempt is made to integrate effective noise reduction and thresholding techniques to enhance the effectiveness of the system in searching within real-life document images. This study also improves the online searching process of the system by accepting multiple query terms then retrieving documents in recall-oriented manner and achieve 77.33% F-measure. A combination of three noise reduction techniques: median, adaptive median and wiener filters, and three thresholding techniques: Otsu’s, Niblack’s and Sauvola’s techniques are experimented in printed real-life documents plagued by low, medium, high and very high noise. Performance analysis shows that the best performing combination of denoising and thresholding techniques are wiener filtering and Otsu thresholding. Finally, the performance of the system is evaluated before and after the integration of the selected preprocessing techniques in which an average overall performance of 82.37% F-measure is registered in documents having low, medium, high and very high levels of noise. The major challenge is segmentation error where the current system either considers multiple separate words as one because of noise or a single word as multiple words when the noise is removed and the space between characters of a single word is large enough to be a word (segmentation threshold value) by the segmentation algorithm.