Browsing by Author "Teferi, Dereje(PhD)"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
Item Afaan Oromo-English Cross-Lingual Information Retrieval (CLIR): a Corpus Based Approach(Addis Ababa University, 2011-06) Bekele, Daneal; Teferi, Dereje(PhD)The goal of Cross Language Information Retrieval (CLIR) is to provide users with access to information that is in a different language from their queries. It has the ability to issue a query in one language and retrieve documents in another. This is achieved by designing a system where a query in one language can be compared with documents in another. Afaan Oromo is one of the major languages that are widely spoken and used in Ethiopia. Despite the fact that Afaan Oromo has a large number of speakers, little effort has been put in conducting researches which aim at making English documents available to Afaan Oromo speakers. This study is, therefore, an attempt to develop Afaan Oromo-English CLIR system which enables Afaan Oromo native speakers to access and retrieve the vast online information sources that are available in English by writing queries using their own (native) language. In this study, the development of a corpus-based CLIR system which makes use of wordbased query translation for Afaan Oromo-English language pairs and evaluation of the system on a corpus of test documents and queries prepared for this purpose is described. This approach requires the availability of parallel documents hence such documents are collected from Bible chapters, legal and some available religious documents. Evaluation of the system is conducted by both monolingual and bilingual retrievals. In the monolingual run, the Afaan Oromo queries are given to the system and Afaan Oromo documents are retrieved while in the bilingual run the Afaan Oromo queries are given to the system after being translated into English to retrieve English documents. For the bilingual run translation of Afaan Oromo queries into their English equivalent is done by using bilingual dictionary constructed from the collected parallel corpora. The performance of the system was measured by recall and precision. In the first phase of the experimentation, the maximum average precision value of 0.421and 0.304 are obtained for the Afaan Oromo and English documents respectively. The second phase of experimentation performs slightly better than the first. Maximum average precision value of 0.468 and 0.316 are obtained for the Afaan Oromo and English documents respectively. Therefore, with the use of large and cleaned parallel Afaan Oromo-English document collections, it is possible to develop CLIR for the language pairs.Item Application of Data Mining Technology to Support Fraud Protection: the Case of Ethiopian Revenue and Custom Authority(Addis Ababa University, 2013-01) Mamo, Daniel; Teferi, Dereje(PhD)Taxes are important sources of public revenue. The existence of collective consumption of goods and services necessitates putting some of our income into government hands. However, collection of tax is the main source of income for the government; it is facing difficulties with fraud. Fraud involves one or more persons who intentionally act secretly to deprive the government income and use for their own benefit. Fraud is as old as humanity itself and can take an unlimited variety of different forms. Fraudulent claims account for a significant portion of all claims received by auditors, and cost billions of dollars annually. This study is initiated with the aim of exploring the potential applicability of the data mining technology in developing models that can detect and predict fraud suspicious in tax claims with a particular emphasis to Ethiopian Revenue and Custom Authority. The research has tried to apply first the clustering algorithm followed by classification techniques for developing the predictive model, K-Means clustering algorithm is employed to find the natural grouping of the different tax claims as fraud and non-fraud. The resulting cluster is then used for developing the classification model. The classification task of this study is carried out using the J48 decision tree and Naïve Bayes algorithms in order to create model that best predict fraud suspicious tax claims. To collect the data the researcher used interview and observation for primary data and database analysis for secondary data. The experiments have been conducted following the six-step Cios et al. (2000) KDD process model. For the experiment, the collected tax payers‟ dataset is preprocessed to remove outliers, fill in ITMD values, select relevant attributes, integrate data and derive attributes. The preprocessing phase of this study really took the highest portion of the study time. In this study, different characteristics of the ERCA customers‟ data were collected from the customs ASYCUDA database. A total of 11080 tax payers‟ records are used for training the models, while a separate 2200 records are used for testing the performance of the model. The model developed using the J48 decision tree algorithm has showed highest classification accuracy of 99.98%. This model is then tested with the 2200 testing dataset and scored a prediction accuracy of 97.19%. The results of this study have showed that the data mining techniques are valuable for tax fraud detection. Hence future research directions are pointed out to come up with an applicable system in the areaItem Bilingual Script Identification for Optical Character Recognition of Amharic and English Printed Document(Addis Ababa University, 2011-06) Abebe, Sertse; Teferi, Dereje(PhD)OCR is a type of document image analysis techniques to recognize the informative content in the text documents to be archived in softcopy for different purposes. The technique involves in conversion of the given image of text to its most probable similar character in a given domain language scripts. A line of a multilingual document page may contain text words in different language. To recognize, such a document page, it is necessary to identify different script forms before running an individual OCR system. In this paper, a system that distinctly identifies Amharic and English Scripts from a document image is presented. The system address the language identification problem on the word level. In extracting the important feature values of word-image of the scripts, preprocessing activities such as noise removal, binarization, segmentation, size and style normalization activities are performed. Maximum Horizontal projection profiles from three selected region, extent of the word image, and the ratio of the number of connected component to the word-image width are the important feature value to discriminate the two languages script. Support Vector Machine algorithm is applied to classify new instance word images. The proposed algorithm is tested with significant number of words with various font styles and sizes. The results obtained are quite promising and encouraging.Item Optical Character Recognition of Amharic Text: an Integrated Approach(Addis Ababa University, 2002-06) Assabie, Yaregal; Teferi, Dereje(PhD); Meshesha, MillionItem Predicting HIV Infection Risk Factor Using Voluntary Counseling and Testing Data: a Case of African Aids Initiative International(Addis Ababa University, 2012-06) Aweke, Girma; Teferi, Dereje(PhD)Despite a great deal of efforts, the world still has neither a cure nor a vaccine for HIV/AIDS infection. Millions of people have been suffering from this incurable disease. Fortunately, researchers have become successful in prolonging and improving the quality of life of those infected with HIV. Nonetheless, it has become increasingly clear that preventing the transmission and the acquisition of HIV through educating people to bring about behavioral changes should be the focus. The widely and freely available voluntary counseling and testing center (VCT) in Addis Ababa which provides an enormous role in counseling, promoting and checking clients HIV status through the clinical laboratory test. In line with this, the center has been collecting client’s information or records for further investigation with confidentiality. The record consists of many attribute that may have a direct or indirect impact with HIV infection. Moreover, identifying HIV infection risk factors or determinate variables provides benefits at different level of the society (such as individual, community and organizational level). The benefit not yet known by the client rather the organization keeps their records after they got tested. To this end, great efforts have been made to develop models to identify HIV infection risk factor using data mining technology. This research is initiated to identify the determinant risk factors of HIV infection by developing predictive models to support voluntary counselling and testing service of African AIDS Initiatives international (AAII) provided at Addis Ababa University and its surrounding. The six steps hybrid methodology has been followed for predictive HIV infection risk factors modeling among selected attributes. Three classifications techniques such as Decision tree J48, PART and SMO algorithms were experimented for building and evaluating the models. Before experimentation data pre-processing task has been performed to remove outliers, fill in missing values, and select best attributes, discretization and transformation of data. The preprocessing phases took considerable time of this work. A total of 15,396 VCT client records have been used for training the models, while a separate 3,000 records were used for testing their performance. The model developed using the PART algorithm has shown the best classification accuracy of 96.7%. The model has been evaluated on the testing dataset and scores a prediction accuracy of 95.8%. The results of this study have shown that the data mining techniques were valuable for predicting HIV infection risk factors. Hence, future research directions are forwarded to come up applicable solutions in the area of the study.Item The Role of Data Mining Technology in Electronic Transaction Expansion at Dashen Bank S.C(Addis Ababa University, 2011-07) Berhe, Luel; Teferi, Dereje(PhD)In this study the application of decision tree J48, ANN classification algorithms, and Kmeans clustering algorithm of data mining on CRM the case of EFT of POS service of the Dashen Bank S.C. have been discovered within the framework of CRISP-DM model. The card holder customers data along with customer book information have been collected, cleansed, integrated and transformed for testing using the clustering and classification models. The final dataset consists of 11000 records in which different clustering models at k values of 6, 5, and 4 with different seed values have been traced and evaluated against their performances. The cluster model at k value of 6 with default seed value has shown a better performance. Hence, the output of the best clustering model (i.e. at k=6) has been used as an input for the decision tree and Artificial Neural Network (ANN) classification models. Different classifications with the J48 decision tree algorithm are tested with 10-fold cross validation, and splitting the dataset into 70% for training and 30% for testing, techniques by setting the cluster index formed by the cluster model as dependent variable and the remaining variables as independent variables. Different decision tree classification models with minNumObj =default, 5, 10, 15, 20, and 25 have been experimented. From these decision tree parameters, a model with default parameter values showed the maximum overall classification accuracy (i.e. 99.55%). Likewise, different classification models of Multilayerperceptron ANN have been tested by changing the hidden layer and learning rate parameter’s value. As a result, a model with a classification accuracy of 99.97%, which is with default parameter value, was chosen. Lastly, a comparison of the decision tree and ANN models in terms of the overall classification accuracy , accuracy in classifying high level customers, and accuracy in classifying low level/value customers have been undertaken. Therefore, the ANN model has been the best in these evaluation parameters, and thus selected as a better classifier in EFT of POS service customers. The result obtained in this study was encouraging as it has very high classification accuracy. This helps and strengthens the possible application of data mining to the xi banking industry in general, and in the EFT of POS service expansion marketing strategy at the Dashen Bank S.C.Item Syllable-based Text-to- Speech Synthesis (tts) for Amharic(Addis Ababa University, 2012-06) Shiferaw, Mulat; Teferi, Dereje(PhD)The goal of Text-to-Speech synthesis is to convert arbitrary input text to intelligible and natural sounding speech so as to transmit information from a machine to a person. In speech synthesis, the capability of information extraction is crucial in producing high quality synthesized speech. This paper describes the design of a syllable based concatenative speech waveform synthesizer for Amharic language using TD-PSOLA algorithm for the prosodic modification and speech waveform analysis/synthesis purpose. This approach is based on the decomposition of the signal into overlapping frames synchronized with the pitch period. In concatenative corpus-based TTS systems, the acoustic units of varying sizes are selected from a large speech corpus and then concatenated to produce speech waveforms. The speech corpus contains more than one instance of each unit to capture prosodic and spectral variability found in natural speech; hence the signal modifications needed on the selected units are minimized if an appropriate unit is found in the unit inventory. A syllable unit is chosen primarily because Amharic language is syllable centred; Consonant-Vowel (CV) assimilated language. The unique syllable units are then added to a syllable repository. Further, concatenation at syllable boundaries can lead to smaller error owing to the spectrum being similar across different syllable boundaries. Syllable based approach to speech processing is an interesting alternative to the diphone (triphone) - based approach, especially for the syllable-timed languages, Amharic. The system was implemented and tested using selected Amharic texts found in the language Amharic. The result gives 97.8% of word accuracy rate for automatic syllabification, which leads to improve prosody and synthesis models as well as speech waveform generation and an average score of 89.58% and 3.45 for ORT and MOS respectively based on the subjective assessment of users‟ for intelligibility and naturalness of the synthesized speech respectively. Subjective listening tests performed on the synthesized speech there is an improvement of in the quality of synthesised speech.