School of Information Science

Permanent URI for this college

http://etd.aau.edu.et/handle/123456789/26

Browse

Now showing 1 - 20 of 21

Amharic - English cross-lingual information retrieval (Cllr): A corpus based approach
(Addis Ababa University, 2009-08) Tesfaye Aynalem; Abebe Ermias (Ato)
Amharic is the official working language of the Federal Democratic Republic of Ethiopia. On the other hand, English serves as medium of instruction and communication in educational centers, working language in governmental and nongovernmental organizations in Ethiopia. Thus, experimenting on the applicability of a cross language information retrieval system for Amharic-English that can break the language barrier is important. This research is conducted to break the language barrier that users face in obtaining and using documents prepared in Amharic and English. Amharic is the official working language of the Federal Democratic Republic of Ethiopia. On the other hand, English serves as medium of instruction and communication in educational centers, working language in governmental and nongovernmental organizations in Ethiopia. Thus, experimenting on the applicability of a cross language information retrieval system for Amharic-English that can break the language barrier is important. This research is conducted to break the language barrier that users face in obtaining and using documents prepared in Amharic and English. The method that is employed to conduct the experimentation is a corpus-based approach. This approach requires availability of a large volume of parallel documents prepared in Amharic and English. The documents that were collected to conduct this research are news articles and legal items.The performance of the system was measured by precision and recall. At the first phase of the experimentation, precision values were very low - the highest being 0.2 and 0.3 for Amharic and English respectively. This was due to the index term list which could not fully represent the documents used for the experimentation. The process of indexing removed important terms from index list which resulted in lack of documents to be retrieved for most of the queries. Thus, the index list was modified, i.e., all the terms which occur in the corpus with the exception of stop words were used. This showed the increase in precision values - the highest being 0.36 and 0.33 for Amharic and English documents respectively. Therefore, with the use of sufficiently large and cleaned parallel Amharic-English document collection, it is possible to develop a cross language information retrieval for the language pairs.
Amharic Question Answering for Factoid and List Questions Using Machine-Learning Approaches
(Addis Ababa University, 2019-02-04) Getachew Medhanit; Abebe Ermias (Ato)
Question answering is a system that allows users to ask questions about some topic in natural language and give exact answers by retrieving answers from collection of documents. Its main aim is to assist human to get exact answers to questions they ask. In addition, it avoids going through many documents to find a single answer to their questions. There are two types of questions in QA namely factoid and non-factoid questions. The first one comprises of what, where, when, who questions and the second one deals with list, definition, acronym, how questions. The focuses of this study are factoid and list questions. There are some researches conducted previously on question answering. Most of the researches used only SVM algorithm for question classification and any of them did not make use of named entity recognizer for answer extraction. In this study an attempt is made to design a list and factoid question answering using machine learning approach and an answer extraction that makes use of NER. This research is a closed domain QA for Amharic that focuses on Ethiopian history. It has three components. Question classification for identifying the types of questions which is done using two algorithms; HMM and SVM, passage retrieval that is performed by selecting the relevant sentences using sentence-level retrieval and answer extraction component selects answers from the top ranked sentences using a NER which is developed for this research. Factoid questions are answered by using key words matching and extraction using the NER from the question and the list questions are answered by using co-occurrence of answer types and candidate answers in a text. The study achieved an F-measure of 73% using the SVM classifier for question classification and an F-measure of 65% was achieved using the HMM classifier for question classification. From the result we achieved, we realized that question classification using SVM has a better answer extraction performance than the HMM. In addition, the use of NER tool helped answer extraction in getting exact answers.
Applicability of Data Mining Techniques to Support Voluntary Counseling and Testing (VCT) for HIV: The Case of Ccnter for Discase Control and Prevention (CDC)
(Addis Ababa University, 2009-01) Asmare Biru; Abebe Ermias (Ato)
Data mining is emerging as an important too l in many areas of research and industry. Companies and organizations are increasingly interested in applying data mining tools to increase the value added by their data collections systems. Nowhere is this potential more important than in the healthcare industry. As medical records systems become more standardized and commonplace, data quantity increases with much of it going analyzed. Data mining can begin to leverage some of this data into tools that help health organizations to organize data and make decisions. Data related to HIV ) AIDS are available in VCT centers. A major objective of this thesis is to evaluate the potential applicability of data mining techniques in VCT, with the aim of developing a model that could help make informed decisions. Using the datasets collected from OSSA, which is supported by CDC, and CRISP-OM as a knowledge discovery process model findings of the research are presented using graphs and tabular formats For the clustering task the K-means and EM algorithms were tested U Sing WEKA. Cluster generated by EM were appropriate for the problem at hand in generating similar group. According to the results of these experiments it was possible to see similar groups from VCT clients. The gender, martial status, and HIV test result, and education has shown patterns. For the class unification task, dices ion tree (J48 and Random tree) and neural network (ANN) classifier are evaluated .Although AI\TN shows better accuracy than decision tree classifier, the decision tree (J48) is appropriate for the datasets at hand and is used to build the classification model. Finally, cluster-derived class unification models are tested for their cross-validation accuracy and compared with non cluster generated classification ion model. The outcomes of this research will serve users in the domain area, decision makers and planners of HIV intervention program like CDC and MOH.
Application of Data Mining Techniques for Customer Segmentation in Isurance Business: the Case of Ethiopian Insurance Corporation
(Addis Ababa University, 2016-07-02) Merga Gutema; Abebe Ermias (Ato)
The aim of this study is to apply data mining techniques in insurance business to build models that can segment customers based on their value. The study subject for this research is Ethiopian Insurance Corporation, which stores life insurance policy holders‘ data in LIFE INSIS database located at Life Addis District were selected To meet the aforementioned objective of the study, the CRISP-DM methodology, which involves six steps was adopted to undertake data mining process and to address the business problem systematically and iteratively. During the business understanding phase, business practices of EIC life insurance were assessed using interviews with business and technical experts, and document analysis. Through data understanding and preparation phases, information on the subject of policyholders‘ personal, demographic, policy coverage and transactional was taken in to account. Besides, the attributes selected were considered the degree of relevancy to develop value-based customer segmentation model using DM techniques. Accordingly, from LIFE INSIS database, 27845 records and 16 attributes were imported MS-excel. The data used in this study were related to one year (12 months) of customer interactions that found between August, 2011 to August, 2012 time-frame. Attributes such as occupation ID, marital status, and sector were removed because they showed high Missing Values. The preprocessing tasks such as handling outliers and noisy, data integration and data transformation were undertaken. And, customers‘ value was computed using individual policyholders‘ records that indicate their insured value, duration and the cost incurred attract them (agent commission) information. With consultation of experts, 7 attributes and 21622 records were included in the final datasets for modeling purpose the initial database To build the customer segmentation models, K-means clustering algorithm and J48 decision tree algorithms of WEKA implementations were selected to discover useful patterns and to analyse the data. K-means clustering algorithm was selected since it‘s capable to develop models that segment customers with similar characteristics while J48 Decision tree classification technique was applied due to its quite quality and articulacy to decipher the cluster models by assigning XI each record to the target variable. Besides, patterns revealed that DT models are very easy straightforward and useful to integrate with business practices, and understand the revealed clusters. As a result, the experiments made in build DT model revealed that attributes such as age and insured_value were automatically selected as best predictive attributes to split the datasets to sub-segments that have homogenous characteristics based on their value (high or low). The results of the research pointed out that the customer segmentation models built by using the combination of classification and clustering data mining techniques are necessary for the LAD and marketing department of EIC in order to identify the valuable segments of customers and other factors underlying variations of the customers‘ values
Application of Data Mining Techniques to Support Customer Relationship Management At Ethiopian Airlines
(Addis Ababa University, 2002-07) Woubishet Henock; Meshesha Million (Ato); Abebe Ermias (Ato); Tadesse Nigussie (Ato)
The airline industry is highly competitive, dynamic and subject to rap id change. As a result, airlines are being pushed to understand and quickly respond to the individual needs and wants of their customers. Most airlines use frequent flyer incentive programs to win the loyalty of their customers, by awarding points that entitle customers to various travel benefits. Furthermore, these airlines maintain a database o f their frequent flyer customers. Customer relationship management (CRM) is the overall process of exploiting customer- related information and using it to enhance the revenue flow from an existing customer. As part of implementing CRM, airlines use their frequent flyer data to get a better understanding of their customer types and behavior. Data mining techniques are used to extract important customer information from available databases. This study is aimed at testing the application of data mining techniques to support CRM activities at Ethiopian Airlines. The subject of this case study is Ethiopian Airlines' frequent flyer program 's database, which contains individual !"light activity and demographic information of more than 22,000 program members The data mining process was divided in to three major phases. During the first phase, data was collected from different sources since the frequent flyer database lacked revenue data, which was essential for the study's goal of identifying profitable customer segments. The data preparation on phase was next, where a procedure was developed to compute and fill-in for missing revenue values. Moreover, data integration and transformation activities were performed In the third phase, which is model building and evaluation, K-means clustering algorithm was used to segment individual customer records into clusters with similar behaviors. Different parameters were used to run the clustering algorithm before arriving customer segments that made business sense to domain experts. Next, decision tree classification techniques were employed to generate rules that could be used to assign new customer records to the segments. The results from this study were encouraging, which strengthened the belief that applying data mining techniques could indeed support CRM activities at Ethiopian Airlines. In the future, more segmentation stud issuing demographic in formation and employing other clustering algorithms could yield better results.
Application of Data Mining Techniques to Support Customer Relationship Management at Ethiopian Airlines
(Addis Ababa University, 2002-07) Woubishet Henock; Abebe Ermias (Ato); Meshesha Million; Tadesse Nigussie
The airline industry is highly competitive, dynamic and subject to rapid change. As a result, airlines are being pushed to understand and quickly respond to the individual needs and wants of their customers. Most airlines use frequent flyer incentive programs to win the loyalty of their customers, by awarding points that entitle customers to various travel benefits. Furthermore, these airlines maintain a database of their frequent flyer customers. Customer relationship management (CRM) is the overall process of exploiting customerrelated information and using it to enhance the revenue flow from an existing customer. As part of implementing CRM, airlines use their frequent flyer data to get a better understanding of their customer types and behavior. Data mining techniques are used to extract important customer information from available databases. This study is aimed at testing the application of data mining techniques to support CRM activities at Ethiopian Airlines. The subject of this case study is Ethiopian Airlines’ frequent flyer program’s database, which contains individual flight activity and demographic information of more than 22,000 program members. The data mining process was divided into three major phases. During the first phase, data was collected from different sources, since the frequent flyer database lacked revenue data, which was essential for the study’s goal of identifying profitable customer segments. The data preparation phase was next, where a procedure was developed to compute and fill-in for ix missing revenue values. Moreover, data integration and transformation activities were performed. In the third phase, which is model building and evaluation, K-means clustering algorithm was used to segment individual customer records into clusters with similar behaviors. Different parameters were used to run the clustering algorithm before arriving at customer segments that made business sense to domain experts. Next, decision tree classification techniques were employed to generate rules that could be used to assign new customer records to the segments. The results from this study were encouraging, which strengthened the belief that applying data mining techniques could indeed support CRM activities at Ethiopian Airlines. In the future, more segmentation studies using demographic information and employing other clustering algorithms could yield better results.
The Application of Data Mining to Support Customer Relationship Management at Ethiopian Airlines
(Addis Ababa University, 2003-06) Abera Denekew; Getachew Mesfin (Ato); Abebe Ermias (Ato); Wobishet Henok
Airlines are being pushed to understand and quickly respond to the individual needs and wants of their customers due to the dynamic and highly competitive nature of the industry. Most airlines use frequent flyer incentive programs and maintain a database of their frequent flyer customers to win the loyalty of their customers, by awarding points that entitle customers to various travel benefits. Customer relationship management (CRM) is the overall process of exploiting customer- related data and information, and using it to enhance the revenue flow from an existing customer. As part of implementing CRM, airlines use their frequent flyer databases to get a better understanding of their customer types and behavior. Data mining techniques play a role here by allowing to extract important customer information from available databases. This study is aimed at assessing the application of data mining techniques to support CRM activities at Ethiopian Airlines. The subject of this case study, the Ethiopian Airlines’ frequent flyer program, has a database that contained individual flight activity and demographic information of over 35,000 program members. Having the objective of filling the gap left by a related research, which was carried out by Henok (2002), this study has used the data mining database prepared by Henok (2002). In the course of using the database to attain the objective of this research, a data preparation tasks such as driving new attributes from the existing original attributes, defining new attributes and then preparing new data tables were done. The data mining process in this research is divided into two major phases. During the first phase, since there has been an attempt to use three different data mining software, data was prepared and formatted into the appropriate format for the respective data mining software to be used. The second phase, which is model building phase, was addressed in two sub-phases, the clustering sub-phase and the classification sub-phase, the major contribution of this study. In the clustering subphase the K-means clustering algorithm was used to segment individual customer records into clusters with similar behaviors. In the classification sub-phase, J4.8 and J4.8 PART algorithms were employed xi to generate rules that were used to develop the predestined model that assigns new customer records into the corresponding segments. As a final output of this research, a prototype of Customer Classification System is developed. The prototype enables to classify a new customer into one of the customer clusters, generate cluster results, search for a customer and find the cluster where the customer belongs, and also provides with the description of each customer clusters. The results from this study were encouraging and confirmed the belief that applying data mining techniques could indeed support CRM activities at Ethiopian Airlines. In the future, more segmentation and classification studies by using a possible large amount of customer records with demographic information and employing other clustering and classification algorithms could yield better results
The Application of Decislon Tree f or Part of Speech (Pos) T Agging for Amharic
(Addis Ababa University, 2009-09) Kebede Gebeyehu; Abebe Ermias (Ato)
Automatic understanding of natural languages requires a set of language processing tools. POS tagger, which assigns the proper part s of speech (like noun , verb, adjective, etc) to word s in a sentence, is one of these tool s. T h is stud y in vest gates the possibility of applying decision tree based POS tagger for Amharic . The tagger was developed us in g j48 decision tree c classifier algorithm , which is Weka's implementation ofC4.5 algorithm in the process, a corpus developed b y ELRC annotation team was used to get the required data for training and testing the model s . The datasets is comprised of 10 6 5 news documents ; 2 10 ,000 words. A sample o f some 800 sentences are selected and used for model development and evaluation . The datasets was processed in line with the requirements of the Weka's data mining tool. In order to support decision tree classification mode is, a table that contain s the contextual and orthographic information is constructed semi-automatically and used as training and testing datasets The right and left neighboring words tags for each word are used as contextual information. Moreover, orthographic information abut the word like the first and last character, the prefix and suffix, existence of rim e riding it within the word and so o n are included in the table to provide useful information to the word to be tagged. Performance tests we re conducted at various stages using 10-fold cross validation test option. Experimental results show that, only two successive left and rig ht words tag pro v id e useful contextual information; contextual information beyond t woodiest provide useful information rather noise. In the end , a n over all ,including ambiguous us and unknown word s, 84.9% correctness (or accuracy) was obtained us in g 10- fold cross validation test option. Even though , the accuracy of this stud y is encouraging further study to improve the accuracy so a s to reach at implementation level is recommended. .
The Application of Interface Agent Technology for Personalized information Filtering: The Case of Ilrialerts
(Addis Ababa University, 2001-06) Chekol Abebe; Biru Tesfaye (Ato); Abebe Ermias (Ato)
Selective Dissemination of Information (SDI) service is one of the personalised information filtering method of delivering cutTent infOlmation based on predefined user profile. However, this service has been found to be one of the resource-intensive and time-consuming both for the service provider and the user. Particularly, in developing countries, it is one of the rarely implemented information service components. User profile is fundamental to information filtering in SDI systems. However, as prevIous studies have repeatedly stated, the creation and maintenance of users profiles is the most difficult area. This thesis research investigated into the SDI user profi le updating problem domain with particular emphasis to the application of interface agent technologies. Towards addressing the stated problem, the sm user profile has been studied by taking a case SDI system known as ILRIAlerts of the International Livestock Research Institute (ILRI) in Addis Ababa. This has helped to investigate the various approaches and methods that agent technology can be applied in improving information filtering in SDI systems. Literature reviews showed that this technology has already been employed in similar systems, particularly in using the Internet information resources. Based on these findings, agent-oriented modeling of the information filtering process of the SDI system through methods and procedures for updating the user profile has been investigated and some of the basic methods and procedures proposed have been demonstrated by developing a prototype agent. Applying this agent, some small-scale tests have been made on the performance of the proposed methods taking few test data from ILRIAlerts users feedback records. The prototype and tests have indicated the applicability of using agent methods to SDr information fi ltering problems.
Automatic Stemming for Amharic Text: an Experiment Using Successor Variety Approach
(Addis Ababa University, 2009-01) Mezemir Genet; Abebe Ermias (Ato)
The extensive use of the World Wide Web and the increasing digital availability of information and documents accelerated the demand for technologies and tools for an online data retrieval and extraction application. The natural language research, with the aim of quick and reliable online information searching and access, is one major component of the current advanced information technology development. In this research , an indexing system was developed and programmed by using the Successor Variety Stemming Algorithm to find stems for Amharic words. The research has set out to discover whether the Successor Variety Stemming Algorithm technique with the peak and plateau, entropy and complete word methods can be used for the Amharic language or what the limitation would be. In addition, the peak and plateau method compared with the entropy and the complete words method. Stemming is typically used in the hope of improving the accuracy of the search reducing the size of the index. A corpus of 6270 words was obtained form the Ethiopian News Agency (ENA) and Walta Information Center and used to train and test the methods. The experiment result showed that, the peak and plateau method had a performance of 71 .8% level of accuracy, but the performance of the entropy and complete word methods are 63.95% and 57 .99% level of accuracy respectively. Based on the observation made from the experimentation result, the successor variety algorithm with the peak and plateau method had a better performance than successor variety algorithm with the entropy method
Concatenative Text-to-speech (Tts) Synthesis for the Amharic Language
(Addis Ababa University, 2003-06) Lulseged, Henock; Berhanu Solomon (Ato); Abebe Ermias (Ato); Taddese Kinfe (Ato)
In this study, the potential of developing an Amharic TTS system using the TD-PSOLA algorithm has been investigated. In doing this thesis work, the Delphi programming language and the MATLAB software have been used. Additionally, a spectrographic analysis tool called praat had been used for the purpose of data preparation. All the acoustic speech units have been extracted from a corpus recorded at a sampling rate of I 1,025, and the whole of the corpus had been recorded at one time. Two acoustic unit types have been extracted from the corpus data: diaphones and CY-Syllables. CY-Syllables are suitable for the Amharic language because most of the symbols in the Amharic writing system represent a CY -Syllable, and this makes tasks like grapheme-to-phoneme transcription easy. Due to time constraints only a limited number of CY-Syllables and dip hones have been extracted from the corpus. Testing performance of TTS systems is one of the difficult tasks because there is no single measure to pinpoint the quality of the system. Although no standard test is available, a number of testing methods have been developed. The Open Rhyme Test (ORT) and Mean Opinion Score (MOS) test have been used in this work to test performance. The results obtained from the experiment are promising and indicative of the possibility of producing high quality TTS system for Amharic using other advanced algorithms than the one used in this work.
Developing Optimization Model for Bandwidth Utilization Based on Network Traffic Analysis the Case of Addis Ababa University
(Addis Ababa University, 2016-03-11) Sinshaw kalkidan; Abebe Ermias (Ato)
Organizations make rules for their users in order to optimize network usage and network administrators control the users’ network traffic flow over the link. To have full control on activities occurring on network is essential for managing bandwidth utilization to achieve expected service quality. Managing the network is also useful for detecting attacks from malwares, intrusion, or any restricted applications accessing the network. Bandwidth is the most important thing in networking which is the line where traffic data passes through. The larger the bandwidth, the more network traffic flows in it. This study presents approaches to determine bandwidth utilization by using network traffic analysis. Monitoring traffic data helps to know the bandwidth capacity and to filter applications that make the network crowded with traffic. The ingoing and outgoing traffic is analyzed using solarwinds which is efficient for capturing data from continuous streams of network traffic and convert those raw numbers into easy to interpret charts and tables that quantify exactly how the corporate network is being used, by whom, and for what purpose. Continuously, it records ingoing and outgoing traffic from the router, core switch and firewall devices then logs the data to generate daily, weekly and monthly statistics. This study shows an effective use of network bandwidth by using network traffic analysis from three campuses of AAU and traffic data recorded from network devices at the university data center which shows the current bandwidth usage of AAU. As the experiment result shows, the bandwidth usage in the university needs to be managed to serve all the university community. Based on network traffic analysis and SARG report, bandwidth optimization model is proposed which considered bandwidth consumption in weekly and monthly within different campuses. Therefore, the ICT office has to take a measure to effectively distribute network bandwidth to campuses.
Development of Stemming Algorithm for Tigrigna Text
(Addis Ababa University, 2011-06) Fisseha Yonas; Abebe Ermias (Ato)
This paper presents the development of a rule-based stemming algorithm for Tigrigna. The algorithm is simple yet highly effective; it is based on a set of steps composed by a collection of rules. Each rule specifies the affixes to be removed; the minimum length allowed for the stem and a list of exceptions rules. In Tigrigna language there are many exceptions for making any stemming rule. The researcher has considered these exceptions in designing the stemmer. The deep study of the Tigrigna grammar as well as the analysis of the inflectional and derivational types of affixes of the language was necessary for this kind of thesis work. The stemmer was designed by new word classification according to their affixes. The stemming is performed using a rule-based algorithm that removes affixes. Research done for Tigrigna language and Tigrigna stemmer was taken in to consideration. It was necessary to conduct the research as the past research of Tigrigna language stemming is limited. By Analyzing the Tigrigna grammatical rules, the researcher decided to follow inflectional and derivational affix removal and designed a new rule-set for the Tigrigna stemmer. The goal of the research was to develop and document a new rule-based stemmer for the Tigrigna language. The Tigrigna stemmer was developed in Python programming language. The researcher tried to follow a simple structure in the algorithm, creating x small rule-sets for similar affixes, which are working as Rule-sets on the input words. The stemmer was evaluated using error counting method. The system was tested and evaluated based on the counting of actual understemming and overstemming errors using a total of 5437 word variants derived from two data sets. Results show that the stemmer has 85.8 % accuracy for the first dataset and 86.3% accuracy for the second dataset and average accuracy of 86.1%. The proposed method generates some errors. The average error rate is about 13.9%.These errors were analyzed and classified into two different categories (overstemming and understemming). Most of the errors occurred due to overstemming of words.
It Project Management Practice in Ethiopia: The Case of the Banking Sector
(Addis Ababa University, 2001-07) Yilkal Mihiret; Jemenah Getachew (Ato); Abebe Ermias (Ato)
Information is considered the new organizational resource in business organizations. It supports problems solving, decision making, idea creation and motivation, thereby bestowing competitive edge to business organizations. To this end, Information Technologies are used for processing and managing information. In addition to accelerating improved management effectiveness and business performance, Information Technology (IT) also helps in achieving efficiency gains to reap the aforementioned benefits from using Information Technology, many companies in Ethiopia, including the Commercial Bank of Ethiopia, have undertaken Infom1ation Technology investment projects. However, it has been found that some of these benefits have not materialized. The cause for projects with such experience is generally regarded to be poor project management. The research, therefore, aims at exploring if the project management aspect of the Information Technology project of the Commercial Bank of Ethiopia to be one of the cause. To address the research's objective, review and assessment of the Infom1ation Technology project management practice of the Commercial Bank of Ethiopia has been made. The general approach of the IT project management practice has been reviewed and assessed based on the applicability of the project management processes involved in the nine knowledge areas of the Project Management Body of Knowledge developed by the Project Management Institute. To detem1ine the overall impact of the project management practice of the Information To address the research's objective, review and assessment of the Infom1ation Technology project management practice of the Commercial Bank of Ethiopia has been made. The general approach of the IT project management practice has been reviewed and assessed based on the applicability of the project management processes involved in the nine knowledge areas of the Project Management Body of Knowledge developed by the Project Management Institute. To determine the overall impact of the project management practice of the Information Technology project, an overall evaluation has been made using a 'Success Barometer Rati ng' tool developed in the Enhanced Management Framework by the Treasury Board Secretariat of the Canadian government. The result of this study indicates that poor project management practice had an impact in the degree of success of the informational Technology project of the Commercial Bank of Ethiopia.
Predictive Modeling Using Data Mining Techniques in Support of Insurance Risk Assessment
(Addis Ababa University, 2002-06) Hintsay Tesfaye; Abebe Ermias (Ato); Meshesha Million
One of the important tasks that we have to face in a real world application is the task of classifying particular situation or events as belonging to a certain class. Risk assessment in insurance policies is one example that can be viewed as classification problem. In order to solve these kinds of problems we must build an accurate classifier system or model. Data mining techniques are powerful tools in addressing such problems. This research describes the development of predictive model, which determines the risk exposure of motor insurance policies. Decision tree and neural network were used in developing the model. Since rejections of policy renewal are rare at Nyala Insurance SC. (NISCO), where the research was conducted as a case study, policies were classified into one of the three possible groups (Low, Medium, or High risk) on the basis of annual assessment made by NISCO. Six variables were extracted from the 25 variables used in this study. 940 facts (90% of the working dataset) were used to build both decision tree and neural network models. The remaining 116 (10 %) of the dataset were used to validate the performance of the models. The decision tree model, selected based on the meaningfulness of the rules extracted from it, correctly classified 95.69% of the validation set, and the classification accuracy for low, medium and high risk policies are 98.15%, 94.12%, and 92.86% respectively. The neural network model correctly classified 92.24 % of the validation set, high-risk groups are correctly classified, and low and medium-risk groups are classified with accuracy of 98.15% and 76.47% respectively. Some possible explanations for the relatively low performance of the neural network with medium policies are given. In addition, an interesting pattern was found between the two models that some policies misclassified by decision tree were correctly classified by neural networks, and vice versa. This is a good indication that the hybrid of the two models may result in better performance
Recognition of Formatted Amharic Text Using Optical Character Recognition (OCR) Techniques
(Addis Ababa University, 1998-05) Abebe Ermias (Ato); Biru Tesfaye (Ato)
At this age of ours, information is the driving force behind every human endeavor. Information in computer processable format is specially valuable since it can be stored, manipulated, and transferred with a minimum of labor and financial cost. For this, information in paper and other documents should be converted to computer processable format. Quite for some time now, it has been a practice to develop character recognition systems. Scripts such as Latin, Arabic, Kanji, Cyrillic, etc. have enjoyed a significant amount of research in the area, while other scripts like Amharic and Kannada have little work done. The testing of OCR techniques on the Amharic script is a recent phenomenon. Worku Alemu, a 1997 graduate of SISA, was able to adopt an OCR algorithm for the Amharic script. Without applying pre- and post-processing techniques to detect and correct errors, the combination of the segmentation and recognition algorithms he used yielded a significant accuracy level for laser printouts of text with 12 point size and normal type style of Washrag font (the main test case). However, his algorithm was not capable of recognizing texts written in different font sizes and styles (such as italics and outline). In the current work, it is tried to further his work by introducing some per-processing techniques so that his algorithm recognizes texts written in different sizes and styles
Statistical Afaan Oromo Grammar Checker
(Addis Ababa University, 2015-02-05) Mideksa Desalegn; Abebe Ermias (Ato)
Natural Language Processing (NLP) is a research area that focuses on developing systems that allow computers to communicate with people using everyday language. In order to communicate through natural languages, grammatical correctness of a language is very significant. Therefore, it is very important to have natural language processing applications that recognize the grammatical errors that may occur in natural language texts. The natural language processing application that recognizes the grammatical error of a language is called grammar checker. Different approaches can be used to develop a grammar checker for a language. These are rule based, statistical and hybrid approaches. In this study statistical Afaan Oromo grammar checker is developed and tested using a prepared dataset. In the statistical approaches of grammar checking two techniques can be used for detecting the grammatical correctness of a given sentences. The first one is token n-gram, in which sequence of token are extracted and the second is tag n-gram, in which sequence of tag are extracted. In this study these two techniques of statistical approach are used and their performance is tested on 85 Afaan Oromo sentences. The evaluation results show that the performance of token n-gram in identifying incorrect sentence is a recall 100%, precision of 78.1% and F-measure of 89.0% and the performance of tag n-gram technique in identifying incorrect sentences is a recall of 86%, precision of 82.6% and F-measure of 84.3%. On the other hand, the performance of token n-gram technique in identifying correct sentences is a recall 60%, precision of 100% and F-measure of 80% and the performance of tag n-gram technique in identifying correct sentence is a recall of 74.2%, precision of 78.2% and F-measure of 76.4%. There are also some reasons that lead to the low performance of the two techniques. The first one is the issue related to the performance of sentence boundary detector, word splitter, POS tagger and morphological analyzer modules. Another reason is for the low performance of the two techniques is related to the quality of the corpus (spelling error, the spacing error). As a result this study recommends the following recommendation in order to increase the performance of the grammar checker. The first one is using spelling checker in order to increase the performance of POS tagger and Morphological analyzer. The other is using good quality corpus and good performing POS tagger and Morphological analyzer.
Sub-Word Based Amharic Word Recognition: an Experiment Using Hidden Markov Model (Hmm)
(Addis Ababa University, 2002-06) Tadesse Kinfe; Aberra Daniel (Ato); Abebe Ermias (Ato)
In this study, the potential of Hidden Markov Model (HMM) for the development of Amharic speech recognition system has been investigated and in the course of building the recognizer the popular toolkit Hidden Markov Model Toolkit (HTK) was used.In the process of building the recognizer, the speech data is recorded at a sampling rate of 16KHz and the recorded speech is then converted into Mel Frequency Cepstral Coefficient (MFCC) vectors for further analysis and processing.Since large vocabulary systems are envisaged, sub-word modeling is pursued. Sub-word modeling refers to technique whereby one HMM is constructed for each sub-word unit (phoneme, trip hone, syllable, etc .). Phonemes, tied-state trip hones and CV-syllables have been considered as the basic sub-word units and have been used to build phoneme-based, tied-state trip hone based and CV -syllable based recognizer, respectively. In this study, an extensible l70-word vocabulary is constructed and both speaker-dependent and speaker-independent models are built using 15 speakers (8 male and 7 female) in the age range of 20 to 30. Five untrained speakers who had no involvement in training the models are also used to test the speaker-independent models.The results obtained are promising and have shown the potential of tied-state trip hones as good sub-word units for Amharic. In fact, phonemes also have produced encouraging recognition performance. Even though CV -syllables appear to be more convenient for Amharic, this research has not proved that and is underscored for Fruther research.
Unsupervised Corpus Based Approach for Word Sense Disambiguation to Afaan Oromo Words
(Addis Ababa University, 2015-06) Gemechu Feyisa; Abebe Ermias (Ato)
This thesis presents a research work on Word Sense Disambiguation for Afaan Oromo Language. A corpus based approach to disambiguation is employed where unsupervised machine learning techniques are applied to a corpus of Afaan Oromo language, to acquire disambiguation information automatically. We tested five clustering algorithms (simple k means, hierarchical agglomerative: Single, Average and complete link and Expectation Maximization algorithms) in the existing implementation of Weka 3.6.11 package. “Cluster via classification” evaluation mode was used to learn the selected algorithms in the preprocessed dataset. Due to lack of sense annotated text to be able to do these types of studies; a total of 1500 Afaan Oromo sense examples were collected for selected seven ambiguous words namely sanyii, karaa, horii, sirna and qoqhii, ulfina, ifa. Different preprocessing activities like tokenization, stop word removal and stemming were applied on the sense example sentences to make it ready for experimentation. Hence, these sense examples were used as a corpus for disambiguation. A standard approach to WSD is to consider the context of the ambiguous word and use the information from its neighboring or collocation words. The contextual features used in this thesis were co-occurrence feature which indicate word occurrence within some number of words to the left or right of the ambiguous word. For the purpose of evaluating the system, a training dataset was applied using standard performance evaluation matrics. The achieved result was encouraging, because clustering algorithms were achieved better in terms of accuracy of supervised machine learning approaches on the some dataset similar. But, further experiments for other ambiguous words and using different approaches will be needed for a better natural language understanding of Afaan Oromo language.
Unsupervised Machine Learning Approach for Word sense Disambiguation to Amharic Words
(Addis Ababa University, 2011-06) Assemu Solomon; Abebe Ermias (Ato)
Word Sense Disambiguation (WSD) in text is still a difficult problem as the best supervised methods require laborious and costly manual preparation of tagged training data. This work presents a corpus based approach to word sense disambiguation that only requires information that can be automatically extracted from untagged text. We use unsupervised techniques to address the problem of automatically deciding the correct sense of an ambiguous word based on its surrounding context. It was motivated by its use in many crucial applications such as Information Retrieval (IR), Information Extraction (IE), Machine Translation (MT), etc. For this study, we report experiments on five selected Amharic ambiguous words, these are አጠና (eTena), መሳል (mesal), መሣሣት (me`sa`sat), መጥራት (metrat), and ቀረጸ (qereSe). For the purposes of this research, unsupervised machine learning technique was applied to a corpus of Amharic sentences so as to acquire disambiguation information automatically. A total of 1045 English sense examples for the five ambiguous words were collected from British National Corpus (BNC). The sense examples were translated to Amharic using the Amharic-English dictionary and preprocessed to make it ready for experimentation. We tested five clustering algorithms (simple k means, hierarchical agglomerative: Single, Average and complete link and Expectation Maximization algorithms) in the existing implementation of Weka 3.6.4 package. “Class to cluster” evaluation mode was selected to learn the selected algorithms in the preprocessed dataset. The achieved result was encouraging, because best clustering algorithms were close in terms of accuracy of supervised machine learning approaches on the same dataset, using the same features. But, further experiments for other ambiguous words and using different approaches will be needed for a better natural language understanding of Amharic language.

Browse

Browsing School of Information Science by Author "Abebe Ermias (Ato)"

Results Per Page

Sort Options