Application of Data Mining Technology to Identify Significant Patterns in Census or Survey Data : The Case of 2001 Child Labor Survey in Ethiopia

No Thumbnail Available

Date

2003-07

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Knowledge and understa nding of a problem is always the first step in identifying effective solutions . Child labor is both a sign and cause of poverty that should b eliminated as soon as possible. In Ethiopia, there is no much statistical data on chi ld labor practice. To fill this data gap, the FORE, CSA carried out country wide child labor survey in 200 I . This organization uses very simple statistical tools to show summary figures of different variables involved in 2001 child labor survey database. However traditional statistical method s are not good enough to discover complex relationships from large volume databases. The inefficiency of these tools necessitated the development of more powerful methods and techniques that can be used to study relationships and patters through the large volumes of data collected for example for census and survey purposes. In developed world, govemmrnt non-govemment organizations which have access to censuses and surveys are making use of the relatively new a nd modern technology, data mining, to identify important patterns and relationships within the data that is accumulated in large database. The application of data mining techniques to official data such as the 200 I child labor survey has great potential • in supporting good public policy. This research focused on identifying relationships between attributes within the 200 I child labor survey database that can be used to clearly understand the nature of child labor problem in Ethiopia . So the goal of the data mining process in this research was identifying interesting pattems and relationships in the 2001 child labor database. After the identification and understanding of the problem domain and the research objectives, the remaining stages of the research project focused on the following three major phases in data mining process. During the first phase, selection of the appropriate data mining tool which can be used to attain the defined data mining goal and the target dataset used in model building were the major tasks. The next phase, data cleaning and preparation, involved identifying and correcting mis-transmitted information, consolidating and combining records, transforming data from one form to another suitable for the selected data mining tool, handling missing attributes and selecting relevant attributes for generating meaningful association rules. As a final step for data preparation, the selected dataset was categorized into five classes using expectation maximization clustering algorithm implemented in knowledge studio version 3.0. A dataset of 2398 records with 63 attributes were used for clustering purpose. Apriori is an association rule algorithm which is implemented in Weka software. in the third phase, model building and evaluation, the apriori algorithm was used to generate association rules from the clustered as well as non-clustered selected dataset. Different attributes were given to apriori in an effort to generate meaningful rules. The results from this study were encouraging, which strengthened the hypothesis that interesting pattems can be generated from census and survey database by applying one of the data mining techniques: association rule mining. Key words: Data mining, knowledge discovery, association rule, apriori algorithm

Description

Keywords

Data mining, knowledge discovery, association rule, apriori algorithm

Citation