Application of Data Mining Technology to Identify Significant Patterns in Census or Survey Data : The Case of 2001 Child Labor Survey in Ethiopia
No Thumbnail Available
Date
2003-07
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Knowledge and understa nding of a problem is always the first step in identifying effective
solutions . Child labor is both a sign and cause of poverty that should b eliminated as soon as
possible. In Ethiopia, there is no much statistical data on chi ld labor practice. To fill this data gap,
the FORE, CSA carried out country wide child labor survey in 200 I . This organization uses very
simple statistical tools to show summary figures of different variables involved in 2001 child
labor survey database. However traditional statistical method s are not good enough to discover
complex relationships from large volume databases. The inefficiency of these tools necessitated
the development of more powerful methods and techniques that can be used to study
relationships and patters through the large volumes of data collected for example for census and
survey purposes. In developed world, govemmrnt non-govemment organizations which have
access to censuses and surveys are making use of the relatively new a nd modern technology, data
mining, to identify important patterns and relationships within the data that is accumulated in
large database.
The application of data mining techniques to official data such as the 200 I child labor survey has
great potential • in supporting good public policy. This research focused on identifying
relationships between attributes within the 200 I child labor survey database that can be used to
clearly understand the nature of child labor problem in Ethiopia . So the goal of the data mining
process in this research was identifying interesting pattems and relationships in the 2001 child
labor database.
After the identification and understanding of the problem domain and the research objectives, the
remaining stages of the research project focused on the following three major phases in data
mining process. During the first phase, selection of the appropriate data mining tool which can be
used to attain the defined data mining goal and the target dataset used in model building were the
major tasks. The next phase, data cleaning and preparation, involved identifying and correcting
mis-transmitted information, consolidating and combining records, transforming data from one
form to another suitable for the selected data mining tool, handling missing attributes and
selecting relevant attributes for generating meaningful association rules. As a final step for data
preparation, the selected dataset was categorized into five classes using expectation maximization
clustering algorithm implemented in knowledge studio version 3.0. A dataset of 2398 records
with 63 attributes were used for clustering purpose.
Apriori is an association rule algorithm which is implemented in Weka software. in the third
phase, model building and evaluation, the apriori algorithm was used to generate association
rules from the clustered as well as non-clustered selected dataset. Different attributes were given
to apriori in an effort to generate meaningful rules.
The results from this study were encouraging, which strengthened the hypothesis that interesting
pattems can be generated from census and survey database by applying one of the data mining
techniques: association rule mining.
Key words: Data mining, knowledge discovery, association rule, apriori algorithm
Description
Keywords
Data mining, knowledge discovery, association rule, apriori algorithm