Automatic Amharic Text Catigorization
No Thumbnail Available
Date
2007-03
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Rapid developments in Information and Communication Technology
are making available huge amount of data and information. Much of
these data is in electronics forms (like the more than billion
documents in the World Wide Web). Usually these data do not have a
standard structure like that of the relational database. Much of the
data are unstructured or semi-structured and can generally be
considered as a text database.
Text databases are showing accelerated growth throughout the
world. As the result, there is an active field of study in text mining to
facilitate the extraction of u useful and relevant information from text
databases.
The text data In local languages is also increasing fast, requiring
text-processing tools for text documents to be available in local
languages. This is true for Amharic also, as can be surmised from
the recent boom of online newspapers. magazines, data in electronics
storage, etc.
To facilitate the retrieval of useful and relevant information from
Amharic documents, a number of researches on automatic
processing of Amharic text have recently been conducted. This
research work in Automatic Amharic Text Categorization is an effort
to contribute in this direction.
Automatic classification of text data requires that documents are
represented by feature words. Representing a document by relevant
feature words is an important pre-processing step for automatic
classification; it often determines the efficiency and accuracy of the
classification. Standard pre-processing tools and methods are
therefore very important for automatic classification.
Because of the lack of standard in the Amharic writing system and
unavailability of Amharic text processing tools, the focus of the
research was on developing a document-pre-processing scheme
which facilitates for an efficient automatic classification of Amharic
documents.
To this end much a ttention was given to the processing of the source
data by developing and enhancing the following tools. The tools are
specific to the source data - Amharic news documents from ENA.
• A tool to correct word spelling variations. Focusing on spelling
variation due to pronunciation differences.
• Enhancement to the suffix and prefix removal tool developed
in a previous study, so that it can perform semantic analysis
before stripping-off affixes from words.
• A tool to correct word variations due to gender marker
suffixes.
• A tool to correct word variations due to number marker
suffixes.
• A tool to merge com pound words (when they may result In
semantic loss if separated) written as separate words.
The use of these tools (which enabled 10 to 30 % feature reduction)
in addition to other tools and data reduction methods helped to
analyze the huge source data (69,684 news items after data cleaning)
and measure classifier performances.
Because of the high dimensionality of the source data, classifier
algorithms that are suitable for high-dimensional data, Decision Tree
and Support Vector Machine (SVM) classifiers were selected for the
research experiment. The open source Weka package is used for the
automatic classification of the preprocessed data. Out of the many
classifier algorithms available in Weka, the Logic Model Tree (LMT)
and the Library of SVM (LibSVM) classifiers were used for
performance testing.
Both LMT and LibSVM classifier showed good classification accuracy
correctly classifying 79.72% and 8l.15% of the test instance into the
15 news categories, respectively. However, the computational cost of
the automatic classification was very high - taking several hours in
high capacity computers (Computers with 512 MB RAM and 3.7 GHz
speed).
The classification performance measures indicate the need for
additional works in developing tools and methods for mining
Amharic data.
Description
Keywords
Automatic Amharic Text Catigorization