Automatic Amharic Text Catigorization

Yohannes Afework

Automatic Amharic Text Catigorization

Files

Yohannes Afework.pdf (22.59 MB)

Date

2007-03

Authors

Yohannes Afework

Publisher

Addis Ababa University

Abstract

Rapid developments in Information and Communication Technology are making available huge amount of data and information. Much of these data is in electronics forms (like the more than billion documents in the World Wide Web). Usually these data do not have a standard structure like that of the relational database. Much of the data are unstructured or semi-structured and can generally be considered as a text database. Text databases are showing accelerated growth throughout the world. As the result, there is an active field of study in text mining to facilitate the extraction of u useful and relevant information from text databases. The text data In local languages is also increasing fast, requiring text-processing tools for text documents to be available in local languages. This is true for Amharic also, as can be surmised from the recent boom of online newspapers. magazines, data in electronics storage, etc. To facilitate the retrieval of useful and relevant information from Amharic documents, a number of researches on automatic processing of Amharic text have recently been conducted. This research work in Automatic Amharic Text Categorization is an effort to contribute in this direction. Automatic classification of text data requires that documents are represented by feature words. Representing a document by relevant feature words is an important pre-processing step for automatic classification; it often determines the efficiency and accuracy of the classification. Standard pre-processing tools and methods are therefore very important for automatic classification. Because of the lack of standard in the Amharic writing system and unavailability of Amharic text processing tools, the focus of the research was on developing a document-pre-processing scheme which facilitates for an efficient automatic classification of Amharic documents. To this end much a ttention was given to the processing of the source data by developing and enhancing the following tools. The tools are specific to the source data - Amharic news documents from ENA. • A tool to correct word spelling variations. Focusing on spelling variation due to pronunciation differences. • Enhancement to the suffix and prefix removal tool developed in a previous study, so that it can perform semantic analysis before stripping-off affixes from words. • A tool to correct word variations due to gender marker suffixes. • A tool to correct word variations due to number marker suffixes. • A tool to merge com pound words (when they may result In semantic loss if separated) written as separate words. The use of these tools (which enabled 10 to 30 % feature reduction) in addition to other tools and data reduction methods helped to analyze the huge source data (69,684 news items after data cleaning) and measure classifier performances. Because of the high dimensionality of the source data, classifier algorithms that are suitable for high-dimensional data, Decision Tree and Support Vector Machine (SVM) classifiers were selected for the research experiment. The open source Weka package is used for the automatic classification of the preprocessed data. Out of the many classifier algorithms available in Weka, the Logic Model Tree (LMT) and the Library of SVM (LibSVM) classifiers were used for performance testing. Both LMT and LibSVM classifier showed good classification accuracy correctly classifying 79.72% and 8l.15% of the test instance into the 15 news categories, respectively. However, the computational cost of the automatic classification was very high - taking several hours in high capacity computers (Computers with 512 MB RAM and 3.7 GHz speed). The classification performance measures indicate the need for additional works in developing tools and methods for mining Amharic data.

Keywords

Automatic Amharic Text Catigorization

URI

https://etd.aau.edu.et/handle/123456789/6040

Collections

Computer Science

Full item page

Automatic Amharic Text Catigorization

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections