Information Extraction Model from Amharic News Texts

No Thumbnail Available

Date

2010-11

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

As the growth of unstructured documents in the web and intranet is increasing from time to time, a tool that can extract relevant data to facilitate decision making is becoming crucial. IE is concerned with extraction of relevant information from text and stores them in a database for easy use and management of the data. As the first comprehensive work on IE from Amharic text we designed a model that is genuine enough to deal with different domains in the Amharic language. The proposed model has document preprocessing, text categorization, learning and extraction and post processing as its main components. The document preprocessing component handles the normalization of the document while text categorization and learning and extraction handle the categorization of the news text and extracting the predefined relevant information from the categorized text respectively. The post processing component format and save the extracted data to the database. Various evaluation techniques, which are used to evaluate the performance of the classifier machine learning algorithms, are used for IE and text categorization. Among the different classifier machine learning algorithms used for text categorization component, the Naïve Bayes algorithm performs by correctly classifying 92.83% of the 1200 news texts used as a dataset. On the other hand, 1422 instances are used for training and testing the Information Extraction component. Different scenarios are used to evaluate the role of the different features in predicting the category for the candidate texts. Among the different scenarios we considered and the different machine learning algorithms we employed the SMO algorithm correctly classified 94.58% of the instances correctly, when all the features are considered which yields higher precision and recall rate for the different attributes considered for extraction. Key words: Amharic Text Information extraction, Machine Learning Approach to Information Extraction, Amharic Text Categorization, Information Extraction

Description

Keywords

Amharic Text Information Extraction; Machine Learning Approach to Information Extraction; Amharic Text Categorization; Information Extraction

Citation

Collections