Information Extraction Model from Amharic News Texts
No Thumbnail Available
Date
2010-11
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
As the growth of unstructured documents in the web and intranet is increasing from time to time,
a tool that can extract relevant data to facilitate decision making is becoming crucial. IE is
concerned with extraction of relevant information from text and stores them in a database for
easy use and management of the data. As the first comprehensive work on IE from Amharic text
we designed a model that is genuine enough to deal with different domains in the Amharic
language. The proposed model has document preprocessing, text categorization, learning and
extraction and post processing as its main components. The document preprocessing component
handles the normalization of the document while text categorization and learning and extraction
handle the categorization of the news text and extracting the predefined relevant information
from the categorized text respectively. The post processing component format and save the
extracted data to the database.
Various evaluation techniques, which are used to evaluate the performance of the classifier
machine learning algorithms, are used for IE and text categorization. Among the different
classifier machine learning algorithms used for text categorization component, the Naïve Bayes
algorithm performs by correctly classifying 92.83% of the 1200 news texts used as a dataset. On
the other hand, 1422 instances are used for training and testing the Information Extraction
component. Different scenarios are used to evaluate the role of the different features in
predicting the category for the candidate texts. Among the different scenarios we considered and
the different machine learning algorithms we employed the SMO algorithm correctly classified
94.58% of the instances correctly, when all the features are considered which yields higher
precision and recall rate for the different attributes considered for extraction.
Key words: Amharic Text Information extraction, Machine Learning Approach to Information
Extraction, Amharic Text Categorization, Information Extraction
Description
Keywords
Amharic Text Information Extraction; Machine Learning Approach to Information Extraction; Amharic Text Categorization; Information Extraction