The Application of Decislon Tree f or Part of Speech (Pos) T Agging for Amharic
No Thumbnail Available
Date
2009-09
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Automatic understanding of natural languages requires a set of language processing tools.
POS tagger, which assigns the proper part s of speech (like noun , verb, adjective, etc) to
word s in a sentence, is one of these tool s. T h is stud y in vest gates the possibility of
applying decision tree based POS tagger for Amharic . The tagger was developed us in g
j48 decision tree c classifier algorithm , which is Weka's implementation ofC4.5 algorithm
in the process, a corpus developed b y ELRC annotation team was used to get the required
data for training and testing the model s . The datasets is comprised of 10 6 5 news
documents ; 2 10 ,000 words. A sample o f some 800 sentences are selected and used for
model development and evaluation . The datasets was processed in line with the
requirements of the Weka's data mining tool. In order to support decision tree
classification mode is, a table that contain s the contextual and orthographic information is
constructed semi-automatically and used as training and testing datasets The right and left neighboring words tags for each word are used as contextual
information. Moreover, orthographic information abut the word like the first and last
character, the prefix and suffix, existence of rim e riding it within the word and so o n are
included in the table to provide useful information to the word to be tagged. Performance tests we re conducted at various stages using 10-fold cross validation test
option. Experimental results show that, only two successive left and rig ht words tag
pro v id e useful contextual information; contextual information beyond t woodiest
provide useful information rather noise. In the end , a n over all ,including ambiguous us and
unknown word s, 84.9% correctness (or accuracy) was obtained us in g 10- fold cross
validation test option. Even though , the accuracy of this stud y is encouraging further
study to improve the accuracy so a s to reach at implementation level is recommended.
.
Description
Keywords
Information Science