A Morphosyntactic Tagset for the Annotation of Texts in Tigrinya
No Thumbnail Available
Date
2013-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
The major purpose of this thesis is to identify and develop a morphosyntactic tagset for morphosyntactic annotation of texts in Tigrinya, the Ethio-Semitic language having about seven to nine million speakers in Ethiopia and Eritrea (CSA, 2007; CIA 2012; http://en.wikipedia.org/wiki/Tigrinya_language#cite_ref-2). In relation to what is researched, there is almost no Natural Language Processing (NLP) resource for Tigrinya. The researcher thinks that Tigrinya is lucky to start with a comprehensive morphosyntactic tagset development; because morphosyntactic tagset is the foundation for many NLP applications. We have examined the Morphosyntactic features of Tigrinya words and assign a tag that can be applicable for these words in Tigrinya texts. The thesis focuses only on the development of morphosyntactic tagset based on the morphological and morphosyntactic features of Tigrinya. As a result the developed morphosyntactic tagset for Tigrinya has 18 coarse-grained tags at the higher level, 105 fine-grained tags at the lower level, and even we can extend to more fine-grained features and we get 139 tags. We recommend for researchers to use the 105 tags for their applications, unless and otherwise they have a different purpose which needs the coarse-grained major category 18 tags or the very fine-grained 139 tags, even beyond. The uses and applications of morphosyntactic tagsets provide an important level of linguistic information to a document. It is useful as a preprocessing step of parsing and most of all it is useful to develop a POS tagger, which is the basis for many higher NLP applications. Students, researchers and professionals like computational linguists/computer scientists who are engaged in Natural Language Processing applications like speech recognition, text to speech, natural language parsing, information retrieval, lexicography and machine translation are the beneficiaries of this research.
Description
Keywords
Ethio-Semitic language having about, seven to nine million speakers in Ethiopia