Browsing by Author "Alemu, Atelach"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
Item Automatic sentence Parsing for Amharic Text an experiment using probabilistic Context free grammar(Addis Ababa University, 2002-07) Alemu, Atelach; Biru, Tesfaye (PhD)Natural Language processing, as a field of scientific inquiry, plays an important role in increasing computers capability to understand natural languages, the language by which most human knowledge is recorded. Works in the area of Natural Language Processing try to design and implement computer programs that can understand natural language and act appropriately on the information contained in the text or utterance. Enabling computers to understand natural language involves extraction of meaning from natural language sentences. And one of the steps in this process is sentence parsing. Sentence parsing, which is also called syntactic parsing, is the process of identifying how words can be put together to form correct sentences and determining what structural role each word plays in the sentence and what phrases are subparts of what other phrases. A sentence parser outputs a parse structure that could be used as a component in many applications including semantic analysis, machine translation, information storage and retrieval of textual data etc. Today, parsers of different kinds (e.g. probabilistic, rule based) have been developed for languages, which have relatively wider use nationally and/or internationally (e .g. English, German, Chinese, etc). The same story is not true for Amharic, the working language of the Federal Government of Ethiopia, and one of the major languages of Ethiopia (Bender et ai, 1976) since to the best of my knowledge, there are no sentence parsers of any sort that process this language.Sentence parsing, which is also called syntactic parsing, is the process of identifying how words can be put together to form correct sentences and determining what structural role each word plays in the sentence and what phrases are subparts of what other phrases. A sentence parser outputs a parse structure that could be used as a component in many applications including semantic analysis, machine translation, information storage and retrieval of textual data etc. Today, parsers of different kinds (e.g. probabilistic, rule based) have been developed for languages, which have relatively wider use nationally and/or internationally (e .g. English, German, Chinese, etc). The same story is not true for Amharic, the working language of the Federal Government of Ethiopia, and one of the major languages of Ethiopia (Bender et ai, 1976) since to the best of my knowledge, there are no sentence parsers of any sort that process this language. This study, thus, attempted to develop a simple automatic parser for Amharic texts/sentences to address the need for developing systems that automatically process the Amharic language. In the study, the Inside Outside algorithm with a bottom up chart parsing strategy has been used. The probabilistic context free grammar has been used as a grammatical formalism to represent the phrase structure rules of the language. A small sample corpus was selected from sentences in the language, and has been used to serve as a training and test set. The sample was then hand parsed, automatically tagged, and was used as a corpus to extract the grammar rules and assign probabilities. The thesis, in short, describes processes of automatic sentence parsing using a combination of probabilistic and rule-based reasoning. It describes the whole process from manually parsing simple sentences to developing a prototype and conducting an experiment with it. The results obtained using the small manually parsed corpus seems to encourage further research to be launched, especially with the aim of developing a full-fledged Amharic sentence parser.Item Automatic Sentence Parsing for Amharic Text an Experiment Using Probabilistic Context Free Grammars(Addis Ababa University, 2002-06) Alemu, Atelach; Birru, TesfayeNatural Language processing, as a field of scientific inquiry, plays an important role in increasing computers capability to understand natural languages, the language by which most human knowledge is recorded. Works in the area of Natural Language Processing try to design and implement computer programs that can understand natural language and act appropriately on the information contained in the text or utterance. Enabling computers to understand natural language involves extraction of meaning from natural language sentences. And one of the steps in this process is sentence parsing. Sentence parsing, which is also called syntactic parsing, is the process of identifying how words can be put together to form correct sentences and determining what structural role each word plays in the sentence and what phrases are subparts of what other phrases. A sentence parser outputs a parse structure that could be used as a component in many applications including semantic analysis, machine translation, information storage and retrieval of textual data etc. Today, parsers of different kinds (e.g. probabilistic, rule based) have been developed for languages, which have relatively wider use nationally and/or internationally (e .g. English, German, Chinese, etc). The same story is not true for Amharic, the working language of the Federal Government of Ethiopia, and one of the major languages of Ethiopia (Bender et al, 1976) since to the best of my knowledge, there are no sentence parsers of any sort that process this language.Item Development of Stemming Algorithm for Wolaytta Text(Addis Ababa University, 2003-06) Lessa, Lemma; Getachew, Mesfin; Alemu, Atelach; Engdashet, Haile Eyesus(PhD)This study describes the design of a stemming algorithm for Wolaytta language. To give a solid background for the thesis, literatures on conflation in general and stemming algorithms in particular were reviewed. Since it is the nature and characteristics of affixation that guide the development of stemmer, the Wolaytta language morphology was studied and described in order to model the language and develop an automatic procedure for conflation. The inflectional and derivational morphologies of the language are discussed. It is indicated that suffixation is the main word formation process in Wolaytta language. It is also attempted to show that the language is morphologically complex and uses extensive concatenation of suffixes. The result of the study is a prototype context sensitive iterative stemmer for Wolaytta language. Error counting technique was employed to evaluate the performance of this stemmer. The stemmer was trained on 3537 words (80% of the sample text) and the improved version reveals an accuracy of 90.6% on the training set. The number of over stemmed and understemmed words on the training set were 8.6% (304 words) and 0.8% (28 words) respectively. When the stemmer runs on the unseen sample of 884 words (20% of the sample text), it performed with an accuracy of 86.9%. The percentage of errors recorded as understemmed and overstemmed on this unseen (test set) were 9% and 4.1%, respectively. Moreover, a dictionary reduction of 38.92% was attained on the test set. The major sources of errors are also reported with possible recommendations to further improve the performance of the stemmer and also for further research.Item An Integrated Approach to Automatic Complex Sentence Parsing for Amharic Text(Addis Ababa University, 2003-06) Gochel, Daniel; Alemu, AtelachNatural language processing is a research area which is becoming increasingly popular each day for both academic and commercial reasons. Higher NLP systems (e.g., machine translation) are materialized only when the lower ones (e.g., part-of-speech tagger, syntactic parser) are successfully built. This functional dependency exists even among the lower NLP systems. A morphological analyzer can be an important component for a partof- speech (POS) tagger particularly in dealing with unknown words. A POS tagger, which is a system that uses various sources of information to assign possibly unique POSs to words, in turn, can be used as an input to a syntactic parser. Writers in the area of NLP argue that if the POS tagger is accurate, this method is an excellent one. This thesis can be taken as an attempt to integrate ideas and outputs of previously attempted Amharic NLP prototypes towards solving a bit further problem in the NLP of the language, i.e. automatic Amharic complex sentence parsing. Syntactic parsing underlies most of the applications in natural language processing. Parsers are already being used extensively in a number of disciplines such as in computer science (for compiler construction, database interfaces, artificial intelligence, etc), and in linguistics (for text analysis, corpora analysis, machine translation, etc.). Although there have been some comprehensive studies of Amharic syntax from a linguistic perspective, attempts for investigating it from a computational point of view is a very recent story. In this thesis, Amharic word and phrase classes, sentence formalisms, morphological properties peculiar to complex sentence formation in the language, and XI attempts to extract such features that enable implementation of automatic Amharic complex sentence parser is presented. The sample data used in this study has been taken from references that are widely used in the teaching-learning process of the language. This data has also been manually analyzed, tagged, parsed, and then used as a corpus to extract the grammar rules and to assign probabilities. Algorithms that can use the morphological, lexical and syntactic properties of the language have been customized and modified. Experiments have been conducted in this study using the training set and test set. The first experiment was conducted on the part-of-speech tagger to see the state of its performance when a morphological analysis is embedded in it. The result of this experiment showed that the tagger attained 98.7% and 94% of accuracies on the training set and the test set, respectively. The experiments on complex sentence parsing showed 89.6% accuracy result on the training set and 81.6% accuracy result on the test set prepared for this purpose.