AAU-ETD :: Browsing by Author "Assabie, Yaregal (PhD)"

Browsing by Author "Assabie, Yaregal (PhD)"

Now showing 1 - 20 of 104

Accessing Databases Using Amharic Natural Language
(Addis Ababa University, 10/6/2020) Legesse, Beniyam; Assabie, Yaregal (PhD)
Nowadays, day to day activities of human beings is highly dependent on the information distributed in every part of the world. One major source of the information, which is the collection of related data, is the database. To extract the information from the database, it is required to formulate a structured query language which is understood by the database engine. The SQL query is not known by everyone in the world as it requires studying and remembering its syntax and semantics. Only professionals who study the SQL can formulate the query to access the database. On the other hand, human beings communicate with each other using natural language. It would be easier to access the content of the database using that natural language which in turn contributes to the field of natural language interface to the database. Since in many private and public organizations, peoples are performing day to day activities in Amharic language and many of them are not skilled in formulating structured query language, it would be better if there is a mechanism by which the users can directly extract information from the database using the Amharic language. This research accepts questions that are written in Amharic natural language and converts to its equivalent structured query language. A dataset which consists of an input word that is tagged with the appropriate output variable is prepared. Features which represent the Amharic questions are identified and given to the classifier for training purpose. Stemmer, Morphological analyzer, and pre-processor prepare the input question in the format required by the classifier. To identify appropriate query elements, Query Element Identifier uses the dictionary which is prepared by applying the concept of semantic free grammar. The query constructor constructs the required SQL query using these identified query elements. A prototype called Amharic Database querying system is developed to demonstrate the idea raised by this research. Testers from different departments having different mother tongue language test the performance of the system.
Accessing Databases Using Amharic Natural Language
(Addis Ababa University, 2020-10-06) Legesse, Beniyam; Assabie, Yaregal (PhD)
Nowadays, day to day activities of human beings is highly dependent on the information distributed in every part of the world. One major source of the information, which is the collection of related data, is the database. To extract the information from the database, it is required to formulate a structured query language which is understood by the database engine. The SQL query is not known by everyone in the world as it requires studying and remembering its syntax and semantics. Only professionals who study the SQL can formulate the query to access the database. On the other hand, human beings communicate with each other using natural language. It would be easier to access the content of the database using that natural language which in turn contributes to the field of natural language interface to the database. Since in many private and public organizations, peoples are performing day to day activities in Amharic language and many of them are not skilled in formulating structured query language, it would be better if there is a mechanism by which the users can directly extract information from the database using the Amharic language. This research accepts questions that are written in Amharic natural language and converts to its equivalent structured query language. A dataset which consists of an input word that is tagged with the appropriate output variable is prepared. Features which represent the Amharic questions are identified and given to the classifier for training purpose. Stemmer, Morphological analyzer, and pre-processor prepare the input question in the format required by the classifier. To identify appropriate query elements, Query Element Identifier uses the dictionary which is prepared by applying the concept of semantic free grammar. The query constructor constructs the required SQL query using these identified query elements. A prototype called Amharic Database querying system is developed to demonstrate the idea raised by this research. Testers from different departments having different mother tongue language test the performance of the system.
Afaan Oromo Named Entity Recognition Using Neural Word Embeddings
(Addis Ababa University, 10/26/2020) Kasu, Mekonini; Assabie, Yaregal (PhD)
Named Entity Recognition (NER) is one of the canonical examples of sequence tagging that assigns a named entity label to each of a sequence of words. This task is important for a wide range of downstream applications in natural languages processing. Two attempts have been conducted for Afaan Oromo NER that automatically identifies and classifies the proper names in text into predefined semantic types like a person, location, and organizations and miscellaneous. However, their work heavily relied on hand design feature. We proposed a deep neural network architecture for Afaan Oromo Named Entity Recognition, based on context encoder and decoder models using Bi-directional Long Short Term Memory and Conditional Random Fields respectively. In the proposed approach, initially, we generated neural word embeddings automatically using skip-gram with negative subsampling from an unsupervised corpus size of 50,284KB. The generated word embeddings represent words in semantic vectors which are further used as an input feature for encoder and decoder model. Likewise, character level representation is generated automatically using BiLSTM from the supervised corpus size of 768KB. Because of the use of character level representation, the proposed model is robust for the out-of-vocabulary words. In this study, we manually prepared annotated dataset size of 768KB for Afaan Oromo Named Entity Recognition. We split this dataset into 80% for training, 5% for testing and 15% for validation. We prepared totally 12,963 named entities from these 10,370.4 %, 648.15% and 1,944.45% are used for training, validation and test set respectively. Experimental results show that the combination of BiLSTM-CRF algorithms with pre-trained word embedding and character level representation and regularization techniques (dropout) perform better as compared to the other models such as Bi-LSTM, BiLSTM-CRF with only character level representation or word embeddings. Using Bi-LSTM-CRF model with pre-trained word embeddings and character level representation significantly improved Afaan Oromo Named Entity Recognition with an average of 93.26 % F-Score and 98.87 % accuracy.
Afaan Oromo Named Entity Recognition Using Neural Word Embeddings
(Addis Ababa University, 2020-10-26) Kasu, Mekonini; Assabie, Yaregal (PhD)
Named Entity Recognition (NER) is one of the canonical examples of sequence tagging that assigns a named entity label to each of a sequence of words. This task is important for a wide range of downstream applications in natural languages processing. Two attempts have been conducted for Afaan Oromo NER that automatically identifies and classifies the proper names in text into predefined semantic types like a person, location, and organizations and miscellaneous. However, their work heavily relied on hand design feature. We proposed a deep neural network architecture for Afaan Oromo Named Entity Recognition, based on context encoder and decoder models using Bi-directional Long Short Term Memory and Conditional Random Fields respectively. In the proposed approach, initially, we generated neural word embeddings automatically using skip-gram with negative subsampling from an unsupervised corpus size of 50,284KB. The generated word embeddings represent words in semantic vectors which are further used as an input feature for encoder and decoder model. Likewise, character level representation is generated automatically using BiLSTM from the supervised corpus size of 768KB. Because of the use of character level representation, the proposed model is robust for the out-of-vocabulary words. In this study, we manually prepared annotated dataset size of 768KB for Afaan Oromo Named Entity Recognition. We split this dataset into 80% for training, 5% for testing and 15% for validation. We prepared totally 12,963 named entities from these 10,370.4 %, 648.15% and 1,944.45% are used for training, validation and test set respectively. Experimental results show that the combination of BiLSTM-CRF algorithms with pre-trained word embedding and character level representation and regularization techniques (dropout) perform better as compared to the other models such as Bi-LSTM, BiLSTM-CRF with only character level representation or word embeddings. Using Bi-LSTM-CRF model with pre-trained word embeddings and character level representation significantly improved Afaan Oromo Named Entity Recognition with an average of 93.26 % F-Score and 98.87 % accuracy.
Afaan Oromo Text Summarization Using Word Embedding
(Addis Ababa University, 2020-11-04) Tashoma, Lamesa; Assabie, Yaregal (PhD)
Nowadays we are overloaded by information as technology is growing. This causes a problem to identify which information is reading worthy or not. To solve this problem, Automatic Text Summarization has emerged. It is a computer program that summarizes text by removing redundant information from the input text and produces a shorter non-redundant output text. This study deals with development of a generic automatic text summarizer for Afaan Oromo text using word embedding. Language specific lexicons like stop words and stemmer are used to develop the summarizer. A graph-based PageRank is used to select the summary of worthy sentences out of the document. To measure the similarities between sentences cosine similarity is used. The data used in this work was collected from both secondary and primary sources. Afaan Oromo stop word list, suffix and other language specific lexicons are gathered from previous works done on Afaan Oromo. To develop a Word2Vec model we have gathered different Afaan Oromo texts from different sources like: Internet, organizations and individuals. For validation and testing 22 different newspaper topics are collected, from this, 13 of them have been used for validation while the rest 9 were employed for testing purpose. The system has been evaluated based on three experimental scenarios and evaluation is made both subjectively and objectively. The subjective evaluation focuses on evaluation of the structure of the summary like informativeness of the summary, coherence, referential clarity, non-redundancy and grammar. In the objective evaluation we used metrics like precision, recall and F-measure. The result of subjective evaluation is 83.33% informativeness, 78.8% referential integrity and grammar, and 76.66% structure and coherence. This work also achieved 0.527 precision, 0.422 recall and 0.468 F-measure by using the data we gathered. However, the overall performance of the summarizer outperformed by 0.648 precision, 0.626 recall and 0.058 F-measure when compared with the previous works by using the same data used in their work.
Afaan Oromo Text Summarization Using Word Embedding
(Addis Ababa University, 11/4/2020) Tashoma, Lamesa; Assabie, Yaregal (PhD)
Nowadays we are overloaded by information as technology is growing. This causes a problem to identify which information is reading worthy or not. To solve this problem, Automatic Text Summarization has emerged. It is a computer program that summarizes text by removing redundant information from the input text and produces a shorter non-redundant output text. This study deals with development of a generic automatic text summarizer for Afaan Oromo text using word embedding. Language specific lexicons like stop words and stemmer are used to develop the summarizer. A graph-based PageRank is used to select the summary of worthy sentences out of the document. To measure the similarities between sentences cosine similarity is used. The data used in this work was collected from both secondary and primary sources. Afaan Oromo stop word list, suffix and other language specific lexicons are gathered from previous works done on Afaan Oromo. To develop a Word2Vec model we have gathered different Afaan Oromo texts from different sources like: Internet, organizations and individuals. For validation and testing 22 different newspaper topics are collected, from this, 13 of them have been used for validation while the rest 9 were employed for testing purpose. The system has been evaluated based on three experimental scenarios and evaluation is made both subjectively and objectively. The subjective evaluation focuses on evaluation of the structure of the summary like informativeness of the summary, coherence, referential clarity, non-redundancy and grammar. In the objective evaluation we used metrics like precision, recall and F-measure. The result of subjective evaluation is 83.33% informativeness, 78.8% referential integrity and grammar, and 76.66% structure and coherence. This work also achieved 0.527 precision, 0.422 recall and 0.468 F-measure by using the data we gathered. However, the overall performance of the summarizer outperformed by 0.648 precision, 0.626 recall and 0.058 F-measure when compared with the previous works by using the same data used in their work.
Amharic Information Retrieval Using Semantic Vocabulary
(Addis Ababa University, 2019-10-02) Getnet, Berihun; Assabie, Yaregal (PhD)
The increase in large scale data available from different sources and the user’s need access to information retrieval becomes more focusing issue these days. Information retrieval implies seeking relevant documents for the user’s queries. But the way of providing the queries and the system responds relevant results for the user should be improved for better satisfaction. This can be enhanced by expanding the original queries from semantic lexical resources that are constructed either manually or automatically from a text corpus. But, manual construction is tedious and time-consuming when the data set is huge. The way semantic resources are built also affects retrieval performance. Based on formal semantics the meaning is built using symbolic tradition and centered around the inferential properties of languages. It is also possible to automatically construct semantic resources based on the distribution of the word from unstructured data which applies the notion about unsupervised learning that automatically builds semantics from high dimensional vector space. This produces contextual similarity via word’s angular orientation. There have been attempts done to enhance information retrieval by expanding queries from semantic resources for non-Ethiopian languages. In this study, we propose Amharic information retrieval using semantic vocabulary. It isfigured out by considering components including text preprocessing, word-space modeling, semantic word sense clustering, document indexing, and searching. After the Amharic documents are preprocessed the words are vectorized on a multidimensional space using Word2vec based on the notion words surrounding another word can be contextually similar. Based on the word’s angular orientation, the semantic vocabulary is constructed using cosine distance. After Amharic documents are preprocessed it is indexed for later retrieval. Then the user provides the queries and the system expands the original query from the semantic vocabulary. The queries are reformulated and words are searched from indexed data that returns more relevant documents for the user. A prototype of the system is developed and we have tested the performance of the system using Amharic documents collected from Ethiopian public media. The semantic vocabulary based on the word analog prediction using the cosine metric is promising. It is also compared against the semantic thesaurus constructed with the latent semantic analysis and it increases by 17.2% accuracy. Information retrieval using semantic vocabulary based on ranked retrieval increases by 24.3% recall, and using unranked set of retrieval, 10.89% recall improvement was obtained.
Amharic Information Retrieval Using Semantic Vocabulary
(Addis Ababa University, 10/2/2019) Getnet, Berihun; Assabie, Yaregal (PhD)
The increase in large scale data available from different sources and the user’s need access to information retrieval becomes more focusing issue these days. Information retrieval implies seeking relevant documents for the user’s queries. But the way of providing the queries and the system responds relevant results for the user should be improved for better satisfaction. This can be enhanced by expanding the original queries from semantic lexical resources that are constructed either manually or automatically from a text corpus. But, manual construction is tedious and time-consuming when the data set is huge. The way semantic resources are built also affects retrieval performance. Based on formal semantics the meaning is built using symbolic tradition and centered around the inferential properties of languages. It is also possible to automatically construct semantic resources based on the distribution of the word from unstructured data which applies the notion about unsupervised learning that automatically builds semantics from high dimensional vector space. This produces contextual similarity via word’s angular orientation. There have been attempts done to enhance information retrieval by expanding queries from semantic resources for non-Ethiopian languages. In this study, we propose Amharic information retrieval using semantic vocabulary. It isfigured out by considering components including text preprocessing, word-space modeling, semantic word sense clustering, document indexing, and searching. After the Amharic documents are preprocessed the words are vectorized on a multidimensional space using Word2vec based on the notion words surrounding another word can be contextually similar. Based on the word’s angular orientation, the semantic vocabulary is constructed using cosine distance. After Amharic documents are preprocessed it is indexed for later retrieval. Then the user provides the queries and the system expands the original query from the semantic vocabulary. The queries are reformulated and words are searched from indexed data that returns more relevant documents for the user. A prototype of the system is developed and we have tested the performance of the system using Amharic documents collected from Ethiopian public media. The semantic vocabulary based on the word analog prediction using the cosine metric is promising. It is also compared against the semantic thesaurus constructed with the latent semantic analysis and it increases by 17.2% accuracy. Information retrieval using semantic vocabulary based on ranked retrieval increases by 24.3% recall, and using unranked set of retrieval, 10.89% recall improvement was obtained.
Amharic Open Information Extraction
(Addis Ababa University, 3/3/2020) Girma, Seble; Assabie, Yaregal (PhD)
Open Information Extraction is the process of discovering domain-independent relations by providing ways to extract unrestricted relational information from natural language text. It has recently received increased attention and applied extensively to various downstream applications, such as text summarization, question answering, and informational retrieval. Although a lot of Open Information Extraction systems have been developed for various natural language text, no research has been conducted yet for the development of Amharic Open Information Extraction (AOIE). As litrature has shown, the rule-based approach operating on deep parsed sentences yields the most promising results for Open Information Extraction systems. However, to the best of our knowledge, there is no fully implemented deep syntactic parser available for Amharic language. Therefore, in this thesis, we propose the development of a rule-based AOIE system that utilizes shallow parsed sentences. The proposed system has six components: Preprocessing, Morphological Analysis, Phrasal Chunking, Sentence Simplification, Relation Extraction, and Post-processing. In the Preprocessing, each word in the input text is labeled with an appropriate POS tag, and then well-formed and informative sentences are filtered out for further processing based on POS tags of words. The Morphological Analysis component produces morphological information about each word of input sentences. The phrasal chunking component divides the input sentence into non-overlapping phrases based on POS and morphological tags of words. The Sentence Simplification component segments the sentence into a number of self-contained simple sentences that are easier to process. In the Relation Extraction, relation instances are extracted from those simplified sentences and finally the post-processing components prints extracted relations in N-ary format. The proposed method and algorithms were implemented in prototype software and evaluated with a dataset from different domains. In the evaluation, we showed that the system achieved an overall precision of 0.88.
Amharic Open Information Extraction
(Addis Ababa University, 2020-03-03) Girma, Seble; Assabie, Yaregal (PhD)
Open Information Extraction is the process of discovering domain-independent relations by providing ways to extract unrestricted relational information from natural language text. It has recently received increased attention and applied extensively to various downstream applications, such as text summarization, question answering, and informational retrieval. Although a lot of Open Information Extraction systems have been developed for various natural language text, no research has been conducted yet for the development of Amharic Open Information Extraction (AOIE). As litrature has shown, the rule-based approach operating on deep parsed sentences yields the most promising results for Open Information Extraction systems. However, to the best of our knowledge, there is no fully implemented deep syntactic parser available for Amharic language. Therefore, in this thesis, we propose the development of a rule-based AOIE system that utilizes shallow parsed sentences. The proposed system has six components: Preprocessing, Morphological Analysis, Phrasal Chunking, Sentence Simplification, Relation Extraction, and Post-processing. In the Preprocessing, each word in the input text is labeled with an appropriate POS tag, and then well-formed and informative sentences are filtered out for further processing based on POS tags of words. The Morphological Analysis component produces morphological information about each word of input sentences. The phrasal chunking component divides the input sentence into non-overlapping phrases based on POS and morphological tags of words. The Sentence Simplification component segments the sentence into a number of self-contained simple sentences that are easier to process. In the Relation Extraction, relation instances are extracted from those simplified sentences and finally the post-processing components prints extracted relations in N-ary format. The proposed method and algorithms were implemented in prototype software and evaluated with a dataset from different domains. In the evaluation, we showed that the system achieved an overall precision of 0.88.
Amharic Question Classification System Using Deep Learning Approach
(Addis Ababa University, 4/14/2021) Habtamu, Saron; Assabie, Yaregal (PhD)
Questions are used in different applications such as Question Answering (QA), Dialog System (DS), and Information Retrieval (IR). However, some questions might be too complex to be analyzed and processed. As a result, systems are expected to have a good feature extraction and analysis mechanism to linguistically understand these questions. The retrieval of wrong answers, inaccuracy of IR, and crowding the search space with irrelevant candidate answers are some of the challenges that are caused due to the inability to appropriately process and analyze questions. Question Classification (QC) aims to solve this issue by extracting the relevant features from the questions and by assigning them to the correct class category. Even though QC has been studied for various languages, it was hardly studied for the Amharic language. This research studies Amharic QC focusing on designing hierarchical question taxonomy, preparing Amharic question dataset by labeling the sample questions into their respective classes, and implementing Amharic QC (AQC) model using Convolutional Neural Network (CNN) which is part of the DL approach. The AQC uses a multilabel question taxonomy that integrates coarse and fine grain categories. This multilabel class helps us to be more accurate in retrieving answers compared to the flat taxonomy. We constructed the taxonomy by analyzing our AQ dataset and also adopting the standard taxonomies that were previously studied. We have prepared the AQs in three forms: Surface, Stemmed, and Lemmatised forms. We train and test these datasets using a word vectorizer trained on surface words noticing that most interrogative words appear to be similar even when they are stemmed and lemmatized. As a result, we have achieved 97% and 90% training and validation accuracy for Surface AQs. Scoring 40% for the stemmed AQs. However, the word2vec model could not represent the lemmatized AQs appropriately. As a result, no results were obtained during training. we also tried to extract features from AQs by using different filters separately. This gave us an accuracy of 86% while requiring an increasing number of training epochs.
Amharic Question Classification System Using Deep Learning Approach
(Addis Ababa University, 2021-04-14) Habtamu, Saron; Assabie, Yaregal (PhD)
Questions are used in different applications such as Question Answering (QA), Dialog System (DS), and Information Retrieval (IR). However, some questions might be too complex to be analyzed and processed. As a result, systems are expected to have a good feature extraction and analysis mechanism to linguistically understand these questions. The retrieval of wrong answers, inaccuracy of IR, and crowding the search space with irrelevant candidate answers are some of the challenges that are caused due to the inability to appropriately process and analyze questions. Question Classification (QC) aims to solve this issue by extracting the relevant features from the questions and by assigning them to the correct class category. Even though QC has been studied for various languages, it was hardly studied for the Amharic language. This research studies Amharic QC focusing on designing hierarchical question taxonomy, preparing Amharic question dataset by labeling the sample questions into their respective classes, and implementing Amharic QC (AQC) model using Convolutional Neural Network (CNN) which is part of the DL approach. The AQC uses a multilabel question taxonomy that integrates coarse and fine grain categories. This multilabel class helps us to be more accurate in retrieving answers compared to the flat taxonomy. We constructed the taxonomy by analyzing our AQ dataset and also adopting the standard taxonomies that were previously studied. We have prepared the AQs in three forms: Surface, Stemmed, and Lemmatised forms. We train and test these datasets using a word vectorizer trained on surface words noticing that most interrogative words appear to be similar even when they are stemmed and lemmatized. As a result, we have achieved 97% and 90% training and validation accuracy for Surface AQs. Scoring 40% for the stemmed AQs. However, the word2vec model could not represent the lemmatized AQs appropriately. As a result, no results were obtained during training. we also tried to extract features from AQs by using different filters separately. This gave us an accuracy of 86% while requiring an increasing number of training epochs.
Amharic Sentence Generation from Interlingua Representation
(Addis Ababa University, 2016-12-27) Yitbarek, Kibrewossen; Assabie, Yaregal (PhD)
Sentence generation is a part of Natural Language Generation (NLG) which is the process of deliberately constructing a natural language text in order to meet specified communicative goals. The major requirement of sentence generation in a natural language is providing full, clear, meaningful and grammatically correct sentence. A sentence can be generated from different possible sources, including a representation which does not depend in any human languages, which is an Interlingua. Generating a sentence from an Interlingua representation has numerous advantages. Since Interlingua representation is unambiguous, universal and independent of both the source language and the target language, the generation should be target language-specific, and likewise should be the analysis. Among the different Interlinguas’, Universal Networking Language (UNL) is commonly chosen in view of various advantages over the other ones. Various works have been done so far for different languages of the world to generate sentences from UNL expression but to the best of our knowledge there are no works done so far for Amharic language. In this thesis, we present Amharic sentence generator that automatically generates Amharic sentence from a given input UNL expression. The generator accepts a UNL expression as an input and parses to build a node-net from the input UNL expression. The parsed UNL expressions are stored in a data structure which could be easily modified in the successive processes. UNL-to-Amharic word dictionary is also prepared and it contains the root form of Amharic words. The Amharic equivalent root word and attributes of nodes in a parsed UNL expression will be fetched from the dictionary to update the head word and attributes of the corresponding node. Then, the translated Amharic root words will be locally reordered and marked based on the Amharic grammar rules. When the nodes are ready for generation of morphology, the proposed system makes use of Amharic morphology data sets to handle the generation of noun, adjective, pronoun, and verb morphology. Finally, the function words are inserted to the morphed words so that the output matches with a natural language sentence. The evaluation of the proposed system has been performed on dataset of 142 UNL expressions. Subjective tests like adequacy and fluency tests have been performed on the proposed system. Moreover, the quantitative test or error analysis has also been performed by calculating Word Error Rate (WER). From this analysis, it has been observed that the proposed system generates 71.4% sentences that are intelligible and 67.8% sentences that are faithful to the original UNL expression. Consequently, the system achieved a fluency score of 3.0 (on a 4-point scale) and adequacy score of 2.9 (on a 4-point scale). Furthermore, the proposed system has word error rate of 28.94%. These scores of the proposed system can be improved further by improving the rule base and lexicon.
Amharic Sentence Generation from Interlingua Representation
(Addis Ababa University, 12/27/2016) Yitbarek, Kibrewossen; Assabie, Yaregal (PhD)
Sentence generation is a part of Natural Language Generation (NLG) which is the process of deliberately constructing a natural language text in order to meet specified communicative goals. The major requirement of sentence generation in a natural language is providing full, clear, meaningful and grammatically correct sentence. A sentence can be generated from different possible sources, including a representation which does not depend in any human languages, which is an Interlingua. Generating a sentence from an Interlingua representation has numerous advantages. Since Interlingua representation is unambiguous, universal and independent of both the source language and the target language, the generation should be target language-specific, and likewise should be the analysis. Among the different Interlinguas’, Universal Networking Language (UNL) is commonly chosen in view of various advantages over the other ones. Various works have been done so far for different languages of the world to generate sentences from UNL expression but to the best of our knowledge there are no works done so far for Amharic language. In this thesis, we present Amharic sentence generator that automatically generates Amharic sentence from a given input UNL expression. The generator accepts a UNL expression as an input and parses to build a node-net from the input UNL expression. The parsed UNL expressions are stored in a data structure which could be easily modified in the successive processes. UNL-to-Amharic word dictionary is also prepared and it contains the root form of Amharic words. The Amharic equivalent root word and attributes of nodes in a parsed UNL expression will be fetched from the dictionary to update the head word and attributes of the corresponding node. Then, the translated Amharic root words will be locally reordered and marked based on the Amharic grammar rules. When the nodes are ready for generation of morphology, the proposed system makes use of Amharic morphology data sets to handle the generation of noun, adjective, pronoun, and verb morphology. Finally, the function words are inserted to the morphed words so that the output matches with a natural language sentence. The evaluation of the proposed system has been performed on dataset of 142 UNL expressions. Subjective tests like adequacy and fluency tests have been performed on the proposed system. Moreover, the quantitative test or error analysis has also been performed by calculating Word Error Rate (WER). From this analysis, it has been observed that the proposed system generates 71.4% sentences that are intelligible and 67.8% sentences that are faithful to the original UNL expression. Consequently, the system achieved a fluency score of 3.0 (on a 4-point scale) and adequacy score of 2.9 (on a 4-point scale). Furthermore, the proposed system has word error rate of 28.94%. These scores of the proposed system can be improved further by improving the rule base and lexicon.
Amharic Word Sense Disambiguation Using wordnet
(Addis Ababa University, 2015-03) Hassen, Segid; Assabie, Yaregal (PhD)
Words can have more than one distinct meaning and many words can be interpreted in multiple ways depending on the context in which they occur. The process of automatically identifying the meaning of a polysemous word in a sentence is a fundamental task in Natural Language Processing (NLP). This phenomenon poses challenges to Natural Language Processing systems. There have been many efforts on word sense disambiguation for English; however, the amount of efforts for Amharic is very little. Many natural language processing applications, such as Machine Translation, Information Retrieval, Question Answering, and Information Extraction, require this task, which occurs at the semantic level. In this thesis, a knowledge-based word sense disambiguation method that employs Amharic WordNet is developed. Knowledge-based Amharic WSD extracts knowledge from word definitions and relations among words and senses. The proposed system consists of preprocessing, morphological analysis and disambiguation components besides Amharic WordNet database. Preprocessing is used to prepare the input sentence for morphological analysis and morphological analysis is used to reduce various forms of a word to a single root or stem word. Amharic WordNet contains words along with its different meanings, synsets and semantic relations with in concepts. Finally, the disambiguation component is used to identify the ambiguous words and assign the appropriate sense of ambiguous words in a sentence using Amharic WordNet by using sense overlap and related words. We have evaluated the knowledge-based Amharic word sense disambiguation using Amharic WordNet system by conducting two experiments. The first one is evaluating the effect of Amharic WordNet with and without morphological analyzer and the second one is determining an optimal windows size for Amharic WSD. For Amharic WordNet with morphological analyzer and Amharic WordNet without morphological analyzer we have achieved an accuracy of 57.5% and 80%, respectively. In the second experiment, we have found that two-word window on each side of the ambiguous word is enough for Amharic WSD. The test results have shown that the proposed WSD methods have performed better than previous Amharic WSD methods. Keywords: Natural Language Processing, Amharic WordNet, Word Sense Disambiguation, Knowledge Based Approach, Lesk Algorithm
Amharic Word Sense Disambiguation Using wordnet
(Addis Ababa University, 2015-03) Hassen, Segid; Assabie, Yaregal (PhD)
Words can have more than one distinct meaning and many words can be interpreted in multiple ways depending on the context in which they occur. The process of automatically identifying the meaning of a polysemous word in a sentence is a fundamental task in Natural Language Processing (NLP). This phenomenon poses challenges to Natural Language Processing systems. There have been many efforts on word sense disambiguation for English; however, the amount of efforts for Amharic is very little. Many natural language processing applications, such as Machine Translation, Information Retrieval, Question Answering, and Information Extraction, require this task, which occurs at the semantic level. In this thesis, a knowledge-based word sense disambiguation method that employs Amharic WordNet is developed. Knowledge-based Amharic WSD extracts knowledge from word definitions and relations among words and senses. The proposed system consists of preprocessing, morphological analysis and disambiguation components besides Amharic WordNet database. Preprocessing is used to prepare the input sentence for morphological analysis and morphological analysis is used to reduce various forms of a word to a single root or stem word. Amharic WordNet contains words along with its different meanings, synsets and semantic relations with in concepts. Finally, the disambiguation component is used to identify the ambiguous words and assign the appropriate sense of ambiguous words in a sentence using Amharic WordNet by using sense overlap and related words. We have evaluated the knowledge-based Amharic word sense disambiguation using Amharic WordNet system by conducting two experiments. The first one is evaluating the effect of Amharic WordNet with and without morphological analyzer and the second one is determining an optimal windows size for Amharic WSD. For Amharic WordNet with morphological analyzer and Amharic WordNet without morphological analyzer we have achieved an accuracy of 57.5% and 80%, respectively. In the second experiment, we have found that two-word window on each side of the ambiguous word is enough for Amharic WSD. The test results have shown that the proposed WSD methods have performed better than previous Amharic WSD methods. Keywords: Natural Language Processing, Amharic WordNet, Word Sense Disambiguation, Knowledge Based Approach, Lesk Algorithm
Amharic Wordnet Construction Using Word Embedding
(Addis Ababa University, 2020-05-29) Getaneh, Mulat; Assabie, Yaregal (PhD)
A big amount of data is produced on the web and this data is available in the online data portals. By any means, people always need to access, analyze, and organize those data easily. To access and analyze those data effectively there must be an automated system that can understand human language as it is spoken. This is possible by using natural language processing applications. However, most of the natural language applications such as sentiment analysis, information retrieval, question answering, word sense disambiguation, etc. use WordNet as a resource. Some natural language applications like information retrieval can be done using electronic thesaurus and dictionary, but the coverage of such resources is limited. WordNet solves such a problem and it is used as a resource for many other natural language processing applications. A WordNet resource can be constructed using manual, semi-automated, and fully automated methods from the text data. However, while the manual method is time consuming and semi-automated methods are not effective methods since the resource includes different relations in addition with a large dataset. So, using these methods is tiresome and time-consuming. Semi-automated and automated methods can be effective for languages which have sufficient resources like thesaurus, bilingual dictionary, monolingual text corpus, effective machine translator, etc. So, automatically constructing a WordNet resource from unlabeled text data is the best way for languages like Amharic which have limited resource. In this study, we propose Automatic Amharic WordNet construction using word embedding. The proposed model includes different tasks. The first task is text pre-processing which consists of commonly used text pre-processing tasks in many natural language processing applications. We perform text pre-processing in Amharic text document and train the document using a word embedding gensim library (word2vec) in order to generate word embedding model. The embedding result provides a contextually similar word for every words in the training set. Most of contextual similar words belong to a relation r. The trained word vector model captures different patterns. After training the data we take the trained model as input and discover different patterns that used to extract WordNet relations like: hypernym/hyponym, synonym, and antonym. Conceptual synonym of a word is extracted based on cosine similarity. We use an additional distance supervision method for near-synonym (like meaning exist in dictionary) relation extraction. So, for this method we perform feature extraction task based on given sample seed words (synonym pairs). In the other hand, we extracted hypernym/hyponym relation from the trained model by taking the advantage of mutual information concept and measuring similarity (based on cosine distance). Whereas antonym relation of words are extracted from the trained word2vec model based on the concept of word analogy. The common evaluation metrics such as recall and precision were used to measure our proposed model performance. Amharic WordNet prototype is developed and used to tests the system performance of using the collected Amharic text document. Finally, this study shows a result of 78.3% recall and a precision of 53.9. We also evaluate using Spearman’s correlation, and achieve +0.79 correlation coefficient.
Amharic Wordnet Construction Using Word Embedding
(Addis Ababa University, 5/29/2020) Getaneh, Mulat; Assabie, Yaregal (PhD)
A big amount of data is produced on the web and this data is available in the online data portals. By any means, people always need to access, analyze, and organize those data easily. To access and analyze those data effectively there must be an automated system that can understand human language as it is spoken. This is possible by using natural language processing applications. However, most of the natural language applications such as sentiment analysis, information retrieval, question answering, word sense disambiguation, etc. use WordNet as a resource. Some natural language applications like information retrieval can be done using electronic thesaurus and dictionary, but the coverage of such resources is limited. WordNet solves such a problem and it is used as a resource for many other natural language processing applications. A WordNet resource can be constructed using manual, semi-automated, and fully automated methods from the text data. However, while the manual method is time consuming and semi-automated methods are not effective methods since the resource includes different relations in addition with a large dataset. So, using these methods is tiresome and time-consuming. Semi-automated and automated methods can be effective for languages which have sufficient resources like thesaurus, bilingual dictionary, monolingual text corpus, effective machine translator, etc. So, automatically constructing a WordNet resource from unlabeled text data is the best way for languages like Amharic which have limited resource. In this study, we propose Automatic Amharic WordNet construction using word embedding. The proposed model includes different tasks. The first task is text pre-processing which consists of commonly used text pre-processing tasks in many natural language processing applications. We perform text pre-processing in Amharic text document and train the document using a word embedding gensim library (word2vec) in order to generate word embedding model. The embedding result provides a contextually similar word for every words in the training set. Most of contextual similar words belong to a relation r. The trained word vector model captures different patterns. After training the data we take the trained model as input and discover different patterns that used to extract WordNet relations like: hypernym/hyponym, synonym, and antonym. Conceptual synonym of a word is extracted based on cosine similarity. We use an additional distance supervision method for near-synonym (like meaning exist in dictionary) relation extraction. So, for this method we perform feature extraction task based on given sample seed words (synonym pairs). In the other hand, we extracted hypernym/hyponym relation from the trained model by taking the advantage of mutual information concept and measuring similarity (based on cosine distance). Whereas antonym relation of words are extracted from the trained word2vec model based on the concept of word analogy. The common evaluation metrics such as recall and precision were used to measure our proposed model performance. Amharic WordNet prototype is developed and used to tests the system performance of using the collected Amharic text document. Finally, this study shows a result of 78.3% recall and a precision of 53.9. We also evaluate using Spearman’s correlation, and achieve +0.79 correlation coefficient.
Amharic-to-Tigrigna Machine Translation Using Hybrid Approach
(Addis Ababa University, 10/7/2017) Gebremariam, Akubazgi; Assabie, Yaregal (PhD)
Machine Translation is one of the applications of Natural Language Processing that studies the use of computer software to translate a natural language into another language in the form of text or speech. People use human translation and they tend to be slower as compared to machines. Sometimes it can be hard to get a precise translation that reveals what the text is about without everything being translated word-by-word. In addition, it can be more important to get the result without delay which is hard to accomplish with a human translator. It also leads to unwanted expenses like, time and cost. Thus, this research works on Amharic-to-Tigrigna machine translation system using a hybrid approach i.e. the combination of rule based and statistical approaches to solve the problems. Though, Amharic and Tigrigna are from the same family of language and uses similar sentence structure, they have also difference in constructing various types of phrases. Therefore, the study proposes syntactic reordering approach which aligns the structural arrangement order of words in the source sentence to be more similar to the target sentences. So, reordering rules are developed that fulfils for both simple and complex Amharic sentences that have difference in the structural arrangement order of words. As the researcher knowledge is concerned, there is no prior work conducted on machine translation between Amharic and Tigrigna which is in need to solve this currently. In order to achieve the objective of the study, a corpus is collected from different domain and prepared in a format suitable in the development process and classified as training set and test set. Reordering rules are applied on both the training and testing set in a pre-processing step. One language model is developed, since the system is unidirectional i.e. Amharic-to-Tigrigna. Translation model which assign a probability that a given source language sentence generates target language sentence are built and decoder which searches for the best sequence of translation probability is used. Two major experiments are conducted using two different approaches and their results are recorded. The first experiment is carried out using a statistical approach and the result obtained from the experiment has a BLEU score of 7.02%. The second experiment is carried out using hybrid approach and the result obtained has a BLEU score of 17.47% s. From the result, it can be concluded that the hybrid approach is better than the statistical approach for Amharic-to-Tigrigna machine translation system.
Amharic-to-Tigrigna Machine Translation Using Hybrid Approach
(Addis Ababa University, 2017-10-07) Gebremariam, Akubazgi; Assabie, Yaregal (PhD)
Machine Translation is one of the applications of Natural Language Processing that studies the use of computer software to translate a natural language into another language in the form of text or speech. People use human translation and they tend to be slower as compared to machines. Sometimes it can be hard to get a precise translation that reveals what the text is about without everything being translated word-by-word. In addition, it can be more important to get the result without delay which is hard to accomplish with a human translator. It also leads to unwanted expenses like, time and cost. Thus, this research works on Amharic-to-Tigrigna machine translation system using a hybrid approach i.e. the combination of rule based and statistical approaches to solve the problems. Though, Amharic and Tigrigna are from the same family of language and uses similar sentence structure, they have also difference in constructing various types of phrases. Therefore, the study proposes syntactic reordering approach which aligns the structural arrangement order of words in the source sentence to be more similar to the target sentences. So, reordering rules are developed that fulfils for both simple and complex Amharic sentences that have difference in the structural arrangement order of words. As the researcher knowledge is concerned, there is no prior work conducted on machine translation between Amharic and Tigrigna which is in need to solve this currently. In order to achieve the objective of the study, a corpus is collected from different domain and prepared in a format suitable in the development process and classified as training set and test set. Reordering rules are applied on both the training and testing set in a pre-processing step. One language model is developed, since the system is unidirectional i.e. Amharic-to-Tigrigna. Translation model which assign a probability that a given source language sentence generates target language sentence are built and decoder which searches for the best sequence of translation probability is used. Two major experiments are conducted using two different approaches and their results are recorded. The first experiment is carried out using a statistical approach and the result obtained from the experiment has a BLEU score of 7.02%. The second experiment is carried out using hybrid approach and the result obtained has a BLEU score of 17.47% s. From the result, it can be concluded that the hybrid approach is better than the statistical approach for Amharic-to-Tigrigna machine translation system.

Browsing by Author "Assabie, Yaregal (PhD)"

Results Per Page

Sort Options