Computer Science

Permanent URI for this collection

http://etd.aau.edu.et/handle/123456789/25

Browse

Now showing 1 - 20 of 363

Accessing Databases Using Amharic Natural Language
(Addis Ababa University, 10/6/2020) Legesse, Beniyam; Assabie, Yaregal (PhD)
Nowadays, day to day activities of human beings is highly dependent on the information distributed in every part of the world. One major source of the information, which is the collection of related data, is the database. To extract the information from the database, it is required to formulate a structured query language which is understood by the database engine. The SQL query is not known by everyone in the world as it requires studying and remembering its syntax and semantics. Only professionals who study the SQL can formulate the query to access the database. On the other hand, human beings communicate with each other using natural language. It would be easier to access the content of the database using that natural language which in turn contributes to the field of natural language interface to the database. Since in many private and public organizations, peoples are performing day to day activities in Amharic language and many of them are not skilled in formulating structured query language, it would be better if there is a mechanism by which the users can directly extract information from the database using the Amharic language. This research accepts questions that are written in Amharic natural language and converts to its equivalent structured query language. A dataset which consists of an input word that is tagged with the appropriate output variable is prepared. Features which represent the Amharic questions are identified and given to the classifier for training purpose. Stemmer, Morphological analyzer, and pre-processor prepare the input question in the format required by the classifier. To identify appropriate query elements, Query Element Identifier uses the dictionary which is prepared by applying the concept of semantic free grammar. The query constructor constructs the required SQL query using these identified query elements. A prototype called Amharic Database querying system is developed to demonstrate the idea raised by this research. Testers from different departments having different mother tongue language test the performance of the system.
Afaan Oromo Automatic News Text Summarizer Based on Sentence Selection Function
(Addis Ababa University, 2013-11) Berhanu, Fiseha; Hailemariam, Sebsibe (PhD)
The existence of the World Wide Web and advancement in digital device has caused an information explosion. Readers are overloaded with lengthy text where a shorter version would suffice. This abundance of information needs efficient tools to handle. Automatic text summarizer is one of the various tools used for the purpose of shortening lengthy documents, and alleviating the type of problem. This work focuses on developing efficient extractive Afaan Oromo automatic news text summarizer, through systematic integration of features: sentence position, keyword frequency, cue phrase, sentence length handler, occurrence of numbers and events like: - time, date and month in sentences. The data that aids for the system development are like: abbreviation, synonym, stop word, suffix, numbers, and name of: (time, date and month) collected from both secondary and primary sources. In addition, 350 English cue phrases are collected and translated to 729 Afaan Oromo cue phrases. For validation and testing 33 different newspaper topics are collected, of these, 20 of them have been used for validation while the rest 13 employed for testing purpose. The Total numbers of respondents who have participated in the validation ad testing data corpus preparation are 110. Besides, Open text summarizer C# version open source has been selected as a tool to develop the system The system has been evaluated based on seven experimental scenarios and evaluation is made both subjectively and objectively. The subjective evaluation focuses on evaluation of the structure of the summary like referential integrity and non-redundancy, coherence and informativeness of the summary. The objective evaluation uses metrics like precision, recall and F-measure for evaluation. The result of subjective evaluation is 88% informativeness, 75% referential integrity and non-redundancy, and 68% coherence. Because of the added features, different techniques and experiment applied to this work the system gave 87.47%fm and outperform by 26.95% than the previous work. Keywords: Afaan Oromo, Automatic news text summarizer, Cue Phrase, Sentence Selection Function
Afaan Oromo List, Definition and Description Question Answering System
(Addis Ababa University, 4/14/2016) Fita, Chaltu; Midekso, Dida (PhD)
nformation is very important in our day to day activity. Technology plays an important role in order to satisfy human beings information need through the use of Internet where people ask questions and a system provides an answer for their query. For instance, search engines a user submit a query and the search engine displays a link to relevant web pages for each issued users query. The QA systems emerge as best solution to get the required information to the user with the help of information extraction techniques. QAS has been developed for English, Amharic, Afaan Oromo and other languages. The Afaan Oromo QAS is developed for answering factoid type questions where the answer is named entity. In this thesis, QAS is developed for answering list, definition and description question which deals with more complex information need. Document preprocessing, question analysis, document selection and answer extraction are the components used for developing the QAS. Tokenization, case normalization, short word expansion, stop word removal, stemming, lemmatization and indexing are the tasks of pre-processing. Question classification is done using a rule based approach. The subcomponents in document selection are document retrieval used for retrieving relevant documents and document analysis used for filtering the retrieved documents. The answer extraction component have sentence tokenizer for tokenizing sentences retrieved from the document analysis and independent subcomponents for definition-description and list were used, DDAE contains sentence extractor for extracting sentences from sentence tokenizer, the answer selection algorithm selects top 6 sentences from the scored and ranked sentences and finally sentence ordering algorithm order the sentences. The LAE contain candidate answer extraction for extracting through rules and gazetters and select answer. The system is tested using evaluation metrics. We used percentage ratio for evaluating question classification which classified 98% correctly. The performance of document selection and answer extraction is tested using precision, recall and F- score. Document selection component is tested and scored an F-score of 0.767. Finally, the answer extraction component is evaluated with an average F-score of 0.653. Keywords: Afaan Oromo List, Definitional and Descriptional Question Answering, Rule Based Question Classification, Document Filtering, Sentence Extraction, Answer Selection
Afaan Oromo Named Entity Recognition Using Hybrid Approach
(Addis Ababa University, 2015-03) Sani, Abdi; Midekso, Dida (PhD)
Named Entity Recognition and Classification (NERC) is an essential and challenging task in Natural Language Processing (NLP), particularly for resource scarce language like Afaan Oromo(AO). It seeks to classify words which represent names in text into predefined categories like person name, location, organization, date, time etc.Thus, this paper deals with some attempts in this direction. Mostly researcher have applied Machine Learning for Afaan Oromo Named Entity Recognition(AONER) while no researchers have used hand crafted rules and hybrid approach for Named Entity Recognition(NER) task. This thesis work deals with AONER System using hybrid approach, which contains machine learning(ML) and rule based components. The rule based component has parsing, filtering, grammar rules, whitelist gazetteers, blacklist gazetteers and exact matching components. The ML component has ML model and classifier components. We used General Architecture for Text Engineering (GATE) developer tool for rule based component and Weka in ML part. By using algorithms and rules we developed, we have identified Named Entity (NE) from Afaan Oromo texts, like name of persons, organizations, location, miscellaneous.Feature selection and rules are important factor in recognition of Afaan Oromo Name Entity (AONE). Various rules have been developed like prefix rule, suffix rule, clue word rule, context rule, first name and last name rule. We have used AONER corpus of size 27588, which is developed by Mandefro [1].From this corpus we have used corpus of size 23000 for training and 4588 for testing of our work. And we havean average result of 84.12% Precision, 81.21% Recall and 82.52% F-Score. Keywords: Named Entity Recognition, Named Entities, GATE Developer, Weka, Afaan Oromo
Afaan Oromo Named Entity Recognition Using Neural Word Embeddings
(Addis Ababa University, 10/26/2020) Kasu, Mekonini; Assabie, Yaregal (PhD)
Named Entity Recognition (NER) is one of the canonical examples of sequence tagging that assigns a named entity label to each of a sequence of words. This task is important for a wide range of downstream applications in natural languages processing. Two attempts have been conducted for Afaan Oromo NER that automatically identifies and classifies the proper names in text into predefined semantic types like a person, location, and organizations and miscellaneous. However, their work heavily relied on hand design feature. We proposed a deep neural network architecture for Afaan Oromo Named Entity Recognition, based on context encoder and decoder models using Bi-directional Long Short Term Memory and Conditional Random Fields respectively. In the proposed approach, initially, we generated neural word embeddings automatically using skip-gram with negative subsampling from an unsupervised corpus size of 50,284KB. The generated word embeddings represent words in semantic vectors which are further used as an input feature for encoder and decoder model. Likewise, character level representation is generated automatically using BiLSTM from the supervised corpus size of 768KB. Because of the use of character level representation, the proposed model is robust for the out-of-vocabulary words. In this study, we manually prepared annotated dataset size of 768KB for Afaan Oromo Named Entity Recognition. We split this dataset into 80% for training, 5% for testing and 15% for validation. We prepared totally 12,963 named entities from these 10,370.4 %, 648.15% and 1,944.45% are used for training, validation and test set respectively. Experimental results show that the combination of BiLSTM-CRF algorithms with pre-trained word embedding and character level representation and regularization techniques (dropout) perform better as compared to the other models such as Bi-LSTM, BiLSTM-CRF with only character level representation or word embeddings. Using Bi-LSTM-CRF model with pre-trained word embeddings and character level representation significantly improved Afaan Oromo Named Entity Recognition with an average of 93.26 % F-Score and 98.87 % accuracy.
Afaan Oromo Search Engine
(Addis Ababa University, 2010-11) Guta, Tesfaye; Midekso, Dida (PhD)
The Web is a repository of huge amount of information among other sources of information used in the day-to-day activities of human being. Moreover, this information may be presented in different languages. Retrieving information from the Web requires the presence of search engines. There are general purpose search engines like Google, Yahoo, and MSN. These general purpose search engines are mainly designed for English language. Shortcomings of these search engines are reflected when they are applied to non-English languages such as Afaan Oromo as they lack specific characteristics of such languages. This research work came up with design and prototype of a search engine for Afaan Oromo texts. The search engine mainly consists of three components – crawler, indexer, and query engine that are optimized for Afaan Oromo. The crawler downloads documents and then filtering of these documents for Afaan Oromo is done by the categorizer subcomponent of the crawler. Next, documents that are identified as Afaan Oromo are preprocessed and stored in an index for later retrieval. Finally, queries supplied in an interface to the query engine component are preprocessed, checked for a match in the index, and matched documents are displayed through an interface in a ranked order. Performance evaluation of the search engine is conducted using selected set of documents and queries. According to precision-recall measures employed, 76% precision on the top 10 results and an average precision of 93% are obtained. Experiment on some specific features of the language against the design requirements is also made. Key words: Information Retrieval, Search Engine, Categorizer, Afaan Oromo
Afaan Oromo Text Summarization Using Word Embedding
(Addis Ababa University, 11/4/2020) Tashoma, Lamesa; Assabie, Yaregal (PhD)
Nowadays we are overloaded by information as technology is growing. This causes a problem to identify which information is reading worthy or not. To solve this problem, Automatic Text Summarization has emerged. It is a computer program that summarizes text by removing redundant information from the input text and produces a shorter non-redundant output text. This study deals with development of a generic automatic text summarizer for Afaan Oromo text using word embedding. Language specific lexicons like stop words and stemmer are used to develop the summarizer. A graph-based PageRank is used to select the summary of worthy sentences out of the document. To measure the similarities between sentences cosine similarity is used. The data used in this work was collected from both secondary and primary sources. Afaan Oromo stop word list, suffix and other language specific lexicons are gathered from previous works done on Afaan Oromo. To develop a Word2Vec model we have gathered different Afaan Oromo texts from different sources like: Internet, organizations and individuals. For validation and testing 22 different newspaper topics are collected, from this, 13 of them have been used for validation while the rest 9 were employed for testing purpose. The system has been evaluated based on three experimental scenarios and evaluation is made both subjectively and objectively. The subjective evaluation focuses on evaluation of the structure of the summary like informativeness of the summary, coherence, referential clarity, non-redundancy and grammar. In the objective evaluation we used metrics like precision, recall and F-measure. The result of subjective evaluation is 83.33% informativeness, 78.8% referential integrity and grammar, and 76.66% structure and coherence. This work also achieved 0.527 precision, 0.422 recall and 0.468 F-measure by using the data we gathered. However, the overall performance of the summarizer outperformed by 0.648 precision, 0.626 recall and 0.058 F-measure when compared with the previous works by using the same data used in their work.
Afaan Oromo Word Sense Disambiguation Using Wordnet
(Addis Ababa University, 11/2/2017) Tesfaye, Birhane; Assabie, Yaregal PhD)
All human languages have words that can mean different things in different contexts. In the natural language processing community, Word Sense Disambiguation (WSD) has been described as the task which selects the appropriate meaning (sense) to a given word in a text or discourse where this meaning is distinguishable from other senses potentially attributable to that word. One of the several approaches proposed in the past is Michael Lesk’s 1986 algorithm. This algorithm is based on two assumptions. First, when two words are used in close proximity in a sentence, they must be talking of a related topic and second, if one sense each of the two words can be used to talk of the same topic, then their dictionary definitions must use some common words. For example, when the words ”pine cone” occur together, they are talking of ”evergreen trees”, and indeed one meaning each of these two words has the words ”evergreen” and ”tree” in their definitions. Thus we can disambiguate neighboring words in a sentence by comparing their definitions and picking those senses whose definitions have the most number of common words. The main drawback of this algorithm is that dictionary definitions are often very short and just do not have enough words for this algorithm to work well. To overcome this problem Satanjeev Banerjee 2002 deal with this problem by adapting Lesk algorithm to the semantically organized lexical database called WordNet. Besides storing words and their meaning like a normal dictionary, WordNet also ”connects” related words together. To this end, we have developed a WSD system that identifies a sense of an Afaan Oromo ambiguous word by using information from Afaan Oromo WordNet. The system identifies the sense by checking different types of sense relationships between words that will help to identify the sense of a word, The conventional WordNet organizes nouns, verbs, adjectives and adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the structure of conventional WordNet, we used a clue word based model of WordNet. The related words for each sense of a polysemy word are referred to as the clue words. These clue words are used to disambiguate the correct meaning of the polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms. The clue word can be a noun, verb, adjective or adverb which can solve limitation of English WordNet which has limited number of cross pos relation(relation not between single part of speech ). The performance of the system is tested using 50 polysemy Afaan Oromo ambiguous words which are selected randomly. The performance of the WSD based on clue word based WordNet achieved 92%.
Aflatoxins, Heavy Metals, and Safety Issues in Dairy Feeds, Milk and Water In Some Selected Areas of Ethiopia
(Addis Ababa University, 2/3/2018) Mesfin, Rehrahie; Assefa, Fassil (PhD)
The production of wholesome milk is controlled by the quality and safety of feed supply. Aflatoxins and heavy metals are some of the major factors that affect the quality of feeds and water sources that are transferred and eventually get bio-accumulated in livestock species and humans via meat, milk and milk products. Monitoring dairy production inputs using technical tools and gathering appropriate information on perception, experience and indigenous knowledge of stake holders along the feed and milk chains are relevant in assessing how processing, storage and distribution of feeds and water sources to ensure safety of milk and milk products. The objective of this study was to determine aflatoxin B1 (AFB1) in feeds and aflatoxin M1 (AFM1) in milk and heavy metals cadmium (Cd), lead (Pb), arsenic (As) and chromium (Cr) in feeds, water, and milk samples from West Shoa, East Shoa and Hawassa, Ethiopia. A total of 205 samples consisting of 115 concentrate feeds, 45 roughage feeds and 45milk samples were collected for the detection and quantification of aflatoxin using Enzyme-linked Immunosorbent Assay (ELISA). A total of 90 samples (30 feeds, 30 water and 30 milk) were collected for determination of heavy metals using Graphite Furnace Atomic Absorption Spectrophotometer (GFAAS). Stakeholders’ perception and experience of handling feeds and water sources were evaluated by interviewing peri-urban farmers, feed processors, feed retailers and urban dairy producers using semi-structured questionnaires and field observations. The results showed half of the feed samples (81) were free from aflatoxin, and the remaining (79 samples) were within the EU standard of 5μg/kg and the USA standard of 20μg/kg. The pattern of afltoxin contamination showed that concentrate feeds were more contaminated (7.67 ± 0.80 μg/kg) than roughage feeds (0.41 ± 0.14 μg/kg); hay (0.72 ± 0.25 μg/kg) was more contaminated than straw (0.05 ± 0.05 μg/kg) and oilseed cake based concentrate feeds were more contaminated (13.09 ± 1.12 μg/kg) than concentrate feeds without oilseed cake (2.78 ± 0.66 μg/kg). The average AFB1 of feeds in Bishoftu (9.76μg/kg) was significantly higher (p<0.05) than the sampling sites in Holetta (6.33μg/kg) and Hawassa (1.19μg/kg). The AFB1 of feeds handled by dairy producers was significantly higher (p<0.05) (9.35 ± 1.04μg/kg) than feed retailers (6.91 ± 1.09 μg/kg) and 2 feed manufacturers (7.50 ± 1.43 μg/kg). The AFM1 of milk was in a range and average of 0– 0.146 μg/L and 0.054 μg/L respectively of which 29% of the milk samples did not contain aflatoxin, and 58% of them had AFM1 level within the EU permitted limit of 0.05μg/L and 42% of the samples were less than the U.S.A. recommended limit of 0.5 μg/L. The AFB1and AFM1 levels of milk samples collected from the study locations were in the order of Hawassa < Holetta < Bishoftu. With regards to heavy metals, the data showed that concentrations of heavy metals in teff straw in Holetta and Bishoftu were 1543.54 ± 318.70 μg/kg and 1486.92 ± 279.73 μg/kg, respectively. The overall concentration of heavy metals in teff straw was in the order of Cr > As > Pb > Cd. The water samples taken from Mojo areas (Eastern Shoa) showed the highest of heavy metals (43.64 μg/L - 86.89 μg/L) with very high concentration of Cr (300.56 μg/L). In general, the average concentration of heavy metals in livestock water in Eastern shoa (Akaki to Mojo) (28.08 ± 7.02 μg/L) was significantly higher (p<0.05) than the levels of heavy metals in water collected from Western Shoa (Holetta/Welmera) (1.96 ± 0.28 μg/L) and the levels of the heavy metals was in the order of Cr > As > Pb > Cd. With the exception of pH of water from Mojo Lake (10.37) and Gelan dye factory (8.9), the rest of the water samples collected from Bishoftu and Holetta areas were within the legal pH limit of 6.5-8.5 for livestock drinking. The overall concentration of heavy metals in cow milk samples was in the order of Cr > Cd > Pb > As. The concentrations of Cd and As in milk were within the permissible limits. However, 60% and 73% of the milk samples from Holetta and Bishoftu respectively for Pb and, all the milk samples in both study locations for Cr were above the permissible limits indicating poor quality of milk due to environmental pollution. The data from the interview of stakeholders showed that 91% of the farmers sometimes encountered mold formation in roughage feeds due to lack of good harvesting and stacking practices. Most of the farmers admitted to feeding light moldy feeds to their livestock by diluting with uncontaminated ones. Most of the respondents (67%) used extreme moldy feeds for firewood; and 33% of the interviewees damped the extreme moldy feeds into landfills. Farmers recognized two causes of water contaminants associated with health and production problems in livestock. Accordingly farmers from Eastern Shoa (100%) were aware of the effect of industrial effluent as the most important hazard for dairy production; whereas 66% of the farmers from Eastern Shoa and 34% of the respondents from Western shoa identified leech problems in water bodies in dry season. Farmers also had indigenous knowledge to tackle the leech problem in that 69% of the farmers used bucket for selectively scooping water 3 from the water body to exclude the leech from being consumed by animals; whereas 50% of the respondents treated animals with chopped tobacco and onion. The majority of the feed processors (64%), feed retailers (82%) and dairy producers (56%) reported that they did not use palate for placing their concentrate feeds implying that there is probability of mold contamination in times of prolonged storage. Among the respondents, 88% of feed processors, all feed retailers and most (96%) of the dairy producers recognized that wheat bran was the most mold susceptible feed ingredient. Majority of the feed processors (67%), feed retailers (73%) and dairy producers (58%) stored their concentrate feeds for a short period of about 1 month. Majority of the feed processors (74%), feed retailers (87%) and most dairy producers (91%) did not encounter mold formation in their concentrate feed because of the small amount of concentrate feed they hold and shorter storage time. To overcome mold formation in concentrate feeds, 64% of the feed processors gave enough space between stored feed and the wall. Further research needs to be undertaken along the feed and milk production and distribution chains using other techniques such as HPLC, GC and multi-mycotoxin assay using LC-MS-MS taking into account different storage conditions such as use of palate, ventilation, and duration of feed storage on aflatoxin. The effect of mold growth in feeds on nutrient composition needs to be investigated. There is also a need for further investigation on heavy metals from soils and fodder feed samples grown in similar study locations.
Aframework for Multi-Agent Interbank Payment and Settlement
(Addis Ababa University, 2009-11) Addis, Yilebes; Libsie, Mulugeta (PhD)
Interbank payment and settlement systems automate transfer of fund from one bank to another bank on the order of a customer. The communication between banks involved in interbank payment and settlement is automated. Moreover, few agent-based payment systems tried to simulate the trend of incoming and outgoing payments so as to manage liquidity requirement. However, interbank payment and settlement systems developed so far are living with critical problems like gridlock, intraday liquidity management, and interfacing with autonomous legacy systems. Hence this thesis proposes a framework for Multi-Agent Interbank Payment and Settlement (MAIPS) system, which improves interbank payment and settlement system, and extends its coverage. The proposed framework interfaces autonomous banking systems with the interbank payment and settlement system. Besides, MAIPS provides solution for intraday liquidity management and gridlock problems through automated interbank lending. Thus the thesis develops a Multi-Attribute Utility Theory (MAUT) based interbank lending model. In order to secure liquidity through interbank lending, the system floats bid to borrow liquidity, evaluates bidders’ proposal, selects the best lender and agrees with the winner. This interbank lending model is simulated through the prototype called Multi-Agent Interbank Lending System (MAILS), which is developed using Java Agent DEvelopment (JADE) Framework and uses FIPA English Auction Interaction Protocol. Finally the prototype is tested using relevant information so as to clearly visualize interaction of participating banks and check correctness of the prototype. The result of this thesis will bring breakthrough in improving interbank payment and settlement systems. It will also pave the way for multidimensional complex auctions to use decision aid techniques. viii Keywords: Interbank Payment and Settlement, Cheque Clearance, Multi- Agent System, Gridlock, Intraday Liquidity Management, Collateralized Credit, and Interbank Lending.
Amharic Document Categorization Using Itemsets Method
(Addis Ababa University, 2013-02) Hailu, Abraham; Assabie, Yaregal (PhD)
Document categorization or document classification is the process of assigning a document to one or more classes or categories. Many researches are conducted in the area of Amharic document categorization. The main focus of those studies is to examine different document categorization techniques and measuring their performance however itemsets method is not so far examined. This study focused to extend Apriori algorithm which is traditionally used for the purpose of knowledge mining in the form of association rules. The research focused on the basic principles of applying itemsets method to categorize Amharic documents. In addition to that the implementation of all the required tools which helps to carry out automatic Amharic Document categorization using itemsets method is developed and the algorithm is examined. Experiment results show itemsets method is an efficient method to categorize Amharic documents. The effectiveness and accuracy of the method to categorize Amharic documents is also evaluated and reported. Finally, factors affecting the performance of the proposed system and the importance of preprocessing training dataset in finding useful information are discussed.
Amharic Document Image Retrieval Using Lingustic Features
(Addis Ababa University, 10/21/2011) Yeshambel, Tilahun; Assabie, Yaregal(PhD)
The advent of modern computers play important roles in processing and managing electronic information that are found in the form of texts, images, audios and videos, etc. With the rapid development of computer technology, digital documents have become popular options for storage, accessing and transmission. With the need of current fast evolving digital libraries, an increasing amount of historical documents, newspaper, books, etc. are being digitized into an electronic format for easy archival and dissemination purposes. Optical Character Recognition (OCR) and Document Image Retrieval (DIR), as part of information retrieval paradigm, are the two means of accessing document images that received attention among the IR community. Amharic is the official language of Ethiopia since 19th century and as a result so many religious and government documents are written in Amharic. Huge collections of Amharic machine printed documents are found in almost every institution of the country. It is observed that accessing those documents has become more and more difficult. To address this problem, very few number of research works have been attempted recently by using OCR and DIR methods. The aim of this research is to develop a system model that enables users to find relevant Amharic document images from a corpus of digitized documents in an easy, accurate, fast and efficient manner. So this work presents the architecture of Amharic DIR which allows users to search scanned Amharic documents without the need of OCR. The proposed model is designed after making detailed analysis of the specific nature of Amharic language. Amharic belongs to the Semitic languages and is morphologically rich language. Surface words formation involves prefixation, suffixation, infixation, circumfixation and reduplication. In this work a model for searching Amharic document images is proposed and word image features are systematically extracted for automatically indexing, retrieving and ranking of document images stored in a database. A new approach that applies one of the NLP tools which is Amharic word generator is incorporated in the proposed system model. By providing a given Amharic root word to this Amharic specific surface word synthesizer, a number of possible surface words are produced. Then, the descriptions of these surface word images are used for indexing and searching purposes. On the other hand the system passes through various phases such as noise removal, binirization, text line and word boundary identification, word segmentation and resizing to normalize different font types, sizes and styles, feature extraction and finally matching query word image against document word images. The proposed method was tested on different real world Amharic documents from different sources like magazines, textbooks and newspapers with various font styles, types and sizes. Precision-recall measures of evaluation had been conducted for sample queries on sample document images and promising results have been achieved.
Amharic Information Retrieval Using Semantic Vocabulary
(Addis Ababa University, 10/2/2019) Getnet, Berihun; Assabie, Yaregal (PhD)
The increase in large scale data available from different sources and the user’s need access to information retrieval becomes more focusing issue these days. Information retrieval implies seeking relevant documents for the user’s queries. But the way of providing the queries and the system responds relevant results for the user should be improved for better satisfaction. This can be enhanced by expanding the original queries from semantic lexical resources that are constructed either manually or automatically from a text corpus. But, manual construction is tedious and time-consuming when the data set is huge. The way semantic resources are built also affects retrieval performance. Based on formal semantics the meaning is built using symbolic tradition and centered around the inferential properties of languages. It is also possible to automatically construct semantic resources based on the distribution of the word from unstructured data which applies the notion about unsupervised learning that automatically builds semantics from high dimensional vector space. This produces contextual similarity via word’s angular orientation. There have been attempts done to enhance information retrieval by expanding queries from semantic resources for non-Ethiopian languages. In this study, we propose Amharic information retrieval using semantic vocabulary. It isfigured out by considering components including text preprocessing, word-space modeling, semantic word sense clustering, document indexing, and searching. After the Amharic documents are preprocessed the words are vectorized on a multidimensional space using Word2vec based on the notion words surrounding another word can be contextually similar. Based on the word’s angular orientation, the semantic vocabulary is constructed using cosine distance. After Amharic documents are preprocessed it is indexed for later retrieval. Then the user provides the queries and the system expands the original query from the semantic vocabulary. The queries are reformulated and words are searched from indexed data that returns more relevant documents for the user. A prototype of the system is developed and we have tested the performance of the system using Amharic documents collected from Ethiopian public media. The semantic vocabulary based on the word analog prediction using the cosine metric is promising. It is also compared against the semantic thesaurus constructed with the latent semantic analysis and it increases by 17.2% accuracy. Information retrieval using semantic vocabulary based on ranked retrieval increases by 24.3% recall, and using unranked set of retrieval, 10.89% recall improvement was obtained.
Amharic Open Information Extraction
(Addis Ababa University, 3/3/2020) Girma, Seble; Assabie, Yaregal (PhD)
Open Information Extraction is the process of discovering domain-independent relations by providing ways to extract unrestricted relational information from natural language text. It has recently received increased attention and applied extensively to various downstream applications, such as text summarization, question answering, and informational retrieval. Although a lot of Open Information Extraction systems have been developed for various natural language text, no research has been conducted yet for the development of Amharic Open Information Extraction (AOIE). As litrature has shown, the rule-based approach operating on deep parsed sentences yields the most promising results for Open Information Extraction systems. However, to the best of our knowledge, there is no fully implemented deep syntactic parser available for Amharic language. Therefore, in this thesis, we propose the development of a rule-based AOIE system that utilizes shallow parsed sentences. The proposed system has six components: Preprocessing, Morphological Analysis, Phrasal Chunking, Sentence Simplification, Relation Extraction, and Post-processing. In the Preprocessing, each word in the input text is labeled with an appropriate POS tag, and then well-formed and informative sentences are filtered out for further processing based on POS tags of words. The Morphological Analysis component produces morphological information about each word of input sentences. The phrasal chunking component divides the input sentence into non-overlapping phrases based on POS and morphological tags of words. The Sentence Simplification component segments the sentence into a number of self-contained simple sentences that are easier to process. In the Relation Extraction, relation instances are extracted from those simplified sentences and finally the post-processing components prints extracted relations in N-ary format. The proposed method and algorithms were implemented in prototype software and evaluated with a dataset from different domains. In the evaluation, we showed that the system achieved an overall precision of 0.88.
Amharic Question Answering for Definitional, Biographical and Description Questions
(Addis Ababa University, 2013-11) Abedissa, Tilahun; Libsie, Mulugeta (PhD)
There are enormous amounts of Amharic text data on the World Wide Web. Since Question Answering (QA) can go beyond the retrieval of relevant documents, it is an option for efficient information access to such text data. The task of QA is to find the accurate and precise answer to a natural language question from a source text. The existing Amharic QA systems handle fact-based questions that usually take named entities as the answers. In this thesis, we focused on a different type of Amharic QA— Amharic non-factoid QA (NFQA) to deal with more complex information needs. The goal of this study is to propose approaches that tackle important problems in Amharic non-factoid QA, specifically in biography, definition, and description questions. The proposed QA system comprises of document preprocessing, question analysis, document analysis, and answer extraction components. Rule based and machine learning techniques are used for the question classification. The approach in the document analysis component retrieves relevant documents and filters the retrieved documents using filtering patterns for definition and description questions and for biography questions a retrieved document is only retained if it contains all terms in the target in the same order as in the question. The answer extraction component works in type-by-type manner. That is, the definition-description answer extractor extracts sentences using manually crafted answer extraction patterns. The extracted sentences are scored and ranked, and then the answer selection algorithm selects top 5 non-redundant sentences from the candidate answer set. Finally the sentences are ordered to keep their coherence. On the other hand, the biography answer extractor summarizes the filtered documents by merging them, and then the summary is displayed as an answer after it is validated. We evaluated our QA system in a modular fashion. The n fold cross validation technique is used to evaluate the two techniques utilized in the question classification. The SVM based classifier classifies about 83.3% and the rule based classifier classifies about 98.3% of the test questions correctly. The document retrieval component is tested on two data sets that are analyzed by a stemmer and morphological analyzer. The F-score on the stemmed documents is 0.729 and on the other data it set is 0.764. Moreover, the average F-score of the answer extraction component is 0.592. viii Keywords: Amharic definitional, biographical and description question answering, Rule based question classification, SVM based question classification, Document Analysis, Answer Extraction, Answer Selection.
Amharic Question Classification System Using Deep Learning Approach
(Addis Ababa University, 4/14/2021) Habtamu, Saron; Assabie, Yaregal (PhD)
Questions are used in different applications such as Question Answering (QA), Dialog System (DS), and Information Retrieval (IR). However, some questions might be too complex to be analyzed and processed. As a result, systems are expected to have a good feature extraction and analysis mechanism to linguistically understand these questions. The retrieval of wrong answers, inaccuracy of IR, and crowding the search space with irrelevant candidate answers are some of the challenges that are caused due to the inability to appropriately process and analyze questions. Question Classification (QC) aims to solve this issue by extracting the relevant features from the questions and by assigning them to the correct class category. Even though QC has been studied for various languages, it was hardly studied for the Amharic language. This research studies Amharic QC focusing on designing hierarchical question taxonomy, preparing Amharic question dataset by labeling the sample questions into their respective classes, and implementing Amharic QC (AQC) model using Convolutional Neural Network (CNN) which is part of the DL approach. The AQC uses a multilabel question taxonomy that integrates coarse and fine grain categories. This multilabel class helps us to be more accurate in retrieving answers compared to the flat taxonomy. We constructed the taxonomy by analyzing our AQ dataset and also adopting the standard taxonomies that were previously studied. We have prepared the AQs in three forms: Surface, Stemmed, and Lemmatised forms. We train and test these datasets using a word vectorizer trained on surface words noticing that most interrogative words appear to be similar even when they are stemmed and lemmatized. As a result, we have achieved 97% and 90% training and validation accuracy for Surface AQs. Scoring 40% for the stemmed AQs. However, the word2vec model could not represent the lemmatized AQs appropriately. As a result, no results were obtained during training. we also tried to extract features from AQs by using different filters separately. This gave us an accuracy of 86% while requiring an increasing number of training epochs.
Amharic Sentence Generation from Interlingua Representation
(Addis Ababa University, 12/27/2016) Yitbarek, Kibrewossen; Assabie, Yaregal (PhD)
Sentence generation is a part of Natural Language Generation (NLG) which is the process of deliberately constructing a natural language text in order to meet specified communicative goals. The major requirement of sentence generation in a natural language is providing full, clear, meaningful and grammatically correct sentence. A sentence can be generated from different possible sources, including a representation which does not depend in any human languages, which is an Interlingua. Generating a sentence from an Interlingua representation has numerous advantages. Since Interlingua representation is unambiguous, universal and independent of both the source language and the target language, the generation should be target language-specific, and likewise should be the analysis. Among the different Interlinguas’, Universal Networking Language (UNL) is commonly chosen in view of various advantages over the other ones. Various works have been done so far for different languages of the world to generate sentences from UNL expression but to the best of our knowledge there are no works done so far for Amharic language. In this thesis, we present Amharic sentence generator that automatically generates Amharic sentence from a given input UNL expression. The generator accepts a UNL expression as an input and parses to build a node-net from the input UNL expression. The parsed UNL expressions are stored in a data structure which could be easily modified in the successive processes. UNL-to-Amharic word dictionary is also prepared and it contains the root form of Amharic words. The Amharic equivalent root word and attributes of nodes in a parsed UNL expression will be fetched from the dictionary to update the head word and attributes of the corresponding node. Then, the translated Amharic root words will be locally reordered and marked based on the Amharic grammar rules. When the nodes are ready for generation of morphology, the proposed system makes use of Amharic morphology data sets to handle the generation of noun, adjective, pronoun, and verb morphology. Finally, the function words are inserted to the morphed words so that the output matches with a natural language sentence. The evaluation of the proposed system has been performed on dataset of 142 UNL expressions. Subjective tests like adequacy and fluency tests have been performed on the proposed system. Moreover, the quantitative test or error analysis has also been performed by calculating Word Error Rate (WER). From this analysis, it has been observed that the proposed system generates 71.4% sentences that are intelligible and 67.8% sentences that are faithful to the original UNL expression. Consequently, the system achieved a fluency score of 3.0 (on a 4-point scale) and adequacy score of 2.9 (on a 4-point scale). Furthermore, the proposed system has word error rate of 28.94%. These scores of the proposed system can be improved further by improving the rule base and lexicon.
Amharic Sentence to Ethiopian Sign Language Translator
(Addis Ababa University, 2014-06) Zegeye, Daniel; Hailemariam, Sebsibe (PhD)
Sign languages that exist around the world are usually identified by the country where they are used such as Ethiopian sign language. Mostly, the communication among the hearing impaired people involves signs that stand for words by themselves. However, to make a sign language complete as a spoken language, the hearing impaired community around the world use manual alphabets for names, technical terms, and sometimes for emphasis. As there are different alphabets for different spoken languages such as Amharic, there are manual alphabets or finger spellings used by the deaf people. Therefore sign language in general is a tool that deaf communities use to communicate with each other. There is no problem when the communication is limited between the deaf, but they struggle to communicate with hearing people due to the language barrier. Using translators was the solution for filling the communication gap especially in Ethiopia, even if it has its own draw backs with respect to economy or privacy issue. Consequently, developing software which fills the communication gap between the deaf and hearing people is a best solution. This thesis contributes on the development of a model and system for Amharic sentence to Ethiopian sign language translator which accepts Amharic sentences, letters, or numbers, and outputs 3D animation of Ethiopian sign language based on the pre-lingual deaf grammar. The model bases on rule based machine translation approaches and the developed system has three basic components; the interface component, the back-end component, and the database component. The first component (front-end) acts as a bridge between the users and the back-end component. The back-end component has three modules; Amharic text analysis, natural language processing (NLP), and text-to-sign mapping. Amharic text analysis modules analyze Amharic sentence and pass Romanized sentence to the NLP module. The NLP module accepts the Romanized Amharic sentence and performs all language processing and return sentence in EthSL with including of morphological information. Then the final module (text-to-sign mapping) maps each word with the SiGML (sign script) and send to the interface component and the 3D avatar animation display the sign. In addition to enhance the quality of the translator we use a POS tagging which combine the previous work (naïve Byes classifier) and the new created one; using a brill tagging approach. x The translator performance evaluated into three classes; at sentences level, letter level, and number level and the result ranked into three categories; number of correctly translated sentences, number of understandable sentences, and number of wrong translations. All results without any errors were considered as correctly translated sentences. The results that conveyed meaning but not clear sense were considered as understandable sentences. But the results that did not covey meaning as well as sense were considered as wrong translations. Finally the system gave an accuracy of 58.77%, 75.76%, and 84% at sentence, letter, and number level respectively
Amharic Speech Training for the Deaf
(Addis Ababa University, 2006-08) Assefa, Daniel; Midekso, Dida (PhD)
It has been believed that all deaf persons can not make edible sound and can only communicate through Sign Language. However, Deaf people can make voices and communicate orally unless they are mute by nature. With speech training it is possible for the Deaf to learn how to speak and “listen”. Speech training can be given manually with a human trainer but it is a very tiresome task and its’ demands are more than the capability of trainers. The solution proposed for this problem is an automated speech training system which is already implemented for different languages. This thesis addresses a similar solution but for Amharic language. Due to the limitations of special equipment and software tools we can get we proposed modeling of a lip for the articulation of Amharic characters which is part of an automated speech training system. We used an Analysis-Synthesis approach to first analyze a real lip in speech making and applied the output of the analysis on our lip model to articulate different Amharic characters. The solution proposed is implemented in a prototype developed for selected Amharic characters and its efficiency is tested with some students of Mekanissa Deaf School. Keywords: Speech Training, Sign Language, Deaf Education, Lip modeling, Talking Head
Amharic Text to Ethiopian Sign Language Translation Model Using Factored Phrase Based Statistical Machine Translation Approach
(Addis Ababa University, 3/27/2021) Belay, Yoseph; Gizaw, Solomon (PhD)
Machine translation is a process of natural language translation automation to translate text from one natural language to another natural language. Machine translation is the fastest way to process a vast amount of data and produce usable translations in any language in the world. In this paper, we deal with the design of an Amharic to Ethiopian Sign Language machine translator. Amharic is the official language of Ethiopia. Ethiopian Sign Language is a visual-gestural language used to communicate and interacting by the Ethiopian Deaf community. This study presents a factored Amharic to Ethiopian Sign Language statistical machine translation system composed of three main components. The first component is a neural network-based Amharic part of speech tagger that is used as a preprocessor to factorize the words in the parallel corpora. The second component is a factored statistical machine translator that is used to translate text from Amharic to Ethiopian Sign Language grammatical structure. The third component is a word to Ethiopian Sign Language video clip mapper which takes the translated text as an input and finds matches from the video corpus. We conducted experiments using three different machine translation approaches and compared with the evaluation result of the proposed system. The first experiment is performed using a standard phrased based statistical approach as a baseline model. The second experiment conducted using a factored phrased-based approach. The third experiment carried out by using a neural machine translation approach. Our evaluation's findings demonstrate that the use of factored phrase-based statistical translation approach effectively improves Amharic to EthSL machine translation. Our proposed factored statistical translation achieves a 35.28 BLEU score which outperforms both the baseline standard phrase-based statical machine translation model and the neural machine translation model.

Browse

Browsing Computer Science by Title

Results Per Page

Sort Options