LETEYEQ (ሌጠየቅ)-A Web Based Amharic Question Answering System for Factoid Questions Using Machine Learning Approach

No Thumbnail Available

Date

2013-03

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

When users need for a certain fact and try requesting search engines for it, they get back a bunch of addresses and snippets which are „related‟ to their need and it is up to the users to decide which address to choose expecting that the requested fact could be found there. Opening the address could present the user with lots of pages of information and it is the user‟s duty to go through the information and extract the actual fact. But given a collection of documents, a Question Answering system attempts to retrieve correct answers to questions posed in natural language. Hence question answering relieves users from the task of digging the information from related pages. There are different types of questions like definition, list, acronyms, true/false, and factoid types. Most of the question answering systems have three major components, question analysis, document (passage) retrieval, and answer extraction. For languages like English, many question answering systems are available which are designed in different approaches. But in the case of Amharic, Seid Muhie‟s [4] TETEYEQ is a pioneer work designed to answer Amharic factoid questions. It aims to answer four kinds of Amharic factoid questions namely the „Person‟, „Place‟, „Time‟, and „Quantity‟ question types. It was designed by employing a rule based approach in question analysis component by manually writing rules to classify questions into one of these four question types resulting in an accuracy of classifying 86.9% of the questions correctly. It was also using manually collected Amharic documents as a search space and the reported overall system performance was 72%. We have designed a similar system for answering the four kinds of Amharic factoid questions using a machine learning approach than the rule based one by employing the known machine learning based classification algorithm, support vector machine (SVM). By doing so, we attained an accuracy of 94.2% in question classification which outperforms the rule based question classification in TETEYEQ. We have integrated a web crawler by customizing the open source JSpider crawler to automatically gather Amharic documents from the web in preparing the search space. The downloaded Amharic documents are then indexed by the open source tool, Lucene indexer, to facilitate the document retrieval process. Hence, our system has two major parts, the search engine part (crawler and indexer) and the question answering part. Besides, our system is designed to be a web based system for interacting with the end users on the web. By employing the machine learning algorithm in question classification and adopting answer extraction techniques used in TETEYEQ, we have achieved 77% overall system performance which is better than that of TETEYEQ‟s. In the absence of basic natural language processing (NLP) tools like part of speech (POS) tagger and named entity recognizer (NER) for the Amharic language, both TETEYEQ and our system have achieved a considerable performance which would be boosted up by the addition of such NLP tools in the future. Key Words: Amharic Factoid Question Answering, SVM based question classification, Answer Extraction, Web based Amharic Question Answering.

Description

Keywords

Amharic Factoid Question Answering; SVM Based Question Classification;Answer Extraction; Web Based Amharic Question Answering

Citation