Amharic Question Classification System Using Deep Learning Approach

No Thumbnail Available

Date

4/14/2021

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Questions are used in different applications such as Question Answering (QA), Dialog System (DS), and Information Retrieval (IR). However, some questions might be too complex to be analyzed and processed. As a result, systems are expected to have a good feature extraction and analysis mechanism to linguistically understand these questions. The retrieval of wrong answers, inaccuracy of IR, and crowding the search space with irrelevant candidate answers are some of the challenges that are caused due to the inability to appropriately process and analyze questions. Question Classification (QC) aims to solve this issue by extracting the relevant features from the questions and by assigning them to the correct class category. Even though QC has been studied for various languages, it was hardly studied for the Amharic language. This research studies Amharic QC focusing on designing hierarchical question taxonomy, preparing Amharic question dataset by labeling the sample questions into their respective classes, and implementing Amharic QC (AQC) model using Convolutional Neural Network (CNN) which is part of the DL approach. The AQC uses a multilabel question taxonomy that integrates coarse and fine grain categories. This multilabel class helps us to be more accurate in retrieving answers compared to the flat taxonomy. We constructed the taxonomy by analyzing our AQ dataset and also adopting the standard taxonomies that were previously studied. We have prepared the AQs in three forms: Surface, Stemmed, and Lemmatised forms. We train and test these datasets using a word vectorizer trained on surface words noticing that most interrogative words appear to be similar even when they are stemmed and lemmatized. As a result, we have achieved 97% and 90% training and validation accuracy for Surface AQs. Scoring 40% for the stemmed AQs. However, the word2vec model could not represent the lemmatized AQs appropriately. As a result, no results were obtained during training. we also tried to extract features from AQs by using different filters separately. This gave us an accuracy of 86% while requiring an increasing number of training epochs.

Description

Keywords

Amharic Question Classification, Deep Learning, Cnn, Fine Grain, Coarse Grain Hierarchical Taxonomies, Word2vec

Citation

Collections