DNN-HMM Based Isolated-Word Tigrigna Speech Recognition System

No Thumbnail Available



Journal Title

Journal ISSN

Volume Title


Addis Ababa University


Automatic Speech Recognition (ASR) is a process of converting a given speech signal into corresponding textual form. An ASR system consists of Feature Extraction (FE), Acoustic Model (AM) and Decoding modules. Using state-of-the-art methods and algorithms at each of the modules improves the overall performance of the system. The objective of this thesis is therefore to develop Tigrigna ASR system using improved algorithms in the modules, the focus being on the AM part. There are previous related works on many Ethiopian languages. But the only work on Tigrigna has been done using Gaussian Mixture Model (GMM) integrated with Hidden Markov Models (HMM) for the AM part. For this work DNN-HMM has been proposed instead of GMM-HMM to improve performance of Tigrigna ASR systems. The acoustic model is created by training of the methods using an appropriate dataset. So the dataset which consists of speech and text data has been prepared by the researcher. Since the system is an isolated-word based, the text data consists of 163 words. The words were collected from a famous Tigrigna newspaper. The corresponding speech of the words was collected from a total of 86 different speakers. Then Mel frequency Cepstrum Coefficient (MFCC) features is used to extract features from the speech data to be used for the training of the AM. Both the methods, GMM-HMM and DNN-HMM, have been implemented on the AM to compare their performance. GMM-HMM was trained first which gives the AM model and a phoneme-to-sound alignment. Once the training of GMM-HMM is done, the training of DNN-HMM is followed. The training of DNN-HMM uses the same input data that was used for the GMM-HMM. In addition to that, DNN-HMM uses the phoneme-to-audio alignment found from the GMM-HMM to be used as a target output during the training. After the training of the AM models has been done, the system is ready for experimentation and performance evaluation. The first experiment was during the training of the AM models, for selecting best values for the training parameters. Then the overall system experiment has been done to test for performance. Two experiments were done using two different speech datasets: clean and noisy; the purpose is to compare the performance of AM methods on both cases. On the clean speech data a recognition accuracy of 97.90% has been achieved using DNN-HMM and 97.64 using GMM-HMM. Similarly, on the noisy speech DNN-HMM has performed 75.46% and GMM-HMM 69.40%.



Acoustic Modelling, Isolated-Word, Deep Neural Network, Gaussian Mixture Models, Hidden Markov Models