Towards Improving the Performance of Spontaneous Amharic Speech Recognition

No Thumbnail Available

Date

2015-10-04

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

The ultimate goal of automatic speech recognition is towards developing a model that converts speech utterance into a sequence of words. With the objective of transforming Amharic speech into its equivalent sequence of words, this study explored the possibility of improving the performance of Amharic spontaneous speech recognition system using hidden Markov model (HMM). To this end, four experiments have been conducted in order to improve the performance of the recognizer. The first three experiments were conducted using the spontaneous speech corpus consisting of 2007 sentences uttered by 36 people from different sex and age groups. This training data consists of 9460 unique words and it is around 3 hours and 10 minutes speech. For testing, speech of 104 sentences uttered by 14 speakers, consisting of 820 unique words has been used. The experiments have been conducted using different parameter tuning, and using CV-syllables and cross-word tri-phone as recognition units. The fourth experiment has been done by increasing the corpus size. A speech corpus consisting of 3556 sentences uttered by 60 speakers from different sex and age group has been used for training. This training data consists of 12306 unique words and it is around 4 hours and 30 minutes of speech performance improvement has been achieved when cross-word tri-phone acoustic and tri-gram language models has been used in recognition. In this system, 58.67% words are correctly recognized, and 49.13% accuracy for the mixed test set, and 46.08% words are correctly recognized, and 32.42% accuracy for the speaker independent test set. From the experimental result we found that using the tuning techniques, changing the sub-word unit using cross-word tri-phone, and tri-gram language model increases the performance of the recognizer. Even if the study come up with performance improvement there is a need to control the existing large variations in the realized speech waveform due to speaking variability, mood, and environment, in spontaneous speech rather than read speech. Besides these, the available spontaneous speech data set is small in size; so it is better to prepare large size spontaneous speech corpus by automatic transcription.

Description

Keywords

automatic speech recognition

Citation