Machine Learning Approach for Voicing Detection

No Thumbnail Available



Journal Title

Journal ISSN

Volume Title


Addis Ababa University


Voiced/unvoiced/silence speech segment detection is a method of assigning and labelling a specific speech category (voiced/unvoiced/silence) to a speech segment. Assigning speech categories to speech segment in a speech sound is an important component of many speech processing systems. An accurate classification of a speech segment as voiced/unvoiced/silence with voicing detection system is often used as a prerequisite for developing other higher level and efficient applications of speech processing systems such as speech coding, speech analysis, speech synthesis, automatic speech recognition, noise suppression and enhancement, pitch detection, speaker identification, and the recognition of speech pathologies. The interest in speech segment category discrimination has intensified lately due to the increasing demand for potential use in a number of commercial or non-commercial speech-based systems. Current personal communication systems such as a cellular phone are examples of commercial systems that integrate speech coding and speech recognition capabilities in their operation. In this study a supervised method of voiced/unvoiced/silence speech segment detection is proposed. Text corpus with size of 900 sentences is collected from political news, economy news, sport news, health news, fictions, Bible, penal code and Federal Negarit Gazzeta. These texts are recorded by one male speaker to prepare the speech corpus. Both text and speech corpuses are split in to training (66.67%) and test (33.33%) data sets. ANN based voiced/unvoiced/silence classifier in particular an MLP with single hidden layer and 25 neurons on the hidden layer shows a high classification performance than other models tested. The network has 15 neurons on the input layer and 3 neurons on the output layer that match the number of features in the feature vector and the number of classes to be classified respectively (MLP 15-25-3). Energy, ZCR and 13 MFCCs of the speech signal are used as feature vectors to train and test the classifier model selected. The feature vectors are extracted for 20, 25, 30 and 35 millisecond of speech segment to see the effect of frame length on the classification performance of the classifiers. The evaluation of the experiments shows that best performance with the selected classifier model and feature vector is achieved on 35 millisecond frame size. The MLP classifier shows an accuracy of 89.69%. Hence, it is found that an MLP with single hidden layer and 25 neurons on the hidden layer outperforms other classifiers tested. Keywords: Voicing Detection, Voiced/Unvoiced/Silence, Machine Learning



Voicing Detection; Voiced/Unvoiced/Silence; Machine Learning