Duration Moelling of Phonemes for Amharic Text To Speech System

No Thumbnail Available



Journal Title

Journal ISSN

Volume Title


Addis Ababa University


Naturalness of synthetic speech highly depends on appropriate modeling of prosodic aspects. Mostly, three prosody components are modeled: segmental duration, pitch contour and intensity.The general goal of duration modeling is to find a computational relation between a set of affecting factors and the segment duration. A number of text-to-speech synthesizers for Amharic language have used synthesis techniques that require prosodic models for good quality synthetic speech. However, due to different reasons like unavailability of adequately large and properly annotated speech databases for Amharic language, prosodic models for these synthesizers have still not been developed. Hence, in this thesis work duration modeling of phonemes for Amharic speech synthesis is done. In this thesis two major tasks have been performed,development of concatinative unit selection synthesizer and data-drive duration model. Unit selection voice has been built on Festival speech synthesis framework using phone as basic unit. We have used a speech corpus having a size of 1hour, 16 minutes and 29 seconds, labeled at phoneme level. After phonetic, prosodic, and acoustic features extraction inventory for each phone has been constructed. In order to synthesize the input text the synthesizer uses cluster unit selection algorithm adopted from Festival speech synthesis. At synthesis time units that minimize acoustically defined target and join costs are then selected from a cluster. In order to build duration model we have extracted features that affect duration of Amharic phones and the whole data is split into training (90%) set and test (10%) set,they consist 45,500 and 5500 segments, respectively. Classification and Regression trees havebeen used to build our duration model. The resulting model is integrated into the synthesizer. In order to evaluate the performance and effectiveness of the duration model, we have conducted objective and subjective tests. From objective test we found correlation between actual and predicted durations is 0.3901 and the Root Mean Squared Error (RMSE) of prediction is 0.8403 in z-score domain. Subjective evaluations are done to ascertain the improvement in the quality of synthesized speech using the duration model. In this thesis, the Mean Opinion Score (MOS)evaluation technique is used. The results from the MOS were found to be 3.5 and 3.58 for intelligibility and naturalness respectively for speeches synthesized by synthesizer with duration model. In the synthesizer without duration model the result obtained for intelligibility and naturalness are 3.31 and 3.33 respectively. Keywords: Speech synthesis, duration modeling, unit selection, root mean square error, correlation coefficient



Speech Synthesis, Duration Modeling, Unit Selection, Root Mean Square Error, Correlation Coefficient