Syllable-based Text-to- Speech Synthesis (tts) for Amharic

No Thumbnail Available

Date

2012-06

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

The goal of Text-to-Speech synthesis is to convert arbitrary input text to intelligible and natural sounding speech so as to transmit information from a machine to a person. In speech synthesis, the capability of information extraction is crucial in producing high quality synthesized speech. This paper describes the design of a syllable based concatenative speech waveform synthesizer for Amharic language using TD-PSOLA algorithm for the prosodic modification and speech waveform analysis/synthesis purpose. This approach is based on the decomposition of the signal into overlapping frames synchronized with the pitch period. In concatenative corpus-based TTS systems, the acoustic units of varying sizes are selected from a large speech corpus and then concatenated to produce speech waveforms. The speech corpus contains more than one instance of each unit to capture prosodic and spectral variability found in natural speech; hence the signal modifications needed on the selected units are minimized if an appropriate unit is found in the unit inventory. A syllable unit is chosen primarily because Amharic language is syllable centred; Consonant-Vowel (CV) assimilated language. The unique syllable units are then added to a syllable repository. Further, concatenation at syllable boundaries can lead to smaller error owing to the spectrum being similar across different syllable boundaries. Syllable based approach to speech processing is an interesting alternative to the diphone (triphone) - based approach, especially for the syllable-timed languages, Amharic. The system was implemented and tested using selected Amharic texts found in the language Amharic. The result gives 97.8% of word accuracy rate for automatic syllabification, which leads to improve prosody and synthesis models as well as speech waveform generation and an average score of 89.58% and 3.45 for ORT and MOS respectively based on the subjective assessment of users‟ for intelligibility and naturalness of the synthesized speech respectively. Subjective listening tests performed on the synthesized speech there is an improvement of in the quality of synthesised speech.

Description

Keywords

Text-to-speech,, concatenative synthesis,, syllable,, TD-PSOLA,, CV-assimilated,, prosodic modification,, unit selection

Citation