Bidirectional Long-short Term Memory Based Text to Speech Synthesis for Amharic Language

dc.contributor.advisorAssabie, Yaregal (PhD)
dc.contributor.authorAwel, Mahlet
dc.date.accessioned2021-04-13T08:38:11Z
dc.date.accessioned2023-11-29T04:06:27Z
dc.date.available2021-04-13T08:38:11Z
dc.date.available2023-11-29T04:06:27Z
dc.date.issued2020-12-07
dc.description.abstractText-to-speech (TTS) synthesis is the automatic conversion of written text to spoken language. TTS systems show an imperative character in natural human-computer interaction. The aim of this work is to develop a bidirectional long-short term based TTS system for the Amharic Language. The system has two phases, the training and synthesis phases. In the training phase, first the text normalization is done and then from the normalized text linguistic features are extracted by using festival tool and the extracted features are used as input for the BLSTM based duration model. Then after that, duration model training is done and the model adds duration information on the extracted linguistic features and feeds for the BLSTM based acoustic model. The world vocoder extracts many acoustic frames composed of features which describe the signal in a more convenient way and used as an input for the acoustic model. Aco5ustic model training is done to map the input linguistic features and the associated duration features into acoustic features. We have prepared 600 speech their corresponding text transcription from Amharic audio bible by a male speaker. For this work the open source merlin speech synthesis toolkit, festival speech synthesis tool as a frontend and world vocoder are used. We have also prepared a pronunciation dictionary (lexicon) of 2500 words, phone set, letter to sound rule and question file set for frontend text processing based on the phonetic structure of Amharic language. In order to test the performance of the system we have performed subjective and objective evaluation. The evaluation with a listening test by 10 volunteers gave a score in MOS of 3.8 for intelligibility and 3.9 for naturalness to our BLSTM model and 3.65 for intelligibility and 3.7 for naturalness to our DNN model and MCD of BLSTM and DNN is 4.68 and 4.7 respectively.en_US
dc.identifier.urihttp://etd.aau.edu.et/handle/123456789/26113
dc.language.isoenen_US
dc.publisherAddis Ababa Universityen_US
dc.subjectDeep Learningen_US
dc.subjectRecurrent Neural Networksen_US
dc.subjectLong-Short Term Memoryen_US
dc.subjectDuration Modelen_US
dc.subjectAcoustic Modelen_US
dc.subjectVocoderen_US
dc.subjectLinguistic Featuresen_US
dc.subjectAcoustic Featuresen_US
dc.titleBidirectional Long-short Term Memory Based Text to Speech Synthesis for Amharic Languageen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Mahlet Awel 2020.pdf
Size:
1.79 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: