Environmental Science
Permanent URI for this collection
Browse
Browsing Environmental Science by Subject "Acoustic Features"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Bidirectional Long-short Term Memory Based Text to Speech Synthesis for Amharic Language(Addis Ababa University, 2020-12-07) Awel, Mahlet; Assabie, Yaregal (PhD)Text-to-speech (TTS) synthesis is the automatic conversion of written text to spoken language. TTS systems show an imperative character in natural human-computer interaction. The aim of this work is to develop a bidirectional long-short term based TTS system for the Amharic Language. The system has two phases, the training and synthesis phases. In the training phase, first the text normalization is done and then from the normalized text linguistic features are extracted by using festival tool and the extracted features are used as input for the BLSTM based duration model. Then after that, duration model training is done and the model adds duration information on the extracted linguistic features and feeds for the BLSTM based acoustic model. The world vocoder extracts many acoustic frames composed of features which describe the signal in a more convenient way and used as an input for the acoustic model. Aco5ustic model training is done to map the input linguistic features and the associated duration features into acoustic features. We have prepared 600 speech their corresponding text transcription from Amharic audio bible by a male speaker. For this work the open source merlin speech synthesis toolkit, festival speech synthesis tool as a frontend and world vocoder are used. We have also prepared a pronunciation dictionary (lexicon) of 2500 words, phone set, letter to sound rule and question file set for frontend text processing based on the phonetic structure of Amharic language. In order to test the performance of the system we have performed subjective and objective evaluation. The evaluation with a listening test by 10 volunteers gave a score in MOS of 3.8 for intelligibility and 3.9 for naturalness to our BLSTM model and 3.65 for intelligibility and 3.7 for naturalness to our DNN model and MCD of BLSTM and DNN is 4.68 and 4.7 respectively.Item Enhanced Robustness for Speech Emotion Recognition: Combining Acoustic and Linguistic Information(Addis Ababa University, 2017-10-05) Sinishaw, Hana; Midekso, Dida (PhD)Affective computing is the area of Artificial Intillegence studies which focuses on the design and development of intelligent devices which can perceive, process and synthesize human emotions. Humans can interpret emotions in a number of different ways, for example, processing spoken utterances, non-verbal cues, facial expressions and also written communication. Changes in our nervous system indirectly alter spoken utterances which makes it possible for people to perceive how others feel by listening to them speak. These changes can also be interpreted by machines through the extraction of speech features. The field of speech emotion recognition (SER) takes advantage of this capability and has subsequently offered many approaches to recognize affect in spoken utterances. The majority of state of the art SER systems employ complex statistical algorithms to model the relationship between acoustic parameters extracted from spoken language. Studies also show that phrases, word senses and syntactic relations that convey linguistic attributes of a language play an important role in enhancing the prediction rates. Our research focuses on this problem of recognizing affect in spoken utterances and offers a contribution to state of the art systems with linguistic knowledge to enhance its efficiency instead of relying only on speech utterances. In this work, speech emotion recognition system is developed for Amharic language based on acoustic and linguistic features. The classification performance is based on extracted features. We used a baseline set of 384 acoustic features and for linguistic analysis techniques from text we used key word spotting, negation handling and sentiment analysis with emotion generation rules. Combining those features, we achieved an accuracy of 64..2% in identifying Happiness, Surprise, Anger, Sadness, Fear, Disgust and Neutral emotions.