Development of Text-to-Speech Synthesis Model for Afaan Oromoo Using Transformer Neural Network
dc.contributor.advisor | Yaregal Assabie (PhD) | |
dc.contributor.author | Bayisa Bedasa | |
dc.date.accessioned | 2025-08-17T21:57:17Z | |
dc.date.available | 2025-08-17T21:57:17Z | |
dc.date.issued | 2025-03 | |
dc.description.abstract | Text-to-speech (TTS) is a process of converting written text into spoken words. It analyzes the incoming text, processes linguistic data, and produces audio output using algorithms. TTS systems are widely utilized in applications such as virtual assistants, accessibility tools for individuals with visual impairments, and language learning software. Afaan Oromoo is a Cushitic language mostly spoken in Ethiopia and other parts of Africa and serves as an essential means of communication for the Oromo people. For Afaan Oromoo, developing a TTS system is essential for enhancing accessibility and promoting the use of the language in digital environments. This study focuses on a transformer-based neural network model technique for Afaan Oromoo TTS. The model architecture comprises an encoder-decoder structure. The encoder processes input text by converting it into a contextualized representation, while the decoder generates speech waveforms from this representation. We enhanced the model with multi-head attention mechanisms to capture long-range dependencies in the input text, improving prosody. Additionally, we employed a HiFi-GAN-based vocoder for converting the model's output into high-fidelity audio waveforms, enhancing the overall quality of the synthesized speech. Utilizing the transformer architecture, the implementation is carried out in Python. We have produced 17 hours of audio dataset and their corresponding text transcription from the Afaan Oromoo speech corpus by a male speaker. The transformer-based text-to-speech synthesis architecture has outperformed the previously done model based on BLSTM-RNN for Afaan Oromoo language TTS, whose results are 3.77 and 3.76 in terms of intelligibility and naturalness, respectively. We used the Mean Opinion Score (MOS) to assess naturalness and intelligibility subjectively. Experimental results indicate that our transformer-based TTS system achieved a MOS score of 4.21 for naturalness and 4.23 for intelligibility, reflecting a commendable performance level. Our model also enables prosody modeling with user input parameters to generate deterministic speech, positioning it as a state-of-the-art solution. | |
dc.identifier.uri | https://etd.aau.edu.et/handle/123456789/6883 | |
dc.language.iso | en_US | |
dc.publisher | Addis Ababa University | |
dc.subject | Afaan Oromoo | |
dc.subject | Text-to-Speech | |
dc.subject | Transformer Neural Network | |
dc.subject | Speech Synthesis | |
dc.subject | End-to-End Architecture | |
dc.title | Development of Text-to-Speech Synthesis Model for Afaan Oromoo Using Transformer Neural Network | |
dc.type | Thesis |