Development of Text-to-Speech Synthesis Model for Afaan Oromoo Using Transformer Neural Network

Bayisa Bedasa

Development of Text-to-Speech Synthesis Model for Afaan Oromoo Using Transformer Neural Network

dc.contributor.advisor	Yaregal Assabie (PhD)
dc.contributor.author	Bayisa Bedasa
dc.date.accessioned	2025-08-17T21:57:17Z
dc.date.available	2025-08-17T21:57:17Z
dc.date.issued	2025-03
dc.description.abstract	Text-to-speech (TTS) is a process of converting written text into spoken words. It analyzes the incoming text, processes linguistic data, and produces audio output using algorithms. TTS systems are widely utilized in applications such as virtual assistants, accessibility tools for individuals with visual impairments, and language learning software. Afaan Oromoo is a Cushitic language mostly spoken in Ethiopia and other parts of Africa and serves as an essential means of communication for the Oromo people. For Afaan Oromoo, developing a TTS system is essential for enhancing accessibility and promoting the use of the language in digital environments. This study focuses on a transformer-based neural network model technique for Afaan Oromoo TTS. The model architecture comprises an encoder-decoder structure. The encoder processes input text by converting it into a contextualized representation, while the decoder generates speech waveforms from this representation. We enhanced the model with multi-head attention mechanisms to capture long-range dependencies in the input text, improving prosody. Additionally, we employed a HiFi-GAN-based vocoder for converting the model's output into high-fidelity audio waveforms, enhancing the overall quality of the synthesized speech. Utilizing the transformer architecture, the implementation is carried out in Python. We have produced 17 hours of audio dataset and their corresponding text transcription from the Afaan Oromoo speech corpus by a male speaker. The transformer-based text-to-speech synthesis architecture has outperformed the previously done model based on BLSTM-RNN for Afaan Oromoo language TTS, whose results are 3.77 and 3.76 in terms of intelligibility and naturalness, respectively. We used the Mean Opinion Score (MOS) to assess naturalness and intelligibility subjectively. Experimental results indicate that our transformer-based TTS system achieved a MOS score of 4.21 for naturalness and 4.23 for intelligibility, reflecting a commendable performance level. Our model also enables prosody modeling with user input parameters to generate deterministic speech, positioning it as a state-of-the-art solution.
dc.identifier.uri	https://etd.aau.edu.et/handle/123456789/6883
dc.language.iso	en_US
dc.publisher	Addis Ababa University
dc.subject	Afaan Oromoo
dc.subject	Text-to-Speech
dc.subject	Transformer Neural Network
dc.subject	Speech Synthesis
dc.subject	End-to-End Architecture
dc.title	Development of Text-to-Speech Synthesis Model for Afaan Oromoo Using Transformer Neural Network
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Bayisa Bedasa 2025.pdf
Size:: 879.78 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

Computer Science