Amharic Speech Recognition System Using Joint Transformer and Connectionist Temporal Classification with External Language Model Integration
No Thumbnail Available
Date
2023-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Sequence-to-sequence (S2S) attention-based models are deep neural network models that
have demonstrated some tremendously remarkable outcomes in automatic speech recognition
(ASR) research. In these models, the cutting-edge Transformer architecture has been
extensively employed to solve a variety of S2S transformation problems, such as machine
translation and ASR. This architecture does not use sequential computation, which makes
it different from recurrent neural networks (RNNs) and gives it the benefit of a rapid iteration
rate during the training phase. However, according to the literature, the overall
training speed (convergence) of Transformer is relatively slower than RNN-based ASR.
Thus, to accelerate the convergence of the Transformer model, this research proposes
joint Transformer and connectionist temporal classification (CTC) for Amharic speech
recognition system. The research also investigates an appropriate recognition units: characters,
subwords, and syllables for Amharic end-to-end speech recognition systems. In
this study, the accuracy of character- and subword-based end-to-end speech recognition
system is compared and contrasted for the target language. For the character-based model
with character-level language model (LM), a best character error rate of 8.84% is reported,
and for the subword-based model with subword-level LM, a best word error rate of 24.61%
is reported. Furthermore, the syllable-based end-to-end model achieves a 7.05% phoneme
error rate and a 13.3% syllable error rate without integrating any language models (LMs).