Amharic Speech Recognition System Using Joint Transformer and Connectionist Temporal Classification with External Language Model Integration

No Thumbnail Available

Date

2023-06

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Sequence-to-sequence (S2S) attention-based models are deep neural network models that have demonstrated some tremendously remarkable outcomes in automatic speech recognition (ASR) research. In these models, the cutting-edge Transformer architecture has been extensively employed to solve a variety of S2S transformation problems, such as machine translation and ASR. This architecture does not use sequential computation, which makes it different from recurrent neural networks (RNNs) and gives it the benefit of a rapid iteration rate during the training phase. However, according to the literature, the overall training speed (convergence) of Transformer is relatively slower than RNN-based ASR. Thus, to accelerate the convergence of the Transformer model, this research proposes joint Transformer and connectionist temporal classification (CTC) for Amharic speech recognition system. The research also investigates an appropriate recognition units: characters, subwords, and syllables for Amharic end-to-end speech recognition systems. In this study, the accuracy of character- and subword-based end-to-end speech recognition system is compared and contrasted for the target language. For the character-based model with character-level language model (LM), a best character error rate of 8.84% is reported, and for the subword-based model with subword-level LM, a best word error rate of 24.61% is reported. Furthermore, the syllable-based end-to-end model achieves a 7.05% phoneme error rate and a 13.3% syllable error rate without integrating any language models (LMs).

Description

Keywords

Citation