Automatic Classification of Ethiopian Traditional Music Using Audio-Visual Features and Deep Learning
No Thumbnail Available
Date
2020-06-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Music bridges the gap between linguistic and cultural gap and helps connect people. Ethiopia is a country with more than 80 tribes each having their own unique musical sound and style of dance. Distinguishing one from another is not an easy task especially in the era of streaming where lots of music are recorder and released each day through the Internet.
Machine learning and recently deep learning is a subfield of machine learning that came to tackle the problem of automating tedious classification tasks previously done by programmers manually crafting the classification rules. Deep learning algorithms automatically learn the classification rules by just looking at the data.
In this work, we address the automatic classification of Ethiopian traditional music to their respective locality using audio-visual features. To achieve that we use a deep neural network architecture composed of both convolutional neural network (CNN) and recurrent neural network (RNN). This architecture has an audio feature extracting component, that is composed of a parallel deep CNN and RNN which takes mel-spectrogram of an audio signal as an input and a video feature extracting component. The video feature extracting component uses transfer learning to extract visual information from a pre-trained network (VGG-16) then passes these features to a Long Short-Term Memory (LSTM) recurrent network so that sequential information will be extracted. Features from both modules will then be merged and the class of the music video will be predicted.
We did an experiment to know the performance of the proposed system. We collected music data that represent Ethiopian traditional music from Internet-based music archive such as YouTube and personal music collections. After passing the collected data through a pre-processing step, we trained the proposed system, which uses both audio-visual feature and a system that only uses visual feature or audio feature. The performance of the video data only classifier was 78% while the audio data only classifier was 85% and by adding audio feature to the video data only classifier we were able to increase the accuracy of the proposed system by 7 units making its performance 85%.
Description
Keywords
Deep Learning, Cnn, Rnn, Transfer Learning, Music Information Retrieval, Music Processing, Dance Recognition